KAPOW

Welcome to the Kapow forum. Here you can get help, use your skills to help others and enjoy hanging out in the company of other Kapow Robot Developers.


CRAWL PAGES

Share
avatar
Shyam Kumar
Ranks

Posts : 73
Points : 1622
Join date : 2013-07-05
Location : Kerala, India

CRAWL PAGES

Post by Shyam Kumar on Tue Sep 27, 2016 12:43 pm

   

 The Crawl Pages action loops through the pages of a web site. In effect, it crawls the web site one web page at a time. Hence, the first iteration crawls the first page, the second iteration crawls the second page, and so on.


The Crawl Pages action accepts a loaded page as part of the input, such as the start page of the web site. The output contains the next crawled web page.


Crawl Pages robot command result in each link being fully loaded and the Javascript on the page executing.


Each link that is traversed will be loaded and any Javascript on the page is executed. You can verify the behavior by creating a robot that will crawl a few levels of a website that you know utilizes JavaScript.


Turn on the BROWSER TRACER tool (TOOLS > Open Browser Tracer > push the RED button to start recording traffic)


As you step through the robot, you can use the BROWSER TRACER tool to verify the HTTP traffic as well as what Javascript was executed.


If you wish to try having the CRAWL PAGES traverse links/pages without executing any Javascript, you could turn off EXECUTE JAVASCRIPT at the Global Configuration level for the entire robot (File > Configure Robot > BASIC Tab > [CONFIGURE] > Javascript Execution)


How to Crawl an Entire Site


In this example, we wish crawl an entire site.

  1. Add a step with the Load Page action that loads the main page.
  2. Add a new step and choose the Crawl Pages action.
  3. On the Rules tab, add a Crawling Rule that applies to all pages in the site, e.g. by specifying the domain that the pages belong to or by making a pattern that the URL should match. For these pages, the rule should specify "Crawl Entire Page" and "Output the Page".
  4. On the Rules tab, set the "For all Other Pages" property to "Do Not Crawl".
  5. After the step with the Crawl Pages action, add steps to handle each page, e.g. by extracting information into returned variables.


How to Crawl a Popup Menu


In this example, we wish to discover all the pages that a popup menu links directly to. We do not wish to continue crawling from these pages.

  1. Add a step with the Load Page action that loads the main page.
  2. Add a new step and choose the Crawl Pages action.
  3. Select the menu bar as named tag.
  4. Notice that the "Automatically Handle Popup Menus" option on the Crawling tab is checked.
  5. On the Rules tab, add a Crawling Rule saying that for "All URLs" we "Do Not Crawl", but "Output the Page".
  6. After the step with the Crawl Pages action, add steps to handle each page, e.g. by extracting information into returned variables.

E^E_2016

Posts : 10
Points : 351
Join date : 2016-08-29

Crawl pages

Post by E^E_2016 on Fri Oct 21, 2016 9:18 am

Hi ,

Im really new in this and there isnt much helpful resource for Kapow . Hoping to get some urgent help here.
I need to crawl pages from google search with some keyword. 

Problem is I notice it is mentioned tht you use Crawl Pages action step. How come i couldnt find this in the action list ? 
I am using Kapow Design Studio 9.6.2. See attached screenshot below.

Also, How can i crawl the contents of all the URL from google search list in plain HTML content? How to test that all the URL crawled is able to extract the data content properly if it has different structure? in other words how can i configure to have data crawled with different structure contents in each URL? Can show me this please with screen shot? 

That will be very helpful. Thanks.





avatar
Shyam Kumar
Ranks

Posts : 73
Points : 1622
Join date : 2013-07-05
Location : Kerala, India

Re: CRAWL PAGES

Post by Shyam Kumar on Tue Oct 25, 2016 9:33 am

Hi,
You can select a Crawl Pages action steps using the following methods,

1. Select an action then select Loop then select step “Crawl Pages”



2. Select an action then select All then select step “Crawl Pages”



Thank you,

Regards,


Shyam kumar P

E^E_2016

Posts : 10
Points : 351
Join date : 2016-08-29

Re: CRAWL PAGES

Post by E^E_2016 on Wed Oct 26, 2016 7:53 am

Hi Shyam,

here's my screenshot feedback. Under Loop and All, there's no "Crawl Pages" function seen. I am using Design studio 9.6.2 . Why I don't see it? Is there any other alternatives to crawl google search pages? Please do advice what I can do to fix this. Thanks.

Screenshot 1 - Select an Action, then I select under "Loop". I do not see "Crawl Pages" function.


Screenshot 2 - Select an Action and under Select "All" , I do not see the "Crawl Pages" function either. 

avatar
Shyam Kumar
Ranks

Posts : 73
Points : 1622
Join date : 2013-07-05
Location : Kerala, India

Re: CRAWL PAGES

Post by Shyam Kumar on Wed Oct 26, 2016 4:23 pm

Hi,

May be the version 9.6.2 have not available that action step,

if you need to extract data from website, no need to use crawl page action step.

you can choose another alternative steps for grabbing data.


Thank you

Regards,

Shyam kumar

E^E_2016

Posts : 10
Points : 351
Join date : 2016-08-29

Re: CRAWL PAGES

Post by E^E_2016 on Wed Oct 26, 2016 5:13 pm

Hi Shyam,

So what is the alternative method if I cant use "Crawl Pages" function? Especially if each URL has different structure, how can I create robots to achieve this requirement?

My requirement is basically to extract all the URL list appeared in google search pages base on a keyword search entered and the content of each URL in plain html format and store in database. (The content in plain structure html would have title, url, date and the content). 

Hope you can advice me on this. Thank you so much.

    Current date/time is Thu Jul 27, 2017 1:33 am