Help files

Web Data Extraction Part II

The web data extraction can also take place on an actual IE if you have the "Extract data from Web Page" action open while you move your mouse pointer to the page of interest.

Should you click in the webpage, then the "Live Web Helper- Extract Data From Web Page" window will pop up. In this Window you will be able to preview the extracted data.

Extracting a List:

Lets say that you wish to extract the title for all available results in a webpage.

Having the "Extract data from Web Page" action open, hover your mouse on the page (or click on a blank area). Then right click on the first result and extract its Text as in the screenshot below:

Extract1.png

Do the same for the second result and the list a list of all the items' text will be automatically extracted. Click on the "Advanced Settings" icon to review the CSS selector which you can modify and make it even more efficient.

1. As you can see while extracting a list, we have the Base Selector and the CSS selector. The Base selector is the root element in the HTML code, under which the items of the list are listed. This means that the extraction starts from the ".....div:eq(1) > ul > li"

2. For each list item from the list "...div:eq(1) > ul > li" and then it gets the "h3 > a" element.

3. The attribute that you are extracting is "Own Text" and it can be changed to "Title", "Href", "SourceLink", "Exists" or any other Attribute is available in the HTML code of the page for this element.

4. You also have the option to apply Regular Expressions on the extracted text, in order to get just a part of it.

Changing the selector by hand, then you can click on the "Recalculate now" webdataextractionrecalculatebutton.png button to see the extraction's Result.

Extract2.png

Extracting a Table:

In order to extract more than one piece of info for each result you would have to extract a table.

Let's say that we want to extract the Title of the product, the link behind it and the price.

For the first result we right click on the title, extract its "Text", then right click again to extract the "Href" and finally we right click on the price element to extract its "Text".

We move on to the second result/product to do the same and the table is automagically created in the extraction preview window.

For the table, in the same notion as extracting the list, we have the Base CSS Selector, which is the root element in the HTML code, under which the data of each result/product exist. This means that the extraction starts from the ".....div:eq(1) > ul > li" and then for each or the item we extract the

  • h3 > a Attribute "Own Text"

  • h3 > a Attribute "Href"

  • ul:eq(0) > li:eq(0) > span Attribute "Own Text

Extract3.png

Attributes to extract:

In the Attribute field of the "Advanced Settings" of the "Extraction Preview" window, other than the attributes that are listed in the drop down list, you can specify any other attribute that the element has. For example if an element in the HTML code of the page is:

< li class ="sresult lvresult clearfix li shic" id ="item463b90d307" _sp ="p2045573.m1686.l2210" r ="3" listingid ="301647057671">.......< /li >

Then in the attribute dropdown list you can write "class" if you want to extract its class, "id" if you want to extract its id...and so on.

NOTE

  • You can extract the plain html code of the element -and all its children elements- should you write "outerhtml"

  • You can extract the plain html code of the all the children elements of the element should you write "innerhtml"

This is very helpful if you want to extract a piece of info that resides in the html for this element by applying some Regular Expressions on the extracted code.