Help files

Web Data Extraction Part I

Extracting data from web pages is a big part of Web Automation. In WinAutomation there are four actions dedicated to this task, with "Extract Data from Web Page" being the more important and versatile.

The other actions allow you to take a screenshot of a web page element, retrieve details of a web page such as its title or its HTML source and finally retrieve any HTML attribute of any web page element or even its text.

So far so good, however, very often, you want to retrieve information displayed into the page in the form of tables or lists, rather than technical values. And this is where the "Extract Data from Web Page" action comes into play:

webdataextractioni1.png

As with any other web-related action you will first need to specify the web browser instance containing the page you want to extract data from. The next step is to specify the data itself and finally to select where the extracted data will be stored. The default value is to be written into a newly generated Excel spreadsheet, but to do so you need to have Microsoft Excel installed in your computer.

Alternatively you can have the data stored into a variable for further processing by later actions. Note that the extracted data can be in any of the following forms:

1. Single Value:

Say that from a web page containing info about a product, your extract the product name only. In this case, if the extracted data is stored into a variable this variable will contain a text value.

2. Handpicked (multiple) values:

Say that, in our previous example you select to extract not only the product name, but also the description and its price. In this case three separate values will be extracted and the resulting variable will hold a value of type DataRow.

You will be able to access each one of the retrieved values using the following form: %DataFromWebPage[...]% where within the brackets you will enter either a number or the name of the value.

3. Lists:

You are no longer in the page containing the product info, but in a page containing the list of all products. If you choose to retrieve all the product names displayed in the page then you'll end up with a list. Subsequently, the variable holding the extracted data will be of type List.

4. Tables

In the previous example of the web page containing a list of products you select to retrieve both the name and the price for each product. In this case the resulting variable will hold a DataTable with a product in each row and two columns (with the product name stored in the first column and the product price in the second one).

Selecting the Data to extract from a Web Page

To specify which data you want to extract from the web page you will need to use a Data Extraction Web Helper. The target data can be specified either by using the live version of the Web Helper, or the standard one.

Live Web Helpers conveniently work on an existing Internet Explorer window. Just have the action "Extract data from Web Page" open in your designer and click on the Internet Explorer of your interest.

The standard Web Helper on the other hand, is a browser window itself and opens by pressing the "Specify Web Data to Extract".

Should you click on the "Specify Data to Extract" button the Web Helper Window will appear.

webdataextractioni2.png

this window consists of two parts, the left pane which is the web browser and the right sidebar which displays a preview of the data selected for extraction.

As with the Web Helpers, the first step is to enter the URL in the address bar and navigate to the page containing the data to be extracted.

Next, all you have to do is right click on any element of the page that you want to retrieve and select the property you want to extract. Most often you will want to extract the text of the element, but you have also the choice to specify any HTML attribute you want to retrieve.

webdataextractioni3.png

At any point you can press the "Accept" button and finish the process of selecting the data you want to extract, or you may continue by selecting more elements. Depending on the elements you select, the web helper may or may not expand the selection. For example, if  the next element you select is the URL of the element chosen in the previous screenshot (shown in green in the screenshots) you will have just two elements selected:

webdataextractioni4.png

If, however, the second element you selected was another search result title, WinAutomation would detect that you are extracting a list and would expand the selection to all items of the list:

webdataextractioni5.png

On the right sidebar you see the preview of the data to be extracted in the form of a list.

So now you have specified that you want to extract a list. If you select an additional element, WinAutomation will extract the corresponding data for each element already in the list, returning the result in the form of a table:

webdataextractioni6.png

By selecting an additional element, the table would simply get an additional column. You can edit the column names by clicking on them in the preview sidebar.

If the data spreads over multiple pages there will be a "Next" link somewhere that points to the next page. You can right click on that link and select "Set This Element As Pager". This way, WinAutomation at runtime will not retrieve the data just from the first page but will continue and retrieve the same data from the next pages too.

At any point you can press the "Reset" button webdataextractionresetbutton.png to discard the selection made so far and start over. You can also examine and modify the CSS Selectors generated by the Web Helper that specify which info needs to be extracted by pressing the "Advanced Settings" button webdataextractionadvancedsettingsbutton.png.

Finally you can press the "Recalculate Now" button webdataextractionrecalculatebutton.png to highlight which data will be extracted from a web page based on the current selection. This can be useful if, for example, you select some elements to extract from a web page containing info about a product. You can then visit a page containing info on another product and click the "Recalculate Now" button to make sure that appropriate info will be retrieved from the second page too.

As mentioned before, after you have finished with selecting the data you want to extract you can press the "Accept" button to return to the action's properties dialog.