Recently I was contacted by a visitor to this site who asked me to put together a tutorial on using Talend for web data crawling purposes. This interested me as I have myself come across situations where I have used other software to scrape websites for data (links, pictures, emails addresses). While it is not difficult to find software to do this, it usually comes with a cost or is very limited in what you can do. After a few minutes of Googling, I came across several Java libraries which offered this functionality. It was then it dawned on me that this tutorial could "kill two birds with one stone". I can talk about using third party libraries AND web scraping in one tutorial. If this is not entirely what you were asking for Gabriele, I apologise but I believe from looking at your website that you are more than capable of extrapolating from this. For people interested in Big Data, you may wish to visit Gabriele's website which is here.
One of the things that makes Talend so powerful is the ability to use third party Java libraries to enhance the existing functionality. No matter how many pieces of functionality are supplied with a tool like this, there will always be something that can't be done out of the box. However, with a bit of Java knowledge, there is almost nothing (that I can think of) that cannot be achieved in the data integration world. This tutorial will show you how you can use an existing Java API to scrape data from a webpage and use it in Talend. The website I have chosen is a Formula 1 statistics site (http://www.statsf1.com/en/2014.aspx). I am a fan of F1 and love playing around with the stats ......to find that they give absolutely no insight into making race predictions whatsoever :-). But it is useful to be able to consume the data straight from the webpage without having to do anything by hand. The Java API I have chosen to use is HTMLParser. This API enables you to parse HTML in order to extract data. In this tutorial we are simply getting data from <table> tags. However you can use this API to get hold of practically anything you may require from a webpage. You will require a bit of knowledge of Java and HTML programming to make effective use of it though. For this tutorial, you will need to download the API from the HTMLParser site.
Before we look at the Job (which is relatively simple for this example), I will take you through the Java that I have used to extract the <table> data. This Java is found in a code routine called "WebScraperUtilities" which can be found in the Job export linked to at the bottom of the page.
This routine makes use of the HTMLParser API to extract data from the <table> tags in the web page we will be scraping. The JavaDocs for this API are useful for understanding the code below and extrapolating from it.
The code below provides a static method called "parseFileTable". This method takes a String parameter which holds the html to be processed. It returns an ArrayList of ArrayLists containing Strings. The outer ArrayList represents the table holding rows (the contained ArrayLists). The rows are ArrayLists containing Strings which represent the columns. In a more complicated example you could keep the data in specific data types (Integer, String, Double, etc), but this is just a simple example to plant a seed.... which you can then build upon.
In order to gather the data, this method instantiates an instance of the NodeVisitor class and overrides its "visitTag". This method is called for every html tag in the provided htm String. We have overridden this method to simply look for TableTag objects and once they are found, to collect the column and row data using the ArrayLists nested in an ArrayList.
Essentially this method will collect all of the table data on the F1 page that we are using. There are some specifics in this method that I should point out. Collecting header information can be tricky if the table header tags are not used. They are not used here. So I have had to use some logic that I have created from looking at the html. There are 22 columns of data in the table, but only 21 columns in the header rows. This is how I am identifying the header rows. For column header rows I am adding an empty string to the first column. This ensures that when the data is used that the column header is correctly aligned with the data. This is not ideal, but is an example of how creating a Talend Job to scrape webpage data will likely need to be tailored to the individual page. A one size fits all solution is not going to be easy and it will likely lead to massive inefficiencies.
The code carrying out this logic can be seen below.....
//Define an ArrayList to hold the data found by the NodeVisitor class
//Instantiate an instance of the Parse class
Before the code above will compile, we need to link it to the HTMLParser.Jar. In the real world, this would normally be done before any code is written. The following section shows how to link Jars to a Java routine. These steps will need to be undertaken after downloading and unpacking the HTMLParser.Jar file, in order to use the tutorial code which is linked to at the bottom of the page.
1) Right click on the code routine
....and select the "Edit Routine Libraries".
The "Import External Library" window will appear.
2) Click on the "New" button
.....to bring up the "New Module" window.
3) Click on the "Browse a library file" radio button
......and then click on the "Browse" button circled in blue below.
4) Find where you downloaded the HTMLParser.Jar
......and select it. Click on the "Open" button.
5) On the "New Module" window, click "OK"
6) On the "Import External Library" window, click on "Finish"
The Java library is now linked to the routine.
Now that we have the routine code explained and we know how to link an external Jar to a routine, we can look at the Job that makes use of the library's functionality.
The WebScraper Job
This Job is a very simple Job that has been put together to show off using the code routine with a third party library. As such, all it does is get a html string from a url, process it using the routine above and output the table data to a file. In the real world I imagine you will want to do something a bit more useful than that, but this Job gives you a starting point to extrapolate from. The Job can be seen below....
1) Create the Context variables
This Job needs just two context variables; OutputFile and WebsiteURL. These can be seen below.
The values that are used for these variables are simply to identify where the output file will be written to and which website to process. As this example was tailored to the website shown below, it would be a good idea to test it with that. Once you are familiar with what is going on, then you may wish to try some modified code on other sites.
The website URL is: http://www.statsf1.com/en/2014.aspx
2) "Load website" (tHttpRequest)
This component is used to load the website and retrieve the html string. It doesn't require much configuration. All that needs to be done if for the context variable "WebsiteURL" to be used in the "URI" field. This is seen below, circled in red.
3) "Store html" (tSetGlobalVar)
This component is used to store the html string that is passed from the previous component. It stores the html in the Talend "globalMap" hashmap with a key of "html". The value is taken from row1.ResponseContent.
4) "Call Parse F1 Site tables" (tJavaFlex)
This component is the most important component in this Job. The name doesn't make much sense (an error I just spotted), but it is supposed to indicate that the "parseF1Table" method from the code in the routine is being called here. The tJavaFlex component allows you to create loop functionality with your code. The "setup code" is placed in the "Start code" section, the "looping code" is placed in the "Main code" section and the "End code" section is self explanatory. In this example, we are calling the "parseF1Table" method of the "WebScraperUtilities" routine in the "Start code" section and then looping through the ArrayList rows in the "Main code" section. The code will be show below.
But before we look at the code, we need to think about creating a schema for the data that is returned. In this example we are dealing entirely with string data. There are 22 columns of data in the table, so we need 22 string columns created. To do this you need to click on the "Edit schema" button which is circled in red. The columns created can be seen below....
You can create columns by clicking on the green plus sign circled in red.
The code for each of the tJavaFlex sections can be seen below.....
The start code is show below....
// start part of your Java code
This code creates an ArrayList called "list" and assigns it the result of processing the html string using the "parseF1Table" method from the "WebScraperUtilities" routine. Once the ArrayList has been set, an Iterator is created to enable iteration through the ArrayList. The iterator is used in the "while" loop.
The main code is show below....
// here is the main part of the component,
java.util.ArrayList<String> columns = it.next();
This section is repeated for every iteration of the iterator created in the start code section. For every row in the "list" ArrayList, a new ArrayList "columns" is assigned a value. This ArrayList holds the column data. That data is passed to the columns of the schema created above. The "try" and "catch" sections are there to catch "IndexOutOfBoundsException"'s.
The end code is show below....
|// end of the component, outside/closing the loop|
This section closes the "while" loop created in the "start code" section
5) "Write table to file" (tFileOutputDelimited)
This component is used to write the data scraped from the website to a file. It is pretty much left as its default state, apart from the "File Name" which is set to the context variable "OutputFile".
Running the Job
To run this Job simply click on the "Run" button on the "Run" tab. As there is a tLogRow component in the Job you will see the data appearing in the Run Window as well as in the file that is produced at the end. An example of what should be seen in the Run Window can be seen below....
[statistics] connecting to socket on port 4070
A copy of the completed tutorial can be found here. Remember that you will need to download the HTMLParser Java library and link it to the routine (described above). It was built using Talend 5.5.1 but can be imported into subsequent versions. It cannot be imported into earlier versions, so you will either need to upgrade or recreate it following the tutorial.