Using a third party Java library to scrape the content of a table on a web page

Recently I was contacted by a visitor to this site who asked me to put together a tutorial on using Talend for web data crawling purposes. This interested me as I have myself come across situations where I have used other software to scrape websites for data (links, pictures, emails addresses). While it is not difficult to find software to do this, it usually comes with a cost or is very limited in what you can do. After a few minutes of Googling, I came across several Java libraries which offered this functionality. It was then it dawned on me that this tutorial could "kill two birds with one stone". I can talk about using third party libraries AND web scraping in one tutorial. If this is not entirely what you were asking for Gabriele, I apologise but I believe from looking at your website that you are more than capable of extrapolating from this. For people interested in Big Data, you may wish to visit Gabriele's website which is here.

One of the things that makes Talend so powerful is the ability to use third party Java libraries to enhance the existing functionality. No matter how many pieces of functionality are supplied with a tool like this, there will always be something that can't be done out of the box. However, with a bit of Java knowledge, there is almost nothing (that I can think of) that cannot be achieved in the data integration world. This tutorial will show you how you can use an existing Java API to scrape data from a webpage and use it in Talend. The website I have chosen is a Formula 1 statistics site (http://www.statsf1.com/en/2014.aspx). I am a fan of F1 and love playing around with the stats ......to find that they give absolutely no insight into making race predictions whatsoever :-). But it is useful to be able to consume the data straight from the webpage without having to do anything by hand. The Java API I have chosen to use is HTMLParser. This API enables you to parse HTML in order to extract data. In this tutorial we are simply getting data from <table> tags. However you can use this API to get hold of practically anything you may require from a webpage. You will require a bit of knowledge of Java and HTML programming to make effective use of it though. For this tutorial, you will need to download the API from the HTMLParser site.

Before we look at the Job (which is relatively simple for this example), I will take you through the Java that I have used to extract the <table> data. This Java is found in a code routine called "WebScraperUtilities" which can be found in the Job export linked to at the bottom of the page.

 

WebScraperUtilities

This routine makes use of the HTMLParser API to extract data from the <table> tags in the web page we will be scraping. The JavaDocs for this API are useful for understanding the code below and extrapolating from it. 

The code below provides a static method called "parseFileTable". This method takes a String parameter which holds the html to be processed. It returns an ArrayList of ArrayLists containing Strings. The outer ArrayList represents the table holding rows (the contained ArrayLists). The rows are ArrayLists containing Strings which represent the columns. In a more complicated example you could keep the data in specific data types (Integer, String, Double, etc), but this is just a simple example to plant a seed.... which you can then build upon.

In order to gather the data, this method instantiates an instance of the NodeVisitor class and overrides its "visitTag". This method is called for every html tag in the provided htm String. We have overridden this method to simply look for TableTag objects and once they are found, to collect the column and row data using the ArrayLists nested in an ArrayList.

Essentially this method will collect all of the table data on the F1 page that we are using. There are some specifics in this method that I should point out. Collecting header information can be tricky if the table header tags are not used. They are not used here. So I have had to use some logic that I have created from looking at the html. There are 22 columns of data in the table, but only 21 columns in the header rows. This is how I am identifying the header rows. For column header rows I am adding an empty string to the first column. This ensures that when the data is used that the column header is correctly aligned with the data. This is not ideal, but is an example of how creating a Talend Job to scrape webpage data will likely need to be tailored to the individual page. A one size fits all solution is not going to be easy and it will likely lead to massive inefficiencies. 

The code carrying out this logic can be seen below.....

package routines;

import java.util.ArrayList;
import org.htmlparser.Parser;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;
import org.htmlparser.tags.TableRow;
import org.htmlparser.tags.TableColumn;
import org.htmlparser.visitors.NodeVisitor;
import org.htmlparser.Tag;
import org.htmlparser.tags.TableTag;

/*
 * user specification: the function's comment should contain keys as follows: 1. write about the function's comment.but
 * it must be before the "{talendTypes}" key.
 * 
 * 2. {talendTypes} 's value must be talend Type, it is required . its value should be one of: String, char | Character,
 * long | Long, int | Integer, boolean | Boolean, byte | Byte, Date, double | Double, float | Float, Object, short |
 * Short
 * 
 * 3. {Category} define a category for the Function. it is required. its value is user-defined .
 * 
 * 4. {param} 's format is: {param} <type>[(<default value or closed list values>)] <name>[ : <comment>]
 * 
 * <type> 's value should be one of: string, int, list, double, object, boolean, long, char, date. <name>'s value is the
 * Function's parameter name. the {param} is optional. so if you the Function without the parameters. the {param} don't
 * added. you can have many parameters for the Function.
 * 
 * 5. {example} gives a example for the Function. it is optional.
 */
public class WebScraperUtilities {

    /**
     * parseF1Table: a method to process <table> data on the http://www.statsf1.com/en/2014.aspx website.
     * This method can easily be extended to be used for other sites.
     * 
     * 
     * {talendTypes} ArrayList
     * 
     * {Category} User Defined
     * 
     * {param} String("html") input: A string representation of the website
     * 
     */
    public static ArrayList<ArrayList<String>> parseF1Table(String html){

        //Define an ArrayList to hold the data found by the NodeVisitor class
        final ArrayList<ArrayList<String>> rows = new ArrayList<ArrayList<String>>();
       
        //if the html string is not null
        if(html!=null){
        
            //Instantiate an instance of the NodeVisitor class and override its visitTag method
            final NodeVisitor linkVisitor = new NodeVisitor() {
                    
                @Override
                public void visitTag(Tag tag) {
                    
                    //If the tag is an instance of the TableTag, process the data
                    if(tag instanceof TableTag ){
                        TableTag tt = (TableTag)tag;
                        TableRow[] trows = tt.getRows();
                
                        //For each row, get the column data
                        for(int i = 0; i<trows.length; i++){
                            TableRow tr = trows[i];
                            TableColumn[] tcols = tr.getColumns();
                            ArrayList<String> tmpRow = new ArrayList<String>();
               
                            //If the column length is 21, then we know this column is a header
                            if(tcols.length==21){
                                tmpRow.add("");
                            }
                            
                            //if this row has more than 1 column
                            if(tcols.length>1){
                                //Get each column value
                                for(int x = 0; x<tcols.length; x++){
                                    TableColumn tc = tcols[x];
                                    String columnVal = tc.toPlainTextString().trim();
                                    
                                    //Remove "&nbsp;" strings from the column values
                                    columnVal = columnVal.replaceAll("&nbsp;", "");
                                    tmpRow.add(columnVal);
                                }
                                rows.add(tmpRow);
                            }
                        }
                    }
                }
            };

            //Instantiate an instance of the Parse class
            Parser parser = Parser.createParser(html, null);
            //Instantiate an empty instance of the NodeList class
            NodeList list;
            
            try {
                //Assign the NodeList instance (list) an object
                list = parser.parse(null);
                //Assign the list object a Visitor ....defined above.
                list.visitAllNodesWith(linkVisitor);
                return rows;//list.toHtml();
            } catch (ParserException e) {
                // Could not parse HTML, return original HTML
                return null;
            }
        }else{
            return rows;
        }
        
    }

}

 

Adding the HTMLParser Java library

Before the code above will compile, we need to link it to the HTMLParser.Jar. In the real world, this would normally be done before any code is written. The following section shows how to link Jars to a Java routine. These steps will need to be undertaken after downloading and unpacking the HTMLParser.Jar file, in order to use the tutorial code which is linked to at the bottom of the page.

1) Right click on the code routine

....and select the "Edit Routine Libraries".

The "Import External Library" window will appear.

 

2) Click on the "New" button

.....to bring up the "New Module" window.

3) Click on the "Browse a library file" radio button

......and then click on the "Browse" button circled in blue below.

4) Find where you downloaded the HTMLParser.Jar

......and select it. Click on the "Open" button.

5) On the "New Module" window, click "OK"

6) On the "Import External Library" window, click on "Finish"

The Java library is now linked to the routine.

 

Now that we have the routine code explained and we know how to link an external Jar to a routine, we can look at the Job that makes use of the library's functionality.

 

The WebScraper Job

This Job is a very simple Job that has been put together to show off using the code routine with a third party library. As such, all it does is get a html string from a url, process it using the routine above and output the table data to a file. In the real world I imagine you will want to do something a bit more useful than that, but this Job gives you a starting point to extrapolate from. The Job can be seen below....

1) Create the Context variables

This Job needs just two context variables; OutputFile and WebsiteURL. These can be seen below.

The values that are used for these variables are simply to identify where the output file will be written to and which website to process. As this example was tailored to the website shown below, it would be a good idea to test it with that. Once you are familiar with what is going on, then you may wish to try some modified code on other sites.

The website URL is: http://www.statsf1.com/en/2014.aspx

2) "Load website" (tHttpRequest)

This component is used to load the website and retrieve the html string. It doesn't require much configuration. All that needs to be done if for the context variable "WebsiteURL" to be used in the "URI" field. This is seen below, circled in red. 

3) "Store html" (tSetGlobalVar)

This component is used to store the html string that is passed from the previous component. It stores the html in the Talend "globalMap" hashmap with a key of "html". The value is taken from row1.ResponseContent.

 

4) "Call Parse F1 Site tables" (tJavaFlex)

This component is the most important component in this Job. The name doesn't make much sense (an error I just spotted), but it is supposed to indicate that the "parseF1Table" method from the code in the routine is being called here.  The tJavaFlex component allows you to create loop functionality with your code. The "setup code" is placed in the "Start code" section, the "looping code" is placed in the "Main code" section and the "End code" section is self explanatory. In this example, we are calling the "parseF1Table" method of the "WebScraperUtilities" routine in the "Start code" section and then looping through the ArrayList rows in the "Main code" section.  The code will be show below.

But before we look at the code, we need to think about creating a schema for the data that is returned. In this example we are dealing entirely with string data. There are 22 columns of data in the table, so we need 22 string columns created. To do this you need to click on the "Edit schema" button which is circled in red. The columns created can be seen below....

You can create columns by clicking on the green plus sign circled in red.

The code for each of the tJavaFlex sections can be seen below.....

Start code

The start code is show below....

// start part of your Java code
java.util.ArrayList<java.util.ArrayList<String>> list = routines.WebScraperUtilities.parseF1Table((String)globalMap.get("html"));
java.util.Iterator<java.util.ArrayList<String>> it = list.iterator();


while(it.hasNext()){

This code creates an ArrayList called "list" and assigns it the result of processing the html string using the "parseF1Table" method from the "WebScraperUtilities" routine.  Once the ArrayList has been set, an Iterator is created to enable iteration through the ArrayList. The iterator is used in the "while" loop.

Main code

The main code is show below....

// here is the main part of the component,
// a piece of code executed in the row
// loop

java.util.ArrayList<String> columns = it.next();

try{
        row2.col1 = columns.get(0);
        row2.col2 = columns.get(1);
        row2.col3 = columns.get(2);
        row2.col4 = columns.get(3);
        row2.col5 = columns.get(4);
        row2.col6 = columns.get(5);
        row2.col7 = columns.get(6);
        row2.col8 = columns.get(7);
        row2.col9 = columns.get(8);
        row2.col10 = columns.get(9);
        row2.col11 = columns.get(10);
        row2.col12 = columns.get(11);
        row2.col13 = columns.get(12);
        row2.col14 = columns.get(13);
        row2.col15 = columns.get(14);
        row2.col16 = columns.get(15);
        row2.col17 = columns.get(16);
        row2.col18 = columns.get(17);
        row2.col19 = columns.get(18);
        row2.col20 = columns.get(19);
        row2.col21 = columns.get(20);
        row2.col22 = columns.get(21);
    }catch(java.lang.IndexOutOfBoundsException iobe){
    
    }

This section is repeated for every iteration of the iterator created in the start code section. For every row in the "list" ArrayList, a new ArrayList "columns" is assigned a value. This ArrayList holds the column data. That data is passed to the columns of the schema created above. The "try" and "catch" sections are there to catch "IndexOutOfBoundsException"'s.

End code

The end code is show below....

// end of the component, outside/closing the loop
}

This section closes the "while" loop created in the "start code" section

 

5) "Write table to file" (tFileOutputDelimited)

This component is used to write the data scraped from the website to a file. It is pretty much left as its default state, apart from the "File Name" which is set to the context variable "OutputFile".

 

Running the Job

To run this Job simply click on the "Run" button on the "Run" tab. As there is a tLogRow component in the Job you will see the data appearing in the Run Window as well as in the file that is produced at the end. An example of what should be seen in the Run Window can be seen below....

[statistics] connecting to socket on port 4070
[statistics] connected
|Drivers|1AUS|2MYS|3BHR|4CHN|5ESP|6MCO|7CAN|8AUT|9GBR|10DEU|11HUN|12BEL|13ITA|14SGP|15JPN|16RUS|17USA|18BRA|19ABD|Pts
1.|L. HAMILTON|-|25|25|25|25|18|-|18|25|15|15|-|25|25|25|25|25|18|50|384.00
2.|N. ROSBERG|25|18|18|18|18|25|18|25|-|25|12|18|18|-|18|18|18|25|-|317.00
3.|D. RICCIARDO|-|-|12|12|15|15|25|4|15|8|25|25|10|15|12|6|15|-|24|238.00
4.|V. BOTTAS|10|4|4|6|10|-|6|15|18|18|4|15|12|-|8|15|10|1|30|186.00
5.|S. VETTEL|-|15|8|10|12|-|15|-|10|12|6|10|8|18|15|4|6|10|8|167.00
6.|F. ALONSO|12|12|2|15|8|12|8|10|8|10|18|6|-|12|-|8|8|8|4|161.00
7.|F. MASSA|-|6|6|-|-|6|-|12|-|-|10|-|15|10|6|-|12|15|36|134.00
8.|J. BUTTON|15|8|-|-|-|8|12|-|12|4|1|8|4|-|10|12|-|12|20|126.00
9.|N. HULKENBERG|8|10|10|8|1|10|10|2|4|6|-|1|-|2|4|-|-|4|16|96.00
10.|S. PEREZ|1|-|15|2|2|-|-|8|-|1|-|4|6|6|1|1|-|-|12|59.00
11.|K. MAGNUSSEN|18|2|-|-|-|1|2|6|6|2|-|-|1|1|-|10|4|2|-|55.00
12.|K. RAIKKONEN|6|-|1|4|6|-|1|1|-|-|8|12|2|4|-|2|-|6|2|55.00
13.|J. VERGNE|4|-|-|-|-|-|4|-|1|-|2|-|-|8|2|-|1|-|-|22.00
14.|R. GROSJEAN|-|-|-|-|4|4|-|-|-|-|-|-|-|-|-|-|-|-|-|8.00
15.|D. KVYAT|2|1|-|1|-|-|-|-|2|-|-|2|-|-|-|-|-|-|-|8.00
16.|P. MALDONADO|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|2|-|-|2.00
17.|J. BIANCHI|-|-|-|-|-|2|-|-|-|-|-|-|-|-|-|||||2.00
18.|A. SUTIL|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|0.00
19.|M. ERICSSON|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-||||0.00
20.|E. GUTIERREZ|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|0.00
21.|M. CHILTON|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-||||0.00
22.|K. KOBAYASHI|-|-|-|-|-|-|-|-|-|-|-||-|-|-|-|||-|0.00
23.|W. STEVENS|||||||||||||||||||-|0.00
|Constructors|1AUS|2MYS|3BHR|4CHN|5ESP|6MCO|7CAN|8AUT|9GBR|10DEU|11HUN|12BEL|13ITA|14SGP|15JPN|16RUS|17USA|18BRA|19ABD|Pts
1.|Mercedes|25|43|43|43|43|43|18|43|25|40|27|18|43|25|43|43|43|43|50|701.00
2.|Red Bull Renault|-|15|20|22|27|15|40|4|25|20|31|35|18|33|27|10|21|10|32|405.00
3.|Williams Mercedes|10|10|10|6|10|6|6|27|18|18|14|15|27|10|14|15|22|16|66|320.00
4.|Ferrari|18|12|3|19|14|12|9|11|8|10|26|18|2|16|-|10|8|14|6|216.00
5.|McLaren Mercedes|33|10|-|-|-|9|14|6|18|6|1|8|5|1|10|22|4|14|20|181.00
6.|Force India Mercedes|9|10|25|10|3|10|10|10|4|7|-|5|6|8|5|1|-|4|28|155.00
7.|Toro Rosso Renault|6|1|-|1|-|-|4|-|3|-|2|2|-|8|2|-|1|-|-|30.00
8.|Lotus Renault|-|-|-|-|4|4|-|-|-|-|-|-|-|-|-|-|2|-|-|10.00
9.|Marussia Ferrari|-|-|-|-|-|2|-|-|-|-|-|-|-|-|-|-||||2.00
10.|Sauber Ferrari|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|0.00
11.|Caterham Renault|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|||-|0.00
[statistics] disconnected
Job WebScraper ended at 00:47 27/03/2015. [exit code=0]

 

 

A copy of the completed tutorial can be found here. Remember that you will need to download the HTMLParser Java library and link it to the routine (described above). It was built using Talend 5.5.1 but can be imported into subsequent versions. It cannot be imported into earlier versions, so you will either need to upgrade or recreate it following the tutorial. 

Tutorial: 
Talend Version: 
Type of content: