Load XML files in batches of records

There are many things that Talend is great at doing with XML....there are also many things that it is not so great at. One such example of something that Talend is not so great at is loading multiple XML files with batches of data. Say for example you have 100 records and you want to load multiple records into single XML files, but have a limit on the number of records per file? How would you do that with the tXMLMap component. It is great in that it will let you load multiple XML files with single records or it will let you load a single XML file with all records in it, but you can't specify a limit by number or by some sort of grouping mechanism. It is one or the other. This means that in order to achieve functionality like above, you have to find another way to achieve this. Thankfully Talend allows us to do a lot more than just join components and configure them. It allows us to add our own Java code and make use of the some the Java code that is created by its components. You just need to know how. This tutorial shows how we can easily make use of some of the auto-generated Java code to enable the scenario above.

The XMLBatches Job

This job demonstrates how we can batch up data into sets and create an XML file per set.

I will describe each of the numbered components so that this can be created in any Talend version.

 

Context Variables

For this job, I use 2 context variables; "batcheSize" to set the size of batches, and "fileoutputpath" to set the path for the XML files that will be generated.

 

 

1) "Generate data" (tRowGenerator)

This component is simply used to generate data for this tutorial. I have set it to generate 100 rows using the settings shown in this screenshot....

2) "Load data to batches" (tJavaFlex)

Since the tJavaFlex component is just a component with 3 sections (Start Code, Main Code and End Code), I will show the code that I have used in each of these sections rather than show a screenshot.

Start Code

Below is the code for the Start Code.

//Creare variables that will be used
int recordCount = 0;
int batchSize = context.batchSize;

//An ArrayList to store the batches
java.util.ArrayList<java.util.ArrayList<row1Struct>> batches = new java.util.ArrayList<java.util.ArrayList<row1Struct>>();

//An ArrayList to store the data rows
java.util.ArrayList<row1Struct> batch = new java.util.ArrayList<row1Struct>();

Here we define the variables that will be used during the lifetime of this component. The "row1Struct" class is actually created by Talend when the schema for row1 is defined. You can find this class for your job by searching the Talend Code tab for the code the code that corresponds to your row name. The row we are dealing with here is "row1", so I was looking for a public static class starting with "row1" and ending in "Struct" in the code. Below is the code that Talend has automatically generated for this class.....

public static class row1Struct implements
            routines.system.IPersistableRow<row1Struct> {
        final static byte[] commonByteArrayLock_LOCAL_PROJECT_XMLBatches = new byte[0];
        static byte[] commonByteArray_LOCAL_PROJECT_XMLBatches = new byte[0];

        public Integer recordId;

        public Integer getRecordId() {
            return this.recordId;
        }

        public String firstname;

        public String getFirstname() {
            return this.firstname;
        }

        public String surname;

        public String getSurname() {
            return this.surname;
        }

        public Integer house_number;

        public Integer getHouse_number() {
            return this.house_number;
        }

        public String street;

        public String getStreet() {
            return this.street;
        }

        public String city;

        public String getCity() {
            return this.city;
        }

        public String state;

        public String getState() {
            return this.state;
        }

        private Integer readInteger(ObjectInputStream dis) throws IOException {
            Integer intReturn;
            int length = 0;
            length = dis.readByte();
            if (length == -1) {
                intReturn = null;
            } else {
                intReturn = dis.readInt();
            }
            return intReturn;
        }

        private void writeInteger(Integer intNum, ObjectOutputStream dos)
                throws IOException {
            if (intNum == null) {
                dos.writeByte(-1);
            } else {
                dos.writeByte(0);
                dos.writeInt(intNum);
            }
        }

        private String readString(ObjectInputStream dis) throws IOException {
            String strReturn = null;
            int length = 0;
            length = dis.readInt();
            if (length == -1) {
                strReturn = null;
            } else {
                if (length > commonByteArray_LOCAL_PROJECT_XMLBatches.length) {
                    if (length < 1024
                            && commonByteArray_LOCAL_PROJECT_XMLBatches.length == 0) {
                        commonByteArray_LOCAL_PROJECT_XMLBatches = new byte[1024];
                    } else {
                        commonByteArray_LOCAL_PROJECT_XMLBatches = new byte[2 * length];
                    }
                }
                dis.readFully(commonByteArray_LOCAL_PROJECT_XMLBatches, 0,
                        length);
                strReturn = new String(
                        commonByteArray_LOCAL_PROJECT_XMLBatches, 0, length,
                        utf8Charset);
            }
            return strReturn;
        }

        private void writeString(String str, ObjectOutputStream dos)
                throws IOException {
            if (str == null) {
                dos.writeInt(-1);
            } else {
                byte[] byteArray = str.getBytes(utf8Charset);
                dos.writeInt(byteArray.length);
                dos.write(byteArray);
            }
        }

        public void readData(ObjectInputStream dis) {

            synchronized (commonByteArrayLock_LOCAL_PROJECT_XMLBatches) {

                try {

                    int length = 0;

                    this.recordId = readInteger(dis);

                    this.firstname = readString(dis);

                    this.surname = readString(dis);

                    this.house_number = readInteger(dis);

                    this.street = readString(dis);

                    this.city = readString(dis);

                    this.state = readString(dis);

                } catch (IOException e) {
                    throw new RuntimeException(e);

                }

            }

        }

        public void writeData(ObjectOutputStream dos) {
            try {

                // Integer

                writeInteger(this.recordId, dos);

                // String

                writeString(this.firstname, dos);

                // String

                writeString(this.surname, dos);

                // Integer

                writeInteger(this.house_number, dos);

                // String

                writeString(this.street, dos);

                // String

                writeString(this.city, dos);

                // String

                writeString(this.state, dos);

            } catch (IOException e) {
                throw new RuntimeException(e);
            }

        }

        public String toString() {

            StringBuilder sb = new StringBuilder();
            sb.append(super.toString());
            sb.append("[");
            sb.append("recordId=" + String.valueOf(recordId));
            sb.append(",firstname=" + firstname);
            sb.append(",surname=" + surname);
            sb.append(",house_number=" + String.valueOf(house_number));
            sb.append(",street=" + street);
            sb.append(",city=" + city);
            sb.append(",state=" + state);
            sb.append("]");

            return sb.toString();
        }

        /**
         * Compare keys
         */
        public int compareTo(row1Struct other) {

            int returnValue = -1;

            return returnValue;
        }

        private int checkNullsAndCompare(Object object1, Object object2) {
            int returnValue = 0;
            if (object1 instanceof Comparable && object2 instanceof Comparable) {
                returnValue = ((Comparable) object1).compareTo(object2);
            } else if (object1 != null && object2 != null) {
                returnValue = compareStrings(object1.toString(),
                        object2.toString());
            } else if (object1 == null && object2 != null) {
                returnValue = 1;
            } else if (object1 != null && object2 == null) {
                returnValue = -1;
            } else {
                returnValue = 0;
            }

            return returnValue;
        }

        private int compareStrings(String string1, String string2) {
            return string1.compareTo(string2);
        }

    }

Main Code

Below is the code for the Main Code.

//Check to see if the recordCount is an exact mulitple of the batchSize
if(recordCount%batchSize==0){
    //Check to see if this is after the first batch
    if(recordCount>0){
        //Add the current batch to the batches ArrayList
        batches.add(batch);
        //Instantiate a new batch
        batch = new java.util.ArrayList<row1Struct>();
    }
    
}

//Create a temporary "row1Struct" object to copy row details to
row1Struct tmpRow = new row1Struct();

tmpRow.recordId = row1.recordId;
tmpRow.firstname = row1.firstname;
tmpRow.surname = row1.surname;
tmpRow.house_number = row1.house_number;
tmpRow.street = row1.street;
tmpRow.city = row1.city;
tmpRow.state = row1.state;

//Add tmpRow to the batch ArrayList
batch.add(tmpRow);

//Append 1 to the rowCount
recordCount++;

This section shows the logic (at the top) to save a "batch" into the "batches" ArrayList once it hits the size set in the "batchSize" context variable. The rest of the code is used to create a copy of the row which is being added to the current "batch". At the end the "recordCount" is appended by 1.

End Code

Below is the code for the End Code.

//Add the last batch to the batches ArrayList
batches.add(batch);

//Add the batches ArrayList to the globalMap
globalMap.put("batches", batches);

This section shows the last "batch" that was processed being added to the "batches" ArrayList. The "batches" ArrayList is then added to the globalMap to make available later in the job.

3) "Loop through batches" (tLoop)

This component is used to drive the iteration through the batches in the "batches" ArrayList. The configuration of this component can be seen below....

This is set with a "While" loop. The component is left pretty much as standard, with the addition of the following "Condition"....

i< ((java.util.ArrayList<java.util.ArrayList<row1Struct>>)globalMap.get("batches")).size()

This basically says that "i" must be less than the size of the "batches" ArrayList.

4) "Retrieve batch" (tJavaFlex)

Since the tJavaFlex component is just a component with 3 sections (Start Code, Main Code and End Code), I will show the code that I have used in each of these sections rather than show a screenshot. This component is used to retrieve a single batch of records at a time and iterate through the batch until the records for the batch have all been released. 

The output schema of this component is a direct copy of the input schema of the "Load data to branches" component.

Start Code

Below is the code for the Start Code.

//Create variables to be used by this component
//The batch number
int batchNum = ((Integer)globalMap.get("tLoop_1_CURRENT_ITERATION")).intValue()-1;

//The batch
java.util.ArrayList<row1Struct> batch = ((java.util.ArrayList<java.util.ArrayList<row1Struct>>)globalMap.get("batches")).get(batchNum);

//An Iterator for the batch
java.util.Iterator<row1Struct> it = batch.iterator();

//The start of a while loop used to iterate through the batch
while(it.hasNext()){

This section handles creating the variables for this component. You can see the "batches" ArrayList is retrieved from the gobalMap and the "batch" within that ArrayList is retrieved using the "batchNum" variable which is linked to the current iteration of the tLoop component.

Main Code

Below is the code for the Main Code.

//Retrieve the row from the batch
row1Struct myRow = it.next();

//Set the records from the row
row3.recordId = myRow.recordId;
row3.firstname = myRow.firstname;
row3.surname = myRow.surname;
row3.house_number = myRow.house_number;
row3.street = myRow.street;
row3.city = myRow.city;

In this section we simply retrieve the row1Struct object and set the output row values to the values held by the row1Struct object.

End Code

Below is the code for the End Code.

//End of While loop
}

In this section we simply close the "While" loop.

5) "Create XML" (tXMLMap)

This component is used to create the output XML. This is pretty simple, but has a couple of details that need to be noted. A screenshot of this component's configuration can be seen below. The red and green ovals highlight settings that will be discussed below.....

The red oval highlights the "All in one" setting. This needs to be set to true and allows multiple records to be added to the same XML Document.
The green oval shows that the "record" entity is the looping entity. This means that this "complex type" will be looped within the XML Document. The "loop" setting must be correctly placed for the XML to be generted properly. This can be set by right clicking on the entity that you want the looping set on, and selecting "As loop element".

6) "Output to file" (tFileOutputXML)

This component is used to output the XML document generated by the tXMLMap component, to a file. This has been configured for simplicity and can be seen below....

The file path is set using the "fileoutputpath" context variable and the file name is simply set to the number of the current iteration (from the tLoop component), with ".xml" added. This is very simple and you may want something a little bit more descriptive for your files.

Running the XMLBatches Job

To run this job simply set the context variables as you require them and press the "Run" button in the Run tab. There is no output to view in the System Out window. It will simply produce 10 XML files in the directory you select for the "fileoutputpath" context variable.

 

A copy of the completed tutorial can be found here. This tutorial was built using Talend ESB 6.1.1, but should be able to be imported into Talend DI 6.1.1 as well. It can be imported into subsequent versions. It cannot be imported into earlier versions, so you will either need to upgrade or recreate it following the tutorial. You will need to set the Context variables according to your system before running it.

 

Tutorial: 
Talend Version: 
Type of content: