Load Data into a Dynamic Number of Files

A question that I have seen multiple times on forums and have been asked several times while I have been on site somewhere, is "How can I load data to a dynamic number of files where I don't know the total number?". When I first heard this I have to admit I thought that the person was just trying to pick holes in what Talend can do. Why would you want to load to files that you have no idea about? But then it occurred to me, with Big Data we are seeing a resurgence of flat file data storage/usage. Maybe there is a requirement to split files into smaller chunks by a key within those files? Maybe a phonebook file of several GB might need to be split into a file per surname, for example? I then decided to look into how Talend could do this, and it turns out it is quite simple.


This tutorial demonstrates how simple it is to split a data source into multiple files based on a key within that datasource. 

The LoadIntoDynamic NumberOfFiles Job

The layout of this Job is shown below.....

I will describe each of the numbered components so that this can be created in any Talend version.
 

Context Variables

For this Job I use 1 context variable called "outputFolder". This is used to select where to create the generated files.

1) "tRowGenerator_1" (tRowGenerator)

This component is used to generate 10,000,000 rows of random data with a peron's name as the row key. It wasn't renamed to something more meaningful before I took a screenshot of the job, which is the only reason I has a rubbish name :-). Below is a screenshot of the configuration of this component. I chose to simply generate random data rather than creating data purely due to the overhead of hosting a file with 10,000,000 rows in it.

As you can see, the schema has been set up to return random ascii strings for the most part, but for the "keyField" I have set it to be set using the TalendDataGenerator's "getFirstName()" method. 
This component has been set to randomly generate 10,000,000 rows. This is set using the number circles in red.

2) "Distribute data" (tJavaFlex)

This component is where all of the logic takes place. tJavaFlex components are essentially made up of 3 sections of Java code. Instead of taking a screenshot of these, I will include the code below....

Start Code

//Create a HashMap object to contain potentially multiple FileOutputStream objects
java.util.HashMap<String, java.io.FileOutputStream> fileOutputStreamMap = new java.util.HashMap<String, java.io.FileOutputStream>();

In this section we simply instantiate a HashMap to hold our FileOutputStream objects. 

Main Code

//Set the groupKey variable
String groupKey = row1.keyField;

//Create String row from columns
String tmpVal = groupKey +";"+
    row1.field1 +";"+
    row1.field2 +";"+
    row1.field3 +";"+
    row1.field4 +";"+
    row1.field5 +";"+
    row1.field6+"\n"; 

//Convert the String row to a byte array    
byte[] contentInBytes = tmpVal.getBytes();

//Check to see if FileOutputStream for the groupKey exists
if(fileOutputStreamMap.containsKey(groupKey)){
        //Write the byte array to the file associated with the FileOutputStream
        fileOutputStreamMap.get(groupKey).write(contentInBytes);    

}else{
        //Create a new FileOutputStream for the groupKey. Add it to the HashMap
        fileOutputStreamMap.put(groupKey, new java.io.FileOutputStream(context.outputFolder+groupKey+".csv",false));

        //Write the byte array to the file associated with the FileOutputStream
        fileOutputStreamMap.get(groupKey).write(contentInBytes);

}

In this section we concatenate the column data (using a semi colon as a separator and a new line at the end), convert the String to a byte array and use a FileOutputStream to write the line to the required file. If a corresponding (to the keyField) FileOutputStream is not available, we create a new one and add it to the HashMap. HashMaps allow us to search for objects by key value. We use the keyField value as our key value. This way we can ensure that we write to the correct file and only create FileOutputStream objects when they are needed.

End Code

//Instantiate Iterator from the HashMap keyset
java.util.Iterator<String> it = fileOutputStreamMap.keySet().iterator();

//Iterate over keyset
while(it.hasNext()){
    
    //Close each FileOutputStream
    java.io.FileOutputStream tmpStream = fileOutputStreamMap.get(it.next());
    tmpStream.close();

}

In this section we simply close all of the FileOutputStreams that have been created.

......and that is it!

While this is a very simple example, it hopefully shows how not having a Talend component for generating file dynamically doesn't stop us from doing this.

 

Running the LoadIntoDynamic NumberOfFiles Job

To run this job, simply set the "outputFolder" context variable to a path that suits your requirements, go to the Run tab and click "Run". There is no output to the System.out window, but you will see CSV files in the folder you specified.

 

A copy of the completed tutorial can be found here. This tutorial was built using Talend ESB 6.1.1, but should be able to be imported into Talend DI 6.1.1 as well. It can also be imported into subsequent versions. However, it cannot be imported into earlier versions, so you will either need to upgrade or recreate it following the tutorial. You will need to set the Context variables according to your system before running it.

Tutorial: 
Talend Version: 
Type of content: