Dynamic column order

This tutorial was inspired by questions I get asked a lot when out on sites, I have also seen it asked on forums. The question is "Is there a way to deal with files that have the same columns, but in different orders?" or "Can I idnetify the column order from the header row?". There are several ways in which this can be achieved and this is just one of them. Some ways may required much more complicated logic and maybe a bit of Java. This way makes use of the tMap component and the ordered processing of variables in that component. As with always, I have included a copy of the completed Job with this tutorial and a couple of example files to try it out on.

This Job will cater for text files with four columns. These columns are headed "one", "two", "three" and "four". They can occur in any order and the Job will order these columns in the order "one", "two", "three", "four. An example of the formats are shown below.....

The required output order

ONETWOTHREEFOUR
1234
11223344

An example incorrect order

TWOONEFOURtHREE
2143
22114433

Example files with varying orders are included.

 

The DynamicColumnOrder Job

Below we can see the layout of the "DynamicColumnOrder" Job. This is a very basic example which will hopefully allow you to extrapolate from it to solve your dynamic column issue. One thing that should be thought about here is the column type. When taking data in from unknown column types, it is best practice to dump that data into a String type. You can convert from that type once you are sure of the actual data type held by that column. For this reason, this example will work entirely with Strings. However, hopefully you will see that as soon as you know the data type, it is quite easy to cast to that type.

There are no context variables use in this tutorial and all paths to files are hardcoded. If you download the Job you will need to change these.

1) "File List" (tFileList)

This component is used to list through the files in a folder. In this example we are assuming that every file in the folder is suitable. This will need to be looked at more seriously in a real life example. Some filtering at the very least will be required. The image below shows the configuration of this component. It will need changing for your environment.

The text in the red square will need changing to the path to where you keep the files.

2) "Input File" (tFileInputDelimited)

This component is used to read the contents of a file. The configuration of this component is its default configuration, other than the "File name/Stream" parameter. This is set to the "Current_Filepath" global variable that is set for each file that is found by the tFileList component. As this component is linked by an "Iterator", every file that is found by the tFileList initiates this component again. This means that every file has its content read in by this component. The configuration can be seen below.....

 

3) "Fix order" (tMap)

This component is used to identify the order and change it to the expected order. The configuration of this component can be seen below....

The input "row1" will not have any columns when you start as we did not configure the tFileInputDelimited component's schema. To do this, we will click on the green plus symbol (in a green square) four times and create 4 String columns as shown above. This schema will be passed back to the tFileInputDelimited component.

The next thing we need to do is to create 9 variables by clicking on the green plus symbol (in the red square). This is where all of the work takes place. It is important to realised that for each row from the input file, these variables are processed in order from top to botton. This allows us to do some clever processing of the data. The configuration of these variables can be seen in the table below....
 

ExpressiontypevariableDescription
Var.rowcount+1introwcountThis variable is to keep track of the rows per file. The first row is a header, this variable comes in handy when working out if we are looking at the header row or not.
Var.col1Name==null ? row1.one : Var.col1Name  Stringcol1NameThis is used to store the name of the first column when the header row is passed through. The assumption being that the header is the first row and therefore the existing value for the variable will be Null. Variables keep their values between rows.
Var.col2Name==null ? row1.one : Var.col2Name  Stringcol2NameSee "col1Name" description
Var.col3Name==null ? row1.one : Var.col3Name  Stringcol3NameSee "col1Name" description
Var.col4Name==null ? row1.one : Var.col4Name  Stringcol4NameSee "col1Name" description
Var.rowcount>1 ? Var.col1Name.compareToIgnoreCase("one")==0 ? row1.one : Var.col2Name.compareToIgnoreCase("one")==0 ? row1.two : Var.col3Name.compareToIgnoreCase("one")==0 ? row1.three : Var.col4Name.compareToIgnoreCase("one")==0  ? row1.four : null : null  Stringcol1ValueThis variable checks to see that "rowcount" is greater than 1 (ensuring it is a data row) then checks each of the "colName" variables to see if it is named "one" (in this case). If it is, we know that this is the correct column and its value is used.
Var.rowcount>1 ? Var.col1Name.compareToIgnoreCase("two")==0 ? row1.one : Var.col2Name.compareToIgnoreCase("two")==0 ? row1.two : Var.col3Name.compareToIgnoreCase("two")==0 ? row1.three : Var.col4Name.compareToIgnoreCase("two")==0  ? row1.four : null : null  Stringcol2ValueThis variable checks to see that "rowcount" is greater than 1 (ensuring it is a data row) then checks each of the "colName" variables to see if it is named "two" (in this case). If it is, we know that this is the correct column and its value is used.
Var.rowcount>1 ? Var.col1Name.compareToIgnoreCase("three")==0 ? row1.one : Var.col2Name.compareToIgnoreCase("three")==0 ? row1.two : Var.col3Name.compareToIgnoreCase("three")==0 ? row1.three : Var.col4Name.compareToIgnoreCase("three")==0  ? row1.four : null : null  Stringcol3ValueThis variable checks to see that "rowcount" is greater than 1 (ensuring it is a data row) then checks each of the "colName" variables to see if it is named "three" (in this case). If it is, we know that this is the correct column and its value is used.
Var.rowcount>1 ? Var.col1Name.compareToIgnoreCase("four")==0 ? row1.one : Var.col2Name.compareToIgnoreCase("four")==0 ? row1.two : Var.col3Name.compareToIgnoreCase("four")==0 ? row1.three : Var.col4Name.compareToIgnoreCase("four")==0  ? row1.four : null : null  Stringcol4ValueThis variable checks to see that "rowcount" is greater than 1 (ensuring it is a data row) then checks each of the "colName" variables to see if it is named "four" (in this case). If it is, we know that this is the correct column and its value is used.

The logic above allows the tMap to select the appropriate value for the column that is expected. The "colValue" columns are all passed to the tMap output in the expected order ("col11Value", "col12Value", "col13Value", "col14Value").

To create the "out1" output column, click on the green plus symbol (in a blue square above) and it creates the output. Then you need to connect up the "colValue" variables in numeric order (top to bottom) as shown. The last step is to ensure that no header values are passed through. This is done using the output filter. We have "Var.rowcount>1" to ensure that the row is only passed on when it is greater than row number 1.

 

4) "tLogRow_1" (tLogRow)

This component is used to print out the result to the System.out. This needs no configuration apart from connecting to the last component.

 

Running the Job

Once you have downloaded the example files (or tailored this to your requirements and located the files in the same folder), you can run this by clicking on the "Run" button on the "Run" tab. The expected output (from the included files) is seen below......

Starting job DynamicColumnOrder at 13:34 30/01/2015.

[statistics] connecting to socket on port 3673
[statistics] connected
1|2|3|4
11|22|33|44
111|222|333|444
1|2|3|4
11|22|33|44
111|222|333|444
1|2|3|4
11|22|33|44
111|222|333|444
[statistics] disconnected
Job DynamicColumnOrder ended at 13:34 30/01/2015. [exit code=0]

 

A copy of the completed tutorial code can be found here. The example files can be downloaded here. This tutorial was built using Talend 5.5.1 but can be imported into subsequent versions. It cannot be imported into earlier versions, so you will either need to upgrade or recreate it following the tutorial.

If you have any questions about this tutorial, please post them in the comments below so that others (who will inevitably have similar problems) can benefit from the discussion.

Tutorial: 
Talend Version: 
Type of content: