Unless you are very lucky, you are unlikely to have a plethora of test data for all of your data sources when you start working on a Data Integration project. As building Data Integration solutions usually starts in the absence of some of the systems that are intended to be integrated, we need to do our best to emulate their data outputs as accurately as possible so that when they are introduced it is like slotting a puzzle piece in the right hole. It rarely is that easy, but creating test data based on whatever is known about that actual data always helps to make the pieces more likely to play together nicely.
In this tutorial we will create a job to build some test data based on the metadata we created in the Creating a piece of simple metadata tutorial.
This will cover several Talend concepts such as...
- Creating a Talend Data Integration Job
- Using metadata
- Using Java routines
1) Open the Talend Open Studio for ESB application and find "Job Designs" in the Project Tree. Right click on this and select "Create job".
2) Give the job a name (remembering to follow any naming conventions that may be applicable) and a description. It is quite common that a job may be built and not amended for some time. It is a good idea to give a detailed description of complicated jobs so that it makes it easier for people to inherit and work on them. Once a suitable name and description have been written, hit Finish.
3) So long as a unique name has been chosen, the job will be created and a blank workspace will be loaded, as below.
4) Click on the workspace and type "tRowGenerator". As you type each letter you will notice that a list of components will be narrowed down until you are left with the tRowGenerator. Click on the component name to select it.
5) Next, find the WebsiteOrderData schema that was built in the Creating a piece of simple metadata tutorial and drag it to the workspace. A components window will appear. Click on the "tFileOutputDelimited" component to select it. This component will be used to output the test data using the schema selected.
6) Another way of selecting components is to go to the Palette. This is a good place to discover components that you are not familiar with. You can search for components using the search box. In this case we are searching for a "tLogRow" component. When found, drag and drop it to the workspace.
7) We now have 3 components on the workspace. At the moment they are not connected and will do nothing in this state.
8) At step 5 we added a tFileOutputDelimited component that was associated to the WebsiteOrderData schema. In this step we will link the tLogRow component to the tFileOutputDelimited component by right clicking on the tLogRow component, selecting "Row" and "Main", then dragging the row to the tFileOutputDelimited component.
9) After dragging the row to the tFileOutputDelimited component, a window appears asking whether to get the schema of the target component. The target component is configured with the schema we shall be using, so click "Yes" to format the tLogRow with the same schema.
10) Repeat steps 8 and 9 to connect the tRowGenerator component with the tLogRow component, making use the of the same schema.
11) The tFileOutputDelimited component is set to create a file called "WebsiteOrderData.dat". This is the file that was used to create the schema in the tutorial where we built the schema. To change this, click on the "File Name" field and select "Change to built-in property" when the "Edit parameter using repository" window appears.
12) Add the path and filename you wish to be produced by this job. In this case, we have selected "C:\temp\PrepTestData\MyTestOutput.dat".
13) Now the tFileOutputDelimited component is configured, click on the tRowGenerator component. In the component tab below, a "RowGenerator Editor" button appears. Click this to edit create the logic to build the data.
14) The editor window will appear showing the schema that data needs to be built for. Here we can select functions to provide the data. The number of rows of data this component will produce is set using the "Number of Rows for RowGenerator" value. For this example, it is set at 100. The first column that we will supply data for is the "Id" column. This needs to be a numeric sequence. Click on the "Functions" cell and select "Numeric.Sequence".
15) Ensure that the "Start value" and "Step" values are both 1. These indicate that the sequence will start at 1 and increment by 1. The sequence is automatically named "s1". This can be changed. The purpose of this name is so that the same sequence can be used in several different places.
16) Talend is shipped with several functions for generating random data. The "Firstname" and "Surname" data will use the "TalendDataGenerator.getFirstName" and "TalendDataGenerator.getLastName" functions. These can be selected from the "Functions" cell drop down.
17) Some of the other data that needs to be generated is specific to the example. Talend supplies some generic data generator functions, but cannot possibly supply functions for every possible usecase. It does allow you to create Java functions to generate your own however. To do this, right click on the "Routines" object in the Project Tree and click on "Create routine".
18) Give the routine a name and a description. Click "Finish".
19) A skeleton Java class is created for you. This is where you can build your functions. Where possible it is a good idea to use static functions to prevent memory issues.
20) One of the fields in the schema that needs to be populated with data is a car registration field. Below is a screenshot of the Java function that was written to accommodate the need for random car registration numbers.
21) After the Java routine has been created and saved, the functions can be used with the tRowGenerator editor. In this example several functions were created to accommodate the schema. In order to use them you need to click on the row of a column you wish to use a function for. This will reveal an empty "Value" field in the "Function parameters" tab below. Click on the "Value" field and a button is revealed. Click this to reveal an expression builder.
22) In the expression builder you can reference any function that you have created. In the screenshot below we can see a "getTown" function belonging to the "RilhiaDataGenerator" routine being used.
23) Work through every column of the schema and assign a method of generating data. This can be standard Talend functionality or bespoke functionality built using Java. Pure Java can also be used in the expression builder but it is better to create a function in a routine to make it easier to reuse. Once all columns have been configured, you can click on the "Preview" tab to preview the data, as seen below.
24) If no problems are experienced in previewing the data, we are ready to test the job. To do this, click on the "Run" tab and then click on the "Run" button. You will see the output window is populated with data, as below.
25) The output window was populate with data thanks to the tLogRow component. This component shows every data row that passes between the tRowGenerator and the tFileOutputDelimited components. While useful in debugging, this can slow jobs down quite a bit. To switch this functionality off, simply right click on the tLogRow component and select "Deactivate tLogRow". This leaves the component in place if you want to use it in the future, but switches it off so that it will not slow down the performance of the job.
26) If we go to the location that was specified in the tFileOutputDelimited component, we can see the file that was produced. Here we see the file has been generated with random data. But in this example, the file is generated without a header row.
27) To include a header row, go to the tFileOutputDelimited component and click on the "Include Header" tick box in the "Component" tab. Re-run the job.
28) Find the generated file and open it up. You will see that this time there is a header row.
The source files can be downloaded using the link below....