Using OAuth 2.0 with Talend to Access Google APIs

This tutorial deals with a reasonably complicated process and the Talend DI stuff is arguably the simple bit. As this is the case, I will not be going through each of the steps in as much detail as some of the other tutorials. It is assumed that if this functionality is required, that most of the Talend DI basics will have been mastered, or at least understood.

Google make use of the OAuth 2.0 protocol for authentication for (I think, but am happy to be corrected) all of their services. They do a pretty good job of describing the protocol, but do start out by saying that it is "...a relatively simple protocol". Don't worry if you don't agree. They have some pretty smart cookies working there. However, once you have read through the documentation you should have a better idea of what you are doing, which should make this easier. The Google documentation is here.

The first thing that needs to be done is to create a Google Project. This is described below....

Create a Google Project

Creating a Google Project is pretty simple. First of all, you need to go to the Google Developers Console.

Once you have logged in, you should see a page with content as below....

To create a project, click on the "Create Project" button circled in red. A popup will appear for you to fill out the "Project Name" and "Project Id". It might be a good idea to leave the "Project-Id" as the randomly generated value you are given as none of the values I have ever changed it to have worked frown

This tutorial will be built as if it is going to be used to access Google Drive. Accessing other Google tools is done in the same way. You need to specify which APIs you require when you register a project.

Ensure that the "I have read and agree..." tick box is ticked.....oh and remember to read it wink

Then press the "Create" button. 

After a few seconds you will see the following screen appear....

Next we need to give the project access to certain APIs. In this case we will give the project access to Google Drive. To do this, click on the "APIs & auth" link (circled in red). When the tree expands, click on the "APIs link. The next screen will appear...

You may see some other APIs automatically selected. You can leave those or (as I have done) remove them. Then select the "Drive API" and "Drive SDK".....or whichever Google products you wish to use. As I said earlier, this tutorial is an example of how to access Google Drive. But it can be followed for giving access to any of the Google applications.

Once the APIs that are required have been selected, click on the "Credentials" link to see the screen below....

Here is where we create our OAuth 2.0 Client ID. To start the process click on the "Create new Client ID" button (circled in red). This will reveal the screen below....

In this example we are using the "Web application" application type. While this isn't necessarily the best type to choose for Talend, it doesn't have any limitations as to what can be used with it. It does mean that a user will need to log in the first time, but a "refresh token" can be used to ensure that future "access tokens" can be created from that. This is explained later when describing the Talend Job.

You can see that the "AUTHORIZED JAVASCRIPT ORIGINS" and "AUTHORIZED REDIRECT URI" both contain "http://localhost". We are not using any Javascript, so we don't need to worry too much about the "AUTHORIZED JAVASCRIPT ORIGINS". The "AUTHORIZED REDIRECT URI" is a URI that access tokens are sent back to. This is described later.

Once you click on the "Create Client ID" button, you will see the following section appear on the screen. This holds all of the details you will need for your Talend Job...

The "CLIENT ID" and "CLIENT SECRET" are required by the Talend Job in order to get refresh tokens and access tokens.

Now we can look at building the Talend Job.

 

The "Retrieve Google Access Code" Job

This Job is intended to be used as a child job by other Talend Jobs that require access to Google products. The purpose of this Job is simply to return an access token.

This Job isn't terribly complicated, but there are lots of IF Conditions to control the data flow. This is to accommodate several scenarios that might be hit when retrieving an access code. A screen shot of the Job can be seen below....

As explained earlier, this tutorial will not go into so much detail about configuring components. Each of the numbers in the screenshot above correspond to areas that need a bit of detail. If there is anything that you feel is not described adequately, please feel free to leave a comment or question below and I will get back to you.

 

1) Reading Context Variables

This subjob is used to read Context variables in from a flat file. It makes use of a tFileInputDelimited component and a tContextLoad component.

The tFileInputDelimited component makes use of a Context variable called "context_file" to point to the correct file location and requires a schema of two columns; "key" and "value".

This enables the population of the following Context variables which have been created for this Job....

NameTypeDefault Value
access_tokenString 
client_idString 
client_secretString 
context_fileString"C:/GoogleDrive/Config/contextGoogle.csv"
redirect_uriString 
refresh_tokenString 
scopeString 

 

The only Context variable that needs a default value is the "context_file" variable. The rest are handled within the file referenced by that variable.

2) "Access Token Empty And Refresh Token Not Empty" (Run If)

This "Run If" link tests to see if the "access_token" Context variable is empty and the "refresh_token" is not empty. If this test is "true", then the next phase is to generate an access token from the refresh token. The code used is below...

(context.access_token==null||context.access_token.compareToIgnoreCase("")==0)&&(context.refresh_token!=null&&context.refresh_token.compareToIgnoreCase("")!=0)

 

3) "Access Token Not Empty And Refresh Token Not Empty" (Run If)

This "Run If" link tests to see if the "access_token" Context variable is not empty and the "refresh_token" is not empty. If this test is "true", then the next phase is to test the access token. The code used is below...

(context.access_token!=null&&context.access_token.compareToIgnoreCase("")!=0)&&(context.refresh_token!=null||context.refresh_token.compareToIgnoreCase("")!=0)

 

4) "Access Token Empty And Refresh Token Empty" (Run If)

This "Run If" link tests to see if the "access_token" Context variable is empty and the "refresh_token" is empty. If this test is "true", then the next phase is to generate a new refresh token and access token. The code used is below...

(context.access_token==null||context.access_token.compareToIgnoreCase("")==0)&&(context.refresh_token==null||context.refresh_token.compareToIgnoreCase("")==0)

 

5) "Get Access Token using Refresh Token" (tRESTClient)

This component is used to retrieve an access token using an existing refresh token. This will only be carried out if the "Access Token Empty And Refresh Token Not Empty" Run If link condition is true.

To configure this component copy the configuration shown below. Ensure that the sections circled in red are set correctly. To add the "Query parameters" use the green plus symbol circled in red.

The values required can be seen above, but you can find them below so that you can copy and paste them.....

URL: "https://accounts.google.com/o/oauth2/token"

NameValue
"refresh_token"context.refresh_token
"client_id"context.client_id
"client_secret" context.client_secret
"grant_type" "refresh_token"

 

It should be noted that although we are receiving JSON back, this component will automatically convert it to a DOM document with the JSON wrapped with a "ROOT" element by default. This is explained here. You will need a "MyTalend" account. This can be acquired for free.

6) "tExtractXMLField_2" (tExtractXMLField)

This component is used to retrieve the access token from the returned JSON string which has been converted to an XML document. The configuration of this component can be seen below.....

 

An output schema is required. To set this up click on the "Edit schema" button circled in red. A single column called "access_token" is required.

Ensure that the areas circled in red are configured as seen above. 

The "XPath query" required for the column that is being output is "./access_token". 

 

7) "IF more than 0 rows from tLogRow 4" (Run If)

This "Run If" link tests to see if any rows have come from the tLogRow_4 component. This is done to prevent the code from following this path if no rows were output. The code used is below...

((Integer)globalMap.get("tLogRow_4_NB_LINE"))>0

 

8) "tJavaRow_2" (tJavaRow)

This component is used to take the "access_token" column from the previous component and set it as the current value of the Context variable "access_token". The code to do this is below....

String atoken = input_row.access_token;

context.setProperty("access_token", atoken);

 

The "input_row.access_token" bit of code represents the value coming in. The "context.setProperty(..." section assigns the "access_token" value.

 

9) "IF more than 0 rows from tLogRow 2" (Run If)

This "Run If" link tests to see if any rows have come from the tLogRow_2 component. This is done to prevent the code from following this path if no rows were output. The code used is below...

((Integer)globalMap.get("tLogRow_2_NB_LINE"))>0

 

10) "tJava_3" (tJava)

This component is used to reset the "access_token" and "refresh_token" to an empty string if trying to acquire an access token from the refresh token fails. It also points the user to where to go to revoke access to the Talend Job so that the process can be started again from scratch. This should not be a common situation, but needs to be handled. The code for this can be seen below....

context.setProperty("refresh_token", "");
context.setProperty("access_token", "");
System.out.println("The tokens do not exist. Revoke access using this URL https://accounts.google.com/b/0/IssuedAuthSubTokens and then run the job again");

 

11+18) Writing the Context variables to file

These subjobs are used to take the current values held by the Context variables and output them to the file that holds those values. The tContextDump component needs no configuration. The tFileOutputDelimited component needs basic configuration which can be seen below...

The "File Name" value is set to the "context_file" Context variable. This is the only Context variable with a default set in the Job.

The schema needs to be a copy of the tContextDump. This is achieved by clicking on the "Edit schema" button and copying the input schema to the output schema. 

 

12) "tJava_1" (tJava)

This component is used to build a URI to be sent to the user to place in a web browser. It is made up of several Context variables which must be set in the Context variable file. This URI is described by Google here.

String uri = "https://accounts.google.com/o/oauth2/auth?";
uri = uri + "scope="+ context.scope + "&";
uri = uri + "state=123456789qwertyui&";
uri = uri + "redirect_uri="+ context.redirect_uri + "&";
uri = uri + "response_type=code&";
uri = uri + "client_id=" + context.client_id + "&";
uri = uri + "approval_prompt=auto&";
uri = uri + "include_granted_scopes=true&";
uri = uri + "access_type=offline";

System.out.println(uri);

 

13) "tMsgBox_1" (tMsgBox)

This component is used to retrieve the value of the redirect URL that is returned after a successful authorisation via a web browser. This is demonstrated later. The configuration of this component can be seen below...

Ensure that the "Buttons" drop down is set as "Question".

 

14) "tJava_2" (tJava)

This component is used to receive the result from the tMsgBox component and extract the authorization code from it. This is used by the next component to authorise the request for an access token. This process is described by Google here.

The code used in this component is below....

String code = ((String)globalMap.get("tMsgBox_1_RESULT"));
code = code.substring(code.indexOf("code=")+5);
code = code.substring(0,code.indexOf("&"));
System.out.println(code); //can be removed if an output is not required
globalMap.put("code", code);

 

15) "Get Access Token and Refresh Token" (tRESTClient)

This component is used to retrieve an access token and refresh token using the authorisation code retrieved from the component before. 

To configure this component copy the configuration shown below. Ensure that the sections circled in red are set correctly. To add the "Query parameters" use the green plus symbol circled in red.

The values required can be seen above, but you can find them below so that you can copy and paste them.....

URL: "https://accounts.google.com/o/oauth2/token"

NameValue
"code"((String)globalMap.get("code"))
"client_id"context.client_id
"client_secret" context.client_secret
"redirect_uri" "http://localhost"
"grant_type" "authorization_code"

 

It should be noted that although we are receiving JSON back, this component will automatically convert it to a DOM document with the JSON wrapped with a "ROOT" element by default. This is explained here. You will need a "MyTalend" account. This can be acquired for free.

 

16) "tExtractXMLField_1" (tExtractXMLField)

This component is used to retrieve the access token and refresh token from the returned JSON string which has been converted to an XML document. The configuration of this component can be seen below.....

 

An output schema is required. To set this up click on the "Edit schema" button circled in red. Two columns called "access_token" and "refresh_token" are required.

Ensure that the areas circled in red are configured as seen above. 

The "XPath query" required for the access_token column is "./access_token".

The "XPath query" required for the refresh_token column is "./refresh_token". 

 

17) "tJavaRow_1" (tJavaRow)

This component is used to take the "access_token" and "refresh_token" column values from the previous component and set them as the current values of the Context variables "access_token" and "refresh_token". The code to do this is below....

String atoken = input_row.access_token;
String rtoken = input_row.refresh_token;

context.setProperty("access_token", atoken);
if(rtoken!=null){
    context.setProperty("refresh_token", rtoken);
}

 

The "input_row...." bits of code represent the values coming in. The "context.setProperty(..." sections assign the "access_token" and "refresh_token" values. 

An "IF Condition" is used to cover situations where a "refresh_token" is not received. This should not happen, but this code prevents the Job from falling over if it does. 

 

19) Read the newly set Context variables into the Job and output just the Access Token

This subjob is run at the end of the Job. It will always run, no matter which path the code has taken. It is used to return the access token that has been retrieved/generated. As it has no idea where the access token has come from, it reads the latest value from the Context variable file. As ALL Context variables will be returned from this file, a tMap component is used to filter the return values.

The tFileInputDelimited component points to the Context variable file using the context_file Context variable. It also has the schema that can be seen in the tMap "row14" table. This needs to be configured.

The tMap component can be seen below....

The filter that is used in the "access_token_return" table can be seen below...

row14.key.compareToIgnoreCase("access_token")==0

 

Remember that "row14" might be named differently in a version you write. If you have errors here, check the input row name.

 

20) "Test List Files Services" (tRESTClient) 

This component is to simply test the access_token that is said to exist. If it tests successfully, the Job will end. If it fails, the error trigger will be used and the Job will attempt to generate a new one.

To configure this component copy the configuration shown below. Ensure that the sections circled in red are set correctly. To add the "Query parameters" use the green plus symbol circled in red.

The values required can be seen above, but you can find them below so that you can copy and paste them.....

URL: "https://www.googleapis.com/drive/v2/files"

NameValue
"corpus""DEFAULT"
"q""modifiedDate < '2000-01-01T00:00:00'"

 

This HTTP request is described by Google here. I have used a query to search for files with a modified date less than 2000/01/01. This has been done so that a successful response will return no data.

In order for the HTTP request to work, we need to provide the access token. This is done via the "Advanced Settings" tab as can be seen below....

The access token is provided by the HTTP header "Authorization". Its value must be a combination of the word "Bearer " (with a space) and the access token that has been supplied. 

The Context Variable File

Below we can see an example of what the Context variable file will need to look like when it is first run. The variables that are assigned values here must be assigned values in your version. When the Job has been run for the first time, all of the values will be populated.

client_secret;YIMgcQ24ghjt65GHy8wTtiSpn8
refresh_token;
redirect_uri;http%3A%2F%2Flocalhost
scope;https://www.googleapis.com/auth/drive
context_file;C:/Talend/OpenSource/5.5.1/Studio/workspace/contextGoogle.csv
client_id;689878354248.apps.googleusercontent.com
access_token;

 

Notice that the "context_file" variable has been set. It MUST point to its own location.

The "redirect_uri" variable is "http://localhost" where the value has been URL encoded. This could be done inside the Job if you prefer to leave this as a natural value.

The "scope" variable is described by Google here.

 

Running the Job for the first time

This Job can be run on its own to demonstrate that it works. It will print the access token to the System.out. It can also be used as a child Job that returns a key/value pair holding the access_token to be used by the parent. This section will demonstrate the Job being run as a standalone Job.

1) Running the Job

When running for the first time, we need to make sure that the Context variable file is fully configured minus values for the refresh_token and access_token (as seen above). Once that is sorted, load the Job and click on the "Run" button (circled in red).

This will produce a string in the System.out. Copy this string (circled in red) and paste it into a web browser.

2) Authorise the Talend Job with Google

When the Google authentication page loads, click on the "Accept" button. As below....

3) Copy the Authorisation Redirect URL

If the authorisation has worked, a redirect URL (like below) will be returned. Copy it.

4) Pass the Redirect URL back to the Talend Job

Paste the value copied from the web browser address bar into the message box and click "OK". The Job will then continue.

5) The Access Token is Generated

As can be seen below, the Access Token will be displayed at the botton of the System.out (circled in red). It will also be added to the Context variable file along with the Refresh Token.

 

Refreshing the Access Token from the Refresh Token

After the Refresh Token and Access Token have been generated for the first time, there should be no need for future human interaction unless the Refresh Token has been lost. To show this, open the Context variable file and add a few random characters to the Access_Token variable. Then run the Job as above. You will notice that there is no user interaction required and that a new Access_Token is generated.

Resetting the Refresh Token

You may find that for whatever reason the Refresh Token is not working or has been lost. If this is the case then the Talend Job will need it's authentication revoked before the Job can be run again from scratch. This is an unusual situation, but needs to be covered. To emulate this, open the Context variable file and alter some of the characters of the Access_Token and Refresh_Token. Then run the Job. You will see a screen like below informing you to revoke the access and giving you a URL to use....

Open the URL in a web browser and revoke access to your Talend Job (using the name you specified when you created the Google Project). Then start from scratch. 

 

Running the Job as a child Job

This Job can be run as any child Job in Talend. Ensure that you remember to configure a schema for the child Job that returns exactly what is output by the tBufferOutput component.

 

A copy of the completed tutorial can be found here. An example Context variable file can be found here. It was built using Talend 5.5.1 but can be imported into subsequent versions. It cannot be imported into earlier versions, so you will either need to upgrade or recreate it following the tutorial. 

 
Talend Version: 
Type of content: