vendredi 4 décembre 2009

One simple example of using GDAL/OGR together with Talend or how to get rid of scripts

In this post, you can replace any mention to Talend by Spatial Data Integrator as this latter software is an extension of the initial Talend

One goal of some ETL tools like Talend is to avoid writing multiple and complex lines of code and speed the conception and execution of integration processes. All the operations are designed in a workspace. The administrator picks the components he needs from a palette and links them. Behind, Java or Perl lines of code are generated but the casual administrator wouldn't need to read nor modify these lines. Somewhat, Talend offers a graphical and friendly way for programming.

Spatial Data Integrator is just Talend to which spatial components and functions have been added. The java libraries behind are GeoTools, Java Topology suite and Sextante. Despite the richness of functionalities, you won't find some components needed for some specific operations. In particular, Talend SDI doesn't support as many formats as OGR. Therefore, you can't convert your files between some specific formats. Besides, in Talend SDI, for the conversion case, it requires knowing the structure of your files (called schemas) in advance. It's a prerequisite that limits massive format-converting.

Happily, Talend is flexible enough to allow the administrator to enrich the application with additional Java libraries and to launch command line tools. Talend will allow you to easily integrate GDAL/OGR operations inside complex processes with only one line of code. Actually, depending on the utility, it's the single line of code one is supposed to know.

Many people ask on forums how to use GDAL/OGR over a series of files, for instance how to convert a bunch of ESRI files into KML Files. The given solutions require knowing some elements in shell or batch scripting. Depending on the case, the script could become quite big, thus difficult to maintain (even more when considering one sometimes copies-pastes the lines of code without really understanding them).

Let's look at the solution given by Tim Sutton, a well-known developer in the OSGeo world, to convert a directory of tiffs to ecw. The SHP to KML script would look alike.
#!/bin/bash
mkdir ecw
for FILE in *.tif
do
BASENAME=$(basename $FILE .tif)
OUTFILE=ecw/${BASENAME}.ecw
echo "Processing: ${BASENAME}.tif"
if [ -f $OUTFILE ] #skip if exists
then
 echo "Skipping: $OUTFILE"
else
 /usr/local/bin/gdal_translate -of ECW -co LARGE_OK=YES $FILE $OUTFILE
fi
done

As you can see, skills in programming are obvious.

Designing the conversion process in Talend is quite easy as it uses only two components. The first one lists the files inside a folder and the second applies the OGR2OGR command over each of them. This case is one of the simplest example of integrating GDAL/OGR command line tools in Talend.

Let's examine the job:


Let's look at the tFileList component:

No explanation needed: just looking at the component properties is explicit enough, far more than multiple lines of code.

Now, let's look at the tSystem properties in which we launch the OGR2OGR command:

The command is the following one:
"ogr2ogr -f \"KML\" "+((String)globalMap.get("tFileList_1_CURRENT_FILEPATH")).replace("SHP","KML")+" "+((String)globalMap.get("tFileList_1_CURRENT_FILEPATH"))

  • ((String)globalMap.get("tFileList_1_CURRENT_FILEPATH")) is the complete ESRI file path returned by the tFileList component. You access this global variable by typing the Ctrl-Space shortcut.
  • ((String)globalMap.get("tFileList_1_CURRENT_FILEPATH")).replace("SHP","KML") is the complete ESRI File Path in which we replace "SHP" by "KML". The KML file will be generated in the same folder as the ESRI GIS files.
  • \" escapes the " character.

This job gives you the main principle for using a command line tool like GDAL/OGR together with Talend. You could do the same with gdalwarp, gdal_translate commands. As the method is mainly graphical and intuitive, it's easier to develop and to maintain than a shell or batch script. This kind of operation could get more complex when being part of a job that includes other Talend SDI components.

The possibility of transmitting the result of a command to other components can lead to powerful processes, complex, yet at the same easy to improve and maintain. We'll see one example of command outputting in a future post.