View
2
Download
0
Category
Preview:
Citation preview
camptocamp SA / Foss4g2008 / www.camptocamp.com / info@camptocamp.com
Practical introduction to Spatial Data Integrator powered by
OpenSource SpatialETL
Agenda:Camptocamp and Talend presentationSpatial Data Integrator (SDI) powered by Talend overviewTutorialWhat's next ?
2
Camptocamp, an Open Source Base Camp ! 45 employees
Switzerland & France About 50 to 70 % of growth per year since 2002 3 activity domains
Spatial solutions Business solutions Infrastructure solutions
4 services poles Consulting Engeneering Supporting Training
Geo-spatial Solutions
Infrastructure Solutions
Business Solutions
CONSULTING
ENGENEERING
SUPPORT
TRAINING
WebmappingGIS / MetadataSpatial Data InfrastructuresWeb Services
ERPBusiness IntelligenceETL
SecurityLinux ServerVoIP
3
Talend overview
Talend is the first provider of open source data integration software
Located in France, USA, Germany, China 70 employees
First product release: 2006 Leader in open source data integration
Rival large established proprietary players
4
What is ETL? Extract / Transform / Load
ETL is a process in Data Warehousing. « How to get data in ? » is ETL process name.
http://en.wikipedia.org/wiki/Extract,_transform,_load)
Extractextract data from source system where data originates.
Transformapply series of rules or functions to the extracted data (selecting, translating, joining, ...)
Loadonce data transformed and cleaned, load the data in a data warehouse
5
Spatial Data integration
Synchronize and check integrity
of your applications data
ExternalData Files
Migrate legacyapplications
Parcel
RoadsNetwork Production SoE
CentralGeodata
warehouse
Extract, Transform and Load Data
GeospatialDatabase
Replicate subset of datainto subject matter DM
Datamart
Datamart
Exchange / sharedata with customers
or suppliers
eCommerce
Govt agency
6
Productivity & Ease of Use
Graphical development Dramatically increased productivity & ramp up Combined graphical & technical views Drag-and-drop mapping interface Large library of components & connectors
Leverage industry-standard languages Java, Perl, SQL
8
GeoSpatial components
Feature manipulation Formats Raster processing*
MetadataGeocoding Viewer
* next release (version 1.3)
*
*
*
*
*
*
*
*
9
Tutorial: Places around natural reserves
Unzip the application Start Talend Create a connection (set the email address)
Create a new project in Java (Create button)
With GeoNames.org data for South Africa (isoCode ZA), use nature reserves (fcode=RESN) and search for populated places (more than 10 000 inhabitants)in .2 width circle.
10
1.Create a new job
First step is to create a new job.On the repository Tab, in Job design, click on create a new Job
Job name: getData This will create a new view to design your job
20
New release coming soon
Version 1.3 Changes
New Components : WFS, GPX, simplify, transform, ...
Added Geometry type Added wizard for automatic schema
creation Embed uDig to view data Linked to Sextante library to process
RASTER Demo
21
Spatial Data Integrator project
Community website http://spatialdataintegrator.org forum, wiki, tutorials
Developpers: SVN repository Prototype with uDig (thanks Jesse) Sandbox for Sextante (thanks Victor) Sandbox for Grass
camptocamp SA / Foss4g2008 / www.camptocamp.com / info@camptocamp.com
Practical introduction to Spatial Data Integrator powered by
OpenSource SpatialETL
Agenda:Camptocamp and Talend presentationSpatial Data Integrator (SDI) powered by Talend overviewTutorialWhat's next ?
Data integration is a key process
Data volumes in exponential growth
Diversity and heterogeneity of data sources
Data processing plays a major role in implementing GIS projects
Consolidating and aggregating spatial data with data from other sources is often required
GIS data integration situation
Use command or hand-made script from various tools and libraries
gdal/ogr commands, fwtools, postgis command, ...
Proprietary Spatial ETL such as FME
Lack of Open Source global geo-spatial data integrator
SDI prototyped in 2007 and presented at Foss4g 2007
Foss4g2008 / Francois Prunayre
2
Camptocamp, an Open Source Base Camp ! 45 employees
Switzerland & France About 50 to 70 % of growth per year since 2002 3 activity domains
Spatial solutions Business solutions Infrastructure solutions
4 services poles Consulting Engeneering Supporting Training
Geo-spatial Solutions
Infrastructure Solutions
Business Solutions
CONSULTING
ENGENEERING
SUPPORT
TRAINING
WebmappingGIS / MetadataSpatial Data InfrastructuresWeb Services
ERPBusiness IntelligenceETL
SecurityLinux ServerVoIP
Foss4g2008 / Francois Prunayre
3
Talend overview
Talend is the first provider of open source data integration software
Located in France, USA, Germany, China 70 employees
First product release: 2006 Leader in open source data integration
Rival large established proprietary players
Foss4g2008 / Francois Prunayre
4
What is ETL? Extract / Transform / Load
ETL is a process in Data Warehousing. « How to get data in ? » is ETL process name.
http://en.wikipedia.org/wiki/Extract,_transform,_load)
Extractextract data from source system where data originates.
Transformapply series of rules or functions to the extracted data (selecting, translating, joining, ...)
Loadonce data transformed and cleaned, load the data in a data warehouse
more info http://en.wikipedia.org/wiki/Extract,_transform,_load
Foss4g2008 / Francois Prunayre 5
5
Spatial Data integration
Synchronize and check integrity
of your applications data
ExternalData Files
Migrate legacyapplications
Parcel
RoadsNetwork Production SoE
CentralGeodata
warehouse
Extract, Transform and Load Data
GeospatialDatabase
Replicate subset of datainto subject matter DM
Datamart
Datamart
Exchange / sharedata with customers
or suppliers
eCommerce
Govt agency
Spatial Data Integrator is one component of the SDI useful for ...
Data manipulation (Extraction, Quality checking, Conversion, Projection)
Data & metadata production (vector and Raster analysis)
Data & metadata manager (Network files and database manipulation, archiving)
Data dissemination (WWW publication, Deploy jobs as webservice)
Data reporting (Indicators, Analysis, ...)
... End user tools to define common tasks (ie. Job, process, script) usually made by hand or scripting in desktop GIS.
Foss4g2008 / Francois Prunayre 6
6
Productivity & Ease of Use
Graphical development Dramatically increased productivity & ramp up Combined graphical & technical views Drag-and-drop mapping interface Large library of components & connectors
Leverage industry-standard languages Java, Perl, SQL
Key features Business-oriented process modeling Graphical development Robust and scalable execution Broadest connectivity to support all systems Project repository for design and execution Real-time debugging
A high adoption rate 100,000 product downloads 20% register as users
Active community 1,000 beta testers 500 forum contributors
Highest performance, robust and scalable execution Grid-distributed processing Industry-standard code generated (Java or Perl) Leverage both ETL and ELT architectures Process data closest to the source
Foss4g2008 / Francois Prunayre
7
Click to add title
Job pannel
Componentspalette
Job start/stopComponentproperties
tab
Broadest connectivity to support all systems 100+ connectors available out of the box
RDBMS: Oracle, PostgreSQL, MySQL, DB2, SQL Server, Sybase, Ingres, …
Web: Web Services, FTP, HTTP, POP, SMTP…
Files: Delimited, positional, XML, Excel…
Business Applications: SugarCRM, SalesForce.com, LDAP…
Geospatial GIS Format (Shapefile, MIF, WFS, GPX, PostGIS), Geometry manipulation based on
JTS (buffer, relation, transformation)
8
GeoSpatial components
Feature manipulation Formats Raster processing*
MetadataGeocoding Viewer
* next release (version 1.3)
*
*
*
*
*
*
*
*
Foss4g2008 / Francois Prunayre
9
Tutorial: Places around natural reserves
Unzip the application Start Talend Create a connection (set the email address)
Create a new project in Java (Create button)
With GeoNames.org data for South Africa (isoCode ZA), use nature reserves (fcode=RESN) and search for populated places (more than 10 000 inhabitants)in .2 width circle.
GeoSpatial components are only available in Java
Slides, datasets are available here
http://spatialdataintegrator.org/foss4g2008/data.zip
Foss4g2008 / Francois Prunayre
10
1.Create a new job
First step is to create a new job.On the repository Tab, in Job design, click on create a new Job
Job name: getData This will create a new view to design your job
Foss4g2008 / Francois Prunayre
11
2.Download the data & unzip
Click to add an outline
1.DOWNLOAD THE DATA
From the palette view, Add a Internet/tFileFetch component to the workspace.
In the component properties tab, set URI to "http://download.geonames.org/export/dump/ZA.zip" to get geonames.org data for South Africa.
Set the destination directory (eg. ''/tmp/'' or ''c:/temp/'') and filename (''ZA.zip'').
2.UNZIP
Add a File/FileManagement/tFileUnarchive component.
Right click the tFileFetch and trigger « OnComponentOk » the tFileUnarchive.
Set the archive file (eg. ''/tmp/ZA.zip'') and the extraction directory (eg. ''/tmp'')
3. RUN THE JOB (F6) and check that a ZA.txt file and a Readme.txt should be in your temp directory.
NOTE: In Talend, all strings should be quoted in component's properties.
NOTE: In the name of the component, the first letter « t » stands for Talend initial
components, « s » for Spatial ones, « u » for Users ones.
HINT: If a panel could not be find in the Talend workspace (eg. « Palette »), click on menu
« Window>Show view », and then search for the « Palette ».
Foss4g2008 / Francois Prunayre
12
3.Define a schema to use later in jobs
Click to add an outline
In the repository tab, Metadata is use to store information to be used in all the jobs of your workspace. Let's create one for the geonames text file format.
Right click File Delimited and create a file delimited
Set name (eg. Geonames)
Click next to go to step 2
* Browse to select the file you unzipped in the previous step
Click next to go to step 3
* Set Field separator to tabulation
Click next to go to step 4
* Talend automaticaly detect column type (but modify type of column6 to be String instead of Character because ESRI Shapefile doesn't support that datatype)
* Rename column name in order to easily use this later. Geonames provides a Readme.txt with information about column names (Columns list: is: geonameid, name, asciiname, altname, lat, lon, fclass, fcode, country, cc2, adm1, adm2, adm3, adm4, pop, elevation, gtopo, tz, date)
NOTE: you could also import a schema from an XML file (http://spatialdataintegrator.org/foss4g/)
Foss4g2008 / Francois Prunayre
13
4.Convert text file to MapInfo format
Click to add an outline
Create a new job named « convertToGIS ».
Drag & drop, the geoname file delimited from the repository object into this new job view.
Add a Geo/Manipulators/s2DPointReplacer.
* Right click the tFileDelimited component and create a row/main link between the 2 components.
* In the s2DPointReplacer select lon , lat columns to define X and Y to be used to create a point geometry. If you can't see column list in the properties, click the « Sync Column » button to update the column of the current component.
Add a Geo/File/Output/sMapinfoOutput components.
* Create a row/main link between the s2DPointReplacer and this component.
* Set the file name (eg. ''/tmp/ZA.mif'')
Run the job (F6). Try to turn on statistics in the run job tab and run the job again.
Display the new layer in a GIS (eg. QGis).
Foss4g2008 / Francois Prunayre
14
5.Filter natural reserves ...
Click to add an outline
The objectives is to filter geonames data to extract natural reserve.
Add a processing/Replicate component after the s2DReplacer to duplicate the flow. Remove connection between s2dPointReplacer and sMapinfoOutput. Connect s2DPointReplacer to tReplicate, and tReplicate to sMapinfoOutput.
Add a tFilterRow component and connect the tReplicate to the tFilterRow. In the tFilterRow properties, set a filter on the « fcode » column to get only the natural reserve of South Africa (ie. InputColumn: fcode, value=''RESN''; do not forget « '' »).
Add a Geo/File/Output/sShapefileOutput component and set the filename (eg. ''/tmp/resn.shp''). Connect the tFilterRow (filter flow) to the new output component. Click the « synch column » button to synchronise columns between the 2 components.
Run the job (F6) (optionnaly with statistics)
Display the new layer in a GIS.
Foss4g2008 / Francois Prunayre
15
... and compute buffer
Click to add an outline
The objectives is to compute a buffer around natural reserve for later use.
Add a processing/Replicate component after the tFilterRow to duplicate the flow.
Add a Geo/Manipulators/sBufferCalculator component. Connect the tReplicate to the sBufferCalculator. In the sBufferCalculator properties set the distance property (eg. « .2 ») to define the width of the buffer around the natural reserves.
Add a Geo/File/Output/sShapefileOutput component and set the filename (eg. ''/tmp/buffer.shp''). Connect the sBufferCalculator to the new output component.
Run the job (F6)
Display the new layer in a GIS.
Foss4g2008 / Francois Prunayre
16
6.Search for populated place
Click to add an outline
Search for populated place with a population greater than 10 000 inhabitants.
Create a new job named « intersect ».
Drag & drop, the geoname file delimited object into this new job.
Add a Geo/Manipulators/s2DPointReplacer and set properties (X and Y columns to be used to create the point geometry as in first job).
Add a processing/tMap component and connect the main flow from the s2DPointReplacer to the tMap.
Add a Geo/File/Input/sShapefileInput. Set filename to the file containing the buffer around natural reserves (eg. ''/tmp/buffer.shp''). Connect it to the tMap.
Double click the tMap...
Foss4g2008 / Francois Prunayre
17
7.Search for populated place
Click to add an outline
Intersect buffers with places
Filter population
The tMap component is one of the more advanced component in Talend. It allows complex join, mapping and filter actions.
In the left side of the tMap panel, INPUT components are listed (row2 should be the geonames text files and row3 the buffer).
1.CREATE A JOIN TO INTERSECT BUFFER AND GEONAMES DATA
In the buffer input (row3), click the « activate filter expression » button.
Click in the expression filter section, and set the join « GeoOperation.INTERSECTS(row2.the_geom , row3.the_geom) » (use CTRL+SPACE to turn on autocompletion)
2. CREATE THE OUTPUT FLOW
In the right side, click the « Add output table » button. Set the name (eg. « place »)
Drag & drop the columns from the left to the right (eg. Geonameid, name, the_geom, pop) to set the schema of the output.
Click in the expression filter section, and set the filter on the population column « row2.pop > 10000 »
Click ok.
3. CREATE OUTPUT
Add a sShapefileOutput component, set filename (eg. ''/tmp/place.shp'') and connect the tMap output flow (named « place ») to this new component.
Run the job & display in a GIS.
Foss4g2008 / Francois Prunayre
19
9.Run the full process in a job of sub-jobs
Click to add an outline
Create a new job.
From the repository> Job, Drag & Drop the first job (getData) in the job designer.
Drag & Drop the convertToGIS job
Drag & Drop the intersect job.
Then connect the 3 jobs. Trigger onSubJobOk events, the next job to process the 3 jobs in the correct order.
Foss4g2008 / Francois Prunayre
20
New release coming soon
Version 1.3 Changes
New Components : WFS, GPX, simplify, transform, ...
Added Geometry type Added wizard for automatic schema
creation Embed uDig to view data Linked to Sextante library to process
RASTER Demo
Changes
* Update to TOS 2.4.1
* Update to GeoTools 2.5
* 7 new Components
** sWfsInput:
** sGpxInput:
** sSimplify: Use DouglasPeuckerSimplifier and TopologyPreservingSimplifier algorithme to simplify geometries
** sSimpleGeomToMulti: Convert simple geometry (POINT, LINESTRING, POLYGON) to collection (MULTIPOINT, MULTILINESTRING, MULTIPOLYGON).
** sGraticuleBuilder: Create grid
** sChangeLineDirection: Invert line direction
** sTransform: Compute affine transformation (translation, rotation and scaling)
Recommended