Tutorial for the GeoProcessing extension

BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified ExpertPosts:876Unicorn
edited February 2020 inKnowledge Base
There is new an extension calledGeoProcessingin the RapidMiner Marketplace. To give you an idea of what you can do with this extension, here is a tutorial using some of the operators.
Our fictional scenario: We're working with the city of Vienna, Austria, to celebrate the long history of Vienna and the river Danube. For the celebrations, we would like to organize a boat race and a running event for children. We are working with geodata from the Open Data server of Vienna.
In the 1970s Vienna built an artificial island inside the Danube, calledDonauinsel(Danube Island). Since then there's the Danube (left arm on the picture) and the New Danube (right). Here's a map to give you an idea:

We are only interested in the parts of the Danube and the New Danube that flow through Vienna. These are highlighted in the next map:

The boat race should be in the longest part of the Danube (or New Danube) through Vienna, so we want to determine the length of the river parts.
For the children's running event, we want to select the two bridges with the shortest distance between them. All bridges in Vienna are of course also available on the Open Data server:

We are obviously only interested in the bridges over the Danube, not every bridge in Vienna. So we will filter the data accordingly:

Then we will calculate the distance between every bridge and select the shortest one (ignoring very short distances of multi-part bridges).
In order to make RapidMiner capable of doing all this, install the GeoProcessing extension from the Marketplace. Make sure that you see the Geoprocessing folder in your Extensions in the Operators panel.

Some background knowledge

Earth is an irregular ellipsoid, but we like to look at maps in two dimensions, as these are more suitable for computer screens or paper. This transformation to two dimensions also allows the application of geometry calculations like distance, length, area and so on.
我们在纬度和长期表达全局坐标itude degrees (counted from the equator and from the international 0 meridian in Greenwich). These are angles, so the distance between coordinates depends on the geographic position. We can't use these coordinates for calculating absolute sizes in our favourite measurement system (meters, yards, miles, ...).
The process of transforming coordinates to a new coordinate system (CRS, coordinate reference system) is called projection or reprojection. You can think of it as taking a photo from an airplane or a satellite to transform the three-dimensional earth surface to a two-dimensional picture. The projected coordinates can be measured in meters or other units, and geometry functions will give us the expected measurements.
Coordinate systems are referred to by EPSG codes. You can checkepsg.ioto find an appropriate coordinate system for the area you're working on.
It's not always necessary to reproject coordinates. If we only want to know if a geometry contains or touches another geometry, we can calculate that in the original coordinate system (if we ignore problems spanning the line between longitudes -180° and 180°).

Getting the data

The Vienna open data server contains geodata in many formats. We can easily use the CSV version in RapidMiner. The example process loads the data directly from the web, you could of course save them locally if you need them more often.

这个过程包含标准RapidMiner操作符rs only, the extension is not yet in use. The Read CSV operators are set up with the comma as the separator, and UTF-8 encoding, but otherwise with the default settings. The attribute names come from the first line, the data format is determined automatically.
We only keep a few attributes (the geometry and the object name) and rename them for later use. For example, the river geometry is renamed toriverGeom.
The standard for expressing geometries in textual form is called WKT, Well Known Text. The open data server delivers the geometries in this format, and this is also the format used by the GeoProcessing operators. If you have GIS data in a database, you can useST_AsTextin SQL to get them in this format.

The tutorial process


After reading the data, we first extract the parts of the Danube inside the boundaries of Vienna. We use Calculate Geometry Relation for this (Danube inside Vienna in the process). It has one input, so we need both the Vienna and the Danube coordinates in one example set. The easiest way to achieve this is a Cartesian join (it combines every row from the first example set with every row from the second one). We use theintersectionfunction of Calculate Geometry Relation for getting the result. It returns the common part of the two geometries (a polygon and a line) as another geometry, in our case a shorter line (just the part of the Danube inside the Vienna polygon).
We then filter out the New Danube for the bridges, but keep both parts for the river part length calculation.
We want to get the length in meters here, not in ellipsoid degrees. So we reproject the original coordinates to a projection commonly used in Austria,ETRS89/Austria (EPSG code: 3416). This projection is appropriate here. If you work in a different geographical area, be sure to select an appropriate projection. (Choosing a wrong projection will lead to big distortions in the calculated measures.)
After reprojecting to EPSG:3416, we can calculate the length of the river arms with Calculate measures on a geometry (called Calculate river length here).

Now on to the bridges.
First we want to find bridges that cross the Danube. This is a geographic join operation if we apply it on two example sets.

We select the functioncrosseshere. Other functions includecontains/containedBy,intersects,overlaps,touches, etc. The function parameter stays empty here, it is only used byisWithinDistance.
Now we can create a distance "matrix" (not formatted as a matrix) for all the selected bridges. This happens in a subprocess.
To calculate distances, we will of course reproject the bridge coordinates to the Austrian meter-based coordinate system. We join the bridge table with itself using a Cartesian join so we get a row for every combination of bridges, but remove the row if it compares the bridge with itself.
Then we use Calculate Geometry Relation with thedistancefunction on the projected geometries.

然后我们过滤掉一切的距离less than 100 meters to avoid returning irrelevant combinations (some smaller parts of the bridges are separate entries in the data).
Now we can sort the data by distance and return the first row. According to our data, "Steg an der Nordbahnbrücke" and "Floridsdorfer Brücke" would be the nearest ones, with a distance of 481 meters.
That's it, we are done with the analysis. We imported geodata from the Web, transformed coordinates, combined different example sets with different methods and calculated real-world measures on the geometries.
Some directions you could go from there:
- Use the operatorGeometry to Coordinatesto visualize data (it works best with point geometries, or if you have a large number of geometries)
- Try different ways to geographically join example sets
- Try out the different functions inCalculate Geometry RelationandCalculate measures on a geometry

I'm looking forward to your questions and remarks on the GeoProcessing extension and this tutorial.
Sharan_Gadi sgenzer gmeier IngoRM [Deleted User] Jasmine_ varunm1 lionelderkrikor Pavithra_Rao yyhuang

Comments

  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified ExpertPosts:876Unicorn
    A downloadable version of the processes is attached here.
    Jasmine_ DocMusher
  • sgenzersgenzer 12Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM ModeratorPosts:2,959Community Manager
    edited February 2020
    And if you have RapidMiner 9.6+ running, you can click on this link to open the processes directly:

    Get Data (1st process shown above)

    Calculating Distances (2nd process shown above)
    Jasmine_ BalazsBarany
  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified ExpertPosts:876Unicorn
    I also added the process in the 9.5 repository under /Community Data Science. (Extension Example ...)
    Pavithra_Rao sgenzer Jasmine_
  • ivaneivane MemberPosts:2Contributor I
    Hi Balazs, I created a number of groovy scripts using your zip archive of library from geoscript and geotools. These processes have been working since you gave us (TCA) that geoscript library bundle in 2016 when you came down here in Melbourne. Now it seems that adding these jar files into the Rapidminer studio/lib folder causes the sql connector to beak for this latest version 9.10.008. Is there a revised library bundle I can use to continue using processes with groovy scripts in it?

    Strangely enough, it is still working on 9.10.001 studio version. However, when executed on AI Hub there is an issue with connection to the sql database on version 9.10.001 - after 1 hour and 45 minutes during execution the following error gets thrown: java.lang.IllegalAccessError: tried to access class com.microsoft.sqlserver.jdbc.SQLServerDriverIntProperty from class com.microsoft.sqlserver.jdbc.SQLServerDriver.

    Rapidminer support advised me to upgrade to 9.10.008, but when I add the bundles of geoscript jar files the sql connection breaks. Any help would be much appreciated. Note that I've also developed scripts making use of the geohash and interpolation for quicker data matching, so I would really need to keep using the groovy script using geoscript (unless there is also geohash and interpolation operators as extension).
  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified ExpertPosts:876Unicorn
    Hi@ivane,

    nice to hear from you after such a long time.

    I haven't looked into updating the geotools and geoscript lately. However, I'm actively using the GeoProcessing extension which should have newer libraries, and accessing MySQL and PostgreSQL databases is not a problem in the latest Studio. I don't have MS SQL to test.

    I guess that updating the geo* library jars one by one to current versions is the best approach. Maybe some common logging or utility library is too old, it gets loaded when Studio starts, and then the MSSQL driver breaks.

    Regards,
    Balázs
  • ivaneivane MemberPosts:2Contributor I
    Hi@BalazsBarany

    I managed to get the geoscript working without breaking the sql connector - I only added the 113 jar files below out of the 142 jar files you have in the package. The sql connections still works (both in studio and AI hub)

    bufr-4.6.2.jar
    c3p0-0.9.1.1.jar
    cdm-4.6.2.jar
    commons-beanutils-1.7.0.jar
    commons-dbcp-1.4.jar
    commons-jxpath-1.3.jar
    commons-pool-1.5.4.jar
    core-0.26.jar
    eastwood-1.1.1-20090908.jar
    ecore-2.6.1.jar
    ehcache-1.6.2.jar
    fop-0.94.jar
    gdal-1.11.2.jar
    geodb-0.7-RC2.jar
    GeographicLib-Java-1.44.jar
    geoscript-groovy-1.6.0.jar
    gt-api-14.0.jar
    gt-app-schema-resolver-14.0.jar
    gt-arcgrid-14.0.jar
    gt-brewer-14.0.jar
    gt-complex-14.0.jar
    gt-coverage-14.0.jar
    gt-coverage-api-14.0.jar
    gt-cql-14.0.jar
    gt-css-14.0.jar
    gt-data-14.0.jar
    gt-epsg-wkt-14.0.jar
    gt-geobuf-14.0.jar
    gt-geojson-14.0.jar
    gt-geopkg-14.0.jar
    gt-graph-14.0.jar
    gt-grassraster-14.0.jar
    gt-grid-14.0.jar
    gt-gtopo30-14.0.jar
    gt-jdbc-14.0.jar
    gt-jdbc-h2-14.0.jar
    gt-jdbc-mysql-14.0.jar
    gt-jdbc-postgis-14.0.jar
    gt-jdbc-spatialite-14.0.jar
    gt-main-14.0.jar
    gt-metadata-14.0.jar
    gt-ogr-core-14.0.jar
    gt-ogr-jni-14.0.jar
    gt-opengis-14.0.jar
    gt-process-14.0.jar
    gt-process-feature-14.0.jar
    gt-process-geometry-14.0.jar
    gt-process-raster-14.0.jar
    gt-property-14.0.jar
    gt-referencing-14.0.jar
    gt-shapefile-14.0.jar
    gt-swing-14.0.jar
    gt-transform-14.0.jar
    gt-wfs-ng-14.0.jar
    gt-wms-14.0.jar
    gt-xml-14.0.jar
    gt-xsd-core-14.0.jar
    gt-xsd-fes-14.0.jar
    gt-xsd-filter-14.0.jar
    gt-xsd-gml2-14.0.jar
    gt-xsd-gml3-14.0.jar
    gt-xsd-kml-14.0.jar
    gt-xsd-ows-14.0.jar
    gt-xsd-sld-14.0.jar
    jai_codec-1.1.3.jar
    jai_core-1.1.3.jar
    jai_imageio-1.1.jar
    json-simple-1.1.jar
    jsr-275-1.0-beta-2.jar
    jt-affine-1.0.6.jar
    jt-algebra-1.0.6.jar
    jt-attributeop-1.4.0.jar
    jt-bandcombine-1.0.6.jar
    jt-bandmerge-1.0.6.jar
    jt-bandselect-1.0.6.jar
    jt-binarize-1.0.6.jar
    jt-border-1.0.6.jar
    jt-buffer-1.0.6.jar
    jt-classifier-1.0.6.jar
    jt-colorconvert-1.0.6.jar
    jt-colorindexer-1.0.6.jar
    jt-contour-1.4.0.jar
    jt-crop-1.0.6.jar
    jt-errordiffusion-1.0.6.jar
    jt-format-1.0.6.jar
    jt-iterators-1.0.6.jar
    jt-jiffle-language-0.2.0.jar
    jt-jiffleop-0.2.0.jar
    jt-lookup-1.0.6.jar
    jt-mosaic-1.0.6.jar
    jt-nullop-1.0.6.jar
    jt-orderdither-1.0.6.jar
    jt-piecewise-1.0.6.jar
    jt-rangelookup-1.4.0.jar
    jt-rescale-1.0.6.jar
    jt-rlookup-1.0.6.jar
    jt-scale-1.0.6.jar
    jt-stats-1.0.6.jar
    jt-translate-1.0.6.jar
    jt-utilities-1.0.6.jar
    jt-utils-1.4.0.jar
    jt-vectorbin-1.0.6.jar
    jt-vectorbinarize-1.4.0.jar
    jt-vectorize-1.4.0.jar
    jt-warp-1.0.6.jar
    jt-zonal-1.0.6.jar
    jt-zonalstats-1.4.0.jar
    jts-1.13.jar
    net.opengis.fes-14.0.jar
    net.opengis.ows-14.0.jar
    net.opengis.wcs-14.0.jar
    net.opengis.wfs-14.0.jar
    netcdf4-4.6.2.jar
    BalazsBarany
Sign InorRegisterto comment.