Geospatial Data Integration Using Matillion ETL
Spatial data adds a unique aspect to analysis: the ability to generate insights based on location awareness. Location is one of a small number of universal correlation attributes that apply to almost all kinds of data. Uncovering spatial patterns and relationships can lead to uniquely enhanced decision making.
Spatial data can be challenging to work with: there are dedicated storage and processing techniques, and an entire associated vocabulary. But the opportunities and risks identified through spatial analysis would be impossible to obtain in other ways.
As an example, here is a small test: There are four events listed in the table below. Three are quite unremarkable, but one of them is a very rare outlier. Can you spot it?
The answer has to do with spatial proximity: in this case, how close the coordinates are to a geological boundary line (1). Without that location based insight, spotting the outlier is really just guesswork, and using the data without that knowledge might lead to flawed conclusions.
(1) The answer is in “Sources and Samples” at the end
In this article I will try to bring this to life, with an end-to-end spatial proximity analysis example using real data. The table above contains just four records from tens of thousands more.
Snowflake’s GEOMETRY data type and geospatial SQL functions are the foundations for running this kind of analysis. Matillion ETL for Snowflake is an enabling technology. It helps you design and orchestrate data integration challenges such as these inside Snowflake on an enterprise scale.
I have a set of events: earthquakes, and a set of reference data: tectonic plate boundaries. I’d like to know which plate boundary every earthquake is associated with. The earthquakes are Point types, because they occur at a certain latitude and longitude. The boundaries are LineString types, since every boundary follows a defined path over the earth’s surface.
Clearly if an earthquake occurs exactly on a tectonic plate boundary, it is associated with that boundary. But that never happens! There’s always some finite distance between them, known as the minimum geodesic distance.
The diagram above shows the minimum geodesic distance between the Point and the LineString. It’s part way along the fourth segment of the boundary. It would be complicated and time-consuming to work that out by hand. Happily, among its numerous geospatial functions, Snowflake has the built-in ST_DISTANCE to perform that calculation in SQL.
Tectonic plate boundaries cover the earth’s surface like baseball seams. Depending how you count them, there are roughly a few hundred boundaries in total. Here’s a small map showing the Caribbean plate, and highlighting its boundary with the North American plate.
There are also lots of earthquakes! To find the distance between every earthquake and every boundary requires a spatial join.
In Matillion ETL’s graphical user interface a spatial join looks like this:
It will not be helpful to know how far an earthquake was from a plate boundary on the other side of the world. So to make the spatial join efficient I impose a condition to only include cases where the minimum geodesic distance is less than 1000 km (or one million meters). The SQL in the join component is like this:
ST_DISTANCE("q"."geo", "pb"."geo") < 1000000
You can see the two tables stg_quakes and stg_pb2002 on the screenshot above. Those are the source tables with the GEOGRAPHY columns, and I’ll talk about them next.
Handling GEOGRAPHY Data
All the individual earthquake events are held in the stg_quakes table. There’s a unique identifier column, plus a depth and magnitude and a GEOGRAPHY type that holds the latitude and longitude of the earthquake as a Point. In 2018 there were 62,224 in total, as you can see from the Sample in Matillion ETL.
Earthquakes below magnitude 2.0 usually can not be felt, and can be caused by human activities such as fracking. I don’t want them in this analysis so I have added a filter component to remove them.
Most of the records are higher magnitude, and the filter leaves 61,884 rows.
All the tectonic plate boundaries are held in the stg_pb2002 table. The “geo” column is a GEOGRAPHY type and contains the LineString data of the individual boundaries. The data is rather too bulky to preview in great detail, but you can get an idea of the composition of the LineStrings through Matillion ETL’s previewer.
There are fragments of two boundaries in the screenshot above. You can see, for example, that the second one begins at the exact point where the first one ended.
Proximity detection data transformations
Several transformations are useful for proximity detection. We have already seen the first of those in the spatial join: the distance calculation between earthquake and plate boundary.
Adding a calculated field named km_to_boundary uses the same formula, with the addition of a rounding function to take it to the nearest kilometer.
ROUND(ST_DISTANCE("q_geo", "pb_geo") / 1000)
This formula, along with others, are added in the “Spatial Derivations” Calculator component that you can see in the screenshot below.
The spatial join will actually result in a partial Cartesian product, relating all earthquakes to all boundaries within 1000km. What about finding the closest boundary to every earthquake?
This requires two separate, related steps.
- The “Rank prox” Rank component, which generates a ROW_NUMBER query in Snowflake. It puts the plate boundaries into ordinal sequence according to their proximity to every earthquake
- The “Only Closest” Filter component, which filters on the generated ordinal. It selects the closest boundary to every earthquake
The combination of rank and filter is a widely used technique, for example to detect duplicates, or while handling time variance. In this example it is the same method but used in a spatial context.
No boundary nearby
One thing to note is that most earthquakes do occur near plate boundaries. But there are also a few that occur nowhere near a boundary, for example around Hawaii. These are caused by other processes (such as hotspots or plumes) and should be labelled as intraplate events.
In data integration those cases are handled by using a left join. Those earthquakes will not have a plate boundary name, but it is an easy data transformation to give them a default name using a CASE statement:
CASE WHEN "pb_id" IS NULL THEN 'Intraplate' ELSE INITCAP(COALESCE(NULLIF("type", ''), 'boundary')) END
After the integration it is worth running a quick check to make sure none have been lost and that there is no duplication:
Preparations for visualization
I’d like to store the data with maximum accessibility in mind. So instead of the GEOGRAPHY datatype copied from source, I have used the ST_X and ST_Y spatial functions to split out the longitude and latitude coordinates:
Most earthquakes occur at quite a shallow depth. The deeper they are, the more difficult it is to capture precise measurements. Consequently the source data contains what looks like an unrealistic accuracy for deep events. The margin of error means occasional events appear with negative depth, which obviously can not happen.
So the following SQL expression is used to fix the outliers, and round all depth measurements to the nearest 10km.
Among its sync back capabilities, Matillion ETL enables you to export the results of a data transformation as a CSV file. From there it can be used by a number of different tools for further processing or visualization.
There are two steps to this. First, in the transformation job save the data as a table:
Then, in the orchestration job, run an Unload component to save the data to a file:
This is a small example of an aggregated data model. The aggregate does not add anything that does not already exist in the source data. It exists to remove the complication of the spatial join, to represent the data in the simplest, most consumable way.
Sources and samples
The Matillion jobs used in this example can be downloaded from here. The orchestration job should look like this once you have imported it using the API.
The above job will download the source datafiles for itself. But you can also download them independently yourself from here. They are gzipped, pipe-delimited CSV files ready to load into Snowflake.
- PB2002-LineString.csv – the tectonic plate boundaries
- EMSC-2018-Events.csv.gz – the earthquakes in 2018
You can check a visualization of this data here on Tableau Public.
So, which of the coordinates in the original challenge is the outlier? If you look at the visualization, you should be able to find an earthquake at N65.72, E23.08 on the Baltic coast, and immediately see that it is unusually far from any plate boundary. The other three are very close to their respective plate boundaries.
The Tableau Public visualization has a slider that allows you to decide how near a tectonic plate boundary an earthquake has to be before it is associated with that boundary. In the screenshot below the slider is set to 200km.
See what Matillion ETL can help you do with your data
Want to see how Matillion ETL can help you derive more insight from your data? Get a demo.