One of the world’s leading tourism groups, the TUI group umbrella globally incorporates 1600 travel agencies, five airlines and over 400 hotels. All these operational systems capture large amounts of raw data while providing a top class experience to holidaymakers.
Trying to accumulate this raw data to obtain business insights, Dr. Michael Schäfers – Senior Data Engineer in Hotels & Resorts Analytics – found that significant information gaps were preventing TUI from realizing its full potential. In particular, usable geographic region information was only being captured 20% of the time. This severely limited the analytic power of the data.
Using a data fabric combining Matillion ETL and Snowflake, and adding various public data sets, Dr. Schäfers was able to fully plug that information gap, and make 100% of the data available to TUI for capacity planning and marketing. This article will explain how.
Address Cleansing Architecture
Dr. Schäfers found that the core of the issue was high variability in the formatting and level of detail of the address data being recorded.
With service delivery points in 140 different countries, data recorded locally usually lacked the country-specific insight needed to assign the correct geographical district. In addition, road and town names were often misspelled – Straße / Strasse – or shortened inconsistently – St. / Street.
To solve this problem, Dr. Schäfers validated and geocoded the raw addresses using OpenStreetMap data – which itself required several stages of metadata enrichment before use.
The OpenStreetMap (OSM) data is mirrored by Geofabrik and can be downloaded country by country in its native, highly compressed PBF format. Conversion to CSV/WKT is needed before the data can be processed further. To perform this format conversion cost effectively, Dr. Schäfers used GDAL running on an AWS Batch spot instance, launched from a Matillion ETL Python Script.
Spatial data enrichment
OSM data is tagged with the geographic hierarchy of specific interest to TUI. This was loaded into Snowflake alongside the spatial data using a Matillion S3 Load component.
However, OSM’s tagging is patchy, with the building > street > postcode > city > region hierarchy not always fully present. The solution was to perform spatial intersect joins in a Matillion ETL transformation job, at increasing levels of granularity to the OSM polygons for street, postcode, city, and region.
An Extract Nested Data component helped extract the metadata tags by taking advantage of Snowflake’s JSON handling capabilities.
Address geocoding with fuzzy matching
Geocoding the raw, operational address data with simple lookups into the OSM reference data proved to be ineffective due to the frequent misspellings and inconsistencies. With this approach, almost no matches appeared at all.
Instead, Dr. Schäfers implemented a fuzzy matching solution using a Snowflake user defined function (UDF). The UDF created simplified, lowest common denominator versions of address fields – for example by removing shorthand, correcting spelling mistakes, and removing the accents from characters. Passing both sets of address data through the same UDF removed the differences and mismatches caused by human error.
This was done in a second Matillion ETL transformation job, with joins at four levels with decreasing confidence: house, street, city and region. Below is a screenshot of the geocoding job with the UDF-based fuzzy matching for street names.
A world of spatial analytics
Once the missing information became available, TUI were able to use it to gain a wide variety of reliable insight into customer behaviours, including:
- Regional variation within a single country, unlocking the potential for much more effectively targeted sales and marketing, and for more accurate capacity planning
- Geographical confirmation of the increase of consumer confidence to pre COVID levels
TUI took the analytics one step further by adding weather data to the spatial join. This unlocked the ability to send real-time travel recommendations for sunny destinations to customers who are currently experiencing rainy weather (blue areas) at home.
TUI are actively recruiting right now, including in the Digital and Tech functions.
For more information about how Matillion ETL and Snowflake work together in a modern enterprise data stack, download our ebook, Optimizing Snowflake; A Real-World Guide.
Read more details about Geospatial Data Integration Using Matillion ETL.