Matillion loves to feature regular contributions from our consulting partners about the business value they’ve been able to achieve for their clients with Matillion. We also love bacon. This week, Aron Clymer, CEO and Founder of Data Clymer gives us a little of both. This blog post originally appeared on Data Clymer’s blog on September 30, 2020.
A small team of expert consultants at Data Clymer implemented a Panoply / Looker data stack at one of our many e-commerce clients to centralize data and democratize across the employee base. We spent the majority of time on this project modeling sales, inventory, and accounting data into a set of easy-to-use analytical tables. This data is then democratized across a 500-person organization so that everyone can answer their own business questions self-service.
Panoply’s solution provides a hosted, smart data warehouse combined with data connectors to popular systems. Looker is a best-in-class business intelligence platform for self-service analytics and dashboarding. Although the overall Panoply / Looker solution was working well with nightly batch loads, end users were looking for high velocity data (synced every 30-60 min). It was time to wake up and smell the bacon! There were enough blockers to high velocity data with the existing solution that we needed to eggspand the data stack with a tool that would give us more control over the data ingestion and transformation. We chose Matillion because it is a powerful extract, load, and transform (ELT) tool with an easy-to-use graphical interface that expedites development.
Below I’ll explain 4 of the main blockers that Matillion helped us resolve.
1. True incremental load
We typically ingest data using Panoply native connectors. For this project we have connected 13 different data sources, many of which are common for e-commerce companies (Google AdWords, Facebook Ads, Instagram, etc.). One data source was challenging with Panoply: Microsoft Dynamics NAV ERP (MS NAV). We ingest data from MS NAV against a mirrored Microsoft SQL Server database. Binlogs are not available in this case, so we must issue SQL statements against source tables to pull data.
Robust incremental load on many tables was not possible because the only reliable incremental key on all MS NAV tables has a datatype not supported by Panoply. This key, consistently called “timestamp” on every table, is not actually a timestamp, but an obscure data type known as rowversion. It’s a very unique data type only supported by MS SQL Server. So, in Panoply we had to implement a truncate-and-reload (full refresh) process to sync these tables to ensure accuracy, but each run was was taking 4-5 hours. A few tables were too large to fully reload every day, so we had to relegate those jobs to run only on weekends. All of this made it impossible to delivery high velocity data.
Matillion allowed us to cut the fat on load times because we had full control over the data load process. We can now perform data type conversions that allows us to query a rowversion field in SQL Server and convert it to a BIGINT in Panoply, and vice-versa. Thus, we now have implemented true incremental loads using the timestamp incremental key.
2. “Always On” tables
We have built some somewhat complex transformations in Looker because a) it’s an effective rapid prototyping approach and b) Panoply doesn’t have a transformation layer. However, Looker must truncate/reload every transform step (materialized view), and while this is happening end users cannot query the data because their queries are queued. Some of the transforms are taking 10 to 15 minutes to build, which wouldn’t be an issue with 24 hour batch loads, but the high data velocity requirements couldn’t be met with 15-minute queuing delays every time the data is refreshed. Because of this issue, we had to limit data velocity to six-hour syncing periods.
With Matillion, we were able to implement one of our favorite design patterns: something we call “Always On” tables. Using more complex UPSERT and DELETE logic against the final target tables eliminates the need for a truncate-and-reload process. Hence, we now have target tables that are “always on”, so end users can query these tables with Looker at any time with no delays. The data is simply as fresh as it can be.
3. One-time cleanup of source data
In one dataset, there are tens of thousands of duplicate or otherwise inaccurate records due to issues with a system upgrade last year. Sadly, it was impossible to correct this data in the source, so we had to clean up this pigsty in the data warehouse. There is no way to handle this as a one-time, pre-processed cleanup with Panoply or Looker; it would have to be handled on every query generated by Looker, which impacts performance.
With Matillion, we developed a one-time cleanup script that resulted in a “clean” version of the target tables. We then modified our data ingestion pipelines to update those clean target tables with incremental data, so incremental load times remain fast and there are no downstream performance impacts.
4. Leading zeros
We encountered some source fields that contain leading zeros in the values, but Panoply strips them out because it automatically converts numeric values to numeric data types. There was no real work-around here because it is impossible to detect which rows had leading zeros before they were removed, so this was causing large headaches for report writers. You might say it was a real boar. (Oh, Aron. -eds.)
Again, the fine control that Matillion affords us in the data ingestion and transformation process was the solution. We built new data pipelines that maintain a VARCHAR (string) datatype throughout the ingestion process, and leading zeros are now preserved.
And that’s one of many ways in which Matillion has saved our bacon. Contact us to see how Data Clymer and Matillion might save yours.
About the author
Aron Clymer is CEO and Founder of Data Clymer, a premier boutique consulting firm specializing in data culture transformation.