Blog| Matillion ETL for BigQuery

    How we accelerated our marketing analytics and reporting with BigQuery and Matillion ETL

    Editor’s note: Today’s post comes from Johnathan Romero, Lead Analytics at the7stars. The7stars is a leading UK digital marketing agency using BigQuery and Matillion ETL to accelerate their marketing analytics and better serve their customers.

    Because many of our clients are global, we work with a lot of data—billions of rows, in fact, with more added each day. In the past, we’d tried to manage all of it manually. But with many different data sources and numerous CSV files, day-to-day reporting was costing us thousands of hours each year. We knew we needed a solution that would allow for quick and easy reporting and analysis, and act as a single source of truth for all our customer marketing data.

    Our goal from the beginning was to do more than just streamline our data ingestion process—we wanted to improve our entire reporting workflow. Each of our customers, however, had a list of unique requirements and preferences when it came to reporting and visualization. After considering our options, we decided on Google BigQuery with Matillion ETL. BigQuery can consume data from common sources like Doubleclick, AdWords, Criteo, and Facebook, and with BigQuery we can use Tableau and Data Studio to create bespoke visualizations tailored for each customer. To speed up reporting, we needed pre-aggregated data sets with transformations already done that could then be fed into BI tools. Matillion ETL for Google BigQuery was our answer to this problem and gave us additional data integration components to meet clients’ individual needs.

    Here’s a little more on how we accelerated our marketing analytics and reporting with BigQuery and Matillion ETL and how it works.

    How does it work?

    This sample architecture shows where the data originates and how it gets into BigQuery.

     

    matillion the7stars architecture diagram

     

    The above diagram shows how we get different data into BigQuery. On the left-hand side you can see four data sources Matillion helps to bring into BigQuery. Matillion loads data by streaming it to Cloud Storage and then loading the data directly into BigQuery. Data, however, doesn’t go via the Matillion instance. For Google Sheets at the bottom of the diagram, Matillion will create a Federated Data Source table in BigQuery, which will query the data directly in Google Sheets. Other Federated Data Sources can be created in BigQuery as shown on the right. For us, this needs set up outside of Matillion.

    Matillion runs as a virtual machine instance (VM) inside our existing cloud infrastructure, providing us with an added level of privacy and security. Our data never goes to Matillion servers, in fact, it never leaves our GCP environment. Since it’s browser-based we can access it from any location so teams can work together across the globe, in real time.

    Bringing in the data

    Within Matillion, we create what is called an ‘Orchestration job’ to bring the data into BigQuery from different sources. Matillion can pull or link to data from Google hosted sources (Google Drive, DoubleClick, Analytics, etc.) and from non-Google sources (on-premise database using a JDBC driver, Facebook, and REST/SOAP APIs).

    To create an Orchestration job I can just drag and drop a variety of different pre-built load and unload components making this fast, easy and virtually code-free. Even better still, as I build out my job the components validate in real time. This means if I fail, I fail fast before I even run my ETL job.

    An example of our job to bring in Google Drive and Facebook data is below. We use the Google Drive Table component to create a linked table in Google BigQuery to a Google Sheets file containing Google Analytics data and also the Facebook Query component to bring in the relevant Facebook data into new tables in BigQuery.

     

    matillion the7stars orchestration job

     

    This job starts by checking if new data is available. If there isn’t updated data then no re-run is required and the job ends successfully. With that one simple sanity-check, we save time and money by incrementally loading data as needed. If there is new data, it’s then loaded in parallel, taking advantage of Matillion’s multithreaded approach.

    As part of the Orchestration job, ‘Transformations jobs’, which give we use the aggregate the data before feeding it into a BI tool, can be added and called so the data is transformed immediately after the load.

    Matillion has all of the flow control components you would expect in an ETL tool. You can run tasks in parallel and wait for them to finish before performing the next task with an AND gate as seen above. What we find really useful is the Retry component which will ‘retry’ a certain number of times with different back-off options if a job fails.

    Making the most of our data

    As you can see in the Orchestration job, there are a number of Transformation jobs have been added after the data load activities. A Transformation job transforms the data into new tables or views before we send it to Tableau or Google Data Studio for visualization. An example Transformation job is below. This is a standard reporting job which combines and normalizes data from DoubleClick, Facebook and Google Analytics:

     

    matillion the7stars transformation job

     

    I will break down the transformations performed in the job:

    1. It takes the DoubleClick data, filters out the specific advertisers and activities for the customer.
      matillion the7stars transformation job1
    2. Aggregates it to give the total number of events, total revenue, costs, and conversions and then brings this together. By aggregating the data, the job is reducing the number of rows which need to be joined and then finally loaded by Data Studio or Tableau, both reducing the cost of the BigQuery query runs and improving the speed of the queries and the speed of BI tools. The aggregates were taking billions of rows down to between 20k and 30k.
      matillion the7stars transformation job2
    3. The data is then denormalized by bringing in the associated site, campaign names, etc.
      matillion the7stars transformation job3
    4. Some calculations are performed on the data to apply business rules to calculate the true cost per site by matching against the site names.
    5. In parallel, the daily Facebook data is brought in. This is a table which has been transformed by another Matillion job which includes Facebook data from AdInsights and AdInsightActions.
      matillion the7stars transformation job5
    6. Also in parallel, the daily Google Analytics data is brought in. Again, this has been transformed by another Matillion job.
      matillion the7stars transformation job6
    7. Lastly, all data sets from the 3 sources are combined into one large table and a timestamp is added. This highly denormalized table is in the ideal format for users to use in Tableau or Google Studio to visualize the data.
      matillion the7stars transformation job7

    This is just one job for one client, but as I mentioned previously, we have global clients each with their own reporting requirements. Since most Matillion Transformation jobs can be used as templates and given our data schemas are normally similar for each client/data source – we can just quickly plug in new accounts for a new client. Using Matillion we can onboard a new client and get a report out to them in a matter of days instead of months. So it’s very scalable. We recently went from signing a new client to having a Data Studio report which had aggregated data from all their data sources, DCM, Facebook, Display Advertising and others, all done within 1 week.

    Our big benefit: productivity

    The following are some ways this solution has helped us boost our productivity and streamline our data reporting process.

    Scheduling Jobs

    Matillion gives us 3 options for running jobs:

    • Manually through the user interface
    • Programmatically using the API
    • At a given time using the Scheduler.

    At the7stars we use the 3rd option which utilizes the Scheduler to regularly load data into BigQuery and then transform it. We opted for this method for two reasons. First, we generate reports at regular intervals for clients (i.e. weekly, fortnightly, monthly reporting). Having the jobs scheduled means our reports are ready when they due to clients without oversight. The second use case for the Scheduler is batch loading our data.

    Batch loading

    The second use case for the Scheduler is batch loading our data. We use the Scheduler to conduct loads every five minutes on weekdays so we have essentially real-time data to work with and can closely monitor campaigns. No one has to manually run a job every five minutes. This gives us control over what data we are bringing in, and when.

    An example of a campaign monitoring job scheduled to run every five minutes on weekdays is below:

     

    matillion the7stars batch load

     

    This schedule automatically loads new data without any manual intervention and if for some reason the job fails a notification is automatically sent to the Matillion administrator via email. This gives us near real-time updates which can provide essential insights while running specific campaigns.

    Automating

    The automation provided by Matillion ETL and BigQuery seamlessly links into Tableau replacing what was once a very manual process. Previously, we had to conduct manual transformations to the data in Excel. This is saving hundreds if not thousands of work hours a year and allowing more time for commentary and business insights in our reports. We are redirecting our attention to data exploration and finding insights that might not otherwise have been investigated.

    Our top Matillion tips

    1. Use Matillion ETL for BigQuery to create a copy of your source data in BigQuery. BigQuery storage is cheap and this enables traceability for all of your data all within the one place.
    2. Incrementally load data using Matillion ETL, rather than doing a full reload each time. This will decrease job time because you will process less data and will also mean no user downtime and no data will be deleted – new data can simply be appended into your existing table.
    3. Configure Matillion ETL to send you notifications. Every component within a Matillion Orchestration job has success and failure links. These can be used in conjunction with the Python Script component to send an email when a job fails. Furthermore, you can use Matillion variables to send the Task Message when the job fails in the email. This means any failures can be immediately assessed and mitigated.

    Conclusion

    We were fortunate enough to stumble across Matillion at an event and get a demo on the spot. It was a lightbulb moment for us really. In fact, I have worked a long time in this career and I don’t think I have ever had a moment where a software solution came along and literally did exactly what I needed it to do, exactly when I needed it. If you are going to a NEXT event head to the Matillion booth or sign up for a remote demo.

     

    For more BigQuery and Matillion tips, download a free 60+ page guide on‘Optimizing Google BigQuery’

    Matillion-Optimizing Google BigQuery - Email banner