Building Data Pipelines with Matillion’s Data Productivity Cloud
Data integration tools are nothing new. They have been around for decades, assisting developers in creating complex integration solutions for organizations to gain insights from their data and simplify and streamline business processes. However, not all data integration tools are designed the same. As technology changes and data volumes explode, you need a modern solution that embraces those changes and delivers more value faster than ever. This is where Matillion Data Productivity Cloud comes in with a cloud-first platform that makes data integration accessible to all data team members, fostering team-wide collaboration while delivering best-in-class capabilities to solve even the most complex data integration challenges.
In this blog, I will discuss the capabilities of Matillion Data Productivity Cloud Designer and explain the process of building a complete, end-to-end data integration solution that highlights Matillion’s strengths in moving, transforming, and orchestrating data across the organization.
In Matillion Data Productivity Cloud, almost all new data integration efforts start with an Orchestration Pipeline to collect data from various sources and move it into the connected cloud data warehouse. This is the ‘E’ and the ‘L’ (Extract and Load) of ‘ELT.’ Starting in Matillion Designer, create a new Orchestration pipeline by selecting Orchestration Pipeline from the Add dropdown within the Pipelines Pane. Give the pipeline a meaningful name and select Add to confirm. The new pipeline will open on the Designer Canvas. Components within the Components Pane can now be added to the canvas, connected, and configured appropriately to move data into the Cloud Data Warehouse.
The Components Pane is a dynamic pane that provides a complete list of available components to build an Orchestration pipeline, including Matillion-defined and custom-defined connectors, as well as Data Dictionary Language (DDL) components, Logic components and others. The Smart Search bar within the Components Pane can be used to search for specific components based on name, type, or capability, so it is easy to find the desired component. To add an Excel Query component to the canvas, type ‘Excel’ into the Search of the Components Pane, then drag and drop the component onto the canvas to the right of the Start Component.
The Excel Query component is a Source Component in Matillion Data Productivity Cloud. It can read data from Excel documents saved in a cloud-based storage location such as an Amazon S3 Bucket or an Azure Blog. A component's properties are listed in the Properties Pane. Configuring the component properties lets Matillion’s execution engine, PipelineOS, know where to locate the file, how to select the appropriate data, and where to load the data. This is where the value of each parameter can be specified. The Properties pane is also dynamic, displaying more or less parameters based on specific parameter inputs. As parameter values are updated within a component’s Properties Pane, Matillion will automatically validate the pipeline, providing instant feedback on the status of a component and the pipeline.
By connecting the Start Component to a Source Component (such as the Excel Query component) and setting the appropriate parameter values, a basic Orchestration pipeline is created. It can now be executed within Matillion Data Productivity Cloud Designer to move data from the source into the cloud data warehouse, all with a single component.
Matillion is a true ELT solution. This means that all data transformation is completed in and on the connected Cloud Data Warehouse, leveraging the full scalability of the cloud to complete the task. Now that data has been loaded into the Cloud Data Warehouse, it can be manipulated, massaged, and enriched through a Transformation Pipeline to deliver valuable business insights. To create a Transformation Pipeline, select Transformation Pipeline from the Add dropdown within the Pipelines Pane. Give the Transformation Pipeline a meaningful name and click Add to open it on the Designer Canvas. The Components Pane will now display only Transformation components.
All Transformation Pipelines start with a Read Component, and Matillion provides easy filtering within the Components Pane to quickly find the available component by type. By clicking on the ‘Read’ filter within the Components Pane, the five main inputs to a Transformation Pipeline are displayed. Drag and drop an appropriate Read component onto the canvas and set the component properties parameter values to the desired Target Table and Column Names. Because Matillion is constantly connected to the Cloud Data Warehouse, sampling the data from the Sample tab ensures the correct dataset is selected and is accurate. If additional datasets need to be added to the pipeline, those read components can be added to the canvas and configured accordingly. Further, a Join Component can be added to the canvas, connecting multiple datasets into a single stream for data transformation.
Once the dataset within the cloud data warehouse is selected, Matillion provides an extensive list of components to cleanse, enrich, and transform data in the cloud data warehouse. Some of the most common and powerful components include the Calculate Component and the Aggregate Component. The Calculate Component allows users to perform simple and/or complex data calculations on the data, all within a single component. Like many components within Matillion Data Productivity Cloud, it is designed for both High- and Low-code users, allowing Low-code users to point and click their way to data calculations. At the same time, Matillion automatically generates the code in proper syntax. In contrast, High-code users can use the open-text format to type out complex algorithms and leverage native functions specific to the cloud data warehouse.
Similarly, the Aggregate Component comes pre-loaded with an extensive list of aggregation functions that are specific to the cloud data warehouse, ensuring native code is generated and passed down to the CDW at execution time. Finally, adding any of the available Write Components to the Transformation Pipeline ensures the data is available for any BI or data analytics tool to read and display.
Transformation pipelines in Matillion are designed to execute on the cloud data warehouse. Therefore, every component that is added on the design canvas and configured is simply generating native SQL code. The graphical interface makes it easy to design and visualize how the data is flowing. But behind the scenes, the end result of the visual pipeline is a single SQL statement that is executed on the CDW. This code can be reviewed at each component, and data can be sampled along the pipeline to ensure the data is being manipulated as expected.
The final steps in building an end-to-end data integration solution within Matillion Data Productivity Cloud include finalizing the Orchestration of the pipelines. The most important thing to know about Orchestration within Matillion is that every single component within an Orchestration pipeline is simply a task that must be managed. This includes Transfomation pipelines and can even include other Orchestration pipelines. This provides unlimited flexibility to ensure data integration processes achieve their desired results.
Using the same Orchestration pipeline used to load the data into the Cloud Data Warehouse, the Transformation pipeline outlined above can be placed onto the Design canvas and managed along with the data load task(s) and/or any other required tasks. It is also important to note that each component within an Orchestration Pipeline has three possible outcomes - Success, Failure, or Unconditional. Properly handling these outcomes is critical to delivering a complete solution for pipeline orchestration. Once all the tasks in an Orchestration pipeline are connected and configured, the pipeline can be scheduled to execute as often as needed to ensure the organization has the most up-to-date information available.
Get Started with Matillion Today!
In this blog, I explained how to build a complete, end-to-end integration with Matillion Data Productivity Cloud. The intuitive and easy-to-use interface makes it possible for nearly anyone, regardless of technical skill set, to collaborate with their data team to move, transform, and orchestrate data across the organization for faster insights.
To learn more about Matillion and the Data Productivity Cloud, visit www.matillion.com to explore interactive demos, sign up for a 14-day trial, or participate in a Virtual Hands-on Lab, building your own version of the pipelines described above.
About the Author
With nearly 20 years in the IT industry, I have a wide range of experience across a number of different industry verticals, including Telecom, Pharma, Retail, and Software. Across these different experiences, one thing has remained constant - my passion for data. Now, as I have shifted into a marketing-focused role, I have not lost sight of my technical background and believe now more than ever, in the importance of quality data to drive growth.
Director - Technical Product Marketing
Maximizing Matillion: Real-world Insights into Elevating Data Productivity
Understanding the real value that can be achieved with a modern ...Blog
Elevate Your Data Game With Matillion's Data Productivity Cloud
Getting the most out of your business data is key to success. But ...Blog
Simplifying Complex Data Architecture with Matillion’s Data Productivity Cloud
Data engineers and analysts understand the power of a streamlined ...