Depending on your architecture and data requirements, you may handle some of your data ingestion needs with Azure Data Factory (ADF), while handling the rest of your orchestration and data transformations workflows in Matillion ETL, because of its purpose-built design and simple graphical user interface.
When this is the case, it’s possible to integrate Matillion ETL for Azure Synapse and Azure Data Factory to:
- Have ADF start a Matillion ETL job via the API
- Have ADF poll Matillion and wait while the Matillion ETL job is still running
- Have ADF unload a table created by Matillion ETL from Azure Synapse to Azure Blob storage in parquet format
Integrating Matillion ETL for Synapse and Azure Data Factory: what you’ll need
Before you get started you’ll need:
- A Matillion ETL for Azure Synapse instance connected to your Azure Synapse Cluster, that is publicly available*. See our documentation to Get Started.
- An Azure Data Factory instance
- Two ADF linked services configured within your Azure Data Factory instance:
- Azure Synapse database
- Blob Storage account (or Gen 2 Data lake)
How to trigger ETL jobs in Azure Data Factory
In order to trigger ETL jobs in Azure Data Factory, the high level steps are:
- The ADF pipeline calls the Matillion API to start a Matillion job – recording the job_id returned by the Matillion API
- Matillion ETL will:
- Create a table in the Synapse Database
- Populate the table with example data
- ADF will poll the Matillion API using the job_id until the job completes
- ADF will then unload the database table to Parquet format into a new container matillion-adf-demo within your storage account
Starting a Matillion ETL job via the API
In this scenario, the ADF pipeline will call the Matillion API to start a Matillion ETL job – recording the job_id returned by the Matillion API. The Matillion ETL job will create a table in Azure Synapse and then populate that table with example data.
To get started, within ADF add a new “Pipeline from Template” and select the attached zip file.
Next, select your linked services when prompted, for Azure Synapse and Blob Storage. This will import the ADF job.
You’ll now need to change variables within the pipeline to correctly find your Matillion ETL job:
- mat_base url – the base URL of your Matillion instance
- mat_group – your Matillion project group
- mat_project – the Matillion project
- mat_job_name – this can be left as it is if you’re using the example job
- mat_version – this can be left as it is if you’re running in the default Matillion version
You will also need to set a username and password for accessing your Matillion instance (in order to call the API) within:
- The Start Matillion Job activity
- The Get Task Status activity within the Until activity
Switching over to Matillion ETL within your Orchestration Job, import the attached JSON file.
All of the components have a Green border, indicating that they are configured correctly, but to test the job we can run it. The success run writes a table called matillion_demo_adf in your Azure Synapse data warehouse.
Within ADF, publish the changes so the datasets are available. You should now be able to run the ADF Pipeline, and see that ADF calls the Matillion API and starts the Orchestration Job. This job will write data to the demo Azure Synapse table.
Azure Data Factory Poll
With the above set up, ADF will poll the Matillion API using the job_id until the job completes, using the ‘Until’ activity.
Unload table to Blob Storage
Once the Matillion ETL job completes, ADF will unload the table to Parquet format into a new container matillion-adf-demo within the Azure Storage Account selected in your Linked Service:
Speed up analytics within Azure Synapse
Being able to trigger your Matillion ETL jobs from ADF streamlines your workflows and ensures that your data users will have access to clean and transformed data within Azure Synapse to make data analysis even faster.
Download the files you need to set up this feature:
*Note: Microsoft makes no guarantee that the Web component can make calls to resources private to your Azure account. One method to work around this would be to use an Azure Function to call Matillion ETL, and have ADF call the Azure Function.
The post Automate your Matillion ETL Jobs: Triggering ETL jobs from Azure Data Factory appeared first on Matillion.