Skip to main content

Integrating Matillion ETL and Apache Airflow

 

Matillion ETL is a cloud platform that helps you to extract, migrate and integrate your data into your chosen cloud data platform (for example, Snowflake or Databricks), in order to gain business insights. It’s a complete, cloud-native ELT solution. One of the best things about Matillion ETL is its flexibility and extensibility. That means it can integrate with some great open source tools which offer complementary features. Take Apache Airflow

 

Apache Airflow is an open source workflow management platform. It’s great for programmatically managing pipelines using Python.

 

When to use Matillion ETL and Apache Airflow

 

Both Matillion ETL and Apache Airflow have job-scheduling and orchestration capabilities. However, unlike Airflow, Matillion ETL is also specifically designed to perform data transformation and integration. There may be use cases when you’ll want to use the two together. Here are few examples: 

 

  • You have some Apache Airflow logic already, especially if it uses an event driven pattern such as a FileSensor, a SqlSensor or an ExternalTaskSensor
  • You are processing significant amounts of data – that is, more than XComs is designed to handle. Perhaps you are using a SparkSubmit task somewhere
  • You have a cloud data platform, such as Snowflake or Databricks, and wish to take advantage of its scalability for data processing, along with Matillion ETL’s speed and simplicity in terms of job design and maintenance

 

In these cases, you can actually launch a Matillion Job from Apache Airflow, using only Airflow’s core operators. Let’s look at how that’s done.

 

Using the Matillion REST API

Matillion ETL has a large REST API, which is documented here. The API provides the extensibility that Apache Airflow can use to launch jobs.

 

To launch a Matillion Job using the API, you need several pieces of information:

 

  • The address of your Matillion ETL instance
  • The Group name
  • The Project name
  • The Version name
  • The Environment name
  • The Job name
  • Login credentials of a user who has API access to Matillion

 

In the screenshot below, I’ve pixelated the address of my Matillion ETL instance, but yours will be the IP address or hostname that appears in your browser address bar.

 

After logging in to Matillion ETL, you can find the Group, Project, and Version names from your web browser like this:

 

In this case the Group name is Matillion, the Project name is Demo and the Version name is default.  The Version name also appears in the left sidebar.

 

You can find the Environment name in a couple of different ways. The first is by opening the Job you want to run, and context clicking the background canvas:

 

 

In this case, the Environment name is Demo, coincidentally the same as the Project name.

 

Alternatively you can find all your Environment names in the panel at the bottom left of the web user interface.

 

 

The Job name is simply the name of the Orchestration Job that you want to launch.

 

(Note that you can not directly launch a Transformation Job or a Shared Job using the Matillion ETL REST API. To do that, you would simply need to create a new Orchestration Job that runs your Transformation Job or Shared Job.)

 

You will need to authenticate into the Matillion REST API. This is governed by your choice of User Configuration, which you can do in several ways. The key thing is that the user must be authorized to use the API. If you have Internal Users, they need a tick in the API column like this:

 

 

Once you have all of those pieces of information, it’s a good idea to run a quick test to make sure the REST API call is working. It’s helpful if you have cURL, because then you can copy and paste parts of the command, like this:

 

curl -X POST -u "ian:s3cr3t" 
'http://172.32.7.73/rest/v1/group/name/Matillion/project/name/Demo/v
ersion/name/default/job/name/TheMETLJob/run?environmentName=Demo'

Two things should happen immediately, signalling that it worked:

 
  1. You should get back a JSON fragment containing “success”: true. This is an asynchronous job launch endpoint so it will not wait for the Matillion job to finish.
  2. In the Matillion user interface, provided you are logged into the same Group, Project and Version, you should see the job launch appear in the Tasks panel at the bottom right of the screen

 

 

If the command just hangs, you probably need to look at your network configuration, especially the firewall on your Matillion server. More on that subject later.

 

If you get a “failure” or “unauthorized” message back, check your username, password, and whether that user is permitted to use the API.

 

Once you’ve got it working, it’s time to move onto the next section and set this up in Apache Airflow. This will be much more secure, as it does not require you to copy and paste a username and password.

 

Configuring Apache Airflow 

 

There are two main parts to this configuration:

 

  • Adding a Connection to your Matillion ETL instance
  • Using a SimpleHttpOperator in your DAG

 

Connection to Matillion ETL

An Airflow Connection object abstracts two pieces of information that you need to use the Matillion ETL REST API:

 

  • The address of your Matillion ETL instance
  • Login credentials of a user that  has API access to Matillion

 

Add an http connection to your Matillion ETL instance, for example naming it Matillion ETL, like this:

This does require the Airflow Provider Package for http if you have not installed it already. Both of these are a once-only configuration.

 

My Airflow server is in the same subnet as my Matillion server, so I can use the Matillion server’s private IP address for the Host. If you’re using different subnets, or maybe using a mixture of cloud providers, then you’ll need to use the public IP address.

 

If you prefer to use TLS then set the Schema to https; otherwise, leave it blank.

 

SimpleHttpOperator Matillion ETL job

If you use a SimpleHttpOperator, as I do in this example, you won’t need to install any packages from the Airflow Providers framework.

 

SimpleHttpOperator uses the Python “requests” HTTP library internally. One common gotcha with requests is that the default timeout is None, meaning that it will hang forever if it can’t open a connection.

 

I want to allow just 10 seconds for the Matillion API to respond (which should be plenty). So inside the DAG I’ll create a dictionary containing the relevant requests option:

requests_opts = {
    "timeout": 10

}

If you’re using Matillion ETL’s default, self-signed TLS certificate, you’ll need to add another option, setting “verify” to False. Without that extra setting, the connection will fail with a SSLCertVerificationError.

 

Now the Task itself can be added to the DAG, like this. For now, I’ve hardcoded everything just to make sure it works.

 

   metl_job_launch = SimpleHttpOperator(
        task_id='launch_job',
        http_conn_id='Matillion ETL',

endpoint='rest/v1/group/name/Matillion/project/name/Demo/version/nam
e/default/job/name/TheMETLJob/run?environmentName=Demo',
        method='POST',
        log_response=True,
        headers={"Content-Type": "application/json"},
        extra_options=requests_opts,
        response_check=lambda response: response.json()['success'] 
== True,
        dag=dag)

A few notes on this step:

 
  • Matillion ETL is the name of the Connection I created above
  • The Group name, Project name, Version name, and Job name are used to build the path to the endpoint. I hardcoded them all for this example. You’ll recognize that it’s the same path as the cURL command in the previous section
  • This is a POST operation, although the Environment name is supplied as a URL query string and no POST data is needed
  • The requests options are passed in under the “extra_options” parameter
  • The lambda function in the response check is the same result we were hoping for in the cURL command, except this time it’s automated
 

At this point you should be able to successfully run the DAG task in isolation using an “airflow tasks test” command.

 

I did not remove the default XComs, so once you start to schedule the DAG you should see the information build up in your Airflow user interface.

 

 

 

Networking considerations

I’ve been using the Matillion REST API over HTTP. Your Matillion instance is always protected by a cloud firewall, so you’ll need to make sure that port 80 is open to the Airflow client. If you chose to use TLS instead, then the port you need is 443.

 

So if you’re using a single node Airflow with a LocalExecutor, it just means opening that port to your Airflow server.

 

If you’re using a remote executor, you’ll need to make sure that all of the nodes are permitted to open that connection. You may find the administration simplest if you use the same subnet for all of them.

 

See how Matillion ETL integrates with your most important tools

 

Apache Airflow is just one example of the way Matillion’s extensibility options make it possible to integrate with some of the most commonly used data tools. To see how Matillion ETL integrates with your existing data ecosystem, request a demo.

 

Request a demo



The post Integrating Matillion ETL and Apache Airflow appeared first on Matillion.