Blog

Setting Up Change Data Capture (CDC) in Matillion ETL

 

CDC: sychronized trees representing synchronized data

 

Want to ensure that Amazon Redshift or Snowflake cloud data warehouse on AWS stays up to date with your source database? Change Data Capture (CDC) is your new best friend.  As we mentioned in our latest ebook, Accelerate Time to Insight, Matillion offers CDC functionality for AWS-hosted instances. CDC works directly with several AWS products: DMS, Lambda, SQS and S3. Setting up Change Data Capture takes a few steps. But once it’s ready to go, you no longer need to update with complex transformations, as CDC will keep your cloud data warehouse (CDW) and your source database in sync.

 

Setting up Change Data Capture: How it works in Matillion

 

CDC within Matillion follows your source system logs in order to track changes. This method is less resource intensive on your source database server than using a High-Water Mark or History Triggers. It’s especially helpful if you have a transactional production database tied to your application.

 

Matillion’s CDC functionality currently supports the following databases:

 

  • MySQL
  • PostgreSQL
  • Oracle
  • Microsoft SQL Server

 

 

If you want to immerse yourself in CDC, we have a detailed 3-part series article and YouTube video that reviews the requirements and setup for CDC. But, as a quick example here, we’ll do a quick dive into setting up CDC with an RDS MySQL database as a source. Snowflake is the target CDW in this case. But similar steps apply for Amazon Redshift.

 

Setting up the source database

 

Replication is a special use case and requires some additional configuration so the source system can support it. When using RDS MySQL as a source database, please refer to the AWS documentation for following the guidelines to set up the parameters needed. Keep in mind that the source database tables must have primary keys.

 

Enable backups

 

In the AWS RDS console, select your MySQL instance to adjust the following (some of these parameters may already be in place depending on your configuration):

 

  • In the RDS MySQL console, select “Modify”
  • Enable Automatic Backups – The backup retention period and backup window can be set to what you need for your environment.

Setting up CDC backup rds screen

Create or edit parameter group for bin logging

 

The next few steps involve setting up the Parameter group to enable binary logging and the additional bin log properties needed for CDC. If you already have a parameter group setup for your instance, you can simply edit it.

 

Note: You can’t modify a default parameter group so a new one will need to be created if the default is tied to your instance.

 

  • In the RDS Console, select “Parameter groups” and create a new parameter group for your MySQL instance. Select the appropriate parameter group for the version of MySQL you are running. In this setup, we are working with MySQL 5.7 so the “parameter group family” should be mysql5.7. Provide the new parameter a group name and description. Set the following parameter options:
    1. binlog_format = ROW
    2. binlog_checksum = NONE
    3. In addition, you will need to run the following code in a SQL tool or via command line:

 

call mysql.rds_set_configuration('binlog retention hours', 24); 

 

This will set the binlog retention period to a timeframe based on your needs. Otherwise, RDS MySQL instances purge binary logs as soon as possible. For this example, we will set the period to 24 hours.

  • If this is a new parameter group, you will need to add it to your MySQL instance. You can edit this in the RDS MySQL console by selecting “Modify”. Under Database options, select the new parameter group from the “DB parameter group” drop-down menu. Save these changes. Note: You will need to reboot the instance in order for the parameter group to be applied.

Setting up CDC: database options screen

Setting up the DMS replication instance

 

Now that we have the RDS MySQL instance configured to allow for CDC, we can now set up our DMS replication instance by following these steps:

 

  • In the AWS DMS console, select “Replication Instances”

CDC screen: replication instances

 

 

  • Create a new replication instance by by selecting “Create replication instance.” Use the default options. Or you can also modify them based on your needs. You will need to name your replication instance and choose the VPC that your Matillion instance is located in. (Note: Remember the instance name, as we will reference it in the CDC configuration)
  • The RDS MySQL instance will need to allow inbound access for this new DMS replication instance. Edit the security group attached to the MySQL instance, as well as the Inbound group. Add the private IP address for the DMS replication instance.

 

CDC DMS replication instance

Create an SQS FIFO queue and S3 bucket

 

Next, set up the SQS FIFO queue and S3 bucket. Note that this SQS queue should not be used for any other purpose other than CDC for a specific project.

 

  • In the AWS SQS console, select “Create New Queue”. Select the “FIFO Queue” option, give the queue a name and select “Quick-Create Queue”.

 

CDC create new queue

 

  • In the AWS S3 console, create a new S3 bucket. You can use an existing S3 bucket if you already have one setup.

 

CDC S3 buckets

 

 

 

CDC Credentials

We are almost finished with the AWS portion of setting up CDC. Last steps are creating two IAM roles for S3 and Lambda and an IAM user. The resources for the policies attached to the user/roles can be adjusted as needed, but the referenced policy actions are what’s required for CDC to work with Matillion.

Create IAM user

 

The credentials requested in “Manage CDC” within the Matillion UI can be either Instance Credentials or User-Defined Credentials. The recommendation is to create a new IAM user since these credentials need only be used for the CDC Configuration.

 

  • In the AWS IAM console, select “Users” and select “Add User”. Give the user a name relevant to CDC and select “programmatic access”. This enables an access key ID and secret access key. 
  • The next step is adding permissions to the IAM user. Select “Attach existing policies directly”. 
    1. A new policy will need to be created by selecting “Create policy”. This will open a new browser tab.
    2. Select the “JSON” tab and enter the policy. See file “cdc_user_policy” for the actions needed. Reference the Credentials for CDC section in Snowflake support for more information on what each action performs.
    3. Review the policy, give it an appropriate name and create the policy.
    4. Once you’ve created the policy, go back to the IAM Users browser tab and attach the policy to the IAM user. Select “Create User”. You can apply tags, but it is optional.

Create Lambda IAM role

  • In the AWS IAM console, select “Roles” and “Create role”. 
  • Select “Lambda” as the service. 
  • Next, add permissions to the Lambda IAM role. Select “Next: Permissions”. 
    1. You will need to create a new policy by selecting “Create policy”. This will open a new browser tab.
    2. Select the “JSON” tab and enter the policy. See file “lambda_iam_policy” for the actions needed. 
    3. Review the policy, give it an appropriate name, and create the policy.
    4. Once the above is completed, go back to the IAM Roles browser tab and attach the newly created policy to the IAM role. Select “Create role”. You can apply tags, but it is optional.

Create S3 IAM role

  • In the AWS IAM console, select “Roles” and “Create role”. 
  • Select “DMS” as the service. 
  • Next, add permissions to the S3 IAM role. Select “Next: Permissions”. 
    1. You will need to create a new policy by selecting “Create policy”. This will open a new browser tab.
    2. Select the “JSON” tab and enter the policy. See file “s3_iam_policy” for the actions needed. 
    3. Review the policy, give it an appropriate name and create the policy.
    4. Once the above is completed, go back to the IAM Roles browser tab and attach the newly created policy to the IAM role. Select “Create role”. Again, you can apply tags, but it is optional.

Setting up CDC in Matillion

We are almost there! The final portion of the CDC setup is configuring CDC in Matillion. Let’s review those steps.

 

Manage Change Data Capture

 

  • From the “Project” menu in Matillion, select “Manage Change Data Capture”. This will open a Change Data Capture canvas in Matillion. Then select “Manage”.

manage cdc

 

 

  • A new window will open and the following needs to be referenced: Select “Enable CDC.” For “Region”, select the appropriate region where your SQS FIFO queue is located. For “Listen Queue”, select the SQS FIFO Queue. For “Lambda Role ARN”, select the Lambda role that was created in the previous steps. Once complete, select “OK”. 

manage cdc configuration

Creating CDC Task

 

Now that we have CDC configured, let’s create a new CDC task.

  • In the CDC screen, select the “+” icon to create a new task. 

create new task

 

 

  • A new menu will open for task configuration. Enter a new “Task Name” and select the “Replication Instance” that was created in the previous steps. Once complete, select “Next”.

cdc setup

 

 

  • Create the source endpoint by entering the following information: Host Address, Port, Database, SSL Mode (if needed), Username, Password, JDBC Connection Attributes (if needed), and DMS Connection Attributes (if needed). The default DMS Connection Attribute, “unloadTimeout,” is set to 36000. You can change it based on what’s needed per use case. Select “Next”. 

CDC source endpoint

 

  • It will take a couple of minutes to create the source endpoint. When complete, the next screen will display the “Status” as “Connection successful” when complete. Select “Next”.
  • Create the target endpoint by entering the following information: S3 Role ARN and Bucket Name (this is the S3 role and bucket we created earlier). Everything else should be auto-populated (Endpoint Name, Batch Interval and Enable Logging). The default Batch Interval setting can be changed based on what’s needed per use case. The “Table Selection” is also needed. Select the schema and table/s needed for CDC. A “%” can be used to select all tables. Select “Next”. 

 

target endpoint configuration

 

  • It will take a couple of minutes to create the target endpoint. The next screen will display the “Status” as “Connection successful” when complete. Select “Next”.
  • The last configuration step is to configure the job. Enter the Environment, Staging Database, Staging Schema, Staging Table Prefix (default is “tmp_”), Target Database, Target Schema, Target Table Prefix, and Job Folder (this is where you want to store your CDC job). Lastly, you can select the “Add Last Updated” check box if it’s needed or required. Select “Finish”.

finish cdc setup

 

 

Congratulations!

At this point, you are finished with setting up CDC within Matillion! After you’re done, a prompt will ask if you want to start the CDC task now. You can start it, or wait to start it at a later time.