- Blog
- 08.05.2024
- Leveraging AI, Data Fundamentals, Product
Crafting a real-time recommendation engine with CDC and APIs

In this two-part series, we'll explore how simple it can be to set up a live data stream to create an on-the-fly hyper-personalized marketing campaign.
This campaign will identify products to upsell based on a key purchasing indicator. Our fictional company sells products directly to consumers worldwide, and we've determined that the local weather of their buyers is the most significant factor affecting purchase amounts and timing.
This isn't far-fetched; just take a look at your own inbox to see seasonal and highly targeted marketing campaigns from your competitors and the companies you buy from.
For this project, we will need two data sources
- Weather APIs (I’m using a specific one).
- A transactions database (I’m using an on-premise PostgreSQL database, but you could use your own database, whether it's Oracle, SQL Server, MS SQL, MySQL, or IBM Db2).
If your transaction data is stored in CRM systems such as HubSpot and Salesforce or databases like Snowflake, Databricks, and BigQuery, there's no need to worry. Matillion integrates with all of these platforms as well.
To start this series, we'll explore how to connect to our transaction database using a technique known as Change Data Capture (CDC). To do that, we'll illustrate how to synchronize these changes in near real-time with our target database, Snowflake.
Next, we’ll demonstrate the ease and speed of creating a connector to load data from any REST API source. For this example, we'll use a Weather API, but the same principles apply to any REST API you need to connect with.
Configuring the Change Data Capture
First, I created a replication slot on my PostgreSQL database following the instructions from the PostgreSQL documentation. Next, I installed an agent in my AWS environment using a CloudFormation template provided by Matillion. This agent is configured to write data to an S3 bucket.
Now that I can see the data streaming from my source database to the data lake every 60 seconds, Matillion also allows me to take snapshots on demand if needed.
This data is currently in AVRO file format in my data lake, but I want to use these log-based changes to replicate my source data into my Snowflake storage. By doing this, I can create a duplicate of my on-premises transactions, making it easier for others in the business to access this valuable data source across my cloud infrastructure.
Matillion Exchange offers a convenient “Sync All Tables” pipeline to set up this synchronization. This feature uses external staging from Snowflake, which includes the data lake where logs are streamed and specifies the target destination as key inputs for the configuration.

Now that I have everything set up, I'm going to join all these tables in Snowflake. Matillion allows me to perform multi-table joins easily with a JOIN component. This feature saves me from writing SQL code and lets me join multiple inputs in a single component, significantly reducing the amount of SQL code needed—fantastic!
With this data source published, I can start building reports in my cloud data warehouse, allowing my team to see live transaction data across the organization.
I've scheduled this pipeline to run every 15 minutes to minimize warehouse compute costs, but you can easily adjust the schedule to sync every minute, every day, or whatever suits your requirements using Matillion's scheduling functionality.
The "Sync All Tables" pipeline gives control to data teams responsible for data-as-a-product while putting cost optimization and refresh rate decisions in the hands of the data designer.
When the business requires changes in refresh rates, Matillion makes it easy to configure. Since we're using the data lake to store AVRO files, there's no need to keep the destination warehouse running constantly (resulting in cost savings), and we still capture all changes from the source database every 60 seconds.
This means we can access both soft and hard deleted information at any desired frequency to report back to our organization. Additionally, it handles schema drift effortlessly, so there's no need to worry when a new column is added or the datatype changes in the source database—awesome!
Adding Another Data Source
Now that I've configured my CDC, it's time to add another data source. As I mentioned earlier, this company's sales are significantly impacted by the weather in the customers' locations. Therefore, I'll be adding this new data source to the custom connectors so that all of my Matillion projects can use it.

While constructing this, Matillion assists by providing helpful messages related to issues such as incorrectly configured Query and Header parameters. I'm not only interested in data for Manchester; I want information for all my customers' locations.
Therefore, when I use this connector in the pipeline, I'll use a pipeline variable that loops through my customers' locations, eliminating the need to create multiple pipelines. Adding a variable couldn't be easier.




In the Query parameter where I previously had Manchester, I'm now referencing the variable ${jv_City}. This productivity-enhancing feature is driven by our table iterator, which loops through a column in my target cloud warehouse containing all of my customers' locations.
Other iterators provide even more impactful ways to leverage metadata, such as the file iterator, which loops through files in our data lakes, for example. Or below where the Table Iterator is looping through metadata in my target and feeding new variables into a child orchestration:

This metadata-driven iterator and variable combination allows me to avoid the hassle of adding new lines of code to an execution script.
Instead, another pipeline can automatically update my customer details table, which is then read into this pipeline each time it runs. This setup creates new tables for me whenever customer locations change or a new location is added, significantly boosting productivity.
I'm configuring this component to write to a new table as well. Each parameter control in our components can use variables. Once you start using these variables, there are virtually no limits to how much you can optimize your data engineering pipelines.
Combining all this data requires just a single Join component in our transformation pipeline. Now, I'm ready to curate my hyper-personalized marketing campaign.
Joe Herbet
Enterprise Sales Engineer
Featured Resources
Big Data London 2025: Key Takeaways and Maia Highlights
There’s no doubt about it – Maia dominated at Big Data London. Over the two-day event, word spread quickly about Maia’s ...
Learn more BlogSay Hello to Ask Matillion, Your New AI Assistant for Product Answers
We’re excited to introduce a powerful new addition to the Matillion experience: Ask Matillion.
Learn more BlogRethinking Data Pipeline Pricing
Discover how value-based data pipeline pricing improves ROI, controls costs, and scales data processing without billing surprises.
Learn more
Share: