The Agentic Advantage series: Real-world AI success stories from leading CDAOs

Register Now

Using Matillion’s Database Query and Fixed Iterator components to scale loads from a SQL Server

This article will demonstrate a data engineering technique to split up a single, large data processing operation into multiple, smaller operations.

Together, the smaller operations achieve the same result.

This technique is valuable when the large data processing operation is too big for the source system to handle and causes timeouts or errors. This might happen during a one-off historical load, for example. Often a smaller operation enables the source system to take advantage of some built-in performance mechanism like indexing or partitioning.

It's also a potentially useful way to parallelize the smaller operations, thus gaining explicit control over the massively parallel processing.

Splitting a large query using a Fixed Iterator

Fixed Iterator allows you to partition the data into multiple concurrent loads using the Database Query Component. Before starting, check the table in SQL Server to see if there is a column by which you can logically partition the data. 

For example, I have 8.6 million rows in a SQL Server table, each with a column called seq_no, which will be chunked into 4 roughly equal partitions. The max sequence is 8,8640,340, which, broken up into 4 streams, is about 2,160,084 records in a stream.

Now, I’ll open a new Orchestration Pipeline Canvas and define 2 Pipeline variables:

  • jv_seq_start
  • Jv_seq_end

Next, I drag the Fixed Iterator Component onto the canvas, connect it to the Start, and configure the connector. Here's how:

  • Set the Concurrency to Concurrent
  • Select the 2 Pipeline variables you just created as the Variables To Iterate
  • Select Iteration Values and add 4 rows; notice the variables jv_seq_start and jv_seq_end at the top. 

After entering the 4 rows, we’ll click Save

After that, your Fixed Iterator Component should look like this:

Now, drag and drop the Database Query Component onto the canvas and configure it as follows:

  • Database Type: Microsoft SQL Server (there are 2 SQL Server Drivers, make sure it is this one)
  • Connection URL: jdbc:jtds:sqlserver://<host>/<database> (jdbc:jtds provides better performance)
  • Fill in your Username and Password.
  • Basic Advanced Mode: Advanced
  • SQL Query:

select * from terrell.flights_sequenced
where seq_no >= ${jv_seq_start} and seq_no < ${jv_seq_end}
  • Target Table: {your table name}
  • Stage Platform: Snowflake Managed

Next, click on Load Options under Advanced Settings and set the following parameters:

  • Recreate Target Table: Off
  • Compression type: None

In the end, your configuration should look like this:  

 

Now, we need to connect the Fixed Iterator to the Database Query Component by clicking on the Blue circle at the bottom of the Fixed Iterator and dragging the line to the Database Query Component.

And that’s it! Finish by right-clicking on the canvas and running the pipeline.

Key takeaways

Use the Fixed Iterator and “chunk” the data into roughly equal groups.

For the Database Query Component:

  • Define the Pipeline variables to use in the select statement
  • Use the Microsoft SQL Server jdbc:jtds driver
  • Use Snowflake Managed for the Stage Platform
  • In Advanced Settings under Load Options:
    • Turn Recreate Target Table OFF (each stream will then add to the table)
    • Turn Compression OFF (compression is an extra step using CPU)
Alan Goodrich
Alan Goodrich

Enterprise Solution Engineer

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.