Integrate data from Microsoft SQL Server to Amazon Redshift using Matillion

Our Microsoft SQL Server to Redshift connector enables seamless and timely data transfers to Redshift, ensuring your information remains current—no manual coding or complex ETL scripting required.

Microsoft SQL Server
Amazon Redshift
Microsoft SQL Server to Amazon Redshift banner

Extracting data from Microsoft SQL Server to Amazon Redshift

Extracting data from Microsoft SQL Server is a vital step for organizations seeking to leverage their existing relational data in powerful, cloud-based analytics environments such as Amazon Redshift. Whether you are migrating legacy data, establishing a hybrid-cloud architecture, or enabling advanced business intelligence, a well-designed data extraction and loading process forms the foundation of your data pipeline. In this article, we will guide you through the key stages of transferring data from Microsoft SQL Server to Amazon Redshift. We will begin with creating an identity in Microsoft SQL Server to ensure your data extraction processes have the appropriate permissions. For Matillion users, this guide will explain how to check for, or acquire, the necessary JDBC driver to facilitate connectivity. We will also discuss establishing secure and reliable network connectivity between your source and target systems—a crucial consideration in any data migration. Lastly, we’ll explore methods for querying your data, both for initial full data loads and for ongoing incremental updates. By the end of this article, you will have a clear understanding of the essential steps required to successfully extract data from Microsoft SQL Server and load it into Redshift, supporting your organization’s broader data strategy.


What is Microsoft SQL Server?

Microsoft SQL Server is a powerful, enterprise-grade relational database management system (RDBMS) supporting diverse data workloads, from transactions to analytics. Using Transact-SQL (T-SQL), it delivers advanced security, high availability, disaster recovery, in-memory processing, and integrates with various languages. It includes tools for integration (SSIS), reporting (SSRS), and analytics (SSAS), plus robust management via SQL Server Management Studio (SSMS). Flexible licensing and support for on-premises, cloud (Azure SQL), and hybrid deployments make SQL Server a versatile, comprehensive solution for organizations to manage and leverage their data effectively.

matillion logo x Microsoft SQL Server

What is Amazon Redshift?

Amazon Redshift is a fully managed, petabyte-scale data warehousing service in the cloud, designed for high-performance analytical querying and reporting. It leverages columnar storage, massive parallel processing (MPP), and advanced data compression techniques to efficiently execute complex queries across large datasets. Redshift integrates tightly with the AWS ecosystem, supporting ETL workflows via services like AWS Glue and data lake architectures via Redshift Spectrum, which enables direct querying of data stored in Amazon S3. Automated backups, encryption at rest and in transit, workload management, and cluster scaling are built-in, facilitating secure and elastic analytics. Its SQL interface is PostgreSQL-compatible, enabling seamless migration and familiarity for developers and data engineers.

Why Move Data from Microsoft SQL Server into Amazon Redshift

Unlocking Analytics: The Case for Copying Data from SQL Server to Amazon Redshift

A data engineer or architect may seek to copy data from Microsoft SQL Server into Amazon Redshift for several compelling reasons. Microsoft SQL Server frequently stores data that is potentially valuable for business intelligence, analytics, and reporting initiatives. By integrating this data with information from other sources within a centralized Redshift environment, organizations can unlock deeper insights and a more comprehensive view of their operations. Additionally, performing data integration tasks and complex queries directly in Redshift, rather than on the SQL Server itself, helps maintain optimal performance on SQL Server. This approach avoids introducing unnecessary workloads and query contention on SQL Server, ensuring it continues to support transactional or operational processes efficiently while leveraging Redshift’s analytics-optimized architecture for data integration and analytics needs.

Creating a User in Microsoft SQL Server

Creating a user in SQL Server is a two-step process: first, you must have a login at the server level, and then you create a database user mapped to that login. Below are instructions and example scripts.

1. Create a SQL Server Login

A login is required to allow authentication at the SQL Server instance level. You can create either a SQL Server authenticated login or a Windows authenticated login.

a) SQL Server Authenticated Login:

CREATE LOGIN [sample_login] WITH PASSWORD = 'StrongPassword123!';

b) Windows Authenticated Login:

CREATE LOGIN [DOMAIN\username] FROM WINDOWS;

2. Create a Database User

After creating a login, you need to create a corresponding user in a specific database. Connect to the desired database and execute:

``` USE [YourDatabaseName]; GO

CREATE USER [sample_user] FOR LOGIN [sample_login]; ```

For a Windows user: ``` USE [YourDatabaseName]; GO

CREATE USER [DOMAIN\username] FOR LOGIN [DOMAIN\username]; ```

3. Assign Database Roles (Optional)

To give the user privileges, assign them to a database role, such as

db_datareader
,
db_datawriter
, or
db_owner
:

ALTER ROLE [db_datareader] ADD MEMBER [sample_user];
ALTER ROLE [db_datawriter] ADD MEMBER [sample_user];

Notes

  • Remember to use strong passwords, especially in production environments.
  • Logins are server-level; users are database-level.
  • Appropriate permissions can be granted using roles or direct GRANT statements.

Checking the JDBC driver

The JDBC driver for Microsoft SQL Server is already installed within Matillion Data Productivity Cloud. You do not need to download or install this driver separately.

To confirm that the JDBC driver is present in your environment, follow these steps:

  1. Create an Orchestration Pipeline:
    In Matillion Data Productivity Cloud, create a new Orchestration pipeline within a project of your choice.

  2. Add the Database Query Component:
    Drag and drop the "Database Query" component into your pipeline workspace.

  3. Check the Database Types Dropdown:
    Open the settings panel for the Database Query component. In the "Database Type" dropdown menu, look for Microsoft SQL Server. The presence of this option confirms that the required JDBC driver is installed.

If you want to check the precise version of the installed driver, please refer to the instructions at the following URL:
https://docs.matillion.com/data-productivity-cloud/designer/docs/database-query/#database-driver-versions

Checking network connectivity

Before connecting Matillion Data Productivity Cloud to your Microsoft SQL Server database, you must ensure that the database allows incoming network connections based on your deployment configuration:

  • Full SaaS Agent Configuration:
    If you are using a Full SaaS agent, the Microsoft SQL Server database must accept incoming connections from all IP addresses listed in the official documentation: Allowing IP addresses. Be sure to update any firewall rules or network access controls to permit these addresses.

  • Hybrid SaaS Deployment:
    For Hybrid SaaS deployments, configure the Microsoft SQL Server to allow inbound connections from your own Virtual Private Cloud (VPC), where the Matillion Data Productivity Cloud components are hosted. To assist in verifying network access and connectivity from your VPC, you can use the utilities available at Check Network Access.

Note:
If you are configuring your Microsoft SQL Server connection using a DNS hostname rather than a static IP address, ensure that the Full SaaS or Hybrid SaaS agent is able to resolve the DNS name to the correct IP address. This may require appropriate DNS configuration or permissions within your deployment environment.

Querying Data from Microsoft SQL Server

This guide provides instructions and examples for querying data from a Microsoft SQL Server database, with considerations for data loading patterns and interoperability with Redshift.


1. Example SELECT Statements

To retrieve data from a SQL Server database, use the

SELECT
statement. Below are some common query patterns:

a. Select All Columns

SELECT * FROM dbo.Employees;

b. Select Specific Columns

SELECT EmployeeID, FirstName, LastName, HireDate
FROM dbo.Employees;

c. Select with Filter

SELECT EmployeeID, FirstName, LastName
FROM dbo.Employees
WHERE Department = 'Sales';

d. Select with Aggregation

SELECT Department, COUNT(*) AS NumEmployees
FROM dbo.Employees
GROUP BY Department;

e. Using Date Filter (for incremental loads)

SELECT EmployeeID, FirstName, LastName, LastModified
FROM dbo.Employees
WHERE LastModified > '2024-06-18 00:00:00';


2. Datatype Conversion Between SQL Server and Redshift

When transferring data from SQL Server to Amazon Redshift, you might encounter datatype differences. Here are a few common datatype mappings:

SQL Server Datatype Redshift Equivalent
INT
INTEGER
VARCHAR(n)
VARCHAR(n)
DATETIME
TIMESTAMP
BIT
BOOLEAN
DECIMAL(p,s)
DECIMAL(p,s)

Be aware of potential conversion issues, such as: - SQL Server's

DATETIME
supports a broader range and higher precision than Redshift's
TIMESTAMP
. - SQL Server
UNIQUEIDENTIFIER
can be cast to Redshift
VARCHAR(36)
.


3. Recommended Data Loading Pattern

The best practice for loading data from SQL Server (e.g., using Matillion ETL) is an initial full load followed by incremental loads, both leveraging the same Database Query Component.

a. Initial Load

  • Description: All data is loaded for the first time.
  • Query Pattern: No filter clause.

SELECT * FROM dbo.Employees;

b. Incremental Load

  • Description: Only new or changed data are loaded based on a filter (e.g., last modification date or incrementing key).
  • Query Pattern: Add a
    WHERE
    clause to filter for new/updated records.

SELECT * FROM dbo.Employees
    WHERE LastModified > @LastLoadTimestamp;

Note: Replace

@LastLoadTimestamp
with the actual timestamp value that marks the end of the previous successful load.

  • Reference: More details on incremental loading and data replication strategy are available on the Matillion Exchange.

By following this pattern—an initial full extract, then efficient incremental updates—you ensure robust, scalable, and reliable data replication between SQL Server and your destination database (such as Redshift).

Data Integration Architecture

Loading data in advance of integration is a hallmark of the Extract, Load, and Transform (ELT) architecture, offering a "divide and conquer" approach by handling ingestion and transformation as two discrete steps. Once raw data is ingested into Amazon Redshift, data integration activities, such as cleaning, joining, and reformatting, are performed as subsequent transformation pipelines within the database itself. These pipelines are essential for producing high-quality, analytics-ready data. A key advantage of the ELT approach is that it allows all transformation and integration operations to happen inside the target Redshift environment, providing speed, scalability, and the ability to run transformations on demand. This native processing model also eliminates the need to provision and maintain separate data processing infrastructure, reducing both operational overhead and cost.

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.