Extracting data from Microsoft SQL Server to Snowflake
Extracting data from Microsoft SQL Server is a common requirement for organizations seeking to leverage cloud-based analytics platforms such as Snowflake. Efficiently migrating both initial and ongoing (incremental) data is critical to ensuring timely and reliable availability of enterprise data where it is needed most. This article will guide you through the key steps involved in this process. First, we will explain how to create and configure a suitable identity in Microsoft SQL Server, ensuring secure access to the required data. For users employing Matillion ETL, we will outline how to check for—or acquire—the appropriate JDBC driver needed to facilitate the database connection. Next, we’ll review best practices for ensuring reliable network connectivity between your SQL Server source and your Snowflake target environment. Finally, we will distinguish between methods used for initial full data extraction and ongoing incremental data loading, providing a clear framework for efficient data migration. By following this guide, you will be well equipped to establish a robust, repeatable process for transferring data from Microsoft SQL Server to Snowflake.
What is Microsoft SQL Server?
Microsoft SQL Server is a relational database management system (RDBMS) by Microsoft, supporting on-premises and cloud applications. It manages structured data using T-SQL, and features in-memory processing, high availability, and robust security (encryption, auditing). SQL Server integrates with development tools, reporting services, and analytics, enabling enterprise-level data warehousing, business intelligence, and transactional workloads. Its flexible deployment options—including containers and managed services on Azure—allow scalable, secure database solutions for diverse needs.
What is Snowflake?
Snowflake is a cloud-based data warehouse designed for high performance, scalability, and ease of use in data storage, processing, and analytics. Its architecture decouples compute and storage, enabling independent scaling and supporting multiple concurrent workloads. Leveraging multi-cluster, cloud-native architecture, Snowflake offers automatic scaling, on-demand provisioning, and near-instant elasticity. It natively supports structured and semi-structured data (e.g., JSON, Avro, Parquet), with SQL compatibility and advanced features like secure data sharing, time travel, and zero-copy cloning. As a fully managed service, Snowflake reduces administrative overhead, letting organizations focus on insights instead of infrastructure management.
Why Move Data from Microsoft SQL Server into Snowflake
The Benefits of Transferring Data from Microsoft SQL Server to Snowflake for Advanced Analytics
A data engineer or architect may wish to copy data from Microsoft SQL Server into Snowflake for several reasons. Firstly, Microsoft SQL Server often contains business-critical and potentially valuable data that, when used effectively, can drive strategic insights and decision-making. However, the true value of this data is often realized only when it is integrated with information from other systems, such as customer relationship management platforms, financial databases, or external data sources. By bringing these diverse datasets together in a unified environment, organizations can perform comprehensive analyses that are not possible when data remains siloed. Leveraging Snowflake as the integration platform offers distinct advantages. It enables powerful, scalable analytics without imposing additional processing overhead on the SQL Server environment. This separation ensures that operational workloads on SQL Server remain unaffected, while Snowflake efficiently handles complex data transformations, integrations, and analyses.
Similar connectors
Creating an Identity in Microsoft SQL Server: Adding a User to a Database
To grant individuals or applications access to a Microsoft SQL Server database, you must create a login at the server level and a user within the target database. Below are instructions and example scripts demonstrating this process.
1. Create a SQL Server Login
A login grants access to the SQL Server instance. There are two common types: SQL Authentication logins and Windows Authentication logins.
a. SQL Authentication Example
-- Replace with appropriate login name and strong password CREATE LOGIN [example_login] WITH PASSWORD = 'Str0ngP@ssw0rd!';
b. Windows Authentication Example
-- Replace with appropriate Windows domain and username CREATE LOGIN [DOMAIN\username] FROM WINDOWS;
2. Create a User in a Specific Database
A database user links the login to permissions within a particular database.
``` USE [your_database]; -- Replace with your database name GO
-- For SQL Authentication login: CREATE USER [example_user] FOR LOGIN [example_login];
-- For Windows Authentication login: -- CREATE USER [DOMAIN\username] FOR LOGIN [DOMAIN\username]; ```
3. Grant Roles or Permissions (Optional)
Often, you must assign database roles to the new user. For example, to grant
db_datareaderand
db_datawriter:
ALTER ROLE db_datareader ADD MEMBER [example_user]; ALTER ROLE db_datawriter ADD MEMBER [example_user];
4. Verify the New User
List the users in the current database:
SELECT name, type_desc FROM sys.database_principals WHERE type IN ('S', 'U', 'G');
Note:
Only users with sufficient privileges (such as members of thesysadminorsecurityadminserver roles) can perform these actions. Change the examples to match your security policies and naming conventions.
Checking the JDBC driver
The Microsoft SQL Server JDBC driver comes pre-installed with Matillion Data Productivity Cloud—there’s no need to download or install it yourself. If you want to check which version of the JDBC driver is currently available, detailed instructions are provided here: Database driver versions documentation.
To confirm that the JDBC driver for Microsoft SQL Server is present and available for use, follow these steps:
- In Matillion, create a new Orchestration pipeline in your workspace.
- Add a Database Query component to the pipeline from the component panel.
- Open the properties of the Database Query component.
- Locate the Database Type dropdown menu.
- Check that Microsoft SQL Server appears as an option in this list.
If you see Microsoft SQL Server listed, the JDBC driver is already installed and ready to use in your data workflows.
Checking network connectivity
To ensure successful connectivity between Matillion Data Productivity Cloud and your Microsoft SQL Server database, you must configure the database to allow incoming connections from the appropriate sources. The configuration depends on your deployment type:
-
Full SaaS Agent Configuration:
You must configure your SQL Server to allow incoming connections from the fixed IP addresses used by Matillion Data Productivity Cloud's Full SaaS agents. The list of required IP addresses is available at: https://docs.matillion.com/data-productivity-cloud/security/docs/allowing-ip-addresses/ -
Hybrid SaaS Deployment:
You must allow incoming connections from the network address range (CIDR block) of your own Virtual Private Cloud (VPC) where the Hybrid SaaS agent is deployed. If you need to verify which IP addresses or network ranges are being used, or to check SQL Server network accessibility, you can refer to the utilities available here: https://exchange.matillion.com/data-productivity-cloud/pipeline/check-network-access/
DNS Name Resolution:
If your Microsoft SQL Server database is referenced using its DNS hostname (rather than just an IP address), ensure that the Matillion Full SaaS or Hybrid SaaS agent has the required network permissions to resolve this DNS address. The agent must be able to successfully query and resolve the hostname to connect to your SQL Server instance.
Querying Data from Microsoft SQL Server
This document outlines how to query data from a Microsoft SQL Server database, with practical examples of SELECT statements, guidance on handling datatype conversion when targeting Snowflake, and best practices for initial and incremental data loads.
Example SQL SELECT Statements
Below are common examples of SQL Server queries you might use with a Database Query component:
Simple SELECT
SELECT * FROM Sales.Orders;
SELECT with Specific Columns
SELECT OrderID, CustomerName, OrderDate, TotalAmount FROM Sales.Orders;
SELECT with WHERE Filter (for Incremental Load)
SELECT * FROM Sales.Orders WHERE LastModifiedDate > '2024-06-01 00:00:00';
SELECT Using Aggregate Functions
SELECT CustomerID, COUNT(*) AS OrderCount FROM Sales.Orders GROUP BY CustomerID;
Handling Datatype Conversion: SQL Server to Snowflake
When querying a Microsoft SQL Server database with the intention to load the result set into Snowflake, be mindful of datatype differences. For example:
| SQL Server | Snowflake | Notes |
|---|---|---|
DATETIME |
TIMESTAMP_NTZ |
Timezone consideration may apply |
BIT |
BOOLEAN |
|
UNIQUEIDENTIFIER |
VARCHAR |
Requires explicit casting |
NVARCHAR |
VARCHAR |
Unicode strings |
MONEY |
NUMBER(19,4) |
Numeric precision differences |
Tip: Use explicit type casts in your SQL to ensure compatibility:
SELECT CAST(OrderID AS VARCHAR(50)) AS OrderID_Str, CONVERT(VARCHAR(36), UniqueIdentifierColumn) AS UID_Str FROM Sales.Orders;
Load Patterns: Initial vs. Incremental
A recommended best practice for data replication is to use a pattern of a once-off initial load, followed by ongoing incremental loads. Refer to the Incremental Load Data Replication Strategy for details.
Database Query Component Usage Patterns:
| Load Type | SQL WHERE Filter? | Description |
|---|---|---|
| Initial Load | No filter clause | Loads the entire dataset. Example: SELECT * FROM dbo.TableName; |
| Incremental Load | Filtered by key column | Loads only new/updated records. Example: SELECT * FROM dbo.TableName WHERE ModifiedDate > ?; |
Example:
- Initial Load:
SELECT *
FROM dbo.TableName;
- Incremental Load:
SELECT *
FROM dbo.TableName
WHERE ModifiedDate > '2024-06-01 00:00:00';
Use the same Database Query component for both patterns; adjust only the filter condition to differentiate between a full (initial) and incremental data load.
Data Integration Architecture
Loading data in advance of integration exemplifies the "divide and conquer" strategy by splitting the data processing workflow into two distinct steps: first, loading raw data into the Snowflake environment, and then performing integration and transformation tasks directly within the database. This sequencing is a core advantage of the Extract, Load, and Transform (ELT) architecture, as it enables organizations to readily adapt to changing business requirements while optimizing resource utilization. Data integration inherently involves transformation, and employing dedicated data transformation pipelines is the most effective way to standardize, cleanse, and enrich disparate data sources before unifying them. With ELT, these transformations and integrations happen within the target Snowflake database itself, delivering significant benefits: transformations are executed quickly and can be triggered on demand, thanks to Snowflake’s inherent scalability. Moreover, this in-database processing model eliminates the need for additional data processing infrastructure, reducing operational complexity and costs.