Extracting data from Microsoft SQL Server to Databricks
Extracting data from Microsoft SQL Server is a crucial task for organizations moving towards scalable cloud analytics platforms such as Databricks. Ensuring data is seamlessly and securely transferred between these systems requires a clear understanding of identity management, driver configuration, connectivity, and incremental data loading techniques. In this article, we will guide you through the essential steps to ensure a successful and reliable data extraction process. We begin by outlining how to create or specify an appropriate identity in Microsoft SQL Server, establishing the correct permissions. For users leveraging Matillion, we will review how to check for or acquire the necessary JDBC driver to facilitate the connection between systems. We will also discuss network connectivity requirements, including any configuration needed to ensure secure and robust data transfer from source to target. Finally, we will cover strategies for querying data—both for an initial bulk load and for efficient incremental data extraction over time. By following these steps, you will be positioned to architect a consistent and robust data migration pipeline from Microsoft SQL Server to Databricks.
What is Microsoft SQL Server?
Microsoft SQL Server is an enterprise-grade RDBMS developed by Microsoft for both on-premises and cloud use. It offers advanced transaction processing, in-memory technology, strong security (e.g., Transparent Data Encryption, Always Encrypted), analytics, reporting, and business intelligence tools. SQL Server supports high availability and disaster recovery via Always On availability groups and transactional replication. With SQL Server Management Studio (SSMS) and Azure integration (Azure SQL Database), it scales efficiently for mission-critical applications. Developers interact using T-SQL and can connect via multiple programming languages, making SQL Server a powerful, flexible choice for data-driven solutions.
What is Databricks?
The Databricks database, built on the scalable Lakehouse Platform, unifies data warehousing and data lake features, enabling management of structured, semi-structured, and unstructured data. Utilizing Delta Lake, it offers ACID transactions, schema enforcement, and data versioning on cloud storage. Databricks boosts query performance through indexing, caching, and advanced planning, supporting large-scale analytics, machine learning, and streaming. Integration with Apache Spark provides reliable distributed processing, while its collaborative workspace streamlines development in Python, SQL, Scala, and R. This makes Databricks a comprehensive, modern solution for data engineering and analytics.
Why Move Data from Microsoft SQL Server into Databricks
Unlocking Analytics: The Benefits of Copying Data from Microsoft SQL Server to Databricks
Microsoft SQL Server often contains vital business data that, when leveraged appropriately, can deliver significant value to organizations. However, this value is more fully realized when the data is integrated with other sources—such as cloud-based data lakes, third-party APIs, or streaming data—enabling more comprehensive analytics and business insights. By copying data from Microsoft SQL Server into Databricks, data engineers and architects can utilize Databricks’ scalable compute and collaborative environment for advanced data integration, transformation, and analytics. Importantly, this approach reduces the processing burden on SQL Server itself, as heavy analytics and integration workloads are offloaded to Databricks, thereby preserving SQL Server’s performance and stability for transactional workloads.
Similar connectors
Creating a User in Microsoft SQL Server
The following instructions demonstrate how to create a user in a Microsoft SQL Server database. These steps cover both the authentication at the SQL Server instance level (login) and the authorization at the database level (user).
1. Create a Login at the Server Level
A login is required for authentication to connect to SQL Server. You can create either a SQL Server-authenticated login or a Windows-authenticated login.
Example 1: SQL Server Authentication
CREATE LOGIN my_new_login WITH PASSWORD = 'StrongPassword!2024';
Example 2: Windows Authentication
CREATE LOGIN [DOMAIN\username] FROM WINDOWS;
2. Create a Database User for the Login
Once the login exists, create a corresponding database user in a specific database, which allows the login to connect and access that database.
``` -- Switch context to your database USE [YourDatabase]; GO
CREATE USER [my_new_user] FOR LOGIN [my_new_login]; ```
Note: The database user name (
my_new_user) does not have to match the login name, but it is common practice for clarity.
3. (Optional) Grant Permissions or Assign Roles
You may grant permissions directly or add the user to a predefined database role.
Example: Add User to db_datareader Role
ALTER ROLE db_datareader ADD MEMBER [my_new_user];
Example: Grant Specific Permissions
GRANT SELECT, INSERT ON dbo.YourTable TO [my_new_user];
4. Verify the User
You can verify the newly created user with the following query:
SELECT name, type_desc FROM sys.database_principals WHERE name = 'my_new_user';
Cautions: - Always use a strong password for SQL Server authentication. - Restrict privileges based on the principle of least privilege. - Database names, logins, and users are case-sensitive on some collations—use consistent naming.
Checking the JDBC driver
The JDBC driver for Microsoft SQL Server is already installed in Matillion Data Productivity Cloud, so there is no need to download or install it manually.
To confirm that the driver is present:
- Open Matillion Data Productivity Cloud and create a new Orchestration pipeline.
- Add a Database Query component to your pipeline.
- When configuring the Database Query component, click the dropdown list next to Database Type.
- Check the list for Microsoft SQL Server. If it appears in the dropdown, the JDBC driver is installed and available for use.
If you need to check the version of the installed JDBC driver, refer to the instructions provided at the following URL:
https://docs.matillion.com/data-productivity-cloud/designer/docs/database-query/#database-driver-versions
Checking network connectivity
To ensure successful connectivity between Matillion Data Productivity Cloud and your Microsoft SQL Server database, you must configure the database to accept incoming connections from the appropriate sources, depending on your deployment:
-
Full SaaS Agent Configuration:
Allow incoming connections from the IP addresses listed at: Matillion Security Documentation – Allowing IP Addresses. These IP ranges are used by Matillion’s Fully Managed (SaaS) agents to establish connections to your database. -
Hybrid SaaS Deployment:
Allow incoming connections from your own virtual private cloud (VPC). The specific IP addresses will depend on your VPC configuration. For assistance checking connectivity from your VPC, you can use the network utilities available at: Matillion Exchange – Check Network Access.
Additionally, if your Microsoft SQL Server database is referenced using a hostname or DNS address, ensure that the Full SaaS or Hybrid SaaS agent can successfully resolve that DNS address to the appropriate IP. This is necessary for establishing a successful connection.
Querying Data from Microsoft SQL Server
This guide provides best practices for querying data from a Microsoft SQL Server database, with an emphasis on once-off initial loads and incremental loads—common strategies when integrating with data platforms such as Databricks. Example SQL queries are provided for both scenarios.
1. Sample SQL SELECT Statements
``` -- Query all columns from the Employees table SELECT * FROM dbo.Employees;
-- Query specific columns with a filter SELECT EmployeeID, FirstName, LastName FROM dbo.Employees WHERE Department = 'Sales';
-- Aggregate data: count employees per department SELECT Department, COUNT(*) as EmployeeCount FROM dbo.Employees GROUP BY Department;
-- Using ORDER BY and limiting results SELECT TOP 10 * FROM dbo.Employees ORDER BY HireDate DESC; ```
2. Datatype Conversion Considerations
When extracting data from SQL Server and loading it into Databricks, be aware that datatype conversion may take place. Some SQL Server datatypes (e.g.,
datetime,
nvarchar,
money) may not map 1:1 in Databricks (which is typically based on Apache Spark’s datatypes).
- Best Practice: Always review schema mappings post-ingestion to ensure that types like
DATE
,TIMESTAMP
, and encoded strings are preserved or transformed as needed. - Reference: Databricks documentation on Data Type Mapping
3. Initial Load vs. Incremental Load Pattern
The recommended approach is to perform a full (initial) load of the data first, followed by subsequent incremental loads to maintain current data with minimal processing.
a. Initial Load (No Filter)
For the initial load, retrieve all data (no WHERE clause).
SELECT * FROM dbo.Orders;
Database Query Component:
- Table:
dbo.Orders- Filter clause: None
b. Incremental Load (With Filter)
Incremental loads fetch only new or changed records using a filter—typically based on a datetime or incrementing key.
Example using a timestamp column (LastModifiedDate
):
SELECT * FROM dbo.Orders WHERE LastModifiedDate > '2024-06-01T00:00:00';
- Update the filter value (watermark) after each successful load so that only new records are fetched in future increments.
Database Query Component:
- Table:
dbo.Orders- Filter clause:
LastModifiedDate > ?
See Matillion's article on incremental load strategies for details on filter patterns.
4. Using the Same Database Query Component
Use the same Database Query component for both initial and incremental loads.
- During the initial load, execute the query without a WHERE clause.
- For each incremental run, add the filter (e.g., WHERE condition based on the high-watermark column).
- This ensures consistency in your pipeline and simplifies management.
References:
Data Integration Architecture
Loading data in advance of integration—an approach central to the Extract, Load, Transform (ELT) architecture—enables teams to divide and conquer the data integration challenge by splitting it into two manageable steps. By first loading raw data into the target Databricks database and subsequently performing integration and transformation, organizations can decouple the challenges of data movement from those of data enrichment. Data integration inherently requires transformation processes—ranging from data cleaning and normalization to more complex business logic—and the most effective way to manage these is through automated data transformation pipelines. A key advantage of the ELT architecture lies in performing both transformation and integration steps directly inside the Databricks environment. This not only eliminates the need for additional data processing infrastructure, but also leverages the inherent speed, flexibility, and scalability of Databricks, allowing transformations to be executed on-demand and at scale.