Integrate data from SAP HANA to Databricks using Matillion

Our SAP HANA to Databricks connector streamlines data transfer, ensuring your information is updated in Databricks within minutes—no manual coding or complex ETL processes required.

SAP HANA
Databricks
SAP HANA to Databricks banner

Extracting data from SAP HANA to Databricks

Extracting data from SAP HANA is a critical step for organizations seeking to leverage cloud-based analytics platforms such as Databricks. Whether you are performing an initial data migration or setting up ongoing data synchronization, establishing a reliable and efficient data pipeline is essential. This article provides practical guidance on moving data from SAP HANA to Databricks. We will begin by walking you through the process of creating a secure identity in SAP HANA, ensuring your extraction processes are both authorized and auditable. For teams utilizing Matillion, we will review how to check for, or acquire, the necessary JDBC driver to establish connectivity between the systems. Additionally, we will cover important considerations regarding network connectivity to ensure secure and performant data transfers. Finally, you will learn how to query and extract data—both for initial full loads and for incremental updates—laying the foundation for robust analytics in Databricks. Whether you are new to data integration or refining an established workflow, this article will equip you with the essential steps to connect SAP HANA and Databricks effectively.


What is SAP HANA?

SAP HANA is an in-memory, column-oriented relational database by SAP that processes high transactional and analytical workloads in real time by storing data in main memory, drastically reducing latency. Its architecture features data compression, parallel processing, predictive analytics, spatial data analysis, and multi-model data management. Dynamic tiering enables effective data aging and lifecycle management. With native SQL and support for multiple programming languages, SAP HANA underpins SAP business applications and serves as a robust platform for developing custom, data-intensive solutions by rapidly processing large volumes of structured data from varied sources.

matillion logo x SAP HANA

What is Databricks?

Databricks provides a unified analytics platform optimized for big data processing and AI workflows, built atop Apache Spark. At its core, the Databricks database leverages scalable cloud storage and a managed Delta Lake, which brings ACID transactions, schema enforcement, and robust data versioning to data lakes. This architecture enables users to efficiently store, query, and analyze massive datasets using SQL, Python, R, or Scala, supporting both batch and real-time workloads. Tight integration with the Databricks Workspace allows seamless collaboration, while interoperability with a variety of data sources—such as cloud object stores and traditional databases—enables flexible, end-to-end data engineering and machine learning pipelines.

Why Move Data from SAP HANA into Databricks

Unlocking Analytical Power: The Case for Copying SAP HANA Data to Databricks

A data engineer or architect might choose to copy data from SAP HANA into Databricks for several compelling reasons. SAP HANA often stores highly valuable data collected from critical business processes, but its true potential is realized when this data is integrated with information from other sources—such as CRM systems, IoT platforms, or unstructured data repositories. By moving data into Databricks, organizations can leverage the platform’s scalable analytics and machine learning capabilities to blend and analyze these disparate datasets. Importantly, this approach ensures that the demanding workloads of advanced analytics and data processing are handled by Databricks rather than SAP HANA itself, thereby minimizing performance impacts and preserving HANA’s responsiveness for core transactional operations.

Creating a User in SAP HANA Database

To create a user in a SAP HANA database, you must have the necessary privileges, such as USER ADMIN or SYSTEM privileges.

Below are the basic steps and an example SQL script for creating a new database user.

Prerequisites

  • You must be connected to the SAP HANA database as a user with the USER ADMIN privilege.
  • SQL commands can be executed using SAP HANA Studio, SAP HANA Database Explorer, or any client with SQL access to the database.

Example: Creating a New User

``` -- Replace 'NEWUSER' and 'UserSecurePassword123!' with your user and password of choice. CREATE USER NEWUSER PASSWORD "UserSecurePassword123!";

-- Optionally, force password change on first login. ALTER USER NEWUSER FORCE PASSWORD CHANGE;

-- (Optional) Grant standard privileges to the new user. -- For example, grant the ability to create tables: GRANT CREATE ANY TO NEWUSER;

-- Grant user access to specific schemas if needed: GRANT SELECT ON SCHEMA MYSCHEMA TO NEWUSER; ```

Important Notes

  • Usernames are not case-sensitive unless enclosed in double quotes.
  • Passwords must comply with the security policy defined for your SAP HANA system.
  • Additional privileges can be assigned to the user as required by their role.

Granting Roles

To grant predefined roles or custom roles to your new user:

``` -- Grant the 'MODEL' predefined role to the user GRANT MODEL TO NEWUSER;

-- Grant a custom role GRANT MY_CUSTOM_ROLE TO NEWUSER; ```

Remember to follow your organisation's security guidelines when assigning user privileges.

Installing the JDBC driver

At the time of writing, the SAP HANA JDBC driver is not bundled with Matillion Data Productivity Cloud due to licensing and redistribution restrictions. This means that users are responsible for manually obtaining and installing the driver before establishing connectivity to SAP HANA databases.

Step 1: Download the SAP HANA JDBC Driver

  1. Navigate to the official SAP HANA tools download page: https://tools.hana.ondemand.com/
  2. Locate the section for client drivers and libraries.
  3. Search specifically for the “Type 4 JDBC driver” for SAP HANA, as this is the preferred driver type that is 100% Java-based and platform-independent.
  4. Download the appropriate JDBC driver JAR file (commonly named something like
    ngdbc.jar
    ).
  5. You may need an SAP account to access the download.

Step 2: Install the JDBC Driver in Matillion Data Productivity Cloud

After downloading the SAP HANA JDBC driver, you must upload it to your Matillion environment. Matillion provides instructions for uploading external drivers, which can be found here: https://docs.matillion.com/data-productivity-cloud/agent/docs/uploading-external-drivers/

Follow those instructions to ensure the driver is correctly uploaded and available to your Matillion Agent(s). Completing this step allows the Data Productivity Cloud to utilize the SAP HANA connection functionality.

Step 3: Reference and Use the JDBC Driver

Once installation is complete, refer to Matillion’s usage documentation to configure database query components with the SAP HANA driver: https://docs.matillion.com/data-productivity-cloud/designer/docs/database-query/

Be sure to follow the guidance in the documentation to test connectivity and confirm successful integration with your SAP HANA instance.

Checking network connectivity

To connect Matillion Data Productivity Cloud to your SAP HANA database, you must ensure that the SAP HANA instance is configured to accept incoming network connections, based on your deployment type:

Additionally, if your SAP HANA connection string uses a DNS name, the agent (whether Full SaaS or Hybrid SaaS) must be able to resolve the hostname to the correct IP address. Ensure any necessary DNS records and resolution are in place for the agent environment.

Querying Data from SAP HANA Database

This guide explains how to query data from an SAP HANA database using SQL statements, with important considerations for data integration tools, incremental loads, and datatype compatibility with platforms like Databricks.

Example SAP HANA SQL SELECT Queries

Below are example SQL queries used to retrieve data from SAP HANA tables:

``` -- Retrieve all columns and rows from a table SELECT * FROM SALES_DATA;

-- Select specific columns SELECT ORDER_ID, CUSTOMER_NAME, ORDER_AMOUNT FROM SALES_DATA;

-- Use filters to retrieve only relevant records SELECT * FROM SALES_DATA WHERE ORDER_DATE >= '2024-01-01' AND STATUS = 'SHIPPED';

-- Aggregate data (e.g., total order value per customer) SELECT CUSTOMER_ID, SUM(ORDER_AMOUNT) AS TOTAL_ORDERS FROM SALES_DATA GROUP BY CUSTOMER_ID; ```

Datatype Conversion Considerations

When extracting data from SAP HANA and loading it into a different platform such as Databricks, be aware that datatype conversion may occur. For example:

  • SAP HANA
    NVARCHAR
    may map to Databricks
    STRING
    .
  • SAP HANA
    DECIMAL(10,2)
    may map to Databricks
    DECIMAL(10,2)
    or
    FLOAT
    , depending on your ETL configuration.
  • SAP HANA
    DATE
    and
    TIMESTAMP
    should be checked for compatibility with your target system (e.g., Databricks
    DATE
    ,
    TIMESTAMP
    ).

Always validate datatype mappings to prevent runtime errors and data truncation.

Pattern: Initial Load and Incremental Load

The best practice for loading data involves:

  1. Once-off Initial Load: Load the entire dataset from SAP HANA without filters.
  2. Incremental Loads: Periodically load only the new or changed records since the last successful extraction (using a filter).

Both scenarios can use the same Database Query component in your ETL tool (such as Matillion).

Initial Load Example

No filter—extract the full dataset:

SELECT * FROM SALES_DATA;

Incremental Load Example

Use a filter with a timestamp or ID to only retrieve new or updated records:

SELECT * FROM SALES_DATA
WHERE LAST_MODIFIED > '2024-06-07 00:00:00';

Alternatively, if using an increasing ID:

SELECT * FROM SALES_DATA
WHERE ORDER_ID > 123456;

For more in-depth guidance, see the article:
Incremental Load Data Replication Strategy

Data Integration Architecture

Loading data prior to integration represents an effective “divide and conquer” strategy, achieved by splitting the process into two distinct steps: extraction/loading and subsequent transformation. This approach characterizes the ELT (Extract, Load, Transform) architecture, where data is first loaded in its raw form into the Databricks database. Crucially, data integration inherently requires transformation—for which purpose well-designed data transformation pipelines offer the best practice, ensuring modular, repeatable, and robust processing of large datasets. Another significant advantage of the ELT approach is that all data transformation and integration activities are performed directly within the target Databricks environment. This not only offers on-demand execution and seamless scalability, but also eliminates the need for separate, often costly, external data processing infrastructure. The result is a faster and more efficient integration workflow that capitalizes on the underlying power of Databricks.

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.