Extracting data from DynamoDB to Databricks
Extracting data from DynamoDB is an essential step for organizations looking to unlock deeper analytics or cross-platform integration, especially when using advanced platforms such as Databricks. This article guides you through the complete process of transferring data from DynamoDB to Databricks, ensuring that each critical requirement is addressed for a successful and secure transfer. We will begin by outlining how to create an identity in DynamoDB, ensuring your extraction process has the appropriate permissions. For users leveraging Matillion, we explain how to check for an existing JDBC driver or acquire the correct version to enable connectivity. Network connectivity between your source (DynamoDB) and target (Databricks) is a key consideration, and we will detail best practices to configure secure communication. Finally, we will cover how to query and load data—starting with an initial extraction and then implementing incremental data loads for ongoing synchronization. By following these steps, you will be able to effectively export your data from DynamoDB and leverage it within Databricks for advanced processing and analytics.
What is DynamoDB?
Amazon DynamoDB is a fully managed NoSQL database offering fast, predictable single-digit millisecond performance at any scale. It supports key-value and document data models, handling massive data volumes across distributed systems. DynamoDB automates hardware provisioning, replication, partitioning, sharding, and provides on-demand backup, restore, and in-memory caching with DAX. It offers fine-grained access control using AWS IAM, native ACID transactions, and integrates with AWS services for event-driven workflows and analytics. These features enable developers to optimize performance and reduce operational overhead with seamless scalability and flexibility.
What is Databricks?
Databricks Database, built on the Databricks Lakehouse Platform and powered by Delta Lake, unifies data warehouse reliability with data lake scalability. It supports ACID transactions, scalable metadata, and unified batch/stream processing in an open format. Integration with Spark SQL, MLlib, and popular BI tools enables seamless analytics and machine learning workflows. Key management features include automated optimization, fine-grained access controls, and lineage tracking. These ensure secure, performant, large-scale data operations across both cloud-native and multi-cloud environments, while maintaining robust connectivity to external data sources.
Why Move Data from DynamoDB into Databricks
Unlocking Advanced Analytics: Copying Data from DynamoDB to Databricks
A data engineer or architect may choose to copy data from DynamoDB into Databricks for several compelling reasons. Firstly, DynamoDB often stores data that is potentially valuable for analytics, machine learning, and business intelligence efforts. However, the full value of this data is best realized when it is integrated and analyzed alongside information from other sources. Databricks provides a unified analytics platform that enables seamless data integration and advanced analytics, allowing users to derive richer insights by combining DynamoDB data with other datasets. Furthermore, transferring data into Databricks for processing ensures that the computational workload required for complex queries and transformations does not impact the performance or availability of the operational DynamoDB database, thereby maintaining the reliability of production systems while still enabling robust analytics.
Similar connectors
Creating a User Identity in DynamoDB
DynamoDB is a NoSQL database and—as such—does not implement user accounts within tables, as found in relational databases. Instead, user management in AWS is handled via IAM (Identity and Access Management) for access controls, while user-related data must be modeled explicitly as part of your application-layer data (i.e., stored in a "Users" table you define). This guide describes how to create a "Users" table and insert a new user record, using both AWS Management Console and AWS CLI. SQL scripts are provided by way of PartiQL, AWS’s SQL-compatible query language for DynamoDB.
1. Create a "Users" Table
You may define a table with a primary key suitable for distinguishing user identities (such as userId or email).
Using AWS Management Console
- Sign in to the DynamoDB Console.
- Click Create table.
- Set Table name to
Users
. - Set Partition key to a unique identifier, such as
UserId
(Type: String). - (Optional) Configure additional settings as needed.
- Click Create table.
Using AWS CLI
aws dynamodb create-table \
--table-name Users \
--attribute-definitions AttributeName=UserId,AttributeType=S \
--key-schema AttributeName=UserId,KeyType=HASH \
--provisioned-throughput ReadCapacityUnits=5,WriteCapacityUnits=5
2. Insert a User Record
You can add a user document, representing a new user identity.
Using PartiQL (via AWS Console, CLI, or API)
INSERT INTO "Users" VALUE {
'UserId': 'u123456789',
'Email': '[email protected]',
'Name': 'Alice Smith',
'CreatedAt': '2024-06-26T11:15:00Z'
}
In the AWS Console
- Select the Users table.
- Choose the Explore table items tab.
- Click PartiQL editor.
- Paste the above INSERT statement.
- Click Run statement.
Using AWS CLI
aws dynamodb put-item \
--table-name Users \
--item '{
"UserId": {"S": "u123456789"},
"Email": {"S": "[email protected]"},
"Name": {"S": "Alice Smith"},
"CreatedAt": {"S": "2024-06-26T11:15:00Z"}
}'
Note: For application-level user management and authentication, consider integrating AWS Cognito or other authentication solutions. DynamoDB's role is to persist user data defined by your schema; controlling access to this data is handled separately via IAM policies or AWS services.
Installing the JDBC Driver
The Matillion Data Productivity Cloud enables connectivity to a wide range of data sources using JDBC drivers. However, due to licensing and redistribution restrictions, the JDBC driver for Amazon DynamoDB is not included by default in the platform. To leverage Matillion’s database orchestration capabilities with DynamoDB, you must manually download and install the appropriate JDBC driver.
Step 1: Download the DynamoDB JDBC Driver
- Visit the Simba Technologies website at: https://www.simba.com/drivers/dynamodb-jdbc-odbc/
- Locate the download section for the DynamoDB driver.
- Select the Type 4 JDBC driver option (this is a pure Java driver with no client-side dependencies and is recommended for Matillion deployments).
- Complete any required registration or agreement forms that may be required by Simba before downloading.
- Download the JDBC driver JAR file to your local machine.
Step 2: Upload the JDBC Driver to Matillion Data Productivity Cloud
Matillion provides clear instructions on how to add external JDBC drivers to your workspace. Please follow the official Matillion documentation for the most up-to-date steps:
https://docs.matillion.com/data-productivity-cloud/agent/docs/uploading-external-drivers/
- Prepare the previously downloaded JAR file for upload.
- Follow the instructions to upload the JDBC driver to your Matillion Agent or specified driver repository.
- If prompted, restart the Matillion Agent or Designer session to register the new driver.
Step 3: Connect to DynamoDB in Matillion
Once the JDBC driver is installed, you can reference and use it within database orchestrations and data loading components.
Consult the Matillion documentation for guidance on configuring and utilizing JDBC-enabled connectors:
https://docs.matillion.com/data-productivity-cloud/designer/docs/database-query/
This will assist you in setting up the connection properties, authentication, and crafting queries for DynamoDB in Matillion.
Reminder: Always verify you are using the most recent and compatible version of both the JDBC driver and Matillion platform for best results.
Checking network connectivity
To enable proper communication between Matillion Data Productivity Cloud and your DynamoDB database, you must ensure that the DynamoDB instance accepts incoming connections according to your deployment configuration:
-
Full SaaS Agent Configuration:
Make sure your DynamoDB database permits incoming connections from the IP addresses used by Matillion’s Full SaaS agents. You can find the current list of IP addresses that require access here: Matillion Allowed IP Addresses. -
Hybrid SaaS Deployment:
For Hybrid SaaS configurations, your DynamoDB database should allow incoming connections from the IP ranges associated with your own Virtual Private Cloud (VPC). To assist with verifying this connectivity and identifying your outbound addresses, you can use the tools provided at: Matillion Exchange: Check Network Access.
Additionally, if your DynamoDB database is addressed using DNS, it is essential that the Full SaaS or Hybrid SaaS agent can resolve (i.e., perform DNS lookups for) the provided address. Make sure that your selected configuration supports DNS resolution for the DynamoDB endpoint.
Querying Data from DynamoDB
This guide explains how to query data from an Amazon DynamoDB database for technical users, providing SQL-like query examples, discussing datatype conversion considerations, and describing best practices for initial and incremental loads (following Matillion's guidance on incremental load strategies).
1. DynamoDB Query Syntax Compared With SQL SELECT
While DynamoDB uses its own query syntax (efficient access typically through a primary key), for integration with tools like Databricks or ETL platforms, SQL-like SELECT statements are often used as examples for clarity:
Example table structure:
- Table: Orders
- Primary Key: OrderID (string)
- Sort Key: OrderDate (string)
- Attributes: CustomerID, TotalAmount, Status
a. Selecting All Records
SQL Equivalent:
SELECT * FROM Orders;DynamoDB Query:
python response = table.scan()For integrations via ODBC/JDBC connectors or Databricks, the SQL SELECT statement can be used directly.
b. Filtering on Partition Key
SQL Equivalent:
SELECT * FROM Orders WHERE OrderID = '12345';DynamoDB Query:
python
response = table.query(
KeyConditionExpression=Key('OrderID').eq('12345')
)
c. Filtering by Other Attributes (using a Filter Expression)
SQL Equivalent:
SELECT * FROM Orders WHERE Status = 'SHIPPED';DynamoDB Query:
python
response = table.scan(
FilterExpression=Attr('Status').eq('SHIPPED')
)
2. Datatype Conversion: DynamoDB ⇆ Databricks
When moving data between DynamoDB (a NoSQL, schema-less database) and Databricks (typically Spark with strict schemas), automatic datatype conversion may occur:
- DynamoDB types: String, Number, Boolean, Binary, List, Map, Null
- Databricks/Spark SQL types: StringType, IntegerType, DoubleType, BooleanType, ArrayType, MapType, NullType
Example:
DynamoDB does not have a Date type; dates are stored as String or Number (timestamp), which must be converted explicitly into date/timestamp types in Databricks.
-- In Databricks SQL SELECT CAST(OrderDate AS DATE) AS Order_Date FROM orders
3. Best Pattern: Initial and Incremental Loads
The recommended loading pattern is:
- Once-off Initial Load: Extract all records from the DynamoDB table.
- Incremental Load(s): After the initial load, extract only records that have changed or been added since the last extraction.
Both load types use the same Database Query component (e.g., Matillion's Database Query).
The difference is in the WHERE/filter clause.
a. Initial Load
- No filter clause: fetches all records for the first import.
- Use when the target warehouse is empty.
SQL Equivalent:
SELECT * FROM Orders;DynamoDB Example (Full Scan):
python response = table.scan()In Matillion (no WHERE clause)
b. Incremental Load
- Includes a filter clause—generally using a "last updated" or timestamp column—to fetch only new or updated records.
- Efficient for routine ETL loads.
SQL Equivalent:
SELECT * FROM Orders WHERE OrderDate > '2024-01-01';DynamoDB Example (with Filter):
python
response = table.scan(
FilterExpression=Attr('OrderDate').gt('2024-01-01')
)
In Matillion, set a WHERE clause based on the tracked column.
For more detailed incremental load strategies, refer to Matillion's exchange article.
Data Integration Architecture
Loading data in advance of integration is a hallmark of the ELT (Extract, Load, Transform) architecture, offering a divide-and-conquer strategy by splitting the integration process into two distinct steps: first loading the raw data into the Databricks database, then performing transformation and integration. This staged approach enables more flexible and efficient data processing, as transformations can be applied as needed after the data has been stored in the target system. Data integration invariably involves data transformation, and the most effective method for handling these operations is through purpose-built data transformation pipelines. These pipelines ensure reliability, maintainability, and scalability as enterprise data volumes grow. Another significant advantage of the ELT approach is that all transformation and integration steps occur inside the target Databricks environment itself. This means data transformations are fast, can be performed on demand, and take advantage of Databricks’ scalable compute resources; as a result, there is no need to maintain separate data processing infrastructure, keeping costs and operational complexity to a minimum.