Extracting data from DynamoDB to Snowflake
Extracting data from DynamoDB is a critical task for organizations seeking to leverage cloud-native NoSQL data in downstream analytics platforms such as Snowflake. A smooth and secure data transfer process enables deeper insights by combining operational data with other sources in a centralized data warehouse environment. In this article, we will guide you through the essential steps required to extract data from Amazon DynamoDB and load it efficiently into Snowflake. We will start by describing how to create an identity in DynamoDB to ensure proper authentication and permissions. For those using Matillion ETL, we will cover how to check for, or obtain, the required JDBC driver for connectivity. We will also discuss best practices for ensuring stable and secure network connectivity between your source (DynamoDB) and target (Snowflake) environments. Finally, we will address the mechanisms needed for extracting data—both initial, full-load processes, and efficient incremental querying—to keep your Snowflake environment refreshed with the latest changes. Follow along to master each stage of this data integration workflow.
What is DynamoDB?
Amazon DynamoDB is a fully managed, serverless NoSQL database offering high performance at any scale. It supports key-value and document data models, predictable low-latency access, automatic scaling, and high availability with multi-AZ replication. With strong or eventual consistency options, DynamoDB abstracts infrastructure management, and developers access it via RESTful APIs or AWS SDKs. Key features include fine-grained AWS IAM access control, DynamoDB Streams for event processing, and easy AWS Lambda integration. These capabilities make DynamoDB ideal for rapid, scalable data requirements in use cases like IoT, gaming, mobile backends, and real-time analytics.
What is Snowflake?
Snowflake is a cloud-native data platform designed to provide scalable, performant, and secure data warehousing and analytics services. Built on a multi-cluster, shared data architecture, Snowflake separates storage and compute, enabling independent scaling of resources to accommodate varying workload demands without downtime. It supports both structured and semi-structured data (such as JSON, Avro, and Parquet) and offers ANSI-standard SQL interfaces for seamless data processing and querying. Snowflake also features robust data sharing, advanced security controls, near-zero administration, and automatic maintenance, making it particularly suitable for organizations seeking rapid deployment and integration with modern data ecosystems across major cloud providers (AWS, Azure, and Google Cloud).
Why Move Data from DynamoDB into Snowflake
The Value of Replicating DynamoDB Data to Snowflake for Advanced Analytics
A data engineer or architect may wish to copy data from DynamoDB into Snowflake for several compelling reasons. First, DynamoDB often contains operational or transactional data that holds significant business value, yet its full potential can be realized only when integrated with data from other sources. By consolidating this valuable data within Snowflake, organizations can perform advanced analytics and gain richer insights that are simply not possible within the isolated environment of DynamoDB. Moreover, leveraging Snowflake’s scalable compute and robust data integration capabilities allows for such analytics without burdening the DynamoDB service, thus maintaining application performance and reliability. In summary, offloading data to Snowflake enables comprehensive analysis and integration—unlocking the value of data in DynamoDB—while preserving the efficiency of operational databases.
Similar connectors
Creating an Identity in DynamoDB
DynamoDB is an AWS-managed NoSQL database and does not have its own concept of "users" inside the database, as a relational database might. Instead, authentication and resource access is managed outside DynamoDB using AWS Identity and Access Management (IAM). A "user" is typically an IAM entity, and access to tables, items, or attributes in DynamoDB is controlled through IAM policies.
The typical process to enable a user (such as an application, developer, or external service) to interact with DynamoDB involves the following steps:
1. Create an IAM User
You create a DynamoDB "user" by defining an IAM user within AWS.
Using the AWS CLI
bash aws iam create-user --user-name MyDynamoDBUser
2. Attach a Policy Granting DynamoDB Access
Attach a managed or custom policy to the user. Example policy to allow access to a single DynamoDB table:
json
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": [
"dynamodb:GetItem",
"dynamodb:PutItem",
"dynamodb:Query",
"dynamodb:UpdateItem",
"dynamodb:DeleteItem",
"dynamodb:Scan"
],
"Resource": "arn:aws:dynamodb:us-east-1:YOUR_ACCOUNT_ID:table/YourTableName"
}]
}
You can attach this policy using the AWS CLI:
bash aws iam put-user-policy --user-name MyDynamoDBUser \ --policy-name DynamoDBAccessPolicy \ --policy-document file://policy.json(Save the above JSON into a file named
policy.jsonbeforehand.)
3. (Optional) Create Access Keys
To programmatically access DynamoDB (e.g., via the AWS SDK), generate an access key and secret:
bash aws iam create-access-key --user-name MyDynamoDBUser
Take note of the
AccessKeyIdand
SecretAccessKeyreturned. These will be used for API/SKED access.
4. Use IAM User Credentials to Interact With DynamoDB
With the access keys, SDKs like boto3 (Python), AWS SDK for JavaScript, or AWS CLI can authenticate and interact with the table according to the attached policy.
Example with AWS CLI
bash aws dynamodb list-tables --profile my_dynamodb_user_profileor specify credentials directly:
bash AWS_ACCESS_KEY_ID=AKIA... AWS_SECRET_ACCESS_KEY=ABCD... \ aws dynamodb scan --table-name YourTableName
Note:
- DynamoDB does not support SQL for user management because user authentication and authorization is entirely managed by IAM outside the database itself.
- For fine-grained access control, consider using DynamoDB Condition Keys in your IAM policies.
Installing the JDBC driver
At the time of writing, the JDBC driver for Amazon DynamoDB is not shipped with Matillion Data Productivity Cloud by default. This decision is due to licensing or redistribution restrictions imposed by the driver vendor. Therefore, users are required to manually download and install the driver before establishing a JDBC connection to DynamoDB within Matillion.
To obtain the JDBC driver, visit the official Simba website: https://www.simba.com/drivers/dynamodb-jdbc-odbc/. When selecting a driver, prioritize the Type 4 JDBC version. Type 4 drivers are pure Java-based and do not depend on native libraries, which enhances compatibility and simplifies deployment within the Matillion environment.
After downloading the appropriate driver (typically delivered as a JAR file), you need to upload it to your Matillion Data Productivity Cloud instance. Detailed installation instructions are provided at: https://docs.matillion.com/data-productivity-cloud/agent/docs/uploading-external-drivers/
Once the JDBC driver has been successfully uploaded and installed, you can proceed to configure and use it within your Matillion projects. For step-by-step usage guidance, including how to set up database queries against DynamoDB using the installed JDBC driver, refer to: https://docs.matillion.com/data-productivity-cloud/designer/docs/database-query/
Checking network connectivity
To ensure Matillion Data Productivity Cloud can connect to your DynamoDB database, you must verify that the database allows incoming network connections from the correct sources, depending on your deployment configuration:
-
Full SaaS Agent Configuration:
When using a Full SaaS agent, you need to permit connectivity from the IP addresses used by Matillion servers. The list of IP addresses you must allow is available at: Matillion IP Addresses. -
Hybrid SaaS Deployment:
For a Hybrid SaaS configuration, you must allow connections from your own Virtual Private Cloud (VPC), since the agent runs within your network environment. To verify connectivity and assist with setup, you can use the utilities provided here: Matillion Exchange – Check Network Access.
In addition, if you reference your DynamoDB database using a DNS hostname, ensure that the agent—whether Full SaaS or Hybrid SaaS—can successfully resolve the DNS address. This is necessary for establishing a reliable connection to the DynamoDB endpoint.
How to Query Data from a DynamoDB Database
This guide explains how to query data from a DynamoDB database using SQL-like syntax and outlines best practices for loading this data into analytical environments.
1. SQL SELECT Statement Examples for DynamoDB
While DynamoDB does not natively support SQL, you can use tools and connectors (e.g., AWS Glue, Athena, or 3rd party ETL platforms) to query DynamoDB tables using SQL-like syntax. Here are example queries as they would be expressed in SQL:
``` -- Select all items from a DynamoDB table SELECT * FROM dynamodb_table;
-- Select specific columns (attributes) SELECT user_id, created_at, status FROM dynamodb_table;
-- Query with a filter (e.g., status is 'active') SELECT * FROM dynamodb_table WHERE status = 'active';
-- Query with a sort key and partition key (if used) SELECT * FROM dynamodb_table WHERE user_id = '12345' AND created_at > '2023-11-01T00:00:00Z'; ``` Note: These SQL statements illustrate the kinds of queries you might express via a connector (such as Matillion, AWS Glue, or Snowflake external tables) that supports access to DynamoDB.
2. Datatype Conversion Considerations
When extracting data from DynamoDB and ingesting it into platforms like Snowflake, some datatype conversions commonly occur:
| DynamoDB Type | Typical Snowflake Conversion |
|---|---|
| String (S) | VARCHAR |
| Number (N) | NUMBER, FLOAT, INTEGER |
| Boolean (BOOL) | BOOLEAN |
| String Set (SS) | ARRAY |
| Number Set (NS) | ARRAY |
| Map (M) | VARIANT (JSON object) |
| List (L) | ARRAY or VARIANT |
| Binary (B) | BINARY, VARIANT |
Some transformation may be needed to align attribute types between DynamoDB and Snowflake, especially for collection or JSON-like types (
Map,
List).
3. Recommended Data Loading Patterns
The recommended approach for analytics is to perform an initial once-off full load, followed by regular incremental loads. This ensures that all data is ingested, and subsequently, only changes (new or updated records) are fetched.
- Use the same Database Query component for both types of loads.
- Refer to Matillion's best practices on Incremental Load & Data Replication Strategy.
Initial Load
- Description: Loads all data from DynamoDB into your target system.
- Implementation: Use the Database Query component with no filter clause.
- SQL Representation:
SELECT * FROM dynamodb_table;
Incremental Load
- Description: Loads only new or changed records since the last load, reducing transfer and processing time.
- Implementation: Use the Database Query component with a filter clause (WHERE) based on a suitable column (e.g., timestamp or modification attribute).
- SQL Representation:
SELECT * FROM dynamodb_table WHERE updated_at > '2024-03-01T00:00:00Z';
- The value in the filter should be dynamically replaced with the last successful load timestamp.
Refer to further guidance from Matillion's Exchange Article.
Data Integration Architecture
An important advantage of the ELT (Extract, Load, Transform) architecture is its "divide and conquer" approach, where raw data is first loaded into the Snowflake database in advance of integration. By splitting the process into two distinct steps—loading and then integrating—organizations gain flexibility and efficiency in managing large, complex datasets. Data integration inevitably requires data transformation, and the most effective way to achieve this within Snowflake is through purpose-built data transformation pipelines. These pipelines automate and orchestrate the required data preparations, such as cleansing, normalization, and enrichment, in a repeatable manner. Another significant benefit of the ELT model is that the transformation and integration occur directly inside the Snowflake database itself. This enables fast, on-demand, and highly scalable processing, while also eliminating the need to pay for and maintain additional external data processing infrastructure.