Integrate data from DynamoDB to Snowflake using Matillion

Our DynamoDB to Snowflake connector enables seamless data transfer to Snowflake within minutes, ensuring up-to-date information without the need for manual coding or complex ETL processes.

DynamoDB
Snowflake
DynamoDB to Snowflake banner

Extracting data from DynamoDB to Snowflake

Extracting data from DynamoDB is a critical task for organizations seeking to leverage cloud-native NoSQL data in downstream analytics platforms such as Snowflake. A smooth and secure data transfer process enables deeper insights by combining operational data with other sources in a centralized data warehouse environment. In this article, we will guide you through the essential steps required to extract data from Amazon DynamoDB and load it efficiently into Snowflake. We will start by describing how to create an identity in DynamoDB to ensure proper authentication and permissions. For those using Matillion ETL, we will cover how to check for, or obtain, the required JDBC driver for connectivity. We will also discuss best practices for ensuring stable and secure network connectivity between your source (DynamoDB) and target (Snowflake) environments. Finally, we will address the mechanisms needed for extracting data—both initial, full-load processes, and efficient incremental querying—to keep your Snowflake environment refreshed with the latest changes. Follow along to master each stage of this data integration workflow.


What is DynamoDB?

Amazon DynamoDB is a fully managed, serverless NoSQL database offering high performance at any scale. It supports key-value and document data models, predictable low-latency access, automatic scaling, and high availability with multi-AZ replication. With strong or eventual consistency options, DynamoDB abstracts infrastructure management, and developers access it via RESTful APIs or AWS SDKs. Key features include fine-grained AWS IAM access control, DynamoDB Streams for event processing, and easy AWS Lambda integration. These capabilities make DynamoDB ideal for rapid, scalable data requirements in use cases like IoT, gaming, mobile backends, and real-time analytics.

matillion logo x DynamoDB

What is Snowflake?

Snowflake is a cloud-native data platform designed to provide scalable, performant, and secure data warehousing and analytics services. Built on a multi-cluster, shared data architecture, Snowflake separates storage and compute, enabling independent scaling of resources to accommodate varying workload demands without downtime. It supports both structured and semi-structured data (such as JSON, Avro, and Parquet) and offers ANSI-standard SQL interfaces for seamless data processing and querying. Snowflake also features robust data sharing, advanced security controls, near-zero administration, and automatic maintenance, making it particularly suitable for organizations seeking rapid deployment and integration with modern data ecosystems across major cloud providers (AWS, Azure, and Google Cloud).

Why Move Data from DynamoDB into Snowflake

The Value of Replicating DynamoDB Data to Snowflake for Advanced Analytics

A data engineer or architect may wish to copy data from DynamoDB into Snowflake for several compelling reasons. First, DynamoDB often contains operational or transactional data that holds significant business value, yet its full potential can be realized only when integrated with data from other sources. By consolidating this valuable data within Snowflake, organizations can perform advanced analytics and gain richer insights that are simply not possible within the isolated environment of DynamoDB. Moreover, leveraging Snowflake’s scalable compute and robust data integration capabilities allows for such analytics without burdening the DynamoDB service, thus maintaining application performance and reliability. In summary, offloading data to Snowflake enables comprehensive analysis and integration—unlocking the value of data in DynamoDB—while preserving the efficiency of operational databases.

Creating an Identity in DynamoDB

DynamoDB is an AWS-managed NoSQL database and does not have its own concept of "users" inside the database, as a relational database might. Instead, authentication and resource access is managed outside DynamoDB using AWS Identity and Access Management (IAM). A "user" is typically an IAM entity, and access to tables, items, or attributes in DynamoDB is controlled through IAM policies.

The typical process to enable a user (such as an application, developer, or external service) to interact with DynamoDB involves the following steps:


1. Create an IAM User

You create a DynamoDB "user" by defining an IAM user within AWS.

Using the AWS CLI

bash
aws iam create-user --user-name MyDynamoDBUser


2. Attach a Policy Granting DynamoDB Access

Attach a managed or custom policy to the user. Example policy to allow access to a single DynamoDB table:

json
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": [
      "dynamodb:GetItem",
      "dynamodb:PutItem",
      "dynamodb:Query",
      "dynamodb:UpdateItem",
      "dynamodb:DeleteItem",
      "dynamodb:Scan"
    ],
    "Resource": "arn:aws:dynamodb:us-east-1:YOUR_ACCOUNT_ID:table/YourTableName"
  }]
}

You can attach this policy using the AWS CLI:

bash
aws iam put-user-policy --user-name MyDynamoDBUser \
  --policy-name DynamoDBAccessPolicy \
  --policy-document file://policy.json
(Save the above JSON into a file named
policy.json
beforehand.)


3. (Optional) Create Access Keys

To programmatically access DynamoDB (e.g., via the AWS SDK), generate an access key and secret:

bash
aws iam create-access-key --user-name MyDynamoDBUser

Take note of the

AccessKeyId
and
SecretAccessKey
returned. These will be used for API/SKED access.


4. Use IAM User Credentials to Interact With DynamoDB

With the access keys, SDKs like boto3 (Python), AWS SDK for JavaScript, or AWS CLI can authenticate and interact with the table according to the attached policy.

Example with AWS CLI

bash
aws dynamodb list-tables --profile my_dynamodb_user_profile
or specify credentials directly:

bash
AWS_ACCESS_KEY_ID=AKIA... AWS_SECRET_ACCESS_KEY=ABCD... \
  aws dynamodb scan --table-name YourTableName


Note:
- DynamoDB does not support SQL for user management because user authentication and authorization is entirely managed by IAM outside the database itself. - For fine-grained access control, consider using DynamoDB Condition Keys in your IAM policies.

Installing the JDBC driver

At the time of writing, the JDBC driver for Amazon DynamoDB is not shipped with Matillion Data Productivity Cloud by default. This decision is due to licensing or redistribution restrictions imposed by the driver vendor. Therefore, users are required to manually download and install the driver before establishing a JDBC connection to DynamoDB within Matillion.

To obtain the JDBC driver, visit the official Simba website: https://www.simba.com/drivers/dynamodb-jdbc-odbc/. When selecting a driver, prioritize the Type 4 JDBC version. Type 4 drivers are pure Java-based and do not depend on native libraries, which enhances compatibility and simplifies deployment within the Matillion environment.

After downloading the appropriate driver (typically delivered as a JAR file), you need to upload it to your Matillion Data Productivity Cloud instance. Detailed installation instructions are provided at: https://docs.matillion.com/data-productivity-cloud/agent/docs/uploading-external-drivers/

Once the JDBC driver has been successfully uploaded and installed, you can proceed to configure and use it within your Matillion projects. For step-by-step usage guidance, including how to set up database queries against DynamoDB using the installed JDBC driver, refer to: https://docs.matillion.com/data-productivity-cloud/designer/docs/database-query/

Checking network connectivity

To ensure Matillion Data Productivity Cloud can connect to your DynamoDB database, you must verify that the database allows incoming network connections from the correct sources, depending on your deployment configuration:

  • Full SaaS Agent Configuration:
    When using a Full SaaS agent, you need to permit connectivity from the IP addresses used by Matillion servers. The list of IP addresses you must allow is available at: Matillion IP Addresses.

  • Hybrid SaaS Deployment:
    For a Hybrid SaaS configuration, you must allow connections from your own Virtual Private Cloud (VPC), since the agent runs within your network environment. To verify connectivity and assist with setup, you can use the utilities provided here: Matillion Exchange – Check Network Access.

In addition, if you reference your DynamoDB database using a DNS hostname, ensure that the agent—whether Full SaaS or Hybrid SaaS—can successfully resolve the DNS address. This is necessary for establishing a reliable connection to the DynamoDB endpoint.

How to Query Data from a DynamoDB Database

This guide explains how to query data from a DynamoDB database using SQL-like syntax and outlines best practices for loading this data into analytical environments.


1. SQL SELECT Statement Examples for DynamoDB

While DynamoDB does not natively support SQL, you can use tools and connectors (e.g., AWS Glue, Athena, or 3rd party ETL platforms) to query DynamoDB tables using SQL-like syntax. Here are example queries as they would be expressed in SQL:

``` -- Select all items from a DynamoDB table SELECT * FROM dynamodb_table;

-- Select specific columns (attributes) SELECT user_id, created_at, status FROM dynamodb_table;

-- Query with a filter (e.g., status is 'active') SELECT * FROM dynamodb_table WHERE status = 'active';

-- Query with a sort key and partition key (if used) SELECT * FROM dynamodb_table WHERE user_id = '12345' AND created_at > '2023-11-01T00:00:00Z'; ``` Note: These SQL statements illustrate the kinds of queries you might express via a connector (such as Matillion, AWS Glue, or Snowflake external tables) that supports access to DynamoDB.


2. Datatype Conversion Considerations

When extracting data from DynamoDB and ingesting it into platforms like Snowflake, some datatype conversions commonly occur:

DynamoDB Type Typical Snowflake Conversion
String (S) VARCHAR
Number (N) NUMBER, FLOAT, INTEGER
Boolean (BOOL) BOOLEAN
String Set (SS) ARRAY or VARIANT
Number Set (NS) ARRAY or VARIANT
Map (M) VARIANT (JSON object)
List (L) ARRAY or VARIANT
Binary (B) BINARY, VARIANT

Some transformation may be needed to align attribute types between DynamoDB and Snowflake, especially for collection or JSON-like types (

Map
,
List
).


3. Recommended Data Loading Patterns

The recommended approach for analytics is to perform an initial once-off full load, followed by regular incremental loads. This ensures that all data is ingested, and subsequently, only changes (new or updated records) are fetched.

Initial Load

  • Description: Loads all data from DynamoDB into your target system.
  • Implementation: Use the Database Query component with no filter clause.
  • SQL Representation:
    SELECT * FROM dynamodb_table;

Incremental Load

  • Description: Loads only new or changed records since the last load, reducing transfer and processing time.
  • Implementation: Use the Database Query component with a filter clause (WHERE) based on a suitable column (e.g., timestamp or modification attribute).
  • SQL Representation:
    SELECT * FROM dynamodb_table
    WHERE updated_at > '2024-03-01T00:00:00Z';
  • The value in the filter should be dynamically replaced with the last successful load timestamp.

Refer to further guidance from Matillion's Exchange Article.

Data Integration Architecture

An important advantage of the ELT (Extract, Load, Transform) architecture is its "divide and conquer" approach, where raw data is first loaded into the Snowflake database in advance of integration. By splitting the process into two distinct steps—loading and then integrating—organizations gain flexibility and efficiency in managing large, complex datasets. Data integration inevitably requires data transformation, and the most effective way to achieve this within Snowflake is through purpose-built data transformation pipelines. These pipelines automate and orchestrate the required data preparations, such as cleansing, normalization, and enrichment, in a repeatable manner. Another significant benefit of the ELT model is that the transformation and integration occur directly inside the Snowflake database itself. This enables fast, on-demand, and highly scalable processing, while also eliminating the need to pay for and maintain additional external data processing infrastructure.

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.