Integrate data from DynamoDB to Amazon Redshift using Matillion

Our DynamoDB to Redshift connector enables seamless, real-time data transfers to Redshift within minutes, eliminating the need for manual coding or complex ETL scripts.

DynamoDB
Amazon Redshift
DynamoDB to Amazon Redshift banner

Extracting data from DynamoDB to Amazon Redshift

Extracting data from DynamoDB is a common requirement for organizations looking to leverage powerful data warehousing capabilities, such as those offered by Amazon Redshift. Migrating data from a NoSQL database like DynamoDB to a columnar store enables complex analytics, reporting, and business intelligence operations that go beyond DynamoDB’s native querying functionalities. This article serves as a practical guide for moving data from DynamoDB to Redshift, with a particular focus on users leveraging Matillion for orchestration and transformation. We will begin by outlining the process of creating an identity in DynamoDB to ensure secure and authorized access. Next, for Matillion users, we will discuss how to check for an existing JDBC driver, as well as the steps needed to acquire and install it if necessary. Understanding and establishing reliable network connectivity between your DynamoDB source and Redshift target will be covered in detail, helping you prevent common data transfer issues. Finally, we will review methods for querying the required data from DynamoDB, exploring strategies for both initial full loads and efficient incremental updates. By following the steps outlined in this guide, you’ll be able to set up a robust workflow for extracting and loading DynamoDB data into Redshift, optimizing your data pipeline for both accuracy and performance.


What is DynamoDB?

Amazon DynamoDB is a fully managed, serverless NoSQL database offering single-digit millisecond performance at any scale. It automatically partitions and replicates data across AWS availability zones, supporting flexible key-value and document models for diverse use cases like web apps, real-time analytics, and IoT. Features include built-in encryption at rest, fine-grained access control, point-in-time recovery, and ACID transaction support, ensuring data security and reliability. Integration with AWS Lambda and DynamoDB Streams enables event-driven processing, making DynamoDB an excellent choice for building modern, scalable application backends.

matillion logo x DynamoDB

What is Amazon Redshift?

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the AWS cloud, designed for fast query performance on large volumes of structured and semi-structured data. Architected atop PostgreSQL, Redshift employs columnar storage, data compression, and massively parallel processing (MPP) to optimize analytic query execution. It supports standard SQL along with integrations for business intelligence tools, and it provides scalability through seamless node expansion and concurrency scaling. Redshift’s architecture separates storage and compute, integrating with AWS services such as S3 via Redshift Spectrum for federated querying. This makes it suitable for advanced analytics, reporting workloads, and operational business intelligence at scale.

Why Move Data from DynamoDB into Amazon Redshift

Unlocking Advanced Analytics: The Case for Copying Data from DynamoDB to Redshift

A data engineer or architect might choose to copy data from DynamoDB into Redshift for several compelling reasons. First, DynamoDB often stores data that is potentially valuable for business insights, such as user activity logs or transactional information. However, the true value of this data is frequently realized only when it is integrated with other data sources—such as customer relationship management (CRM) systems, third-party analytics, or sales databases—to create a more comprehensive view for analysis or reporting. By moving this process to Amazon Redshift, an environment specifically optimized for complex query operations and large-scale data integration, organizations can perform robust and resource-intensive analytics without placing additional workload or operational risk on their DynamoDB tables, which are primarily designed for real-time operational access rather than analytical queries. This approach therefore facilitates richer insights while maintaining the performance and reliability of both systems.

Creating an Identity in DynamoDB

DynamoDB is a fully managed NoSQL database service provided by AWS. Unlike relational databases, DynamoDB does not support SQL-based user creation within the database itself. Access to DynamoDB is controlled externally using AWS Identity and Access Management (IAM). Instead of creating a user in the database, you create AWS IAM users or roles and grant them permissions to access and perform operations on DynamoDB tables.

Below are step-by-step instructions for creating a user and granting them DynamoDB permissions using the AWS Management Console, AWS CLI, and with example IAM policy JSON.

1. Creating an IAM User with DynamoDB Permissions

Using the AWS Management Console

  1. Navigate to the IAM Console:
    AWS IAM Console

  2. Create a New User:

    • Click "Users" in the sidebar, then "Add users."
    • Enter the user name (e.g.,
      dynamodb_app_user
      ).
    • Select "Programmatic access" for access type.
  3. Set Permissions:

    • Click "Attach existing policies directly."
    • Search for
      AmazonDynamoDBFullAccess
      to give full access, or create a custom policy for granular access.
  4. Review and Create User:

    • Review the user summary, click "Create user."
    • Make note of the "Access key ID" and "Secret access key."

Using AWS CLI

To create a user and attach the

AmazonDynamoDBFullAccess
policy:

```

Create IAM user

aws iam create-user --user-name dynamodb_app_user

Attach policy to user

aws iam attach-user-policy \ --user-name dynamodb_app_user \ --policy-arn arn:aws:iam::aws:policy/AmazonDynamoDBFullAccess ```

Example: Custom Policy for Table-Specific Permissions

You can restrict permissions to specific tables by attaching a custom IAM policy. Example policy granting CRUD (Create, Read, Update, Delete) access to the table

UsersTable
:

json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "dynamodb:GetItem",
        "dynamodb:PutItem",
        "dynamodb:UpdateItem",
        "dynamodb:DeleteItem",
        "dynamodb:Query",
        "dynamodb:Scan"
      ],
      "Resource": "arn:aws:dynamodb:us-east-1:123456789012:table/UsersTable"
    }
  ]
}

To create and attach this policy:

```

Save the above JSON policy to policy.json

Create the custom policy

aws iam create-policy \ --policy-name DynamoDBUsersTableCRUD \ --policy-document file://policy.json

Attach the custom policy to the user

aws iam attach-user-policy \ --user-name dynamodb_app_user \ --policy-arn arn:aws:iam::123456789012:policy/DynamoDBUsersTableCRUD ```

2. Authenticating and Accessing DynamoDB

Once credentials (Access Key ID and Secret Access Key) are provisioned, the user (or application) authenticates using these credentials to interact with DynamoDB using AWS SDKs, CLI, or API endpoints.


Note:
Unlike SQL-based databases, DynamoDB itself does not natively store user identities or credentials; all access is managed through AWS IAM. Access granularity, authentication, and authorization are controlled centrally at the AWS account/IAM level.

Installing the JDBC Driver

The DynamoDB JDBC driver is required to connect Matillion Data Productivity Cloud to your DynamoDB database. As of the time of writing, this driver is not included by default in Matillion Data Productivity Cloud due to licensing or redistribution restrictions. You must manually download and install this driver to enable JDBC-based connectivity for your workflows.

Step 1: Download the JDBC Driver

  • Navigate to the Simba drivers page: https://www.simba.com/drivers/dynamodb-jdbc-odbc/
  • On the page, look for the Type 4 JDBC driver for DynamoDB. Type 4 drivers are fully implemented in Java and do not require native libraries, making them the preferred choice for cloud platforms such as Matillion.
  • Complete any required registrations or end-user license agreements required by Simba before accessing the driver downloads.
  • Download the appropriate JDBC .jar file to your local machine.

Step 2: Upload and Install the Driver into Matillion Data Productivity Cloud

  • Refer to the official Matillion documentation for uploading external drivers: Uploading External Drivers
  • In the Matillion Data Productivity Cloud user interface, navigate to the section for managing custom JDBC drivers.
  • Follow the instructions provided in the documentation to upload the downloaded .jar file to your Matillion Agent or platform.
  • After the upload process, ensure the driver is available in the list of installed JDBC drivers.

Step 3: Using the JDBC Driver

  • Detailed guidance for configuring and using external database drivers (such as for DynamoDB) is available here: Database Query Component Usage
  • Within your Matillion workflow, select the database query component and configure it to use your newly installed DynamoDB JDBC driver.
  • Input the required connection parameters as directed by the product and the driver documentation.

Once the JDBC driver is installed and configured, you can leverage the capabilities of Matillion Data Productivity Cloud to query and manage your DynamoDB data.

Checking network connectivity

To ensure successful connectivity between Matillion Data Productivity Cloud and your DynamoDB database, you must configure your networking environment to allow incoming connections according to your deployment type:

If you are referencing your DynamoDB database using a DNS address (hostname), ensure that the Full SaaS or Hybrid SaaS agent is able to resolve this DNS address correctly. Network-level DNS resolution must be permitted so that the agent can obtain the necessary IP address for connecting to DynamoDB.

Querying Data from DynamoDB

This guide explains how to query data from an Amazon DynamoDB database, focusing on best practices for data loading and providing SQL SELECT statement examples. These instructions will aid in understanding query patterns, handling datatype conversions, and implementing both initial and incremental data loads, as outlined in the Incremental Load (Data Replication) Strategy guide.


DynamoDB Query Examples: SQL SELECT Equivalents

While DynamoDB is a NoSQL database and does not natively use SQL, you may use SQL-like syntax with certain integration tools or when loading from DynamoDB into Amazon Redshift or other relational systems.

Example 1: Select all records (Initial Load)

SELECT * FROM dynamodb_table;
This would select all items from the DynamoDB table. In a native DynamoDB context, you typically use a
Scan
operation for this.

Example 2: Select specific attributes

SELECT item_id, price, updated_at FROM dynamodb_table;

Example 3: Incremental load based on a date filter

SELECT * FROM dynamodb_table WHERE updated_at > '2024-06-01T00:00:00Z';
For DynamoDB, you can achieve this using a
Query
operation with a filter expression on the
updated_at
attribute.


Datatype Conversion Considerations

When moving data between DynamoDB and Amazon Redshift (or other relational data stores), be aware that datatype conversions may occur. DynamoDB supports a limited set of types (String, Number, Binary, Boolean, Null, List, Map), whereas Redshift supports typical SQL data types (VARCHAR, INT, TIMESTAMP, etc.).

Common conversions: - DynamoDB

String
→ Redshift
VARCHAR
- DynamoDB
Number
→ Redshift
DECIMAL
/
BIGINT
- DynamoDB
Boolean
→ Redshift
BOOLEAN
- DynamoDB
Map
/
List
→ Redshift
SUPER
(if using Redshift Spectrum), or flattened into regular columns

Always review the schema mapping and adjust as necessary to ensure data integrity.


Data Loading Pattern: Initial vs Incremental Load

The best practice for loading data from DynamoDB into a data warehouse (e.g., Redshift) is to:

  1. Perform a once-off initial load, where all data is extracted.
  2. The Database Query component does not include a filter clause.
  3. Example:
    SELECT * FROM dynamodb_table;
  4. Subsequently schedule incremental loads, where only new or modified data since the last load is extracted.
  5. The Database Query component includes a filter clause, usually based on a timestamp or change-tracking attribute.
  6. Example:
    SELECT * FROM dynamodb_table WHERE updated_at > '{{$previous_run_end_time}}';

Tip: Use the same Database Query component for both loading patterns—simply adjust the WHERE clause or filter condition.

For more details, refer to the official Matillion Exchange article: Incremental Load (Data Replication) Strategy.


Note: Replace

dynamodb_table
,
item_id
,
price
, and
updated_at
with actual table and attribute names relevant to your schema. Also, adjust date/timestamp formats as required by your ETL tool or destination system.

Data Integration Architecture

Loading data in advance of integration is a core principle of ELT (Extract, Load, Transform) architecture, as it allows the overall data integration challenge to be divided into manageable phases: first, bulk loading raw data into the Amazon Redshift database, and second, performing integration and transformation inside the database itself. This approach provides key advantages. Data integration always requires data to be cleansed, standardized, and merged—these transformation processes can be efficiently handled using data transformation pipelines, which automate and optimize sequences of SQL-based data modifications. With ELT in Redshift, all transformation and integration activities occur directly inside the target database, leveraging its computing power. This makes the process both fast and scalable, delivering insights on demand while eliminating the need to provision or pay for separate, external data processing infrastructure.

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.