Integrate data from MongoDB to Amazon Redshift using Matillion

Our MongoDB to Redshift connector streamlines data transfer to Redshift within minutes, ensuring your information remains current—no manual coding or complicated ETL scripts required.

MongoDB
Amazon Redshift
MongoDB to Amazon Redshift banner

Extracting data from MongoDB to Amazon Redshift

Extracting data from MongoDB is a common requirement for organizations seeking to leverage the analytical capabilities of cloud data warehouses such as Amazon Redshift. Migrating or synchronizing data between these platforms enables more advanced reporting, business intelligence, and integration with other enterprise datasets. This article provides a step-by-step guide for successfully transferring data from MongoDB into Redshift, with a particular focus on users of Matillion ETL. You will learn how to: - **Create a suitable user identity in MongoDB** with the necessary privileges for data extraction - **Check for or acquire the correct MongoDB JDBC driver** for use within Matillion jobs - **Configure secure and efficient network connectivity** between your MongoDB source and Redshift target via Matillion - **Query and transfer your data**, handling both initial full loads and ongoing incremental updates By following these outlined steps, you can ensure the smooth and reliable movement of data from MongoDB to Redshift, empowering your organization with timely and actionable insights.


What is MongoDB?

MongoDB is a widely adopted, open-source NoSQL database designed for high performance, scalability, and flexibility. Unlike traditional relational databases, MongoDB stores data in BSON (Binary JSON) documents, allowing for dynamic, schema-less data modeling that adapts to evolving application requirements. Its architecture facilitates horizontal scaling through sharding and supports high availability with replica sets, making it suitable for large-scale distributed systems. Core features include powerful aggregation frameworks, secondary indexes, full-text search, and rich query capabilities that enable developers to efficiently manage and analyze complex datasets. MongoDB’s native drivers and ecosystem integration further streamline development across numerous programming languages and environments.

matillion logo x MongoDB

What is Amazon Redshift?

Amazon Redshift is a fully managed, petabyte-scale data warehouse built on PostgreSQL, optimized for analytic workloads on large datasets. It uses columnar storage and parallel query execution for high performance with complex queries. Redshift integrates seamlessly with AWS, supports standard SQL, and offers flexible deployment, including Redshift Serverless for on-demand scaling. It features strong security, automated backups, and supports easy data ingest from diverse sources. These capabilities make Redshift a popular choice for business intelligence, reporting, and big data analytics solutions in the cloud, particularly for organizations familiar with relational databases.

Why Move Data from MongoDB into Amazon Redshift

Unlocking Analytical Potential: The Case for Copying Data from MongoDB to Redshift

A data engineer or architect might choose to copy data from MongoDB into Amazon Redshift for several compelling reasons. Firstly, MongoDB often stores data that is potentially valuable for analytics, reporting, or machine learning use cases. However, the full value of this data is often realized when it is integrated with information from other sources, such as relational databases or enterprise data warehouses. By loading MongoDB data into Redshift, professionals can leverage Redshift’s robust querying and analytical capabilities to combine and analyze data from multiple origins efficiently. Importantly, performing data integration and intensive queries within Redshift—as opposed to directly on MongoDB—prevents the addition of unnecessary workload and performance bottlenecks on the operational MongoDB system, ensuring its continued responsiveness for transactional workloads.

Creating an Identity in MongoDB

This guide explains how to create a user in a MongoDB database using the MongoDB shell (

mongosh
).

Prerequisites

  • MongoDB server (any supported version)
  • Access to
    mongosh
    (MongoDB Shell)
  • User must have appropriate privileges (e.g., userAdmin on the target database)

Step-by-Step Instructions

1. Connect to MongoDB

Open your terminal and connect to the MongoDB instance:

ell
mongosh

If your MongoDB instance requires authentication, use:

ell
mongosh "mongodb://username@host:port"

2. Switch to the Target Database

Select the database where you want to create the new user. For example, to use the

mydatabase
:

javascript
use mydatabase

3. Create a New User

Use the

db.createUser()
method to create a new user. You must specify at least a username, password, and an array of roles.

Example: Creating a user with read and write permissions

javascript
db.createUser({
  user: "appUser",
  pwd: "securePassword123",
  roles: [
    { role: "readWrite", db: "mydatabase" }
  ]
})

Parameters:

  • user
    : Username of the new user.
  • pwd
    : Password of the new user.
  • roles
    : An array of roles granted to the user. Each role document can specify the role and the associated database.

Available Built-in Database Roles:
-

read
: Read data from the database. -
readWrite
: Read and write data in the database. -
dbAdmin
: Administrative tasks specific to the database. -
userAdmin
: Manage users and roles in the database.

Example: Creating an admin user

To grant a user administrative access to all databases:

```javascript use admin

db.createUser({ user: "superAdmin", pwd: "superSecretPwd!", roles: [ { role: "userAdminAnyDatabase", db: "admin" }, { role: "dbAdminAnyDatabase", db: "admin" } ] }) ```

4. Verify User Creation

List all users in the current database:

javascript
db.getUsers()

Notes

  • Passwords are case-sensitive.
  • Role assignments should follow the principle of least privilege.
  • Restart application connections as needed after modifying authentication.

For additional options and guidance, refer to the MongoDB documentation on user management.

Installing the JDBC driver

At the time of writing, the JDBC driver for MongoDB is not included by default in Matillion Data Productivity Cloud, due to licensing or redistribution restrictions. As such, if you intend to connect to MongoDB from Matillion, you will need to manually obtain and upload the appropriate driver.

1. Download the MongoDB JDBC Driver
Visit the official MongoDB BI Connector page (https://www.mongodb.com/products/bi-connector) to download the JDBC driver. When presented with a choice, look specifically for the Type 4 JDBC driver, as this is a pure-Java implementation that does not require native libraries and is generally preferred for compatibility and portability within data integration platforms.

2. Prepare the Driver File
After downloading, locate the JAR file for the MongoDB JDBC driver on your local machine. Ensure that you are using a version that is compatible with your environment and the version of Matillion Data Productivity Cloud you are running.

3. Upload the Driver to Matillion
Follow the official installation instructions on uploading external drivers to Matillion Data Productivity Cloud:
https://docs.matillion.com/data-productivity-cloud/agent/docs/uploading-external-drivers/

This guide provides step-by-step details on uploading the JDBC driver through the Matillion Data Productivity Cloud interface. It is important to strictly follow these guidelines to ensure the platform recognizes and utilizes the new driver correctly.

4. Use the JDBC Driver in Your Workflows
Once the driver is successfully installed, you can begin using it to create MongoDB connections within your Matillion Data Productivity Cloud workflows. Refer to the database query usage instructions found here:
https://docs.matillion.com/data-productivity-cloud/designer/docs/database-query/

This will guide you through configuring your database queries and making use of the newly installed MongoDB JDBC driver.

Checking network connectivity

To enable connectivity between Matillion Data Productivity Cloud and your MongoDB database, you must ensure that your MongoDB instance is configured to allow incoming connections specific to your deployment type:

  • Full SaaS Agent Configuration:
    If you are using a Full SaaS agent configuration, your MongoDB database must permit incoming connections from the required Matillion IP addresses. These addresses are documented and updated at the following link:
    Allowing IP Addresses – Matillion Documentation.

  • Hybrid SaaS Deployment:
    For Hybrid SaaS deployments, incoming connections should be allowed from your own Virtual Private Cloud (VPC), where the Matillion agent is hosted. To help you verify the network path and ensure connectivity, refer to the utilities available here:
    Pipeline Network Access Check – Matillion Exchange.

Additionally, if your MongoDB database is referenced using a DNS hostname, the Matillion agent—whether Full SaaS or Hybrid SaaS—must be able to resolve this DNS address. Ensure that the relevant DNS records are accessible and the network configuration supports DNS resolution from the agent’s environment.

Querying Data from a MongoDB Database

Below are instructions on querying data from a MongoDB database, especially for users familiar with SQL. These guidelines also consider best practices for data loading into downstream systems like Amazon Redshift, including data type conversion and incremental load patterns.


Writing MongoDB Queries vs. SQL SELECT Statements

While MongoDB uses JSON-like query syntax, many logical concepts are similar to SQL:

SQL (Redshift) Example Equivalent MongoDB Query
SELECT * FROM users;
db.users.find({})
SELECT name, email FROM users WHERE age > 21;
db.users.find({ age: { $gt: 21 } }, { name: 1, email: 1 })
SELECT COUNT(*) FROM orders WHERE status = 'NEW';
db.orders.countDocuments({ status: 'NEW' })
SELECT * FROM users ORDER BY created_at DESC;
db.users.find({}).sort({ created_at: -1 })

Datatype Conversion: MongoDB to Redshift

  • MongoDB supports flexible documents (JSON/BSON), while Redshift uses tabular, strongly-typed columns.
  • Strings (
    MongoDB String
    ) map to VARCHAR in Redshift.
  • Numbers:
    int32
    /
    int64
    /
    double
    in MongoDB map to INTEGER, BIGINT, or DOUBLE PRECISION.
  • Boolean: Direct mapping to Redshift BOOLEAN.
  • Dates (
    ISODate
    ) map to TIMESTAMP in Redshift.
  • Arrays and Objects may be flattened or serialized to VARCHAR in Redshift.
  • Some MongoDB data types (e.g.,
    ObjectId
    ,
    Decimal128
    ) may require conversion to text or numeric formats during processing.

Data Loading Patterns

The recommended pattern for loading data from MongoDB to Redshift or similar platforms is:

1. Once-off Initial Load

  • Approach: Load all data from the MongoDB collection into Redshift.
  • Query example: No filter clause—select everything.

    javascript
        // MongoDB initial load: fetch all documents
        db.collection.find({})
    -- Redshift bulk insert follows after extraction/transformation

  • Process:
    Use the Database Query component with a query fetching all rows (no filter).

  • Purpose: This establishes a base dataset in the destination.

2. Ongoing Incremental Loads

  • Approach: Load only new or updated records since the last extract—using an appropriate filter, often on a column like
    _id
    or a timestamp (e.g.,
    updated_at
    ).
  • Query example: Add a filter.

    javascript
        // MongoDB incremental load: fetch documents updated after a cutoff
        db.collection.find({ updated_at: { $gt: ISODate("2024-06-20T00:00:00Z") }})
    -- Use a similar filter when pulling new data in Redshift staging

  • Process:
    Use the same Database Query component, but configure a parameterized filter to restrict to new or changed data.

  • Purpose: This pattern minimizes data movement and keeps destinations in sync.

  • Read more: Refer to Incremental Load Data Replication Strategy for detailed guidance and examples.


Example: Synchronizing MongoDB with Redshift

Initial Load:

javascript
db.orders.find({})
Extract full data, convert data types as required, and bulk load to Redshift.

Incremental Load (using a last-modified timestamp field):

javascript
db.orders.find({ last_modified: { $gt: ISODate("2024-06-20T00:00:00Z") } })
Extract only recent changes, apply datatype mapping, and upsert or merge as needed in Redshift.


Tip: Always standardize type conversions during ETL/ELT to maintain schema compatibility and avoid runtime errors. Explore parameterized queries and change tracking fields (_id, updated_at) for robust incremental loading.

Data Integration Architecture

Loading data into Amazon Redshift in advance of data integration exemplifies the "divide and conquer" approach central to the ELT (Extract, Load, Transform) architecture. By splitting the workflow into two distinct steps—loading raw source data first and then performing integration and transformation—the process becomes more manageable and flexible. Effective data integration requires data transformation, which is typically best managed through purpose-built data transformation pipelines. These pipelines automate and standardize the transformation of raw source data into formats suitable for analytics and reporting. A further advantage of the ELT architecture is that both data transformation and integration take place directly within the Redshift data warehouse. This approach eliminates the need for separate processing environments, allowing you to leverage Redshift's speed, scalability, and on-demand capabilities without incurring additional infrastructure costs for external data processing.

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.