The Agentic Advantage series: Real-world AI success stories from leading CDAOs

Register Now

Integrate data from MongoDB to Snowflake using Matillion

Our MongoDB to Snowflake connector enables seamless and efficient data transfer to Snowflake within minutes, ensuring your data remains current without the need for manual coding or complicated ETL processes.

MongoDB
Snowflake
MongoDB to Snowflake banner

Extracting data from MongoDB to Snowflake

Extracting data from MongoDB is a critical step in enabling organizations to fully leverage their operational data in advanced analytics platforms such as Snowflake. By transferring data from MongoDB to Snowflake, organizations can take advantage of Snowflake’s powerful data processing and querying capabilities while maintaining MongoDB as the source of operational truth. This article will guide you through the essential steps involved in this data extraction and loading process. We will begin by creating an identity in MongoDB that is authorized to access the data you wish to extract. For users working with Matillion, we will cover how to check for the presence of – or acquire – the requisite JDBC driver needed for connectivity between the two systems. Next, we will review the necessary network connectivity considerations to ensure a secure and efficient transfer from source to target. Finally, we will explain how to query data from MongoDB, with a focus on strategies for both initial full data loads and subsequent incremental updates. By following this guide, you will be well-equipped to set up a robust and reliable process for extracting data from MongoDB and loading it into Snowflake.


What is MongoDB?

MongoDB is a popular open-source, document-oriented NoSQL database designed for scalability, flexibility, and high performance. It stores data in BSON (binary JSON) documents, which enables support for rich and dynamic schemas, allowing for seamless modeling of complex and evolving data structures without predefined schemas. MongoDB offers robust indexing, powerful querying capabilities—including support for ad hoc queries and aggregation pipelines—as well as horizontal scaling through sharding and high availability via replica sets. Widely used in modern application development, it is well-suited for handling large volumes of unstructured or semi-structured data, and it integrates effectively with a variety of programming languages and ecosystems.

matillion logo x MongoDB

What is Snowflake?

Snowflake is a cloud-native data warehouse platform that decouples compute and storage, allowing for highly scalable, concurrent data processing. Built on a multi-cluster shared data architecture, it supports semi-structured and structured data using a SQL-based interface, making it suitable for diverse analytics workloads. Snowflake’s fully managed service automates infrastructure management, scaling, tuning, and security, thus enabling organizations to focus on data engineering and analytics tasks. Its role-based access control, support for ACID transactions, and integration with multiple cloud providers (AWS, Azure, GCP) make it a robust solution for enterprise data warehousing, enabling seamless data sharing and collaboration across teams and partners.

Why Move Data from MongoDB into Snowflake

The Benefits of Transferring Data from MongoDB to Snowflake for Advanced Analytics

Data engineers and architects may choose to copy data from MongoDB into Snowflake for several strategic reasons. MongoDB often stores critical and potentially valuable datasets, such as user activity logs, product catalog information, or application-generated events. However, the true value of this data is frequently realized only when it is integrated with other enterprise data, such as sales records, customer profiles, or financial transactions, which may reside in different systems or formats. By transferring MongoDB data into Snowflake—an advanced, cloud-based data warehousing platform—organizations can efficiently consolidate and analyze information from multiple sources. This approach enables richer analytics and business insights without imposing extra operational workloads or risks on the MongoDB environment, as complex queries and processing are offloaded to Snowflake's scalable architecture.

Creating a User in MongoDB

This guide outlines how to create a user in a MongoDB database using the

mongo
shell. MongoDB manages access to databases through user accounts and role-based privileges.

Prerequisites

  • Access to the
    mongo
    shell or MongoDB client.
  • Required administrative privileges (typically an existing user with
    userAdmin
    or
    userAdminAnyDatabase
    role).

Step-by-Step Instructions

  1. Connect to the MongoDB Instance

ell
   mongo --host <hostname> --port <port> -u <admin_user> -p --authenticationDatabase admin

Replace

<hostname>
,
<port>
,
<admin_user>
, and provide the admin password when prompted.

  1. Switch to the Target Database

Users are created in the context of a specific database. Replace

<mydatabase>
below with your target database name.

javascript
   use <mydatabase>

  1. Create the User

Use the

db.createUser()
method to add a new user. The following example creates a user with
readWrite
access on the current database.

javascript
   db.createUser({
       user: "newuser",
       pwd: "user_password", // choose a strong password
       roles: [
           { role: "readWrite", db: "<mydatabase>" }
       ]
   })

  • Replace
    newuser
    and
    user_password
    with the desired username and password.
  • Adjust roles as necessary:
    • Common roles:
      readWrite
      ,
      read
      ,
      dbAdmin
      ,
      userAdmin
      .

Example

Here is a complete example for a database named

testdb
:

```javascript use testdb

db.createUser({ user: "appuser", pwd: "Str0ngPa$$w0rd!", roles: [ { role: "readWrite", db: "testdb" } ] }) ```

Reference: Common Roles

Role Description
read Allows read-only operations.
readWrite Allows read and write operations.
dbAdmin Administrative tasks (indexes, profiling).
userAdmin Manage users and roles.

For full details, see the MongoDB documentation on user management.

Installing the JDBC driver

At the time of writing, the JDBC driver for MongoDB is not included by default with Matillion Data Productivity Cloud due to licensing or redistribution restrictions. Therefore, users must manually download and install the JDBC driver before connecting Matillion to MongoDB databases.

To obtain the JDBC driver, visit the MongoDB BI Connector download page: https://www.mongodb.com/products/bi-connector. When selecting the driver, ensure you choose the Type 4 JDBC driver, as this provides native connectivity and is the preferred option for integration with Matillion Data Productivity Cloud.

Once you have downloaded the appropriate JDBC driver for your environment, you’ll need to upload and register the driver within your Matillion agent. Detailed installation instructions are available in the official documentation here: https://docs.matillion.com/data-productivity-cloud/agent/docs/uploading-external-drivers/. Carefully follow these steps, ensuring that the JDBC driver’s files are placed and referenced according to the documentation requirements.

After the driver has been installed and the Matillion Data Productivity Cloud agent has recognized it, you will be able to use the driver to connect to your MongoDB database from within Matillion. For further guidance on using database connections via JDBC drivers within the Matillion Data Productivity Cloud Designer, please refer to the usage instructions here: https://docs.matillion.com/data-productivity-cloud/designer/docs/database-query/.

By completing these steps, you will enable JDBC connectivity between Matillion Data Productivity Cloud and your MongoDB database, leveraging the preferred and supported driver method.

Checking network connectivity

To ensure successful connectivity between Matillion Data Productivity Cloud and your MongoDB database, you must configure your database to accept incoming connections according to your deployment scenario:

Additionally, if your MongoDB database connection is configured using a DNS hostname (rather than a direct IP address), ensure that the Matillion agent—whether Full SaaS or Hybrid SaaS—is able to resolve the DNS address correctly to connect to your database. This may require DNS configuration or allowing appropriate outbound DNS traffic from your environment.

Querying Data from a MongoDB Database

When working with MongoDB, you can retrieve data using query expressions that are conceptually similar to SQL

SELECT
statements. Below are instructions and best practices for querying your MongoDB database, especially when integrating with analytics platforms like Snowflake and using ETL tools such as Matillion.


1. Example Queries

SQL
SELECT
Statements vs. MongoDB

SQL MongoDB
SELECT * FROM users;
db.users.find({});
SELECT * FROM users WHERE age > 30;
db.users.find({ age: { $gt: 30 } });
SELECT name, age FROM users WHERE active=1;
db.users.find({ active: 1 }, { name: 1, age: 1, _id: 0 });
SELECT COUNT(*) FROM orders;
db.orders.countDocuments({});
SELECT * FROM sales WHERE date > '2024-01-01';
db.sales.find({ date: { $gt: new Date('2024-01-01') } });

Note: In MongoDB, queries are written in JSON-like syntax. The

.find()
function is analogous to SQL's
SELECT
.


2. Data Type Conversion Considerations

When integrating MongoDB data with Snowflake, be aware that not all MongoDB data types map directly to Snowflake data types. For example:

  • ObjectId: MongoDB’s
    _id
    field (an ObjectId) is typically converted to a string in Snowflake.
  • Dates: MongoDB’s
    Date
    type converts to Snowflake’s
    TIMESTAMP
    .
  • Arrays/Objects: Arrays and embedded documents in MongoDB may need to be serialized (to JSON strings) or unnested, depending on your schema design and integration requirements.

Always review how your ETL tool or connector is mapping data types during transfer to avoid data truncation or loss.


3. Initial vs. Incremental Loads in ETL

The best pattern for data replication from MongoDB to Snowflake (or another analytics destination) is:

  • Initial Load: A one-time, full-data load with no filter.
  • Incremental Loads: Ongoing, partial loads that fetch only new or changed data, using a filtration mechanism.

Using the Database Query Component

Whether performing an initial or incremental load, the Matillion Database Query component can be used to execute your query and return the data.

a) Initial Load
  • Query: No filter/where clause; load all documents.
  • Example:

javascript
  db.sales.find({})
(In Matillion, the component query would often be empty or simply
{}
in the filter parameter.)

b) Incremental Load
  • Query: Includes a filter based on a field that lets you identify new or updated records, such as
    dateCreated
    or an
    _id
    field that includes a timestamp.
  • Example:

javascript
  db.sales.find({ dateCreated: { $gt: <last_loaded_date> } })
Or (using MongoDB ObjectId's timestamp):

javascript
  db.sales.find({ _id: { $gt: ObjectId("<last_loaded_id>") } })

The filter should be dynamically set to only retrieve records inserted or updated since the last load.

For more on this strategy, refer to Incremental Load Data Replication Strategy on Matillion Exchange.


Tip: Always test your incremental logic to ensure you are not missing records (due to late or out-of-order updates) nor reloading already-ingested data.

Data Integration Architecture

Loading data in advance of integration, as implemented in the Extract, Load, and Transform (ELT) architecture, enables organizations to divide and conquer the complexities of data integration by splitting the process into two distinct steps: loading raw data into Snowflake, and then transforming it afterwards. This approach offers important advantages over traditional ETL models. Data integration inherently requires data to be transformed—often cleansed, standardized, or aggregated—before it can be used effectively for analytics or reporting. The most effective way to manage these requirements is through data transformation pipelines, which automate and orchestrate the sequence of transformation tasks. Additionally, a significant benefit of the ELT approach is that all transformation and integration work is performed directly within the Snowflake environment. This means the process is fast and on-demand, with scalable compute resources that adapt to varying workloads, and avoids the need for separate data processing infrastructure—ultimately reducing both operational complexity and cost.

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.