Extracting data from MongoDB to Snowflake
Extracting data from MongoDB is a critical step in enabling organizations to fully leverage their operational data in advanced analytics platforms such as Snowflake. By transferring data from MongoDB to Snowflake, organizations can take advantage of Snowflake’s powerful data processing and querying capabilities while maintaining MongoDB as the source of operational truth. This article will guide you through the essential steps involved in this data extraction and loading process. We will begin by creating an identity in MongoDB that is authorized to access the data you wish to extract. For users working with Matillion, we will cover how to check for the presence of – or acquire – the requisite JDBC driver needed for connectivity between the two systems. Next, we will review the necessary network connectivity considerations to ensure a secure and efficient transfer from source to target. Finally, we will explain how to query data from MongoDB, with a focus on strategies for both initial full data loads and subsequent incremental updates. By following this guide, you will be well-equipped to set up a robust and reliable process for extracting data from MongoDB and loading it into Snowflake.
What is MongoDB?
MongoDB is a popular open-source, document-oriented NoSQL database designed for scalability, flexibility, and high performance. It stores data in BSON (binary JSON) documents, which enables support for rich and dynamic schemas, allowing for seamless modeling of complex and evolving data structures without predefined schemas. MongoDB offers robust indexing, powerful querying capabilities—including support for ad hoc queries and aggregation pipelines—as well as horizontal scaling through sharding and high availability via replica sets. Widely used in modern application development, it is well-suited for handling large volumes of unstructured or semi-structured data, and it integrates effectively with a variety of programming languages and ecosystems.
What is Snowflake?
Snowflake is a cloud-native data warehouse platform that decouples compute and storage, allowing for highly scalable, concurrent data processing. Built on a multi-cluster shared data architecture, it supports semi-structured and structured data using a SQL-based interface, making it suitable for diverse analytics workloads. Snowflake’s fully managed service automates infrastructure management, scaling, tuning, and security, thus enabling organizations to focus on data engineering and analytics tasks. Its role-based access control, support for ACID transactions, and integration with multiple cloud providers (AWS, Azure, GCP) make it a robust solution for enterprise data warehousing, enabling seamless data sharing and collaboration across teams and partners.
Why Move Data from MongoDB into Snowflake
The Benefits of Transferring Data from MongoDB to Snowflake for Advanced Analytics
Data engineers and architects may choose to copy data from MongoDB into Snowflake for several strategic reasons. MongoDB often stores critical and potentially valuable datasets, such as user activity logs, product catalog information, or application-generated events. However, the true value of this data is frequently realized only when it is integrated with other enterprise data, such as sales records, customer profiles, or financial transactions, which may reside in different systems or formats. By transferring MongoDB data into Snowflake—an advanced, cloud-based data warehousing platform—organizations can efficiently consolidate and analyze information from multiple sources. This approach enables richer analytics and business insights without imposing extra operational workloads or risks on the MongoDB environment, as complex queries and processing are offloaded to Snowflake's scalable architecture.
Similar connectors
Creating a User in MongoDB
This guide outlines how to create a user in a MongoDB database using the
mongoshell. MongoDB manages access to databases through user accounts and role-based privileges.
Prerequisites
- Access to the
mongo
shell or MongoDB client. - Required administrative privileges (typically an existing user with
userAdmin
oruserAdminAnyDatabase
role).
Step-by-Step Instructions
- Connect to the MongoDB Instance
ell mongo --host <hostname> --port <port> -u <admin_user> -p --authenticationDatabase admin
Replace
<hostname>,
<port>,
<admin_user>, and provide the admin password when prompted.
- Switch to the Target Database
Users are created in the context of a specific database. Replace
<mydatabase>below with your target database name.
javascript use <mydatabase>
- Create the User
Use the
db.createUser()method to add a new user. The following example creates a user with
readWriteaccess on the current database.
javascript
db.createUser({
user: "newuser",
pwd: "user_password", // choose a strong password
roles: [
{ role: "readWrite", db: "<mydatabase>" }
]
})
- Replace
newuser
anduser_password
with the desired username and password. - Adjust roles as necessary:
- Common roles:
readWrite
,read
,dbAdmin
,userAdmin
.
- Common roles:
Example
Here is a complete example for a database named
testdb:
```javascript use testdb
db.createUser({ user: "appuser", pwd: "Str0ngPa$$w0rd!", roles: [ { role: "readWrite", db: "testdb" } ] }) ```
Reference: Common Roles
| Role | Description |
|---|---|
| read | Allows read-only operations. |
| readWrite | Allows read and write operations. |
| dbAdmin | Administrative tasks (indexes, profiling). |
| userAdmin | Manage users and roles. |
For full details, see the MongoDB documentation on user management.
Installing the JDBC driver
At the time of writing, the JDBC driver for MongoDB is not included by default with Matillion Data Productivity Cloud due to licensing or redistribution restrictions. Therefore, users must manually download and install the JDBC driver before connecting Matillion to MongoDB databases.
To obtain the JDBC driver, visit the MongoDB BI Connector download page: https://www.mongodb.com/products/bi-connector. When selecting the driver, ensure you choose the Type 4 JDBC driver, as this provides native connectivity and is the preferred option for integration with Matillion Data Productivity Cloud.
Once you have downloaded the appropriate JDBC driver for your environment, you’ll need to upload and register the driver within your Matillion agent. Detailed installation instructions are available in the official documentation here: https://docs.matillion.com/data-productivity-cloud/agent/docs/uploading-external-drivers/. Carefully follow these steps, ensuring that the JDBC driver’s files are placed and referenced according to the documentation requirements.
After the driver has been installed and the Matillion Data Productivity Cloud agent has recognized it, you will be able to use the driver to connect to your MongoDB database from within Matillion. For further guidance on using database connections via JDBC drivers within the Matillion Data Productivity Cloud Designer, please refer to the usage instructions here: https://docs.matillion.com/data-productivity-cloud/designer/docs/database-query/.
By completing these steps, you will enable JDBC connectivity between Matillion Data Productivity Cloud and your MongoDB database, leveraging the preferred and supported driver method.
Checking network connectivity
To ensure successful connectivity between Matillion Data Productivity Cloud and your MongoDB database, you must configure your database to accept incoming connections according to your deployment scenario:
-
Full SaaS Agent Configuration:
Allow inbound connections to your MongoDB database from the specific IP addresses used by Matillion. Obtain the latest list of these IP addresses from: https://docs.matillion.com/data-productivity-cloud/security/docs/allowing-ip-addresses/. -
Hybrid SaaS Deployment:
In this configuration, allow connections from your organization's own Virtual Private Cloud (VPC), as the Matillion Hybrid SaaS agent will initiate connections from your infrastructure. For more information on accessing your network and useful tools to test connectivity, refer to: https://exchange.matillion.com/data-productivity-cloud/pipeline/check-network-access/.
Additionally, if your MongoDB database connection is configured using a DNS hostname (rather than a direct IP address), ensure that the Matillion agent—whether Full SaaS or Hybrid SaaS—is able to resolve the DNS address correctly to connect to your database. This may require DNS configuration or allowing appropriate outbound DNS traffic from your environment.
Querying Data from a MongoDB Database
When working with MongoDB, you can retrieve data using query expressions that are conceptually similar to SQL
SELECTstatements. Below are instructions and best practices for querying your MongoDB database, especially when integrating with analytics platforms like Snowflake and using ETL tools such as Matillion.
1. Example Queries
SQL SELECT
Statements vs. MongoDB
| SQL | MongoDB |
|---|---|
SELECT * FROM users; |
db.users.find({}); |
SELECT * FROM users WHERE age > 30; |
db.users.find({ age: { $gt: 30 } }); |
SELECT name, age FROM users WHERE active=1; |
db.users.find({ active: 1 }, { name: 1, age: 1, _id: 0 }); |
SELECT COUNT(*) FROM orders; |
db.orders.countDocuments({}); |
SELECT * FROM sales WHERE date > '2024-01-01'; |
db.sales.find({ date: { $gt: new Date('2024-01-01') } }); |
Note: In MongoDB, queries are written in JSON-like syntax. The
.find()function is analogous to SQL's
SELECT.
2. Data Type Conversion Considerations
When integrating MongoDB data with Snowflake, be aware that not all MongoDB data types map directly to Snowflake data types. For example:
- ObjectId: MongoDB’s
_id
field (an ObjectId) is typically converted to a string in Snowflake. - Dates: MongoDB’s
Date
type converts to Snowflake’sTIMESTAMP
. - Arrays/Objects: Arrays and embedded documents in MongoDB may need to be serialized (to JSON strings) or unnested, depending on your schema design and integration requirements.
Always review how your ETL tool or connector is mapping data types during transfer to avoid data truncation or loss.
3. Initial vs. Incremental Loads in ETL
The best pattern for data replication from MongoDB to Snowflake (or another analytics destination) is:
- Initial Load: A one-time, full-data load with no filter.
- Incremental Loads: Ongoing, partial loads that fetch only new or changed data, using a filtration mechanism.
Using the Database Query Component
Whether performing an initial or incremental load, the Matillion Database Query component can be used to execute your query and return the data.
a) Initial Load
- Query: No filter/where clause; load all documents.
- Example:
javascript
db.sales.find({})
(In Matillion, the component query would often be empty or simply {} in the filter parameter.)
b) Incremental Load
- Query: Includes a filter based on a field that lets you identify new or updated records, such as
dateCreated
or an_id
field that includes a timestamp. - Example:
javascript
db.sales.find({ dateCreated: { $gt: <last_loaded_date> } })
Or (using MongoDB ObjectId's timestamp):
javascript
db.sales.find({ _id: { $gt: ObjectId("<last_loaded_id>") } })
The filter should be dynamically set to only retrieve records inserted or updated since the last load.
For more on this strategy, refer to Incremental Load Data Replication Strategy on Matillion Exchange.
Tip: Always test your incremental logic to ensure you are not missing records (due to late or out-of-order updates) nor reloading already-ingested data.
Data Integration Architecture
Loading data in advance of integration, as implemented in the Extract, Load, and Transform (ELT) architecture, enables organizations to divide and conquer the complexities of data integration by splitting the process into two distinct steps: loading raw data into Snowflake, and then transforming it afterwards. This approach offers important advantages over traditional ETL models. Data integration inherently requires data to be transformed—often cleansed, standardized, or aggregated—before it can be used effectively for analytics or reporting. The most effective way to manage these requirements is through data transformation pipelines, which automate and orchestrate the sequence of transformation tasks. Additionally, a significant benefit of the ELT approach is that all transformation and integration work is performed directly within the Snowflake environment. This means the process is fast and on-demand, with scalable compute resources that adapt to varying workloads, and avoids the need for separate data processing infrastructure—ultimately reducing both operational complexity and cost.