Extracting data from Couchbase to Snowflake
Extracting data from Couchbase is a common requirement for organizations seeking to leverage their NoSQL data in advanced analytics platforms such as Snowflake. Successfully moving data between these systems enables deeper insights and more flexible reporting, but the process involves a series of careful preparations and technical considerations. In this article, we will guide you through each step necessary to efficiently extract your data from Couchbase and load it into Snowflake. We begin by outlining how to create the appropriate identity in Couchbase to ensure secure and authorized data access. Next, for those using Matillion as their data integration platform, we address how to check for the required JDBC driver—or how to acquire it if it is not already available. We will also discuss key aspects of network connectivity, ensuring that a reliable and secure connection exists between your Couchbase instance and Snowflake environment. Finally, we will explore strategies for querying data from Couchbase—first through an initial full load, and subsequently by implementing incremental extraction to keep your Snowflake data warehouse up to date efficiently. Whether you are embarking on your first Couchbase extraction project or optimizing an existing pipeline, this article will provide clear guidance on each crucial step of the process.
What is Couchbase?
Couchbase is a high-performance, distributed NoSQL database designed to support mission-critical web, mobile, and IoT applications requiring low latency, high throughput, and scalability. It combines the flexibility of JSON document storage with powerful key-value and query capabilities, including support for SQL-like queries via N1QL. Couchbase offers integrated caching, memory-first architecture, integrated full-text search, analytics, and eventing features, making it suitable for a wide range of modern workloads. Its architecture includes built-in cross datacenter replication (XDCR), tunable consistency, and seamless horizontal scaling, providing robust support for high availability, reliability, and operational efficiency in large-scale, dynamic environments.
What is Snowflake?
Snowflake is a fully managed, cloud-native data platform designed for scalable, high-performance analytics. Unlike traditional data warehouses, it separates compute from storage for elastic resource allocation and automatic scaling. Running on AWS, Azure, and Google Cloud, it enables seamless data sharing and natively supports semi-structured data (e.g., JSON, Avro) with its VARIANT type. Snowflake features robust security, including automatic encryption and granular access controls. Its multi-cluster compute and micro-partitioned storage allow for both ad hoc analysis and long data pipelines without contention, making it ideal for data engineering, business intelligence, and machine learning workloads.
Why Move Data from Couchbase into Snowflake
Unlocking Analytics Power: Reasons to Replicate Data from Couchbase to Snowflake
A data engineer or architect may choose to copy data from Couchbase into Snowflake for several compelling reasons. Firstly, Couchbase often stores data that is potentially valuable for analytics, reporting, and decision-making processes. However, the true value of this data is often realized only when it is integrated with information from other sources, enabling a holistic view of operations or customer behavior. Directly performing complex analytical queries or integrations on Couchbase can lead to increased workload and potentially impact the performance of the operational system. By migrating data into Snowflake, a scalable and high-performance cloud data platform, organizations can efficiently integrate Couchbase data with data from other systems. This enables advanced analytics and business intelligence use cases without burdening the Couchbase cluster, thus preserving its primary role in supporting mission-critical applications.
Similar connectors
Creating a User in Couchbase
This guide explains how to create a user (identity) in a Couchbase Server deployment. You can create users using either the Couchbase Web Console, CLI, or N1QL (SQL-like) queries through the REST API. This example uses N1QL via the Query Workbench or shell.
Prerequisites
- You require Cluster Admin or Security Admin privileges.
- Make sure you have access to the Couchbase Query Workbench or the
cbq
shell.
Example: Creating a User via N1QL
Couchbase user management is handled by the Administrator REST API and not directly by N1QL. However, for database scripting, you can use the REST API through tools such as
curl.
1. Compose the User Creation Command
The following
curlcommand creates a database user with specific roles.
Example Command
bash curl -u Administrator:password \ -X POST http://localhost:8091/settings/rbac/users/local/john_doe \ -d 'name=John Doe' \ -d 'password=s3cretPassw0rd!' \ -d 'roles=cluster_admin,bucket_admin[travel-sample]'
Parameters
Administrator:password
: Replace with your administrator credentials.john_doe
: The new user's username (identity).name
: User’s display name.password
: User’s password.roles
: Assigns roles to the user; for example,cluster_admin
andbucket_admin
for thetravel-sample
bucket. Comma-separate for multiple roles.
Example: Creating a User via Couchbase Web Console
- Log in to the Couchbase Web Console as an administrator.
- Navigate to Security ➔ Users.
- Click Add User.
- Specify the Username, Display Name, and Password.
- Assign the necessary Roles (e.g.,
Data Reader
orQuery Select
on relevant buckets). - Click Save to create the user.
Notes
- Assign only the minimum required roles for the user in line with security best practices.
- For more information about roles, refer to the Couchbase RBAC documentation.
Installing the JDBC driver
At the time of writing, the Couchbase JDBC driver is not bundled by default within Matillion Data Productivity Cloud due to licensing and redistribution restrictions. To enable connectivity, you'll need to obtain and install the JDBC driver manually before using Couchbase as a data source.
Downloading the Couchbase JDBC Driver
- Visit the official Simba download page: https://www.simba.com/drivers/couchbase-jdbc-odbc/.
- On the page, review the available Couchbase drivers and select the Type 4 JDBC driver (also referred to as a "pure Java" implementation) for compatibility and simplicity.
- Download the appropriate driver package to your local machine, ensuring you comply with the Simba license terms.
Installing the JDBC Driver into Matillion Data Productivity Cloud
The installation process requires uploading the Couchbase JDBC driver to your Matillion Agent. Please follow these steps:
- Sign in to the Matillion Data Productivity Cloud UI with adequate administrative privileges.
- Refer to the official instructions for uploading third-party JDBC drivers, located at: https://docs.matillion.com/data-productivity-cloud/agent/docs/uploading-external-drivers/.
-
Follow the documented procedure to upload the JDBC JAR file you downloaded from the Simba website.
-
Ensure the filename and upload location adheres to documented requirements.
- After uploading, restart the relevant Matillion Agent to activate the new driver (if indicated in the instructions).
Next Steps: Using the Couchbase JDBC Driver
Once the Couchbase JDBC driver is installed on your Matillion Agent, you can begin configuring and running Couchbase queries.
For detailed steps on connecting to your Couchbase database and issuing queries within Matillion Data Productivity Cloud, consult the database component usage guide at: https://docs.matillion.com/data-productivity-cloud/designer/docs/database-query/.
By following these instructions and referencing the linked documentation, your Matillion environment will be equipped to leverage Couchbase as a source for data productivity workflows.
Checking network connectivity
To establish connectivity between Matillion Data Productivity Cloud and your Couchbase database, you must ensure that the Couchbase database allows incoming network connections from the appropriate source, depending on your deployment configuration.
-
Full SaaS agent configuration:
Permit inbound connections to your Couchbase database from the public IP addresses used by Matillion Data Productivity Cloud. The current list of IP addresses can be found at:
https://docs.matillion.com/data-productivity-cloud/security/docs/allowing-ip-addresses/ -
Hybrid SaaS deployment:
Allow incoming connections to your Couchbase database from the network range(s) associated with your own virtual private cloud (VPC). To help verify network connectivity in this setup, you can use the utilities and tools available here:
https://exchange.matillion.com/data-productivity-cloud/pipeline/check-network-access/
Additionally, if your Couchbase database is referenced using a DNS name (not a raw IP address), ensure that the Full SaaS or Hybrid SaaS agent can resolve the DNS name correctly. Failure to resolve the DNS address will prevent Matillion agents from connecting to your Couchbase database.
Querying Data from Couchbase Database
To query data from a Couchbase database, you typically use the N1QL query language, which is similar to SQL for JSON data. The process is very familiar to users of relational databases, though you should be aware of potential datatype conversions (for example, when working with platforms such as Snowflake).
Below are instructions and examples to guide you:
Example: Couchbase SQL-Like Queries
Assume you have a bucket named
customers, with JSON documents representing customer records.
Full Table Select
To fetch all documents (as in an initial full load):
SELECT * FROM `customers`;
Selecting Specific Fields
To fetch only the
nameand
SELECT name, email FROM `customers`;
Filtering Documents
To query records where
countryis
"USA"(useful for incremental loads):
SELECT * FROM `customers` WHERE country = "USA";
Suppose you want only records modified after a certain timestamp (using the field
last_modified):
SELECT * FROM `customers` WHERE last_modified > "2024-05-01T00:00:00Z";
Note on Datatype Conversion
When moving data between Couchbase (with its JSON document model) and another database like Snowflake (tabular SQL model), be mindful of data type conversions, such as:
- JSON
null
vs. SQLNULL
- Couchbase numeric types vs. Snowflake numeric types
- Couchbase ISO date strings mapped to Snowflake
TIMESTAMP
Proper mapping may require transformation at query time, or post-load data wrangling.
Recommended Load Pattern
A robust pattern for querying and ETL from Couchbase consists of:
- Initial Load:
- Use a Database Query (e.g., in a tool like Matillion) with no filter in the WHERE clause to extract all data.
- Example (no filter):
SELECT *
FROM `customers`;
2. Incremental Loads:- Use the same Database Query component, but specify a suitable filter, e.g., only retrieve records modified since last load. - Example (with filter):
SELECT *
FROM `customers`
WHERE last_modified > "2024-06-20T00:00:00Z";
Read more: Incremental Load Data Replication Strategy (Matillion Exchange)
The same Database Query component—changing only the filter—lets you seamlessly switch between full and incremental loads for efficient and manageable data movement.
Data Integration Architecture
Loading data in advance of integration represents a "divide and conquer" strategy, effectively splitting the integration process into two distinct steps. This approach is a key strength of the ELT (Extract, Load, Transform) architecture. Data is first loaded into the target database—in this case, Snowflake—before any integration or transformation occurs. Effective data integration requires robust data transformation, which is best achieved using purpose-built data transformation pipelines that automate and manage these tasks reliably. Furthermore, a major advantage of the ELT architecture is that both transformation and integration take place within the Snowflake environment itself. This ensures that data processing is fast, performed on-demand, and highly scalable, all without the need for external data processing resources or additional infrastructure costs.