Extracting data from Google Firestore to Amazon Redshift
Extracting data from Google Firestore is a critical step for organizations seeking to integrate NoSQL application data with their analytics platforms. Amazon Redshift, as a powerful data warehouse, is a popular target for these integration pipelines. However, successfully moving data between these systems requires careful attention to authentication, connectivity, and efficient extraction strategies. In this article, we will guide you through the key steps needed to transfer data from Google Firestore into Amazon Redshift. We will begin by outlining how to create a suitable identity for accessing Firestore and, for Matillion ETL users, how to ensure you have the appropriate JDBC driver installed. Next, we will address the requirements necessary to establish network connectivity between your source and target systems. Finally, we will demonstrate how to query Firestore data, discussing both initial data loads and strategies for extracting new or updated records incrementally. Whether you are just starting to work with Firestore and Redshift, or are optimizing an existing process, these fundamentals will provide a reliable foundation for your data integration workflow.
What is Google Firestore?
Google Firestore is a fully managed, serverless, NoSQL document database offered as part of the Google Cloud Platform. It provides real-time data synchronization and strong consistency, making it suitable for both web and mobile applications requiring up-to-date, reliable data access. Firestore stores data in collections of documents, supporting flexible, hierarchical data structures. Its architecture enables robust scalability, seamless offline support, and integration with other Firebase and Google Cloud services. Firestore also features powerful querying capabilities, security through fine-grained access controls, and transactional operations, making it a versatile choice for modern distributed applications.
What is Amazon Redshift?
Amazon Redshift is a fully managed, petabyte-scale data warehouse built on PostgreSQL, optimized for fast querying and analysis using columnar storage and advanced compression. It leverages massively parallel processing (MPP) to distribute complex SQL queries across multiple nodes for efficient, scalable analytics. Redshift integrates seamlessly with AWS services like S3 for data loading (using the COPY command) and other analytics tools. Security features include VPC-based network isolation, encryption at rest and in transit, and granular access control with AWS IAM, making Redshift a secure and versatile option for large-scale data analysis in the cloud.
Why Move Data from Google Firestore into Amazon Redshift
Unlocking Analytics Potential: Copying Data from Google Firestore to Amazon Redshift
A data engineer or architect might choose to copy data from Google Firestore into Amazon Redshift because Firestore often contains valuable operational or transactional data that can drive business insights. By integrating this data with other datasets—such as sales, marketing, or logistics data stored in Redshift—the full value of the Firestore data can be realized through comprehensive analytics and reporting. Furthermore, performing data integration and analytical queries within Redshift, instead of directly in Firestore, avoids adding analytical workload to the transactional Firestore database. This approach helps maintain the performance and reliability of Firestore for its primary application workloads, while leveraging Redshift’s scalable analytical capabilities for downstream processing and business intelligence.
Similar connectors
Creating a User in Google Firestore
Google Firestore is a NoSQL, document-oriented database. Rather than using SQL scripts for creating records, data manipulation in Firestore is achieved through API calls in your preferred programming language. Below are step-by-step instructions for creating a user document in a Firestore database, using both the Firestore Console and the Node.js SDK as examples.
1. Using the Firestore Console
-
Go to the Google Cloud Console:
Navigate to https://console.firebase.google.com/ and choose your project. -
Access Firestore Database:
In the left-hand menu, select "Firestore Database". -
Go to the Data Tab:
Click the "Data" tab if you aren't already there. -
Create a New Collection:
- Click “Start collection”.
- Enter a collection ID (e.g.,
users
). -
Click “Next”.
-
Create a Document:
- Leave the Document ID blank for automatic generation or enter a custom ID (such as a user’s unique UID).
- Add fields that define the user. For example:
name
(string)email
(string)createdAt
(timestamp)
- Click “Save”.
2. Using the Node.js SDK
Below is an example of how to create a user document programmatically with the official Firestore Node.js SDK.
a. Prerequisites
-
Install the Firebase Admin SDK:
bash npm install firebase-admin
-
Initialize Firebase in your application (load your service account credentials as required):
javascript
const admin = require('firebase-admin');
admin.initializeApp({
credential: admin.credential.applicationDefault()
});
const db = admin.firestore();
b. Insert a User Document
```javascript const newUser = { name: "Alice Tan", email: "[email protected]", createdAt: admin.firestore.FieldValue.serverTimestamp() };
// Automatic document ID: db.collection('users').add(newUser) .then(docRef => { console.log("User created with ID: ", docRef.id); }) .catch(error => { console.error("Error adding document: ", error); });
// --- OR --- // Specify custom document ID (e.g., user ID or username): const userId = "user_12345"; db.collection('users').doc(userId).set(newUser) .then(() => { console.log("User created with custom ID: ", userId); }) .catch(error => { console.error("Error setting document: ", error); }); ```
3. Equivalent Structure (For SQL Users)
Note: Firestore does not use SQL syntax. The rough Firestore equivalent of an SQL user creation like:
INSERT INTO users (name, email, createdAt)
VALUES ('Alice Tan', '[email protected]', CURRENT_TIMESTAMP);
would be analogous to the operation shown in the Node.js example above.
For other programming environments, consult the Firestore client libraries documentation.
Installing the JDBC Driver
At the time of writing, the JDBC driver for Google Firestore is not bundled by default with Matillion Data Productivity Cloud. This is due to licensing and redistribution restrictions, which require you to obtain and install the driver yourself. The following instructions will guide you through downloading the necessary driver and installing it into your Matillion environment for immediate use.
Downloading the Google Firestore JDBC Driver
- Visit the Simba Drivers for Firestore download page: https://www.simba.com/drivers/firestore-jdbc-odbc/.
- When downloading, ensure you select the Type 4 JDBC driver option if available. A Type 4 JDBC driver is a platform-independent, pure Java driver, which is ideal for cloud and modern ETL platforms like Matillion.
- Follow any required steps on the Simba website to obtain the driver package. This may include registering for an account or agreeing to license terms.
Installing the Driver in Matillion Data Productivity Cloud
To install your downloaded Firestore JDBC driver:
- Access Matillion Data Productivity Cloud and navigate to the Agent that will be used to connect to Firestore.
- Refer to the official driver uploading instructions here: https://docs.matillion.com/data-productivity-cloud/agent/docs/uploading-external-drivers/
- Upload the JDBC driver JAR file according to the documented process. This typically involves selecting the appropriate location in the Matillion UI or agent interface and confirming the upload.
- After uploading the driver, confirm its appearance in your list of available JDBC drivers.
Configuring Database Connections and Using the Driver
Once the driver is uploaded:
- You can configure your Firestore connection and use it within Matillion Data Productivity Cloud’s Designer environment.
- Detailed, step-by-step guidance on working with database query components, test connections, and workflow integration is available here: https://docs.matillion.com/data-productivity-cloud/designer/docs/database-query/.
Please reference the Matillion documentation for product-specific settings and further assistance with advanced connection parameters.
Checking network connectivity
To ensure that Matillion Data Productivity Cloud can connect to your Google Firestore database, you must configure your Firestore instance to allow incoming connections based on your deployment type:
-
Full SaaS agent configuration: Allow incoming connections from the specific IP addresses used by the Matillion service. The required IP addresses are available here: https://docs.matillion.com/data-productivity-cloud/security/docs/allowing-ip-addresses/.
-
Hybrid SaaS deployment: Allow incoming connections from the IP addresses associated with your own Virtual Private Cloud (VPC). You can use the following utilities to verify network connectivity and check your VPC configuration: https://exchange.matillion.com/data-productivity-cloud/pipeline/check-network-access/.
Additionally, if you are referencing your Google Firestore database using a DNS hostname, ensure that your Full SaaS or Hybrid SaaS agent has network access to resolve the DNS address correctly.
Querying Data from a Google Firestore Database
This guide describes how to query data from Google Firestore, with examples presented as SQL
SELECTstatements, considerations for data type conversion when working with Redshift, and patterns for both initial and incremental data load using the Database Query component.
Example Firestore Queries as SQL SELECT
Statements
Though Firestore uses its native querying language (NoSQL/structured queries), you may want to conceptualize your requirements in standard SQL. Below are typical SELECT statements, shown for reference and comparison:
``` -- Fetch all documents from the "users" collection SELECT * FROM users;
-- Fetch documents where the field 'status' is 'active' SELECT * FROM users WHERE status = 'active';
-- Fetch documents created after a certain date SELECT * FROM users WHERE created_at > '2024-01-01';
-- Fetch documents with specific fields only SELECT name, email FROM users WHERE status = 'active';
-- Fetch documents with a range SELECT * FROM orders WHERE order_total > 50 AND order_total < 1000; ```
Note: When querying from Firestore directly (e.g., using client libraries or connectors), the SQL syntax would be mapped to Firestore's API methods; however, tools like Matillion Database Query component often accept SQL-like syntax for ease of use.
Datatype Conversion: Firestore & Redshift
When moving data from Firestore to Redshift, be aware that datatype conversions may occur. Common mapping examples include:
| Firestore Type | Typical Redshift Equivalent |
|---|---|
string |
VARCHAR |
number |
INTEGER, BIGINT, DOUBLE |
boolean |
BOOLEAN |
timestamp |
TIMESTAMP |
array |
VARCHAR(as JSON or delimited) |
map(object) |
VARCHAR(as JSON string) |
null |
NULL |
Review type mapping in your ETL process, especially for arrays and nested objects, which often require conversion/flattening.
Initial and Incremental Data Loads
To move data efficiently from Firestore to Redshift, best practice is to employ a two-stage load strategy:
1. Initial Load
- Usage: Populate Redshift with all existing data from Firestore.
- Database Query component: Used without any filter condition, retrieves entire dataset.
- SQL Example:
SELECT * FROM users;
2. Incremental Load
- Usage: After the initial load, only new or changed data is fetched, to keep Redshift in sync without reloading everything.
- Database Query component: Add a filter clause based on a date, timestamp, or unique incremental field (commonly
updated_at
,created_at
, or document ID). -
SQL Example:
SELECT * FROM users WHERE updated_at > '${last_loaded_timestamp}'; -
The incremental value (
last_loaded_timestamp
) is stored after each load to ensure only new/updated records are captured.
For details on setting up incremental loads, refer to: Matillion Incremental Load Strategy.
- Tip: Use the same Database Query component/configuration in your ETL flow for both loads; switch only the presence or absence of the filter clause to toggle between initial and incremental behavior.
Data Integration Architecture
Loading data in advance of integration is a key strategy for dividing and conquering complex data challenges. In the context of ELT (Extract, Load, Transform) architecture, this approach breaks the process into two manageable steps: first, raw data is quickly loaded into the Amazon Redshift target environment; second, transformation and integration tasks are performed within the database itself. Effective data integration requires data transformation, and the most robust and maintainable method to achieve this is by employing data transformation pipelines, which automate and orchestrate the necessary processing workflows. A significant advantage of the ELT paradigm is that both data transformation and integration occur inside the target Redshift database. This enables organizations to leverage the speed, scalability, and on-demand power of Redshift's processing engine while avoiding the cost and complexity of deploying and managing separate data processing infrastructure.