Integrate data from Google Firestore to Databricks using Matillion

Our Google Firestore to Databricks connector enables seamless, up-to-date data transfers to Databricks within minutes, eliminating the need for manual coding or complex ETL scripts.

Google Firestore
Databricks
Google Firestore to Databricks banner

Extracting data from Google Firestore to Databricks

Extracting data from Google Firestore is an essential task for organizations seeking to consolidate data for analytics, reporting, or advanced processing within platforms such as Databricks. Whether supporting business intelligence or facilitating advanced data science workflows, a streamlined extraction process can significantly accelerate insights. This article will guide you through the key steps involved in securely and efficiently transferring data from Google Firestore to Databricks. You will learn how to: - Create an identity within Google Firestore to enable secure and controlled data access. - For Matillion users, ensure the appropriate JDBC driver is available and correctly configured for seamless integration. - Establish and test network connectivity to permit reliable data transfer from your Firestore instance to your Databricks environment. - Query your Firestore data, covering both initial full data extraction as well as incremental loading strategies to keep Databricks updated with the latest changes. By following these steps, you can set up an effective data pipeline between Firestore and Databricks tailored to your organization's analytic needs.


What is Google Firestore?

Google Firestore is a fully managed, serverless, NoSQL document database designed for scalable and performant web, mobile, and server applications. It stores data in the form of documents organized into collections and supports hierarchical data structures with subcollections. Firestore provides real-time data synchronization between clients and the backend, robust querying capabilities, and support for offline access. Transactions and batched writes allow for atomic operations, ensuring consistency across multiple documents. As part of the Firebase and Google Cloud platforms, Firestore natively integrates with other Google services, offers granular security through Firebase Security Rules and Identity and Access Management (IAM), and handles automatic scaling to accommodate variable workloads.

matillion logo x Google Firestore

What is Databricks?

Databricks, built atop Apache Spark, is a unified analytics platform that streamlines big data processing, machine learning, and data engineering. Its core, Databricks Lakehouse (managed Delta Lake), uses Delta Lake—a storage layer providing ACID transactions, scalable metadata, and unified batch/stream data processing for data lakes. Databricks integrates with cloud object storage for scalable, high-performance handling of structured and unstructured data. Users can work in SQL, Python, R, or Scala, gaining features like schema enforcement, time travel, and transactional consistency. This makes Databricks a robust platform for modern data analytics and AI workloads.

Why Move Data from Google Firestore into Databricks

Unlocking Analytics Potential: Copying Data from Google Firestore to Databricks

Google Firestore is a cloud-native NoSQL database that often serves as a backbone for modern applications, and as such, it houses data that can be highly valuable for analytics and decision-making. However, the true value of this data is often revealed only when it is integrated with information from other sources, such as transactional databases, data lakes, or external APIs. By copying data from Google Firestore into Databricks, data engineers and architects can leverage Databricks’ powerful data processing and analytics capabilities to unify and analyze diverse datasets, generating richer insights. Importantly, performing data integration and processing on Databricks helps to avoid placing additional computational load on the Firestore database itself, preserving the performance and responsiveness of the production environment while enabling scalable downstream analytics.

Creating a User in Google Firestore

Google Firestore is a NoSQL, document-based database and does not use traditional SQL or explicit "users" tables by default. Instead, you create a document to represent a user within a collection (for example,

users
). Below are step-by-step instructions for creating a user document in Firestore, including examples in both the Firebase Console and client libraries.


1. Choose or Create a Collection

In Firestore, data is organized into collections and documents. To represent users, you typically use a collection named

users
.


2. Define the User Document Structure

Typical fields for a user might include:

  • uid
    (string): The user’s unique identifier
  • displayName
    (string): Full name of the user
  • email
    (string): User’s email address
  • createdAt
    (timestamp): The account creation time

3. Creating a User Document via the Firebase Console

  1. Go to the Firebase Console.
  2. Select your project and navigate to Firestore Database.
  3. Click Start Collection or select the
    users
    collection.
  4. Click Add Document.
    • Use the user’s UID (usually from Firebase Authentication) as the document ID for referential integrity.
    • Add relevant fields such as
      displayName
      ,
      email
      , and
      createdAt
      .
  5. Click Save.

4. Creating a User Document Programmatically

Using JavaScript (Node.js)

```js import { getFirestore } from "firebase-admin/firestore";

const db = getFirestore();

async function createUser(uid, displayName, email) { const userDoc = db.collection('users').doc(uid); await userDoc.set({ displayName: displayName, email: email, createdAt: new Date(), }); } ```

Using Python

```python from google.cloud import firestore

db = firestore.Client()

def create_user(uid, display_name, email): user_ref = db.collection('users').document(uid) user_ref.set({ 'displayName': display_name, 'email': email, 'createdAt': firestore.SERVER_TIMESTAMP, }) ```


5. Data Structure Example (JSON)

A new user document might resemble:

json
{
  "displayName": "Jane Doe",
  "email": "[email protected]",
  "createdAt": "2024-06-12T14:23:00Z"
}


Notes

  • Firestore does not employ SQL CREATE USER scripts. Data creation is achieved by adding documents.
  • User authentication and management are commonly performed using Firebase Authentication, with user profiles supplemented in the Firestore
    users
    collection.

For enterprise-level systems, also consider Firestore document and collection security rules to protect user data.

Installing the JDBC Driver

To enable connectivity between Matillion Data Productivity Cloud and a Google Firestore database, you must install a compatible JDBC driver. Note that, as of this writing, the Firestore JDBC driver is not bundled with Matillion Data Productivity Cloud by default, due to licensing or redistribution restrictions. You will need to obtain and install the driver manually by following the instructions below.

Downloading the Firestore JDBC Driver

  1. Access the Driver Download:
    Navigate to the Simba Firestore driver download page:
    https://www.simba.com/drivers/firestore-jdbc-odbc/

  2. Select the JDBC Driver:
    Locate and choose the Type 4 JDBC driver option. Type 4 drivers are preferred as they are pure Java implementations and don't require native libraries.

  3. Obtain the Driver File:
    Download the appropriate JDBC driver JAR file as provided by Simba. You may need to register or accept license agreements before downloading.

Uploading the Driver into Matillion Data Productivity Cloud

Once you have the JDBC driver file, you will need to upload it to your Matillion Data Productivity Cloud agent:

  1. Consult Installation Instructions:
    Detailed step-by-step instructions on uploading external JDBC drivers are provided here:
    https://docs.matillion.com/data-productivity-cloud/agent/docs/uploading-external-drivers/

Follow the documented process, ensuring that the driver is correctly uploaded and available for use in your environment.

Configuring and Using the Driver

After uploading the driver, you can configure and utilize it for connecting to your Firestore instance:

  1. Consult Usage Instructions:
    Refer to the database query documentation for setup and usage guidelines:
    https://docs.matillion.com/data-productivity-cloud/designer/docs/database-query/

These instructions cover how to configure a database query component in Matillion Data Productivity Cloud using the custom JDBC driver you uploaded.

By following these steps, you can successfully add Google Firestore connectivity to your Matillion Data Productivity Cloud workflows.

Checking network connectivity

To successfully connect to your Google Firestore database from Matillion Data Productivity Cloud, you must ensure that your Firestore instance allows incoming connections from the appropriate sources, depending on your deployment configuration:

Note:
If you are connecting to the Firestore database using a DNS address (instead of an IP address), the Full SaaS or Hybrid SaaS agent must be able to resolve the DNS address correctly. Please verify that the necessary DNS resolution is possible from the relevant network environment.

Querying Data from Google Firestore

To extract data from a Google Firestore database, you'll need to issue queries against your Firestore collections. Below are technical instructions on how to perform Firestore queries—referenced here using analogous SQL

SELECT
statements for clarity—and important considerations regarding integration and incremental loading.


Example Queries as SQL SELECT Statements

Firestore uses a document-based NoSQL model, so its queries differ from traditional SQL. However, you can conceptually map Firestore queries to SQL for understanding:

  • Retrieve all documents from a collection

    SELECT * FROM users;
    Firestore equivalent:
    python
      db.collection('users').get()

  • Filter documents (e.g., where status is 'active')

    SELECT * FROM users WHERE status = 'active';
    Firestore equivalent:
    python
      db.collection('users').where('status', '==', 'active').get()

  • Order and limit results

    SELECT * FROM users ORDER BY created_at DESC LIMIT 10;
    Firestore equivalent:
    python
      db.collection('users').order_by('created_at', direction='DESCENDING').limit(10).get()

  • Select with a specific range (pagination)

    SELECT * FROM users WHERE created_at > '2024-06-01';
    Firestore equivalent:
    python
      db.collection('users').where('created_at', '>', datetime.datetime(2024, 6, 1)).get()


Datatype Conversion Considerations

When integrating Firestore with platforms such as Databricks, be aware that datatype conversion may occur. Firestore supports types such as strings, numbers, booleans, timestamps, geopoints, and arrays, while Databricks (and Spark) relies on its own schema. For example:

  • Firestore
    timestamp
    → Databricks
    TIMESTAMP
  • Firestore
    array
    → Databricks
    ARRAY
    or nested structure
  • Firestore
    map
    (object) → Databricks
    STRUCT
    or
    MAP
  • Firestore
    null
    → Databricks
    NULLTYPE

Ensure your ingestion process accounts for these conversions to avoid schema mismatches or data loss.


Initial and Incremental Loads

The best practice pattern for loading data from Firestore is:

  1. Once-off Initial Load:
    Retrieve the entire collection into your destination platform (e.g., Databricks).
  2. In this phase, the Database Query component should execute without a filter clause.

    SELECT * FROM users;
    Firestore:
    python
         db.collection('users').get()

  3. Incremental Loads:
    On a recurring basis, query only the new or updated records using an appropriate filter (e.g., based on a

    last_updated
    timestamp).

  4. Here, the Database Query component includes a filter clause.
    SELECT * FROM users WHERE last_updated > '2024-06-10 00:00:00';
    Firestore:
    python
         db.collection('users').where('last_updated', '>', last_run_timestamp).get()

See the Matillion docs on Incremental Load Data Replication Strategy for further details.

Note: Use the same Database Query component definition for both the initial and incremental loads, adjusting only the filter settings.


By following these principles, you can efficiently extract and synchronize data from a Google Firestore database, leveraging robust ETL patterns and ensuring consistency in your data processing pipelines.

Data Integration Architecture

One of the key advantages of the ELT (Extract, Load, Transform) architecture is that it allows you to load raw data into Databricks in advance of integration, effectively dividing the overall data management problem into two manageable steps: loading and then integrating. This approach enables teams to move quickly by first bringing all required data into a single, centralized location before applying integration logic. Data integration itself typically involves extensive data transformation, such as cleaning, aggregating, or joining datasets; the best practice for this is to use robust data transformation pipelines that can automate and orchestrate these processes reliably. Another significant benefit of ELT is that both data transformation and integration tasks are executed directly within the target Databricks database. This means you benefit from high performance, on-demand scalability, and efficient resource utilization since all processing leverages Databricks’ native compute platform, eliminating the need—and the associated cost—of maintaining additional data processing infrastructure.

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.