Integrate data from GBase to Databricks using Matillion

Our GBase to Databricks connector enables seamless, code-free data transfer and automatic synchronization with Databricks in just minutes, eliminating the need for manual intervention or intricate ETL processes.

GBase
Databricks
GBase to Databricks banner

Extracting data from GBase to Databricks

Extracting data from GBase is a critical task for organizations seeking to leverage their operational data within modern analytics platforms like Databricks. The process involves several preliminary steps to ensure a smooth and secure transfer of information from your GBase environment to your Databricks workspace. In this article, we will guide you through each major phase of extraction and loading. First, we will cover how to create an identity in GBase, as having appropriate credentials is essential for accessing and exporting data securely. For users of Matillion, an ETL tool, we will outline how to check for the required JDBC driver—or how to acquire it if needed—ensuring connectivity between GBase and the extraction tool. Next, we will discuss considerations for establishing robust network connectivity between the GBase source and the Databricks target, including security and firewall configuration. Finally, we will explore effective techniques for querying your data, addressing both initial full-data extraction and subsequent incremental updates to keep your Databricks environment synchronized. By following these steps, you will be well-equipped to extract and load your GBase data into Databricks, creating a foundation for advanced analytics and reporting.


What is GBase?

GBase is a high-performance, distributed relational database management system developed by the Chinese company General Data Technology Co., Ltd. (GBase). Designed to handle large-scale, high-concurrency workloads typical of telecom, finance, and government sectors, GBase supports both row-oriented and column-oriented storage models, catering to OLTP and OLAP scenarios respectively. It features advanced parallel query processing, strong horizontal scalability, and robust high-availability mechanisms. GBase offers compatibility with standard SQL and popular application interfaces, enabling smooth integration with existing systems. Its architecture incorporates data sharding, partitioning, and multi-level caching to achieve low latency and high throughput, making it well-suited for big data warehousing, real-time analytics, and mission critical enterprise applications.

matillion logo x GBase

What is Databricks?

Databricks is a unified analytics platform built on Apache Spark, streamlining big data and machine learning workflows. Central to Databricks is Delta Lake, an open-source storage layer that provides ACID transactions and scalable metadata management for cloud data lakes. This enables efficient querying of large datasets with SQL and supports streaming, batch, and interactive analyses via notebooks. Databricks integrates with diverse data sources and features optimized connectors for major cloud storage services, ensuring data reliability, compatibility, and high performance for data-driven applications.

Why Move Data from GBase into Databricks

Unlocking Advanced Analytics: Transferring Data from GBase to Databricks

A data engineer or architect may wish to copy data from GBase into Databricks for several compelling reasons. GBase often serves as a repository for potentially valuable operational or historical data; however, its full potential is realised when this data is integrated with information from disparate sources. By transferring data into Databricks, professionals are able to leverage powerful data engineering and analytics tools to create a unified, comprehensive view, thereby enhancing insight generation and data-driven decision making. Furthermore, performing data integration and analysis on Databricks rather than directly on GBase ensures that the workload on the GBase system remains minimal, preserving its performance for day-to-day operations and reducing the risk of impacting mission-critical processes.

Creating a User in GBase Database

To create a new user in a GBase database, you must have sufficient administrative privileges (such as the

dba
role). The following instructions outline the standard process for creating a new user and optionally assigning privileges.

1. Connect to the GBase Database

First, log in to your GBase server using a client tool such as

isql
or
gsql
:

ell
gsql -U dba -d DB_NAME -W

Replace

dba
with your administrator username, and
DB_NAME
with your database name.

2. Create the User

The SQL syntax for creating a user in GBase is as follows:

CREATE USER username IDENTIFIED BY 'password';

Example

To create a user named

testuser
with the password
Test@1234
, use:

CREATE USER testuser IDENTIFIED BY 'Test@1234';

3. Grant Privileges (Optional)

By default, a new user may have minimal privileges. To allow the user to connect to or work with particular objects, you need to grant the appropriate permissions.

Example: Grant All Privileges on a Database

GRANT ALL PRIVILEGES ON DATABASE DB_NAME TO testuser;

Example: Grant Specific Privileges on a Table

GRANT SELECT, INSERT ON tablename TO testuser;

4. Verify the User Account (Optional)

You can check the user list with the following query:

SELECT * FROM sysusers WHERE username = 'testuser';

or use the system or information schema views, depending on your GBase version.


Note:
The

CREATE USER
and associated privilege management commands may vary slightly based on your GBase version. Refer to your specific GBase documentation for details on supported features and syntax.

Installing the JDBC driver

At the time of writing, the JDBC driver for GBase is not included by default in the Matillion Data Productivity Cloud due to licensing or redistribution restrictions. Therefore, before you can use GBase as a data source or destination within Matillion, you must manually obtain and install the appropriate JDBC driver.

Downloading the GBase JDBC Driver

To get started, download the GBase JDBC driver from the official source. The download link is:

-

When selecting the driver, make sure to choose the Type 4 JDBC driver, as it is a pure Java implementation and is preferred for compatibility and ease of deployment in the Matillion Data Productivity Cloud environment. The downloaded driver will usually be provided as a

.jar
file.

Uploading the Driver to Matillion Data Productivity Cloud

Once you have obtained the

.jar
file for the GBase JDBC driver, you need to upload it into your Matillion Data Productivity Cloud agent.

Matillion provides detailed, vendor-agnostic instructions for uploading external JDBC drivers to your agent at the following link:
https://docs.matillion.com/data-productivity-cloud/agent/docs/uploading-external-drivers/

Follow these steps closely: - Access the agent to which you want to add the JDBC driver. - Follow the documentation’s guidance for navigating to the driver upload interface and uploading your

.jar
file. - After uploading, ensure that the driver appears in the list of available JDBC drivers, and restart the agent if prompted.

Configuring and Using the GBase Connection in Matillion

After successfully installing the driver, you will be able to configure database components in your Matillion projects to utilize the GBase JDBC connection. For step-by-step guidance on how to create and use a database query component with your newly installed GBase driver, refer to the official usage instructions:
https://docs.matillion.com/data-productivity-cloud/designer/docs/database-query/

This documentation will help you establish connections, set credentials, and design queries leveraging your agent’s new connectivity to GBase.

Checking network connectivity

To ensure successful connectivity between Matillion Data Productivity Cloud and your GBase database, you must configure the GBase database to allow incoming connections based on your deployment type:

  • Full SaaS Agent Configuration:
    Permit incoming connections from the IP addresses listed at Matillion allowed IP addresses. This allows the Full SaaS agent to communicate with your GBase database securely.

  • Hybrid SaaS Deployment:
    Allow incoming connections from the network range of your own Virtual Private Cloud (VPC), as this is where your Hybrid SaaS agents will run. To help verify and troubleshoot network access, you can use tools provided at Matillion network check utilities.

Additionally, if your GBase database is referenced by a DNS hostname, ensure that either the Full SaaS or Hybrid SaaS agent—depending on your deployment—can successfully resolve the DNS address to reach the database. Proper DNS resolution is essential for establishing and maintaining the connection.

Querying Data from a GBase Database

This guide provides instructions and best practices for querying data from a GBase database, with an emphasis on interoperability with platforms like Databricks and strategies for efficient data ingestion.


Example GBase SQL SELECT Queries

Below are common SQL examples for referencing GBase tables and retrieving data:

``` -- Select all columns from a table SELECT * FROM customers;

-- Select specific columns with filtering SELECT customer_id, customer_name, email FROM customers WHERE registration_date >= '2024-01-01';

-- Aggregation with grouping SELECT product_category, COUNT(*) AS num_sales FROM sales WHERE sale_date BETWEEN '2024-05-01' AND '2024-05-31' GROUP BY product_category;

-- Joining two tables SELECT o.order_id, o.order_date, c.customer_name FROM orders o JOIN customers c ON o.customer_id = c.customer_id WHERE o.order_status = 'SHIPPED'; ```


Data Type Conversion Between GBase and Databricks

When transferring data from GBase to Databricks (for example, via Matillion or similar ETL tools), data type conversion may occur. Some points to consider:

  • Numeric Types: GBase and Databricks both support common types such as
    INTEGER
    ,
    FLOAT
    , and
    DECIMAL
    , but precision/scale differences can occur.
  • Strings: GBase’s
    VARCHAR
    and
    CHAR
    generally map to Databricks'
    STRING
    type.
  • Date/Time: GBase types such as
    DATE
    ,
    TIME
    , and
    TIMESTAMP
    might require formatting or conversion when loaded into Databricks.
  • Binary/Blob: These types may not be directly transferrable or require base64 encoding.

Be sure to review the destination schema in Databricks and cast or convert columns as needed within your SELECT statements.


Data Loading Patterns: Initial and Incremental Loads

The recommended approach for ingesting GBase data into analytic platforms like Databricks is to:

  1. Perform a once-off initial full load.
  2. Switch to regular incremental loads capturing only new or updated data.

Always use the Database Query component for both workflows for consistency.

Initial Load (Full Extract)

  • Extracts all rows from the GBase source table.
  • No filter (WHERE clause) is used in this pattern.

-- Example: Initial load, full table extract
SELECT * FROM sales;

Incremental Load (Delta Extract)

  • Extracts only new or modified records since the previous load.
  • Requires a filter clause (often on a timestamp or surrogate key).

-- Example: Incremental load, filtering on last_modified_time
SELECT *
FROM sales
WHERE last_modified_time >= '2024-06-01 00:00:00';

Tip: Store the timestamp or incremental key value after each load, and refer to it in the next incremental filter.

For more on incremental loading best practices for replication strategy, see the Matillion Incremental Load Article.

Data Integration Architecture

Loading data in advance of integration exemplifies the "divide and conquer" principle by splitting the workflow into two manageable stages: extraction and subsequent loading of raw data, followed by transformation and integration. This modular approach is a hallmark advantage of the ELT (Extract, Load, Transform) architecture, as it allows organizations to decouple complex data preparation steps. Effective data integration necessitates transforming raw data into usable formats, which is best achieved through robust data transformation pipelines designed to automate, orchestrate, and validate each transformation step. A further advantage of ELT rests in performing these data transformation and integration tasks natively within the target Databricks database. By leveraging Databricks’ processing capabilities directly, organizations benefit from fast, on-demand, and highly scalable data operations, all without the need for—and costs associated with—additional, external data processing infrastructure.

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.