Extracting data from GBase to Databricks
Extracting data from GBase is a critical task for organizations seeking to leverage their operational data within modern analytics platforms like Databricks. The process involves several preliminary steps to ensure a smooth and secure transfer of information from your GBase environment to your Databricks workspace. In this article, we will guide you through each major phase of extraction and loading. First, we will cover how to create an identity in GBase, as having appropriate credentials is essential for accessing and exporting data securely. For users of Matillion, an ETL tool, we will outline how to check for the required JDBC driver—or how to acquire it if needed—ensuring connectivity between GBase and the extraction tool. Next, we will discuss considerations for establishing robust network connectivity between the GBase source and the Databricks target, including security and firewall configuration. Finally, we will explore effective techniques for querying your data, addressing both initial full-data extraction and subsequent incremental updates to keep your Databricks environment synchronized. By following these steps, you will be well-equipped to extract and load your GBase data into Databricks, creating a foundation for advanced analytics and reporting.
What is GBase?
GBase is a high-performance, distributed relational database management system developed by the Chinese company General Data Technology Co., Ltd. (GBase). Designed to handle large-scale, high-concurrency workloads typical of telecom, finance, and government sectors, GBase supports both row-oriented and column-oriented storage models, catering to OLTP and OLAP scenarios respectively. It features advanced parallel query processing, strong horizontal scalability, and robust high-availability mechanisms. GBase offers compatibility with standard SQL and popular application interfaces, enabling smooth integration with existing systems. Its architecture incorporates data sharding, partitioning, and multi-level caching to achieve low latency and high throughput, making it well-suited for big data warehousing, real-time analytics, and mission critical enterprise applications.
What is Databricks?
Databricks is a unified analytics platform built on Apache Spark, streamlining big data and machine learning workflows. Central to Databricks is Delta Lake, an open-source storage layer that provides ACID transactions and scalable metadata management for cloud data lakes. This enables efficient querying of large datasets with SQL and supports streaming, batch, and interactive analyses via notebooks. Databricks integrates with diverse data sources and features optimized connectors for major cloud storage services, ensuring data reliability, compatibility, and high performance for data-driven applications.
Why Move Data from GBase into Databricks
Unlocking Advanced Analytics: Transferring Data from GBase to Databricks
A data engineer or architect may wish to copy data from GBase into Databricks for several compelling reasons. GBase often serves as a repository for potentially valuable operational or historical data; however, its full potential is realised when this data is integrated with information from disparate sources. By transferring data into Databricks, professionals are able to leverage powerful data engineering and analytics tools to create a unified, comprehensive view, thereby enhancing insight generation and data-driven decision making. Furthermore, performing data integration and analysis on Databricks rather than directly on GBase ensures that the workload on the GBase system remains minimal, preserving its performance for day-to-day operations and reducing the risk of impacting mission-critical processes.
Similar connectors
Creating a User in GBase Database
To create a new user in a GBase database, you must have sufficient administrative privileges (such as the
dbarole). The following instructions outline the standard process for creating a new user and optionally assigning privileges.
1. Connect to the GBase Database
First, log in to your GBase server using a client tool such as
isqlor
gsql:
ell gsql -U dba -d DB_NAME -W
Replace
dbawith your administrator username, andDB_NAMEwith your database name.
2. Create the User
The SQL syntax for creating a user in GBase is as follows:
CREATE USER username IDENTIFIED BY 'password';
Example
To create a user named
testuserwith the password
Test@1234, use:
CREATE USER testuser IDENTIFIED BY 'Test@1234';
3. Grant Privileges (Optional)
By default, a new user may have minimal privileges. To allow the user to connect to or work with particular objects, you need to grant the appropriate permissions.
Example: Grant All Privileges on a Database
GRANT ALL PRIVILEGES ON DATABASE DB_NAME TO testuser;
Example: Grant Specific Privileges on a Table
GRANT SELECT, INSERT ON tablename TO testuser;
4. Verify the User Account (Optional)
You can check the user list with the following query:
SELECT * FROM sysusers WHERE username = 'testuser';
or use the system or information schema views, depending on your GBase version.
Note:
TheCREATE USERand associated privilege management commands may vary slightly based on your GBase version. Refer to your specific GBase documentation for details on supported features and syntax.
Installing the JDBC driver
At the time of writing, the JDBC driver for GBase is not included by default in the Matillion Data Productivity Cloud due to licensing or redistribution restrictions. Therefore, before you can use GBase as a data source or destination within Matillion, you must manually obtain and install the appropriate JDBC driver.
Downloading the GBase JDBC Driver
To get started, download the GBase JDBC driver from the official source. The download link is:
-
When selecting the driver, make sure to choose the Type 4 JDBC driver, as it is a pure Java implementation and is preferred for compatibility and ease of deployment in the Matillion Data Productivity Cloud environment. The downloaded driver will usually be provided as a
.jarfile.
Uploading the Driver to Matillion Data Productivity Cloud
Once you have obtained the
.jarfile for the GBase JDBC driver, you need to upload it into your Matillion Data Productivity Cloud agent.
Matillion provides detailed, vendor-agnostic instructions for uploading external JDBC drivers to your agent at the following link:
https://docs.matillion.com/data-productivity-cloud/agent/docs/uploading-external-drivers/
Follow these steps closely: - Access the agent to which you want to add the JDBC driver. - Follow the documentation’s guidance for navigating to the driver upload interface and uploading your
.jarfile. - After uploading, ensure that the driver appears in the list of available JDBC drivers, and restart the agent if prompted.
Configuring and Using the GBase Connection in Matillion
After successfully installing the driver, you will be able to configure database components in your Matillion projects to utilize the GBase JDBC connection. For step-by-step guidance on how to create and use a database query component with your newly installed GBase driver, refer to the official usage instructions:
https://docs.matillion.com/data-productivity-cloud/designer/docs/database-query/
This documentation will help you establish connections, set credentials, and design queries leveraging your agent’s new connectivity to GBase.
Checking network connectivity
To ensure successful connectivity between Matillion Data Productivity Cloud and your GBase database, you must configure the GBase database to allow incoming connections based on your deployment type:
-
Full SaaS Agent Configuration:
Permit incoming connections from the IP addresses listed at Matillion allowed IP addresses. This allows the Full SaaS agent to communicate with your GBase database securely. -
Hybrid SaaS Deployment:
Allow incoming connections from the network range of your own Virtual Private Cloud (VPC), as this is where your Hybrid SaaS agents will run. To help verify and troubleshoot network access, you can use tools provided at Matillion network check utilities.
Additionally, if your GBase database is referenced by a DNS hostname, ensure that either the Full SaaS or Hybrid SaaS agent—depending on your deployment—can successfully resolve the DNS address to reach the database. Proper DNS resolution is essential for establishing and maintaining the connection.
Querying Data from a GBase Database
This guide provides instructions and best practices for querying data from a GBase database, with an emphasis on interoperability with platforms like Databricks and strategies for efficient data ingestion.
Example GBase SQL SELECT Queries
Below are common SQL examples for referencing GBase tables and retrieving data:
``` -- Select all columns from a table SELECT * FROM customers;
-- Select specific columns with filtering SELECT customer_id, customer_name, email FROM customers WHERE registration_date >= '2024-01-01';
-- Aggregation with grouping SELECT product_category, COUNT(*) AS num_sales FROM sales WHERE sale_date BETWEEN '2024-05-01' AND '2024-05-31' GROUP BY product_category;
-- Joining two tables SELECT o.order_id, o.order_date, c.customer_name FROM orders o JOIN customers c ON o.customer_id = c.customer_id WHERE o.order_status = 'SHIPPED'; ```
Data Type Conversion Between GBase and Databricks
When transferring data from GBase to Databricks (for example, via Matillion or similar ETL tools), data type conversion may occur. Some points to consider:
- Numeric Types: GBase and Databricks both support common types such as
INTEGER
,FLOAT
, andDECIMAL
, but precision/scale differences can occur. - Strings: GBase’s
VARCHAR
andCHAR
generally map to Databricks'STRING
type. - Date/Time: GBase types such as
DATE
,TIME
, andTIMESTAMP
might require formatting or conversion when loaded into Databricks. - Binary/Blob: These types may not be directly transferrable or require base64 encoding.
Be sure to review the destination schema in Databricks and cast or convert columns as needed within your SELECT statements.
Data Loading Patterns: Initial and Incremental Loads
The recommended approach for ingesting GBase data into analytic platforms like Databricks is to:
- Perform a once-off initial full load.
- Switch to regular incremental loads capturing only new or updated data.
Always use the Database Query component for both workflows for consistency.
Initial Load (Full Extract)
- Extracts all rows from the GBase source table.
- No filter (WHERE clause) is used in this pattern.
-- Example: Initial load, full table extract SELECT * FROM sales;
Incremental Load (Delta Extract)
- Extracts only new or modified records since the previous load.
- Requires a filter clause (often on a timestamp or surrogate key).
-- Example: Incremental load, filtering on last_modified_time SELECT * FROM sales WHERE last_modified_time >= '2024-06-01 00:00:00';
Tip: Store the timestamp or incremental key value after each load, and refer to it in the next incremental filter.
For more on incremental loading best practices for replication strategy, see the Matillion Incremental Load Article.
Data Integration Architecture
Loading data in advance of integration exemplifies the "divide and conquer" principle by splitting the workflow into two manageable stages: extraction and subsequent loading of raw data, followed by transformation and integration. This modular approach is a hallmark advantage of the ELT (Extract, Load, Transform) architecture, as it allows organizations to decouple complex data preparation steps. Effective data integration necessitates transforming raw data into usable formats, which is best achieved through robust data transformation pipelines designed to automate, orchestrate, and validate each transformation step. A further advantage of ELT rests in performing these data transformation and integration tasks natively within the target Databricks database. By leveraging Databricks’ processing capabilities directly, organizations benefit from fast, on-demand, and highly scalable data operations, all without the need for—and costs associated with—additional, external data processing infrastructure.