Extracting data from Greenplum Database to Databricks
Extracting data from Greenplum Database is a critical step for organizations seeking to leverage the advanced analytics and machine learning capabilities of Databricks. Efficiently moving data between these platforms requires careful consideration of security, connectivity, and ongoing data synchronization. In this article, we will provide a step-by-step guide to help you extract data from Greenplum Database and load it into Databricks. The process begins with creating an appropriate identity in Greenplum to ensure secure and authorized data access. For Matillion ETL users, we will address how to check for the required JDBC driver, or obtain it if necessary, to enable connectivity. We will also discuss network connectivity options, ensuring that your source (Greenplum Database) and target (Databricks) environments can communicate securely and efficiently. Finally, we will outline best practices for querying data, covering both initial data extraction and incremental loads to streamline your data integration workflow. By following these steps, you will be well-equipped to build a robust and scalable pipeline for transferring data from Greenplum to Databricks.
What is Greenplum Database?
Greenplum Database is an advanced, open-source, massively parallel data platform designed for high-performance analytics on large-scale datasets. Built on PostgreSQL, it extends the core relational database features with horizontal scalability, supporting shared-nothing architecture where data is distributed across multiple segment nodes for parallel query processing. This architecture enables Greenplum to handle complex analytical workloads efficiently, facilitating real-time data warehousing, machine learning, and business intelligence use cases. Additional features include robust fault tolerance, advanced query optimization, resource management, and support for ANSI SQL standards, making it suitable for modern enterprise data environments that require both scale and flexibility.
What is Databricks?
Databricks offers a unified analytics platform that integrates data engineering, data science, and business analytics on top of its scalable database layer. Built on Apache Spark, Databricks manages data in a cloud-native environment, supporting multiple file formats and structured, semi-structured, or unstructured datasets. The Databricks database, typically realized as Delta Lake tables, provides ACID transaction guarantees, scalable metadata handling, and unified batch and streaming data processing. Advanced features such as time travel, schema enforcement, and performant indexing make it suitable for large-scale data warehousing and machine learning tasks. This cohesive ecosystem allows technical teams to collaborate efficiently while harnessing the power of distributed processing and robust data governance.
Why Move Data from Greenplum Database into Databricks
Unlocking Advanced Analytics: The Value of Replicating Greenplum Data in Databricks
A data engineer or architect may choose to copy data from Greenplum Database into Databricks for several compelling reasons. Greenplum Database often houses large volumes of potentially valuable data that could drive analytics, reporting, or data science initiatives. However, the true value of this data is often realized when it is integrated with data from additional sources, enabling comprehensive analysis and richer business insights. By leveraging Databricks for this integration, organizations can efficiently combine and process data at scale without imposing additional workload or query latency on the Greenplum Database itself. This approach ensures that Greenplum remains optimized for its primary operational tasks while Databricks provides a flexible, powerful platform for advanced analytical processing and integration efforts.
Similar connectors
Creating an Identity in Greenplum Database
Creating a new user (also referred to as a role) in Greenplum Database is typically performed by a superuser or a user with the appropriate privileges. Users are managed centrally and have connection and permission rights across databases in the Greenplum cluster.
Prerequisites
- You must have sufficient privileges, such as the
gpadmin
superuser account, to create new users. - You should be connected to the Greenplum Database using a client such as
psql
.
Instructions
1. Connect to Greenplum Database
Use the
psqlcommand-line client to connect to the database. Example:
psql -d postgres -U gpadmin
Replace
gpadminwith the superuser name andpostgreswith your preferred database name.
2. Create a New User
Use the
CREATE ROLEor
CREATE USERstatement in SQL.
CREATE USERis a synonymous, preferred form for user entities:
CREATE USER analytics_user
WITH PASSWORD 'replace_with_strong_password';
Optional: Grant Login Privilege
By default, users created with
CREATE USERcan log in. If you use
CREATE ROLE, you must explicitly add
LOGIN:
CREATE ROLE analytics_user
WITH LOGIN
PASSWORD 'replace_with_strong_password';
3. Assign Additional Privileges (Optional)
You may want to grant the user additional permissions:
- Superuser Privilege:
ALTER USER analytics_user WITH SUPERUSER;
- Database Creation Privilege:
ALTER USER analytics_user WITH CREATEDB;
- Role Management Privilege:
ALTER USER analytics_user WITH CREATEROLE;
Only assign extra privileges when necessary for security reasons.
4. Grant Access to Specific Databases (Optional)
By default, a user can connect if the database allows their connection (controlled by
pg_hba.confsettings). To explicitly grant or revoke CONNECT privileges:
GRANT CONNECT ON DATABASE target_db TO analytics_user;
5. Verify User Creation
List users in the database to confirm:
\du
Or, with privileges detail:
\du+ analytics_user
Example: Complete Script
``` -- Create a user with login rights CREATE USER analytics_user WITH PASSWORD 'S3cur3Pa$$w0rd';
-- Grant connect privilege on a specific database GRANT CONNECT ON DATABASE sales_db TO analytics_user;
-- (Optional) Allow the user to create new databases ALTER USER analytics_user WITH CREATEDB; ```
Note: Remember to update
pg_hba.confand reload the configuration if new authentication rules are required for the new user.
Installing the JDBC Driver
At the time of writing, the JDBC driver required for Greenplum Database connectivity is not included by default in the Matillion Data Productivity Cloud. This omission is due to licensing and redistribution restrictions associated with the driver. As a user, you are responsible for obtaining and installing the necessary JDBC driver prior to configuring any Greenplum database integrations within Matillion.
1. Downloading the JDBC Driver
To begin, you need to download the PostgreSQL JDBC driver, which is compatible with Greenplum Database.
- Visit the official PostgreSQL JDBC driver download page: https://jdbc.postgresql.org/
- On the download page, ensure you select a Type 4 JDBC driver. This is the recommended option as it is a pure Java driver and does not require any native libraries or additional software.
- Download the latest stable version of the Type 4 JDBC driver (typically provided as a
.jar
file).
2. Uploading the JDBC Driver to Matillion Data Productivity Cloud
Once you have obtained the JDBC driver
.jarfile, the next step is to upload it to your Matillion Data Productivity Cloud instance so that it can be used by the Agent.
- Review the step-by-step instructions for uploading external drivers in the Matillion documentation: Uploading External Drivers
- Ensure you follow all prerequisite and compatibility guidelines provided in the documentation.
- After uploading, confirm that the driver appears in your Matillion environment and is recognized as available for use.
3. Using the JDBC Driver in Matillion Data Productivity Cloud
With the driver in place, you are ready to configure data pipelines or database integrations that connect to your Greenplum Database.
- Refer to the official Matillion usage instructions for guidance on connecting to databases and executing queries: Database Query Documentation
- Follow best practices for secure credential handling and connection parameter configuration as described in the documentation.
Following the above steps will ensure that your Matillion environment is ready for Greenplum connectivity via the JDBC driver.
Checking network connectivity
To ensure successful network connectivity between Matillion Data Productivity Cloud and your Greenplum Database, you must verify that the Greenplum Database database is configured to allow incoming connections for your specific deployment type:
-
Full SaaS agent configuration:
You need to allow incoming connections from the specific IP addresses used by Matillion Data Productivity Cloud. These IP addresses are subject to change and are documented at the following URL: Matillion Data Productivity Cloud Allowed IP Addresses. Make sure these IP addresses are whitelisted on your Greenplum Database host and any firewalls or security groups in place. -
Hybrid SaaS deployment:
In this setup, you must allow incoming connections from your own Virtual Private Cloud (VPC). This typically involves updating your Greenplum Database network rules to accept connections from the appropriate VPC IP ranges. To assist with network verification and troubleshooting, Matillion provides useful utilities at: Check Network Access Tool.
Additionally, if you are referencing your Greenplum Database database using a DNS address, you must ensure that the Full SaaS or Hybrid SaaS agent can resolve the database hostname. This may involve confirming the DNS configuration and making sure that the required DNS servers or internal DNS zones are accessible to the relevant agent.
Querying Data from Greenplum Database
This guide explains how to query data from a Greenplum Database using SQL SELECT statements, best patterns for initial and incremental loads, and considerations for datatype conversion with Databricks.
Example SQL SELECT Statements
Greenplum Database supports standard ANSI SQL. Here are common examples for querying data:
``` -- Retrieve all columns from a table SELECT * FROM schema_name.table_name;
-- Select specific columns with a filter SELECT id, name, created_at FROM schema_name.table_name WHERE status = 'ACTIVE';
-- Aggregate data SELECT status, COUNT(*) FROM schema_name.table_name GROUP BY status;
-- Join two tables SELECT a.id, a.name, b.amount FROM schema_name.table_name_a a JOIN schema_name.table_name_b b ON a.id = b.a_id WHERE b.amount > 1000; ```
Datatype Conversion Considerations (Greenplum → Databricks)
When extracting data from Greenplum for use in Databricks (e.g., via ETL pipelines), be aware of the following type mappings:
| Greenplum Type | Common Databricks Type |
|---|---|
| INTEGER, BIGINT | IntegerType |
| VARCHAR, TEXT | StringType |
| NUMERIC, DECIMAL | DecimalType |
| DATE | DateType |
| TIMESTAMP | TimestampType |
| BOOLEAN | BooleanType |
Datatypes may not always map exactly, so review each field and perform explicit casting if necessary during data transformation.
Pattern: Initial and Incremental Loads
A robust data replication strategy typically involves:
- Initial (Full) Load: Loading the complete dataset from the source.
- Incremental Loads: Loading only data that has changed since the last load.
Both use the same Database Query component. For managed workflows, review the strategy at Matillion Exchange.
Initial Load Example
For the initial load, the query has no filter:
SELECT * FROM sales.transactions;
Incremental Load Example
For incremental loads, a filter (typically on a timestamp or primary key) is used:
-- Assuming 'last_updated' is a tracked timestamp
SELECT *
FROM sales.transactions
WHERE last_updated > '{{last_successful_load_timestamp}}';
Update the filter field and value to fit your source schema and replication approach.
Note: Replace
{{last_successful_load_timestamp}}with the actual value capturing the last processed timestamp or checkpoint.
By following this pattern, you ensure efficient and reliable data replication between Greenplum Database and modern analytics environments like Databricks.
Data Integration Architecture
Loading data into the Databricks database in advance of integration provides a strategic "divide and conquer" approach, breaking the problem into two manageable steps: first, the data is loaded, and subsequently, it is integrated and transformed as needed. This approach exemplifies a key benefit of the Extract, Load, and Transform (ELT) architecture, which separates the concerns of data ingestion and processing. Since effective data integration often entails thorough transformation, leveraging robust data transformation pipelines within Databricks is essential. These pipelines enable systematic and repeatable processes for cleansing, joining, and enriching data. Another significant advantage of the ELT architecture is that all transformation and integration operations occur directly inside the target Databricks database. This leads to fast, on-demand, and highly scalable transformations, eliminating the need for dedicated external processing infrastructure and often reducing both complexity and cost.