Extracting data from Greenplum Database to Snowflake
Extracting data from Greenplum is a fundamental requirement for organizations seeking to integrate, migrate, or analyze their data in the cloud—especially when loading that data into modern platforms such as Snowflake. This article provides a practical guide to achieving this objective by outlining each critical step in the process. We will begin with the creation of an appropriate identity in Greenplum, ensuring secure and controlled access to the source data. For those using Matillion, we will review how to check for or acquire the necessary JDBC driver to facilitate the data connection. Next, we will discuss the essential elements of network connectivity between your Greenplum environment and your Snowflake target, addressing potential configuration and security considerations. Finally, we will explore effective strategies for querying your data—both for the initial full extraction and for ongoing incremental loads. By the end of this article, you will be well-equipped to move your data from Greenplum into Snowflake with confidence.
What is Greenplum Database?
Greenplum Database is an open-source, massively parallel processing (MPP) data warehouse built on PostgreSQL, designed to handle large-scale analytics workloads across distributed clusters. By implementing a shared-nothing architecture, Greenplum partitions data across multiple segment nodes, each running its own PostgreSQL instance, enabling high throughput and scalability for complex queries. It supports advanced features such as parallel query execution, data compression, and resource management, allowing organizations to efficiently process petabytes of structured and semi-structured data. Greenplum also integrates with various machine learning libraries and big data tools, making it well-suited for modern analytics ecosystems.
What is Snowflake?
Snowflake is a cloud-native data platform designed to deliver scalable, high-performance data warehousing and analytics. Built on a multi-cluster, shared-data architecture, Snowflake separates compute and storage layers, enabling concurrent workloads and facilitating dynamic scaling based on demand. It supports diverse data formats, including structured and semi-structured data (e.g., JSON, Parquet), and integrates natively with major cloud providers such as AWS, Azure, and Google Cloud Platform. Key features include automatic performance tuning, zero-maintenance infrastructure, secure data sharing, and support for SQL-based querying. Snowflake’s architecture enables organizations to simplify complex data pipelines and accelerate time-to-insight while maintaining robust security and governance controls.
Why Move Data from Greenplum Database into Snowflake
Benefits of Copying Data from Greenplum to Snowflake for Advanced Analytics
A data engineer or architect might choose to copy data from Greenplum into Snowflake for several strategic reasons. Greenplum often houses large volumes of potentially valuable data accumulated from critical business operations and analytics. However, the true value of this data can be fully realized when it is integrated with datasets from other systems, enabling richer analysis and fostering better-informed decisions. Performing such data integration and advanced analytics within Snowflake presents significant benefits: it leverages Snowflake’s scalability, native support for diverse data sources, and advanced processing capabilities—all while offloading resource-intensive workloads from Greenplum. This approach not only preserves Greenplum’s performance for existing operations but also unlocks new opportunities for innovation without incurring unnecessary strain on legacy infrastructure.
Similar connectors
Creating a User in Greenplum Database
To create a user (also known as a role) in Greenplum Database, ensure you have the necessary superuser or CREATEROLE privilege. This guide provides the steps and SQL scripts required.
1. Connect to the Database
Connect to your Greenplum database using the
psqlcommand-line utility or any SQL client. For example:
psql -h <host> -p <port> -U <admin_user> <database_name>
Replace
<host>,
<port>,
<admin_user>, and
<database_name>with your environment's values.
2. Create a New User
Use the
CREATE ROLESQL command. By default, roles do not have login privileges; you must specify
LOGINto create a user account.
Example:
CREATE ROLE new_user WITH LOGIN PASSWORD 'secure_password';
- Replace
new_user
with the desired username. - Replace
'secure_password'
with a strong password.
3. Grant Privileges (Optional)
To allow the new user to connect to specific databases or perform operations, grant the appropriate privileges.
Example:
``` -- Grant CONNECT privilege on a database: GRANT CONNECT ON DATABASE example_db TO new_user;
-- Grant usage on schema: GRANT USAGE ON SCHEMA public TO new_user;
-- Grant permissions on tables: GRANT SELECT, INSERT ON ALL TABLES IN SCHEMA public TO new_user; ```
4. Verify the New User
Query the catalog view to confirm the user’s creation:
\du -- in psql, lists all rolesor
SELECT rolname FROM pg_roles WHERE rolname = 'new_user';
Notes
- The user will not have superuser privileges by default. Only grant elevated permissions if absolutely necessary.
- You may customize the user further using options such as
CREATEDB
,CREATEROLE
, or resource group assignments:
CREATE ROLE new_user WITH LOGIN PASSWORD 'secure_password' CREATEDB;
Refer to the Greenplum documentation for advanced user management features.
Installing the JDBC driver
The Greenplum database leverages a JDBC driver to facilitate connectivity from third-party applications such as Matillion Data Productivity Cloud. At the time of writing, this driver is not packaged with Matillion Data Productivity Cloud by default, primarily due to licensing or redistribution restrictions. As such, users must manually obtain and install the driver.
Download the JDBC Driver
- Navigate to the official PostgreSQL JDBC Driver website: https://jdbc.postgresql.org/
- Locate the latest release of the Type 4 JDBC driver (also known as the “pgJDBC” driver), which is preferred for its platform independence and direct communication with the Greenplum database.
- Download the appropriate JAR file to your local workstation.
Upload the JDBC Driver to Matillion Data Productivity Cloud
To make the JDBC driver available for use within your Matillion environment, follow these steps:
- Log into your Matillion Data Productivity Cloud account and access the environment where you wish to install the driver.
- Consult the official instructions for uploading and installing external drivers:
- Uploading external drivers into Matillion Data Productivity Cloud
- Using the documented procedure, upload the downloaded JDBC JAR file to your Matillion agent or workspace, ensuring the file is correctly placed according to Matillion’s requirements.
Configuring and Using the Greenplum Connection
For guidance on configuring and utilizing a database connection within Matillion Data Productivity Cloud, refer to the official usage documentation: - Database Query component documentation
This documentation provides detailed instructions on specifying the correct driver and parameters to establish a connection to your Greenplum instance.
Checking network connectivity
To ensure successful communication between Matillion Data Productivity Cloud and your Greenplum database, you must configure your network to allow incoming connections to Greenplum, based on your Matillion deployment type:
-
Full SaaS agent configuration:
Allow incoming connections from the IP addresses published by Matillion. You can find the current list at:
https://docs.matillion.com/data-productivity-cloud/security/docs/allowing-ip-addresses/ -
Hybrid SaaS deployment:
Configure your Greenplum database to accept incoming connections from your own virtual private cloud (VPC) where your Hybrid SaaS agent runs. If you need to check or troubleshoot network accessibility, utilities are available at:
https://exchange.matillion.com/data-productivity-cloud/pipeline/check-network-access/
Additionally, if the Greenplum database is referenced using a DNS hostname, ensure that the relevant Matillion Data Productivity Cloud agent (whether Full SaaS or Hybrid SaaS) has DNS resolution enabled and is able to resolve the provided address.
Querying Data from a Greenplum Database
The process for querying and extracting data from a Greenplum database typically involves using SQL
SELECTstatements via a Database Query component within your data pipeline. Below are key points and examples relevant to technical users working with Greenplum, along with the recommended patterns for initial and incremental loading of data (see Matillion docs - Incremental Load Data Replication Strategy).
Example Greenplum Queries
A standard query targeting a Greenplum table, for example:
SELECT id, user_name, email_address, created_at FROM public.users;
Filtering data, for example targeting users created after a specific date:
SELECT id, user_name, email_address, created_at FROM public.users WHERE created_at >= '2024-01-01';
Aggregating data:
SELECT status, COUNT(*) AS order_count FROM public.orders GROUP BY status;
Joining tables:
SELECT u.id, u.user_name, o.order_id, o.amount FROM public.users u INNER JOIN public.orders o ON u.id = o.user_id;
Note on Datatype Conversion
If you are transferring data between Greenplum and Snowflake, be aware that datatype conversion may occur. For instance, Greenplum’s
timestamp without time zonemay map to Snowflake’s
TIMESTAMP_NTZ, and string, numeric, or boolean types may have subtle differences in precision or range. Review your table schemas and data types to ensure compatibility when syncing data.
Recommended Load Pattern: Initial Load & Incremental Load
The best data extraction practice is to perform a once-off initial load followed by incremental loads for ongoing sync. Both processes use the same Database Query component, configured differently as follows:
Initial Load
- The Database Query component is configured using a basic
SELECT
statement without a filter clause. - All available records are extracted.
Example (no filter):
SELECT id, user_name, email_address, created_at FROM public.users;
Incremental Loads
- For subsequent loads, the same database query is updated to include a filter (usually on a monotonically increasing column like
created_at
orlast_updated
). - This filter allows the extraction of only new or changed records since the last load.
Example (with filter):
SELECT
id,
user_name,
email_address,
created_at
FROM
public.users
WHERE
created_at > '${last_loaded_at}';
(Where ${last_loaded_at} is dynamically populated with the timestamp of the last successful extraction.)
For detailed guidance, see the article: Incremental Load Data Replication Strategy.
Summary Table:
| Load Type | Query Example | Filter Clause |
|---|---|---|
| Initial | SELECT ... FROM table; |
No |
| Incremental | SELECT ... FROM table WHERE updated > ?; |
Yes |
Following this pattern ensures efficient and reliable data extraction from Greenplum for downstream processing or loading into Snowflake.
Data Integration Architecture
A key advantage of the ELT (Extract, Load, Transform) architecture is its “divide and conquer” approach to data integration. By first loading data into Snowflake and then performing integration and transformation as a separate, subsequent step, organizations simplify the problem and gain greater flexibility. Data integration itself inherently depends on data transformation—whether for data cleansing, harmonization, or enrichment—and the most effective method for achieving this is through well-designed data transformation pipelines. With ELT, all transformation and integration tasks take place entirely within the target Snowflake database. This strategy delivers exceptional benefits: processing is fast, scalable on demand, and leverages Snowflake’s robust compute resources, so there is no need to invest in or maintain separate data processing infrastructure. This not only reduces operational overhead and cost, but also allows teams to focus on optimizing transformation logic directly in the analytics environment.