Extracting data from Vertica to Amazon Redshift
Extracting data from Vertica is a common requirement for organizations seeking to leverage cloud-based analytics platforms such as Amazon Redshift. This article provides a step-by-step guide to moving your data from Vertica to Redshift, whether for a one-time migration or ongoing synchronization. We will begin by discussing how to create an identity in Vertica to allow secure access for data extraction. For those using Matillion ETL, we will cover how to verify the presence of the necessary JDBC driver—or how to obtain it if it is missing. In addition, the article will address how to establish reliable network connectivity between your Vertica source and Redshift target environments. Finally, we will outline strategies for querying and transferring data, covering both initial full extracts and ongoing incremental loads. By following these steps, you can ensure an efficient and secure data migration process.
What is Vertica?
Vertica is a high-performance, column-oriented relational database management system (RDBMS) designed for large-scale data warehousing and real-time analytics. Engineered to efficiently handle petabyte-scale workloads, Vertica employs advanced compression, aggressive parallelism, and an MPP (Massively Parallel Processing) architecture to optimize query execution and storage utilization. Its columnar storage model accelerates analytical queries by allowing the system to scan only relevant columns, significantly reducing I/O overhead. Vertica supports standard SQL, integrates with popular data visualization tools, and offers robust features such as in-database machine learning, high availability, and seamless scalability across distributed environments, making it well-suited for enterprises seeking rapid insights from massive datasets.
What is Amazon Redshift?
Amazon Redshift is a fully managed, petabyte-scale cloud data warehouse service designed for analytical workloads and large-scale data storage. Based on PostgreSQL and optimized for fast query performance, Redshift uses columnar data storage, parallel processing, and advanced compression to accelerate query execution and minimize storage costs. It integrates seamlessly with the AWS ecosystem, offering scalability through its use of clusters and node-based architecture, automated backups, and security features like VPC support and encryption. Redshift Spectrum extends functionality by enabling SQL queries directly on data stored in Amazon S3, facilitating data lake analytics without loading data into the warehouse.
Why Move Data from Vertica into Amazon Redshift
Leveraging Vertica-to-Redshift Data Copy for Enhanced Analytics
A data engineer or architect might wish to copy data from Vertica into Redshift for several compelling reasons. Vertica often stores rich and potentially valuable datasets that can drive critical analytics and business insights. However, deriving maximum value often requires integrating Vertica’s data with other sources, such as additional databases, cloud storage, or third-party platforms. Amazon Redshift, as a fully managed data warehouse service, excels at consolidating disparate data sources and supporting complex analytics at scale. By copying data into Redshift, organizations can leverage its native integration capabilities and powerful query performance to enable comprehensive data analysis, all while ensuring that the operational workload on the Vertica system remains minimal. This approach helps avoid performance degradation or resource contention on Vertica, preserving its optimal functioning for other mission-critical processes.
Similar connectors
Creating a User in Vertica
To create a new user in a Vertica database, you must connect as a superuser or a user with the
CREATE USERprivilege. Follow the steps below to create a user and optionally specify authentication methods and resource limits.
1. Connect to Vertica
You can connect using the
vsqlcommand-line tool or any SQL client.
vsql -h <host> -U <dbadmin_user> -w <password> -d <database>
2. Create the User
Use the
CREATE USERstatement in SQL. Replace
<username>and
<password>with your desired values.
-- Create a basic user with password authentication CREATE USER <username> IDENTIFIED BY '<password>';
Example:
CREATE USER reporting_user IDENTIFIED BY 'r3port#2024!';
3. Optional: Define User Resource Limits
You can assign user-level resource limits at creation or later with
ALTER USER:
CREATE USER analyst IDENTIFIED BY 'analyzepw'
MEMORYCAP 128M
USERSESSIONLIMIT 2;
4. Optional: Grant Privileges
After creation, grant the new user permission to access schemas or objects:
``` -- Grant CONNECT privilege to a schema GRANT USAGE ON SCHEMA my_schema TO reporting_user;
-- Grant SELECT privilege on a table GRANT SELECT ON TABLE my_schema.sales TO reporting_user; ```
5. Verification
To verify user creation:
SELECT user_name FROM users WHERE user_name = '<username>';
Example:
SELECT user_name FROM users WHERE user_name = 'reporting_user';
Note:
Remember to follow your organization’s password policy and security best practices when creating accounts.
Installing the JDBC Driver
Due to licensing and redistribution restrictions, the Vertica JDBC driver is not included by default in Matillion Data Productivity Cloud. In order to enable connectivity to Vertica, you must download and install the driver manually via Matillion's external driver handling tools. The following instructions describe the required steps.
1. Download the JDBC Driver
-
Navigate to the official Vertica download page for client drivers:
https://www.vertica.com/download/vertica/client-drivers/ -
On the download page, locate the Type 4 JDBC driver option. This is typically the preferred format because it requires no native libraries and is platform-independent.
-
Download and save the latest .jar file for the Type 4 JDBC driver to your local machine. Ensure you comply with Vertica’s licensing terms when downloading and redistributing this file.
2. Install the JDBC Driver into Matillion Data Productivity Cloud
The required steps for uploading external JDBC drivers are detailed in Matillion’s documentation:
Uploading External Drivers – Data Productivity Cloud Agent
Summary of steps:
- Access the Matillion Agent or appropriate control interface according to your organization's setup.
- Use the agent’s interface to upload the downloaded Vertica JDBC JAR file, following prompts to complete the installation.
- Verify the installation in the agent to ensure the driver is now available for connections.
Note: Administrative permissions may be required to complete this process, depending on your organization's Matillion deployment.
3. Connect to Vertica from Matillion
Once the driver is installed, you can configure database connections and use Vertica as a data source or destination in your pipelines. Detailed guidance on creating and managing database connections is provided here:
Database Query Component – Data Productivity Cloud Designer
Refer to these guidelines to ensure the connection properties and driver settings are correctly configured for your Vertica environment.
Checking network connectivity
To enable successful communication between Matillion Data Productivity Cloud and your Vertica database, it is essential to ensure that the Vertica database accepts incoming network connections from the appropriate sources, based on your deployment configuration:
-
Full SaaS Agent Configuration
If you are using Matillion Data Productivity Cloud with a Full SaaS agent, you must update your Vertica database firewall or network access controls to allow incoming connections from the IP addresses managed by Matillion. These IP addresses are published and regularly maintained at the following URL: https://docs.matillion.com/data-productivity-cloud/security/docs/allowing-ip-addresses/. Make sure your Vertica instance is accessible from all IP ranges listed in that documentation. -
Hybrid SaaS Deployment
For Hybrid SaaS deployments, you are responsible for managing network access between your environment and Vertica. You must allow incoming connections from your own Virtual Private Cloud (VPC) to the Vertica database. You can verify connectivity and network routes using the utility tools provided by Matillion at: https://exchange.matillion.com/data-productivity-cloud/pipeline/check-network-access/.
DNS Resolution
If you are referencing your Vertica database using a DNS hostname (rather than a raw IP address), ensure that either the Full SaaS agent or your Hybrid SaaS agent (depending on your setup) can resolve the DNS address correctly. Proper DNS resolution is required for Matillion Data Productivity Cloud to initiate and maintain connections to your Vertica instance.
Querying Data from Vertica Database
This guide explains how to query data from a Vertica database, including typical SQL query patterns and best practices for initial and incremental data loads. Special considerations are included for datatype conversions between Vertica and Redshift.
1. Example Vertica SELECT Statements
Here is a basic example of querying all columns from a Vertica table:
SELECT * FROM sales.transactions;
To select specific columns:
SELECT transaction_id, transaction_date, amount FROM sales.transactions;
To aggregate data (e.g., total sales per month):
SELECT
DATE_TRUNC('month', transaction_date) AS month,
SUM(amount) AS total_sales
FROM sales.transactions
GROUP BY month
ORDER BY month;
2. Datatype Conversion Considerations
When moving data between Vertica and Redshift, be aware that certain datatypes may not directly map; conversion may be needed. For example:
| Vertica Type | Typical Redshift Equivalent |
|---|---|
| BOOLEAN | BOOLEAN |
| INTEGER | INTEGER |
| FLOAT | DOUBLE PRECISION |
| VARCHAR(…) | VARCHAR(…) |
| TIMESTAMP | TIMESTAMP |
| INTERVAL | Not directly supported |
Tip: Always review your schema and use explicit type casts during SELECTs if you expect type mismatches, to avoid data load or compatibility issues.
SELECT CAST(transaction_id AS VARCHAR(50)) AS transaction_id, CAST(amount AS DOUBLE PRECISION) AS amount FROM sales.transactions;
3. Initial Load and Incremental Load Patterns
The recommended approach for ETL (Extract, Transform, Load) with Vertica and Redshift is to perform a once-off initial load followed by regular incremental loads. Both utilize the same Database Query component.
- Initial Load: The Database Query component does not have a filter clause; all historical data is extracted.
SELECT *
FROM sales.transactions;
- Incremental Load: Add a filter clause to pick up only new or changed records, such as those created/modified since the last extraction (using a suitable column like a timestamp or incrementing id).
SELECT *
FROM sales.transactions
WHERE last_updated > '2024-06-01 00:00:00';
- Update the filter value (`last_updated`) on each subsequent run to pull only new/changed data.
See further patterns and strategies for incremental loading at Matillion Article: Incremental Load / Data Replication Strategy.
4. Using the Same Database Query Component
- Use the same Database Query component in both initial and incremental loads.
- For the initial load, simply exclude the filter (WHERE) clause.
- For incremental loads, parameterize or append a suitable WHERE clause.
This approach ensures consistent extraction logic, with runtime control over the scope of data.
Data Integration Architecture
Loading data into Amazon Redshift before integrating it is a strategic approach that follows the "divide and conquer" principle by splitting the overall data pipeline into two steps: loading and then transformation. This design, characteristic of the ELT (Extract, Load, Transform) architecture, allows organizations to first move raw data quickly and efficiently into Redshift, and then perform integration and transformation tasks afterward. Data integration inevitably requires data transformation, which is most efficiently handled through robust data transformation pipelines. These pipelines automate complex transformations, ensuring data quality and consistency across integrated sources. A significant advantage of the ELT approach is that all transformation and integration operations take place within the Redshift database itself. This means processing is both fast and scalable, leveraging Redshift’s computing power on-demand without requiring additional external processing infrastructure. Consequently, organizations benefit from both reduced costs and streamlined operations since there is no need to provision and manage separate data processing environments.