Integrate data from Vertica to Snowflake using Matillion

Our Vertica to Snowflake connector seamlessly migrates your data to Snowflake in just minutes, ensuring it remains current—no manual coding or complex ETL scripting required.

Vertica
Snowflake
Vertica to Snowflake banner

Extracting data from Vertica to Snowflake

Extracting data from Vertica is a common requirement for organizations seeking to leverage cloud data platforms such as Snowflake. Whether you are migrating workloads, consolidating analytical environments, or enabling hybrid architectures, the process involves several technical steps to ensure seamless data movement and integrity. This article provides practical guidance for moving data from Vertica to Snowflake. We will begin by outlining how to create an identity within Vertica, allowing secure and appropriate access to the data. For users of Matillion, we will detail how to check for the necessary JDBC driver and steps to acquire it if needed. The importance of establishing reliable network connectivity between your Vertica instance and Snowflake will also be addressed, covering key considerations for source-to-target data flow. Finally, we will discuss best practices for querying your Vertica data—both for an initial load and for ongoing, incremental extraction. By following these steps, you can streamline your Vertica to Snowflake data transfer process, setting the stage for efficient and reliable analytics in the cloud.


What is Vertica?

Vertica is a high-performance analytical database built for big data and advanced analytics. Using a columnar storage architecture, it delivers fast queries on large data volumes, making it ideal for analysis and reporting with complex aggregations. Vertica features in-database machine learning, data compression, and parallel query execution, and integrates easily with popular data tools. Its distributed design ensures scalability, elasticity, and fault tolerance, supporting deployment on-premises, in the cloud, or hybrid environments. With a familiar SQL interface and broad ecosystem, Vertica enables organizations to perform large-scale analytics efficiently and with minimal infrastructure overhead.

matillion logo x Vertica

What is Snowflake?

Snowflake is a cloud-native data platform that offers a fully managed, multi-cluster shared data architecture, enabling seamless data warehousing, data lake, and data engineering workloads. Unlike traditional on-premises databases, Snowflake decouples storage and compute, allowing each to scale independently for optimized performance and cost control across a variety of structured and semi-structured data formats, including JSON, Avro, and Parquet. It provides robust support for SQL, secure data sharing, and automated maintenance activities such as clustering, backups, and failover, reducing operational overhead. Snowflake’s multi-cloud availability across AWS, Azure, and Google Cloud ensures flexibility and reliability, while its support for role-based access control and comprehensive auditing meets stringent enterprise security requirements.

Why Move Data from Vertica into Snowflake

The Benefits of Copying Data from Vertica to Snowflake for Advanced Analytics

A data engineer or architect may choose to copy data from Vertica into Snowflake for several strategic reasons. Vertica often contains data that is potentially valuable for business intelligence and analytics, but its true potential is realized when integrated with information from other sources. By consolidating data from Vertica with data stored in different systems, organizations can uncover deeper insights, perform comprehensive analyses, and enable more informed decision-making. Furthermore, executing data integration and transformation workloads in Snowflake, rather than directly on Vertica, ensures the production system’s performance is not impacted by the high resource demands of large-scale processing. This approach allows Vertica to focus on its primary operational role, while Snowflake handles resource-intensive data integration tasks in a scalable, cost-efficient manner.

Creating a User in Vertica

To create a new user in Vertica, you must have administrator privileges, such as being a superuser or having the

CREATE USER
privilege. The following instructions guide you through the steps and options when creating a user in Vertica.

Basic User Creation

To create a user with a specified password, use the following SQL syntax:

CREATE USER username IDENTIFIED BY 'password';

Example:

CREATE USER analytics_user IDENTIFIED BY 'SecureP@ssw0rd!';

This command creates a user called

analytics_user
with the specified password.

Creating a User with Additional Options

Vertica allows you to specify additional options, such as resource pool assignments, profile limits, and default roles.

Example:

CREATE USER reporting_app
  IDENTIFIED BY 'changeMe123!'
  DEFAULT ROLE analytics_role
  RESOURCE POOL general_pool
  MEMORYCAP '2G'
  USER PROFILE basic_limit;

  • DEFAULT ROLE
    : Assigns a role to the user, which governs their privileges.
  • RESOURCE POOL
    : Assigns a resource pool to manage workload.
  • MEMORYCAP
    : Limits the memory the user can consume.
  • USER PROFILE
    : Assigns a profile that specifies account-level settings, such as password expiration.

Granting Privileges

After creating the user, you typically need to grant privileges:

GRANT CONNECT ON DATABASE dbname TO analytics_user;
GRANT SELECT ON SCHEMA public TO analytics_user;

Replace

dbname
and
public
with your database and schema names as appropriate.

Password Optional User

You can also create users without a password for authentication using LDAP or other mechanisms:

CREATE USER readonly_user;

Note: It is highly recommended to assign passwords unless you are enforcing external authentication.

Viewing Existing Users

To view the list of users in the system:

SELECT user_name FROM v_catalog.users;

Reference

For more details and advanced options, see the official Vertica documentation on CREATE USER.

Installing the JDBC driver

At the time of writing, the Vertica JDBC driver is not included by default in Matillion Data Productivity Cloud. This omission is due to licensing or redistribution restrictions associated with the Vertica JDBC driver. Consequently, you must manually download and install the driver before you can establish database connections to Vertica from within the Matillion platform.

To complete the installation, follow the steps below:

  1. Download the Vertica JDBC Driver
    Visit the Vertica Client Drivers download page: https://www.vertica.com/download/vertica/client-drivers/
    On this page, locate and download the latest available Type 4 JDBC driver for your operating system. Type 4 JDBC drivers are platform-independent and preferable, as they communicate directly with the Vertica database over the network without the need for native libraries.

  2. Prepare for Installation
    Once downloaded, extract or save the relevant JDBC

    .jar
    file to a location that is easily accessible from your machine.
    Review the driver documentation if provided, to be aware of any supplemental resources or licensing considerations.

  3. Install the JDBC Driver into Matillion Data Productivity Cloud
    To make the driver available for use within Matillion, you must upload it following the product's procedure for external drivers.
    Refer to the official instructions for uploading external JDBC drivers: https://docs.matillion.com/data-productivity-cloud/agent/docs/uploading-external-drivers/
    The page will guide you through steps such as accessing the agent settings, selecting the driver

    .jar
    file, and completing the upload so the driver is recognized by the system.

  4. Begin Using the Driver
    After successful installation of the JDBC driver, Matillion Data Productivity Cloud will allow you to configure data source connections that utilize Vertica. For guidance on how to create and manage database queries using your new driver, see the detailed usage instructions here: https://docs.matillion.com/data-productivity-cloud/designer/docs/database-query/
    This resource covers setting up connections, testing queries, and best practices for use within the Matillion Designer interface.

By following these instructions, you ensure Matillion Data Productivity Cloud is configured to connect to the Vertica database using the appropriate third-party JDBC driver.

Checking network connectivity

To ensure proper operation between Matillion Data Productivity Cloud and your Vertica database, you must confirm that the Vertica instance is configured to allow incoming connections, based on your deployment model:

Additionally, if you are connecting to Vertica using a DNS hostname (rather than a static IP address), the Full SaaS or Hybrid SaaS agent must be able to resolve the database's DNS address. Make sure that DNS lookups are permitted and properly configured in your network environment.

Querying Data from a Vertica Database

This guide explains how to query data from a Vertica database and highlights important best practices, including data type conversion and effective incremental load strategies.


Example Vertica SQL SELECT Statements

Basic Select All Columns from a Table

SELECT * FROM public.sales_orders;

Select Specific Columns with Filtering

SELECT order_id, customer_id, total_amount
FROM public.sales_orders
WHERE order_status = 'SHIPPED';

Aggregate Query Example

SELECT customer_id, COUNT(*) AS order_count
FROM public.sales_orders
GROUP BY customer_id;


Datatype Conversion Between Vertica and Snowflake

Keep in mind that datatypes may not always map one-to-one from Vertica to Snowflake during data migrations or integrations. For instance:

  • Vertica
    BOOLEAN
    ➜ Snowflake
    BOOLEAN
  • Vertica
    CHAR
    /
    VARCHAR
    ➜ Snowflake
    VARCHAR
  • Vertica
    TIMESTAMP
    ➜ Snowflake
    TIMESTAMP_NTZ
  • Vertica
    FLOAT
    ➜ Snowflake
    FLOAT

Carefully review your data definitions and consult the official documentation to ensure appropriate conversions and avoid potential data loss or truncation.


Incremental Load Strategy (Recommended Pattern)

The optimal design is to carry out an initial load (once-off full data extract), followed by regular incremental loads (partial extracts capturing new or changed records).

  • Initial Load: Execute a full data extraction with no filter to pull all records.
  • Incremental Load: Introduce a
    WHERE
    clause filter to capture only new or updated data based on a unique or timestamp column.

Both stages should use the same Database Query component (or process).

Example: No Filter for Initial Load

SELECT *
FROM public.sales_orders;

Example: Filtered Incremental Load

Suppose

last_modified
is a timestamp column that captures record edits:

SELECT *
FROM public.sales_orders
WHERE last_modified > '2024-05-01 00:00:00';

This query would return only the records updated since May 1, 2024. The filter value (i.e., the timestamp or high-water mark) is tracked and updated by your ETL/ELT process.

Learn more on Matillion's Exchange: Incremental Load Data Replication Strategy

Data Integration Architecture

Loading data in advance of integration, a key feature of the ELT (Extract, Load, Transform) architecture, allows organizations to divide and conquer the data integration challenge by splitting it into two distinct and manageable steps: loading raw data into Snowflake first, and then transforming and integrating it as needed. This approach streamlines the workflow and improves operational efficiency. As part of the data integration process, data often requires transformation to fit organizational schemas, meet analytical requirements, or comply with data governance policies. The most effective way to accomplish these transformations is through dedicated data transformation pipelines, which can be developed within Snowflake to automate and manage complex logic. Because ELT centralizes both transformation and integration within the Snowflake environment, organizations benefit from high performance and scalability, as operations take advantage of Snowflake's built-in processing power. This architecture also eliminates the need for additional data processing infrastructure, ensuring that transformation and integration tasks remain cost-effective, rapid, and available on demand.

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.