Integrate data from Vectorwise (Actian Vector) to Databricks using Matillion

The Actian Vector to Databricks connector efficiently transfers your data to Databricks within minutes, ensuring it remains up-to-date without requiring manual coding or complex ETL processes.

Actian Vector
Databricks
Vectorwise (Actian Vector) to Databricks banner

Extracting data from Vectorwise (Actian Vector) to Databricks

Extracting data from Actian Vector is a crucial step in many modern data pipeline workflows, particularly when integrating with advanced analytics and processing environments such as Databricks. Organizations often need a reliable and efficient way to transfer data from their Actian Vector data warehouse to Databricks, ensuring secure access, smooth network connectivity, and minimal disruption as source data evolves. This article will guide you through the essential steps to accomplish this integration, including: - **Creating an identity in Actian Vector** to control and monitor extraction tasks securely. - For **Matillion users**, verifying or acquiring the required JDBC driver to ensure compatibility and a secure connection. - Establishing the necessary **network connectivity** from the Actian Vector source to the Databricks target environment. - Techniques for **querying and transferring data**, both for initial full loads and ongoing incremental updates. By the end of this article, you will have a clear understanding of how to extract data from Actian Vector and load it efficiently into Databricks, establishing a solid foundation for your analytics workflow.


What is Vectorwise (Actian Vector)?

Actian Vector is a high-performance, columnar analytic database designed to deliver exceptional query speed and efficiency for complex analytical workloads. Leveraging vectorized query execution, memory-optimized processing, and efficient data compression, Vector is well-suited for environments requiring real-time analytics over large data volumes. Its architecture, built from the ground up for modern multi-core CPUs and large memory footprints, supports rapid aggregation and filtering typical of business intelligence and data warehousing scenarios. With full SQL compliance and support for standard interfaces such as ODBC and JDBC, Actian Vector integrates easily into existing data ecosystems, providing advanced workload management, horizontal scalability, and strong concurrency controls for demanding enterprise environments.

matillion logo x Actian Vector

What is Databricks?

Databricks offers a unified analytics platform on Apache Spark, supporting data engineering, science, and analytics in a scalable, collaborative cloud environment. Using Delta Lake as its database layer, Databricks enables ACID transactions, reliable data management, and data versioning for cloud data lakes. Its architecture integrates easily with diverse data sources, delivers schema enforcement, time travel, and efficiently manages both batch and streaming data. Native support for SQL, Python, Scala, and R, combined with a collaborative workspace, allows teams to streamline workflows from ETL to advanced analytics and machine learning—all securely managed in the cloud.

Why Move Data from Vectorwise (Actian Vector) into Databricks

The Benefits of Transferring Data from Actian Vector to Databricks for Advanced Analytics

A data engineer or architect may consider copying data from Actian Vector to Databricks for several compelling reasons. Although Actian Vector contains data that is potentially valuable, the true worth of this data emerges when it is integrated with information from other sources within the organization. By leveraging Databricks, which is designed for scalable data engineering and analytics, professionals can easily combine and analyze datasets from disparate origins without imposing additional processing demands on the Actian Vector environment. This approach ensures that the operational performance and responsiveness of Actian Vector are preserved, while allowing users to unlock deeper business insights through unified, collaborative workflows in Databricks.

Creating a User in an Actian Vector Database

To add a new user (identity) in an Actian Vector database, you grant access by creating a user account at the operating system (OS) level and then assigning database permissions as required. Actian Vector leverages OS authentication, so database users typically correspond to OS users.

Step 1: Create an Operating System User

On the server where Actian Vector is installed, create a new OS user. This process differs based on your operating system:

Example for Linux:

sudo useradd vectoruser
sudo passwd vectoruser
Replace
vectoruser
with the desired username.

Step 2: Enable the User in the Database

You do not need to create users in the database explicitly, as Vector relies on existing OS users for authentication. To enable database access, the user simply logs in with their OS credentials.

Example connection using SQL:

sql vectordb -U vectoruser
-
vectordb
is the name of your Vector database. -
vectoruser
is the OS user created in Step 1.

Step 3: Grant Database Permissions (Optional)

Once the user is able to connect, you can grant privileges on specific database objects.

Example: Grant SELECT privilege on a table:

GRANT SELECT ON employees TO vectoruser;

Example: Grant CREATE privilege to allow creating objects:

GRANT CREATE TO vectoruser;

You can execute these SQL commands as a database administrator (e.g., the installation owner or a superuser).


Note:
Actian Vector does not support database-internal user creation (i.e.,

CREATE USER
). User authentication is managed at the OS level, with privileges handled within the database. For finer-grained authentication, consider integrating with LDAP or other security mechanisms as per your organization’s policies.

Installing the JDBC Driver

At the time of writing, the JDBC driver for the Actian Vector database is not bundled with Matillion Data Productivity Cloud, due to licensing or redistribution restrictions. As a result, you will need to manually obtain and install the appropriate JDBC driver before you can use Actian Vector as a source or target within Matillion.

Downloading the Actian Vector JDBC Driver

  1. Visit the Actian Vector website:
    Navigate to https://www.actian.com/products/actian-vector/ to access the official download page for the Actian Vector JDBC Driver.

  2. Locate the JDBC Driver:
    Look for the Type 4 JDBC Driver, which is preferred for its platform independence and direct compatibility with Matillion. Type 4 drivers are typically listed under database connectivity sections or download areas.

  3. Download the Driver:
    Download the appropriate JDBC driver package to your local machine. Ensure you're obtaining the version compatible with your Actian Vector database release and your desired deployment environment.

Installing the JDBC Driver in Matillion Data Productivity Cloud

Before you can use the driver within Matillion, you must upload the downloaded JDBC driver file and make it available to your workspace.

  1. Review Installation Documentation:
    Matillion provides a detailed guide for uploading external drivers, including supported file types and steps for installation, available at
    https://docs.matillion.com/data-productivity-cloud/agent/docs/uploading-external-drivers/.

  2. Upload the Driver:

  3. Log in to your Matillion Data Productivity Cloud account.
  4. Follow the documented steps to upload the JDBC driver JAR file. Typically, this involves navigating to an agent or connector configuration section and selecting the option to add or upload a new driver.
  5. Complete the upload process and assign the driver if prompted.

  6. Verify the Installation:
    Once uploaded, review your agent connection settings or driver lists to confirm that the Actian Vector JDBC driver appears as an available option.

Next Steps: Using the Driver

After installing the driver, you can proceed to configure database connections and interact with your Actian Vector database from within Matillion.
Matillion provides complete usage instructions for connecting to external databases using custom JDBC drivers at
https://docs.matillion.com/data-productivity-cloud/designer/docs/database-query/.

Ensure that you follow all configuration and security best practices when setting up your connections.

Checking network connectivity

To ensure that Matillion Data Productivity Cloud can connect to your Actian Vector database, you must make certain that the database accepts incoming connections from the correct network sources, depending on your deployment:

Additionally, if the Actian Vector database is referenced using a DNS hostname (rather than a direct IP address), the Matillion Data Productivity Cloud agent (either Full SaaS or Hybrid SaaS) must be able to resolve the DNS address successfully to establish a connection. Ensure that any required DNS routing, records, and permissions are correctly configured.

Querying Data from Actian Vector Database

This guide provides instructions and best practices for querying data from an Actian Vector database, including sample SQL statements, notes on datatype interoperability with Databricks, and recommendations for loading data efficiently using an initial and incremental load pattern.


1. Example SQL SELECT Statements

Below are typical SQL queries for retrieving data from Actian Vector:

``` -- Retrieve all columns and rows from a table SELECT * FROM sales.orders;

-- Select specific columns SELECT order_id, customer_id, order_date, total_amount FROM sales.orders;

-- Filtering data with conditions SELECT * FROM sales.orders WHERE order_date >= '2024-06-01';

-- Aggregating data SELECT customer_id, COUNT(*) AS order_count, SUM(total_amount) AS total_sales FROM sales.orders GROUP BY customer_id ORDER BY total_sales DESC;

-- Joining multiple tables SELECT o.order_id, o.order_date, c.customer_name, o.total_amount FROM sales.orders o JOIN sales.customers c ON o.customer_id = c.customer_id WHERE o.status = 'SHIPPED'; ```

2. Datatype Conversion: Actian Vector and Databricks

When integrating Actian Vector with Databricks or similar analytics platforms, it's important to consider datatype compatibility. Common issues include:

  • Integer and numeric types transfer well, but ensure precision is preserved for
    DECIMAL
    ,
    NUMERIC
    , and floating point values.
  • Date and timestamp types: Actian Vector's
    DATE
    ,
    TIME
    , and
    TIMESTAMP
    should be mapped to appropriate Databricks/Spark SQL types, without unintentional timezone conversion.
  • String and character types might have length differences (
    CHAR(n)
    ,
    VARCHAR(n)
    ), so confirm truncation or length matches expectations.
  • If necessary, explicit casting can be performed in the SELECT statement:
    SELECT CAST(amount AS DOUBLE PRECISION) AS amount
      FROM sales.orders;

3. Loading Pattern: Initial and Incremental Loads

Best Practice Pattern

The recommended approach for loading data from Actian Vector to downstream systems (such as Databricks) is:

  1. Once-off Initial Load: Retrieve all existing data.
  2. Incremental Loads: Regularly fetch newly inserted or updated records.

In both cases, use the same Database Query component, adjusting only the SQL.

Initial Load: No Filter Clause

During an initial load, issue a broad query without filters. For example:

SELECT *
FROM sales.orders;

Incremental Load: With Filter Clause

For incremental loads, add a filter to fetch only records since the last successful extraction (e.g., use a timestamp or auto-incrementing key):

SELECT *
FROM sales.orders
WHERE order_date > '2024-06-01 00:00:00';

Or using an extracted max value from the last load:

SELECT *
FROM sales.orders
WHERE last_modified > '<<last_loaded_timestamp>>';

Replace

<<last_loaded_timestamp>>
with the appropriate variable or value.

For further details, refer to the Matillion documentation on Incremental Load Data Replication Strategy.


By following these guidelines and query examples, you can ensure efficient and reliable data extraction from Actian Vector for analytical or ETL purposes.

Data Integration Architecture

Loading data in advance of integration exemplifies the "divide and conquer" approach found in ELT (Extract, Load, Transform) architectures, where data is first loaded into the target environment and then integrated and transformed as a separate step. This separation allows organizations to efficiently manage large-scale ingestion and subsequently focus on data integration challenges. Importantly, data integration involves transforming data into consistent, usable formats—a process best handled through dedicated data transformation pipelines. These pipelines streamline complex tasks such as data cleansing and normalization, enabling repeatable and maintainable workflows. Another significant advantage of the ELT architecture is that both transformation and integration are executed inside the target Databricks database. This in-database processing leverages Databricks' fast, scalable compute capabilities, supporting on-demand operations without the need to invest in or maintain additional data processing infrastructure.

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.