Extracting data from VoltDB (NewSQL) to Snowflake
Extracting data from VoltDB is a critical step for organizations seeking to leverage high-performance transactional data within analytical platforms such as Snowflake. This process enables efficient movement of operational data into environments optimized for reporting, advanced analytics, and long-term storage. In this article, we will walk through the key steps required to transfer data from VoltDB to Snowflake. We begin by outlining how to create an access identity in VoltDB to ensure secure connections. For users working with Matillion, we will discuss how to check for, or acquire, the appropriate JDBC driver required for connecting to VoltDB. We will also cover essential considerations regarding network connectivity between VoltDB and Snowflake, helping you verify that data can flow smoothly between the systems. Finally, we will demonstrate best practices for querying and extracting initial data sets, as well as strategies for incremental data loads to keep Snowflake up to date with changes in VoltDB. Whether you are setting up this data pipeline for the first time or seeking to refine your current process, the following guidance will help you establish a reliable and efficient data movement solution.
What is VoltDB (NewSQL)?
VoltDB is a high-performance, in-memory, relational NewSQL database designed for applications demanding low-latency transaction processing and real-time analytics. Unlike traditional disk-based databases, VoltDB leverages a shared-nothing architecture and stores data entirely in memory, allowing it to support thousands of transactions per second with sub-millisecond latency. It ensures ACID compliance through synchronous replication and strong consistency, making it suitable for mission-critical workloads such as financial services, telecommunications, and IoT event processing. VoltDB’s native support for horizontal scaling, stored procedures in Java, and SQL access allow developers to build applications that process streaming data in real time while maintaining transactional integrity and durability through continuous snapshots and command logging.
What is Snowflake?
Snowflake is a fully managed, cloud-native data platform designed for scalable analytics, including data warehousing, lakes, engineering, and sharing. Built on a multi-cluster, shared data architecture, it separates compute, storage, and services, allowing elastic resource scaling without infrastructure management. Snowflake supports semi-structured data formats (JSON, Avro, Parquet), offers a native SQL interface, and integrates seamlessly with BI and ETL tools. Key features include automatic encryption, zero-copy cloning, time travel, and secure data sharing, enabling collaborative and rapid analytics development across teams and locations, making it a versatile choice for modern analytics needs.
Why Move Data from VoltDB (NewSQL) into Snowflake
Unlocking Analytical Insights: Copying Data from VoltDB to Snowflake
There are several reasons why a data engineer or architect might choose to copy data from VoltDB into Snowflake. Firstly, VoltDB typically contains transactional or real-time data that can be highly valuable when leveraged for analytics, reporting, or business intelligence. By integrating this data with information from other sources—such as CRM systems, data lakes, or historical archives—organizations can uncover deeper insights, enhance their decision-making, and realize the full potential of their data assets. Furthermore, performing data integration and analysis directly within Snowflake avoids placing additional processing demands on VoltDB, which is often optimized for high-throughput transaction processing rather than complex analytical workloads. Leveraging Snowflake's scalable and flexible platform thus enables efficient analysis without compromising the performance and reliability of mission-critical systems that rely on VoltDB.
Similar connectors
Creating a User in VoltDB
VoltDB provides an authentication mechanism based on user credentials. Users are defined in a deployment configuration file (typically
deployment.xml) rather than by issuing SQL statements during normal operation. This approach ensures centralized management and greater security.
Step 1: Update the Deployment Configuration
To create a new user, you need to edit the
deployment.xmlfile for your VoltDB database. Add or modify the
<users>section. For example, the following snippet creates a user named
dbuserwith password authentication:
xml
<deployment>
...
<users>
<user name="dbuser">
<password>securePassword123</password>
</user>
</users>
...
</deployment>
Notes: - The
nameattribute specifies the username. - Supply a strong value for
<password>. - You can define multiple
<user>blocks for additional users.
Step 2: Assign User Permissions (Optional)
You can restrict permissions using the
<roles>option. The roles can specify allowed procedures and other access controls (consult the VoltDB documentation for advanced permission schemes). For full access, define the user as shown above. For more granular control:
xml <user name="readonly"> <password>readonlyPass</password> <roles>readonly</roles> </user>
Step 3: Deploy the Configuration
After updating the deployment file:
- Save your changes to
deployment.xml
. - If the VoltDB cluster is running, shutdown the cluster gracefully.
- Start (or restart) the VoltDB server using the updated deployment file:
voltadmin shutdown voltdb start --deployment=deployment.xml
or, for a new cluster:
voltdb init --deployment=deployment.xml voltdb start
Step 4: Connect Using the New User
When connecting to VoltDB (e.g., using
sqlcmdor a client driver), supply the username and password:
sqlcmd --user=dbuser --password=securePassword123
Or configure the authentication options in your application’s VoltDB client connection logic.
For further detail, always review the latest VoltDB documentation on security configuration and authentication best practices.
Installing the JDBC driver
At the time of writing, the JDBC driver for VoltDB is not included by default in Matillion Data Productivity Cloud due to licensing or redistribution restrictions. To use VoltDB as a data source, you must manually download and install the driver as an external component.
Step 1: Download the JDBC Driver
- Navigate to the VoltDB download page: https://www.voltdb.com/download/
- On the downloads page, look specifically for the Type 4 JDBC driver, as this provides pure Java connectivity and is most suitable for integration in environments like Matillion Data Productivity Cloud.
- Download the appropriate VoltDB JDBC driver archive for your version and platform.
Step 2: Install the Driver in Matillion Data Productivity Cloud
Since the driver is not bundled with Matillion Data Productivity Cloud, you must upload it manually:
- Follow the installation guidance at: https://docs.matillion.com/data-productivity-cloud/agent/docs/uploading-external-drivers/
- Prepare the VoltDB JDBC driver
.jar
file you downloaded in the previous step. - Use your Matillion Data Productivity Cloud Agent interface to upload the JDBC driver in accordance with the documented process, ensuring it is available for use within Matillion components.
Step 3: Configure and Use the Driver
After successful installation, you may configure connection settings and utilize the driver within your Matillion Data Productivity Cloud projects:
- Refer to the usage documentation for how to set up and execute database queries using the installed JDBC driver: https://docs.matillion.com/data-productivity-cloud/designer/docs/database-query/
By completing these steps, you'll have enabled Matillion Data Productivity Cloud to connect and interact with VoltDB using the JDBC driver.
Checking network connectivity
Before connecting Matillion Data Productivity Cloud to your VoltDB database, you must ensure that VoltDB is configured to accept incoming connections based on your deployment type:
-
Full SaaS Agent Configuration:
If you are using Matillion's Full SaaS agent, you must allow incoming connections to VoltDB from the cloud IP addresses used by Matillion. The required IP addresses are listed here: https://docs.matillion.com/data-productivity-cloud/security/docs/allowing-ip-addresses/. Update your firewall or network access control settings to permit these IP addresses. -
Hybrid SaaS Deployment:
For Hybrid SaaS deployments, ensure that VoltDB allows connections from your own virtual private cloud (VPC) where the Matillion agent runs. You may need to configure your network security groups or firewall rules accordingly. To check connectivity and find useful network utilities, visit: https://exchange.matillion.com/data-productivity-cloud/pipeline/check-network-access/.
Additionally, if your VoltDB instance is referenced using a DNS name, you must ensure that the Full SaaS or Hybrid SaaS agent can resolve the DNS address correctly. This may require updating your DNS settings or allowing appropriate DNS traffic for name resolution.
Querying Data from VoltDB
This guide provides technical instructions for querying data from a VoltDB database, including example SQL
SELECTstatements, explanations of datatype conversion between VoltDB and Snowflake, and recommended ETL strategies using the Database Query component.
Example VoltDB SQL SELECT Statements
VoltDB supports standard SQL for querying data. Here are examples for common query patterns:
``` -- Select all columns from a table SELECT * FROM customer;
-- Select specific columns SELECT customer_id, name, balance FROM customer;
-- Use conditions (WHERE clause) SELECT order_id, product_id, quantity FROM orders WHERE order_date >= '2024-01-01';
-- Use aggregation and grouping SELECT product_id, SUM(amount) as total_sales FROM sales WHERE sale_date BETWEEN '2024-01-01' AND '2024-03-31' GROUP BY product_id HAVING total_sales > 1000; ```
Datatype Conversion: VoltDB ↔️ Snowflake
When transferring data from VoltDB to Snowflake, be aware that datatype conversion may occur. For example:
| VoltDB Type | Approximate Snowflake Type |
|---|---|
| INTEGER | NUMBER |
| BIGINT | NUMBER |
| FLOAT | FLOAT |
| DECIMAL(p,s) | NUMBER(p,s) |
| VARCHAR(n) | VARCHAR(n) |
| TIMESTAMP | TIMESTAMP_NTZ |
| BOOLEAN | BOOLEAN |
Ensure you check for compatibility, especially for precise types like
DECIMALor for date/time fields.
Best Pattern: Initial Load and Incremental Loads
A recommended approach to querying (and replicating) VoltDB data is to perform:
- Initial Load: Extract the full dataset from VoltDB.
- Incremental Loads: Subsequently extract only new or updated records.
This pattern facilitates efficient ETL and synchronization with downstream systems such as Snowflake.
Using the Database Query Component
Both loading phases can use the same Database Query component, typically within tools like Matillion.
- Initial Load:
- Query: No filter clause—extracts all data.
-
Example:
SELECT * FROM orders;
-
Incremental Load:
- Query: Includes a filter to extract data modified or added since the last run.
- Example (using a high-water mark column such as
updated_at
):SELECT * FROM orders WHERE updated_at > '2024-06-01T00:00:00';
Adjust the filter value programmatically based on the last load's maximum timestamp or sequence id.
For further details on incremental load strategies, see the Incremental Load / Data Replication Strategy.
This pattern ensures a maintainable, efficient, and robust data integration process when extracting from VoltDB and loading into data warehouses like Snowflake.
Data Integration Architecture
One advantage of the ELT (Extract, Load, Transform) architecture is its divide-and-conquer approach: by loading data into Snowflake ahead of integration, the process is split into two manageable steps. This allows teams to separate the challenges of data ingestion from those of data transformation. Effective data integration relies on robust transformation, and leveraging data transformation pipelines is the recommended approach, as these pipelines automate and standardize the necessary modifications to raw data. Furthermore, ELT offers the significant benefit of performing all data transformations and integrations directly within the Snowflake database itself. This in-database processing is inherently fast, scales with demand, and eliminates the need for separate, costly data processing infrastructure—making data integration both efficient and economical.