Extracting data from VoltDB (NewSQL) to Amazon Redshift
Extracting data from VoltDB is an essential step for organizations looking to utilize high-performance data platforms alongside robust analytical solutions such as Amazon Redshift. Whether you are migrating operational data for historical analysis, reporting, or blending VoltDB’s in-memory speed with Redshift’s scalability, understanding the technical workflow is crucial to ensuring a reliable and efficient data pipeline. This article will guide you through the process of moving data from VoltDB to Redshift. We will begin with creating an identity in VoltDB, which is necessary for secure and managed access. For those who use Matillion, we will clarify how to check for the presence of the required JDBC driver, or how to obtain it if needed. We’ll also discuss the fundamentals of establishing network connectivity between VoltDB (your source) and Redshift (your target), which is often a prerequisite for successful data transfer. Finally, the article will explore strategies for querying data both for an initial bulk extract and for ongoing incremental loads, helping you architect a solution that matches your business requirements. With these foundations, you will be equipped to design a data integration process that is reliable, secure, and scalable.
What is VoltDB (NewSQL)?
VoltDB is a high-performance, in-memory relational database designed for operational workloads requiring low-latency, high-throughput transaction processing. Built on a shared-nothing architecture, VoltDB ensures horizontal scalability and fault tolerance by partitioning data and processing across multiple nodes. It features full ACID compliance, making it suitable for mission-critical applications where consistency and durability are paramount. VoltDB employs a novel command-log replication and snapshotting mechanism to achieve durability without compromising speed, and leverages stored procedures for efficient, elastic workload management. Its native integration with streaming data workflows and real-time analytics capabilities position it as an optimal choice for industries such as telecommunications, finance, and IoT that demand millisecond-scale decisioning and actionable insights.
What is Amazon Redshift?
Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the AWS cloud, designed to enable fast and efficient querying and analysis of large datasets. It utilizes columnar storage and parallel processing to accelerate complex analytical queries, making it well-suited for business intelligence workloads. Redshift is compatible with standard SQL, supports integration with a broad ecosystem of ETL and visualization tools, and provides robust security features, including VPC isolation, encryption at rest and in transit, and granular access controls. Its architecture employs a leader node to coordinate query execution across multiple compute nodes, facilitating scalability and high performance for analytical workloads.
Why Move Data from VoltDB (NewSQL) into Amazon Redshift
Unlocking Analytical Power: Transferring Data from VoltDB to Redshift
A data engineer or architect may wish to copy data from VoltDB into Amazon Redshift for several compelling reasons. First, the information stored in VoltDB is often highly valuable, typically generated by high-velocity event streams, transactional systems, or time-sensitive applications. However, the true value of this data often emerges when it is integrated with data from other sources across the organization, enabling richer analytics and more comprehensive business intelligence. Performing such data integration and analytical workloads within VoltDB could consume significant system resources, potentially impacting its primary function of low-latency transaction processing. By moving the data into Redshift, a scalable data warehousing solution optimized for complex queries and analytics, an organization can efficiently analyze integrated datasets without placing additional load on VoltDB, ensuring both systems operate at their best.
Similar connectors
Creating a User Identity in VoltDB
VoltDB, by default, uses role-based access control. Users are associated with roles, which determine their database permissions. When deploying VoltDB with authentication enabled, you define user identities and assign roles in the deployment configuration file (typically
deployment.xml). There are no SQL commands (like
CREATE USER) for interactive user creation—instead, you manage users via this configuration file.
Below are the steps to add a new user to VoltDB.
1. Open the Deployment Configuration File
Locate your
deployment.xmlfile. It usually resides in your VoltDB project’s root directory or wherever you specify when starting the server.
2. Edit the <users>
Section
Inside
deployment.xml, find (or add if missing) the
<users>section. Each user is defined with a
<user>entry specifying a username, password, and associated roles.
Example: Adding a New User
xml
<deployment>
...
<security enabled="true">
<users>
<!-- Existing users -->
<user name="admin" password="admin-password" roles="admin"/>
<!-- New user entry -->
<user name="jane" password="jane-StrongPassword123" roles="user,readonly"/>
</users>
<roles>
<role name="admin" permissions="all"/>
<role name="user" permissions="invoke,select"/>
<role name="readonly" permissions="select"/>
</roles>
</security>
</deployment>
Notes:
- name: The username ("jane" in this example).
- password: The user's password (ensure this is strong).
- roles: Comma-separated list of roles assigned to this user.
3. Assign Roles and Permissions
Make sure the roles you assign exist in the
<roles>section and have the necessary permissions for the user.
Example role definition:
xml <role name="readonly" permissions="select"/>
4. Save Changes and Restart VoltDB
After editing the
deployment.xml:
- Save the file.
- Restart the VoltDB server for the changes to take effect.
5. Connect with the New User
Once VoltDB restarts, the new user can connect using the credentials provided. For example, using
sqlcmd:
sqlcmd --user jane --password jane-StrongPassword123
Tip: Never store plaintext production passwords in version control. Use deployment best practices according to VoltDB recommendations.
For more details about user and role management in VoltDB, consult VoltDB Security Documentation.
Installing the JDBC Driver
At the time of writing, the JDBC driver for VoltDB is not included by default in Matillion Data Productivity Cloud due to licensing or redistribution restrictions. It is necessary to manually download and install the correct driver to enable database connectivity.
1. Download the VoltDB JDBC Driver
Begin by downloading the JDBC driver from the official VoltDB website: https://www.voltdb.com/download/. When presented with multiple options, look specifically for the Type 4 JDBC driver. The Type 4 driver is a pure Java implementation and is preferred for compatibility and ease of deployment.
2. Upload the JDBC Driver to Matillion Data Productivity Cloud
Once the appropriate driver file (typically a
.jarfile) has been downloaded, you need to upload it into the Matillion environment. Detailed instructions on how to upload and add external JDBC drivers are available at the following URL: https://docs.matillion.com/data-productivity-cloud/agent/docs/uploading-external-drivers/. Follow these steps carefully to ensure that the driver is correctly integrated with Matillion.
3. Configure the Driver for Usage
After the driver has been uploaded and made available, consult the product documentation for usage instructions on connecting to your VoltDB database using the new driver. The relevant guide can be found here: https://docs.matillion.com/data-productivity-cloud/designer/docs/database-query/.
By following these steps, you will enable Matillion Data Productivity Cloud to communicate with VoltDB through the supported JDBC interface.
Checking network connectivity
To establish a successful connection between Matillion Data Productivity Cloud and your VoltDB database, you must ensure that your VoltDB instance allows incoming connections according to your deployment type:
-
Full SaaS agent configuration:
Configure your VoltDB instance to accept incoming connections from the IP addresses used by Matillion's SaaS agents. The current list of required IPs is published here: Matillion SaaS Allowed IP Addresses. -
Hybrid SaaS deployment:
For hybrid deployments, configure VoltDB to allow connections from your own Virtual Private Cloud (VPC). You may find network-checking utilities and further guidance on the Matillion Exchange: Check Network Access page.
Regardless of configuration, if your VoltDB database is specified by DNS rather than by IP address, ensure that the Full SaaS or Hybrid SaaS agent has proper DNS resolution capabilities to resolve the database address.
Querying Data from VoltDB: Best Practices & Examples
When integrating or querying data from a VoltDB database, several important patterns and details should be observed—especially when data is to be used in other platforms, such as Amazon Redshift. The following instructions and examples provide guidance for technical users working with VoltDB and the Matillion ETL Database Query component for both initial and incremental data loads.
1. SQL SELECT Examples for VoltDB
Below are examples of simple and parameterized
SELECTstatements used to query VoltDB tables:
``` -- Basic query returning all columns and rows SELECT * FROM customers;
-- Querying specific columns SELECT customer_id, first_name, last_name FROM customers;
-- Filtering with a WHERE clause SELECT * FROM orders WHERE order_status = 'SHIPPED';
-- Range-limited query (e.g., for incremental load based on updated timestamp) SELECT * FROM transactions WHERE last_modified > '2024-06-01 00:00:00'; ```
Note: VoltDB supports standard SQL queries, but you should review VoltDB SQL syntax for specific compliance and reserved functions.
2. Datatype Conversion: VoltDB vs Redshift
When transferring data from VoltDB to Redshift, be aware that datatype conversion may occur due to differences in the types supported by each platform. For example:
| VoltDB Type | Common Equivalent in Redshift |
|---|---|
| INTEGER | INTEGER |
| BIGINT | BIGINT |
| DECIMAL | NUMERIC/DECIMAL |
| VARCHAR/STRING | VARCHAR |
| TIMESTAMP | TIMESTAMP |
| BOOLEAN | BOOLEAN |
Best Practice: Always verify that the selected columns in your queries have types compatible with your Redshift schema, and convert as needed during ETL transformations.
3. Loading Patterns: Initial & Incremental Loads
The recommended design for data synchronization is:
- Perform a once-off initial load.
- Follow up with regular incremental loads only for new or changed data.
Both loads use the same Database Query component in Matillion ETL, but differ in filter usage.
Initial Load Example
For an initial load (no filter clause), the query extracts all records:
SELECT * FROM customer_events;
Configure the Database Query component without any WHERE clause.
Incremental Load Example
For incremental loads (with a filter clause), add a condition based on an incrementing column, such as
last_modifiedtimestamp or an auto-incrementing ID:
SELECT * FROM customer_events
WHERE last_modified > '${LAST_RUN_TIMESTAMP}';
- Set the parameter (e.g.,
${LAST_RUN_TIMESTAMP}) to the value from your last successful load as part of the ETL job context. - Consult Matillion’s Incremental Load Strategy article for detailed guidance on filter setup and automation patterns.
References: - VoltDB Documentation - Matillion: Incremental Load Data Replication Strategy
Data Integration Architecture
Loading data into Amazon Redshift in advance of integration is a key strategy enabled by the Extract, Load, and Transform (ELT) architecture. This approach divides the overall data pipeline into two distinct steps: extraction and loading of raw data, followed by transformation and integration once the data is securely within Redshift. By decoupling these stages, organizations can handle larger data volumes and more complex integrations with greater flexibility and fewer bottlenecks. Data integration itself inherently involves data transformation, which is most efficiently accomplished using well-designed data transformation pipelines. These pipelines structure the transformation logic, making it easier to maintain, scale, and orchestrate. One of the primary advantages of the ELT paradigm is that all data transformation and integration tasks are executed within the Redshift environment. This not only enables fast, on-demand, and highly scalable processing by leveraging Redshift’s native computational power, but also removes the need for separate, external data processing infrastructure, resulting in simplified operations and reduced costs.