Integrate data from Informix (IBM) to Databricks using Matillion

Our Informix to Databricks connector streamlines your data transfer to Databricks in just minutes, ensuring your information remains current without the necessity for manual coding or complex ETL scripts.

Informix
Databricks
Informix (IBM) to Databricks banner

Extracting data from Informix (IBM) to Databricks

Extracting data from Informix is a critical component of many modern data integration and analytics workflows. As organizations increasingly adopt cloud platforms such as Databricks, the need to efficiently move data from legacy systems like Informix into these new environments becomes more pressing. This article provides a step-by-step guide to transferring data from Informix to Databricks, ensuring a robust and secure process. We will begin by detailing how to create or configure an appropriate identity within Informix, ensuring that proper authentication and security protocols are in place. Next, for users of Matillion, we will explain how to check for the required JDBC driver or install it if necessary, enabling seamless connectivity. We will also review the networking considerations to guarantee reliable and secure data transfer between Informix and Databricks. Finally, you will learn best practices for querying data, handling both initial full data loads and ongoing incremental updates to keep your Databricks environment synchronized with your Informix source. By following these guidelines, you will be able to establish an efficient and scalable data pipeline from Informix to Databricks, supporting your organization's analytics and reporting needs.


What is Informix (IBM)?

Informix is a robust relational database management system (RDBMS) developed by IBM, renowned for its high performance, minimal administration requirements, and exceptional support for Online Transaction Processing (OLTP) environments. It is particularly well-suited for embedding in enterprise and edge applications due to its lightweight footprint, scalability, and ability to manage both structured and semi-structured data, including support for TimeSeries, JSON, and spatial data types. Informix’s dynamic architecture enables automatic task scheduling, storage optimization, and data replication across geographically distributed systems, making it an attractive solution for organizations seeking reliability, seamless integration, and flexibility in managing diverse workloads.

matillion logo x Informix

What is Databricks?

Databricks is a unified analytics platform that integrates a cloud-based data lakehouse with native Apache Spark support. Built on open-source Delta Lake technology, it offers ACID transactions, scalable metadata, and schema enforcement directly on cloud storage (AWS S3, Azure Data Lake, Google Cloud Storage). This allows data engineers and analysts to process, store, and query large datasets efficiently for both batch and streaming workloads. Databricks also integrates with MLflow for end-to-end machine learning, while its SQL analytics workspace enables easy access for data scientists and SQL users, supporting reliable, concurrent analytics and operational tasks from a single source of truth.

Why Move Data from Informix (IBM) into Databricks

Unlocking Analytics Potential: Moving Informix Data into Databricks

A data engineer or architect may wish to copy data from Informix into Databricks for several compelling reasons. Informix often serves as a reliable source of operational data that holds significant potential for generating business insights. By transferring this data into Databricks, it becomes possible to integrate it with information from other systems, thereby unlocking greater analytical value through comprehensive data exploration and advanced analytics. Additionally, conducting data integration and transformation within the Databricks environment helps to minimize any additional processing burdens on the Informix system itself, ensuring that its primary transactional performance remains unaffected while still enabling robust data-driven decision making.

Creating a User in Informix Database

To create a user in an IBM Informix database, you generally need both operating system (OS) and database privileges. Unlike some other databases, Informix relies on the underlying OS user accounts or central authentication mechanism (like LDAP) to manage users, and then grants them database-level permissions.

Below are step-by-step instructions for creating a user in an Informix database.


1. Create an OS User Account

Informix recognizes users who have valid accounts on the OS (or through LDAP). You must first ensure an OS account exists. Run the following as root (or via your system administrator):

On UNIX/Linux:

useradd informix_user
passwd informix_user

Replace

informix_user
with the desired username.

2. Grant Database Access Privileges

You do not explicitly create users within Informix using a SQL

CREATE USER
command (as with some other RDBMS products). Instead, you grant privileges to OS-level users after they exist.

Connect to Informix as a user with DBA privileges, for example via

dbaccess
or any compatible SQL tool.

Example: Grant CONNECT Privilege

GRANT CONNECT TO informix_user;

You can use additional

GRANT
statements to provide further privileges:

GRANT RESOURCE TO informix_user;
-- or, grant DBA role if appropriate
GRANT DBA TO informix_user;

3. (Optional) Grant Table or Schema Permissions

To let the user access specific tables or schemas, use standard SQL permission grants.

GRANT SELECT, INSERT, UPDATE, DELETE ON tablename TO informix_user;

Replace

tablename
with the name of your table.


Note: The authentication mechanism depends on your environment and Informix configuration. Informix does not store user definitions within its internal catalogs by default—users are managed by the OS or centralized directories. Ensure the user or group name matches across both systems for seamless access control.

Installing the JDBC driver

At the time of writing, the JDBC driver for Informix is not bundled by default with Matillion Data Productivity Cloud installations. This is due to licensing or redistribution restrictions imposed by the driver's publisher. For users intending to connect to an Informix database using Matillion, manual download and installation of the JDBC driver is required.

Step 1: Downloading the Informix JDBC Driver

  1. Access the official IBM Informix JDBC driver download page: https://www.ibm.com/support/pages/informix-jdbc-drivers
  2. On the download page, locate the most recent release of the Type 4 JDBC driver. The Type 4 driver is entirely written in Java and offers optimal compatibility and performance within cloud-based applications such as Matillion.
  3. Download the relevant driver package (typically distributed as a .jar file).

Step 2: Uploading the Driver to Matillion Data Productivity Cloud

  1. Follow the installation instructions as described in the Matillion documentation: How to upload external JDBC drivers
    • Ensure you are uploading the downloaded .jar file to your Matillion agent or environment, as instructed in the official documentation.
    • If necessary, restart the relevant Matillion service or agent to recognize the new driver.

Step 3: Configuring Database Access

  1. Once the driver has been uploaded and recognized, consult the product documentation for guidance on using new database drivers: Database Query usage instructions
    • These instructions will guide you through configuring a connection to your Informix database using the newly installed JDBC driver, as well as how to interact with Informix data from within the Matillion Data Productivity Cloud platform.

By following these steps and referencing the provided documentation, technical users can enable Informix database connectivity in Matillion, even though the driver is not included out-of-the-box.

Checking network connectivity

To ensure successful connectivity between Matillion Data Productivity Cloud and your Informix database, you must verify that the database allows incoming network connections from the appropriate sources, depending on your deployment configuration:

  • Full SaaS Agent Configuration: You must allow incoming connections from the set of public IP addresses used by the Matillion service. These IP addresses are listed here: Matillion Data Productivity Cloud: Allowing IP Addresses. Make sure to update your firewall or security group rules to permit access from these addresses.

  • Hybrid SaaS Deployment: For hybrid configurations, allow incoming database connections from your own virtual private cloud (VPC) in which your Matillion Hybrid SaaS agent is running. You may use the network access check utilities provided at Matillion Exchange - Check Network Access to validate reachability between the agent and the Informix database.

In both configurations, if your Informix database is referenced using a DNS name rather than an IP address, ensure that the Full SaaS or Hybrid SaaS agent can successfully resolve that DNS address. This may require updating DNS server settings or ensuring network connectivity to your DNS resolver.

Querying Data from an Informix Database

This guide demonstrates how to query data from an Informix database using SQL, and provides recommendations for efficient extraction patterns, especially when integrating with other platforms such as Databricks. The examples use standard SQL SELECT statements. Attention is given to datatype conversion considerations and incremental data loading strategies.


Example Informix SQL SELECT Queries

Below are several common examples illustrating how to retrieve data from Informix tables.

Selecting All Columns from a Table

SELECT * FROM customer;

Selecting Specific Columns

SELECT customer_id, first_name, last_name FROM customer;

Filtering Rows

SELECT * FROM orders
WHERE order_date >= '2024-01-01'
  AND status = 'SHIPPED';

Aggregating Data

SELECT status, COUNT(*) AS order_count
FROM orders
GROUP BY status;

Joining Tables

SELECT o.order_id, o.order_date, c.first_name, c.last_name
FROM orders o
JOIN customer c ON o.customer_id = c.customer_id;


Datatype Conversion Considerations

When querying Informix to load data into other systems (e.g., Databricks), you may encounter datatype mappings that require attention. Some types in Informix may not map directly to Databricks types; for example:

  • Informix
    DATETIME YEAR TO SECOND
    → Databricks
    TIMESTAMP
  • Informix
    BYTE
    and
    TEXT
    → Databricks handles as binary or string, with possible conversion steps required
  • Informix
    NUMERIC(p,s)
    → Databricks
    DECIMAL(p,s)

Always verify your schema mapping and test queries to ensure that datatypes are interpreted as expected post-transfer.


Pattern for Data Extraction: Initial & Incremental Loads

The optimal approach for querying and transferring data from Informix is to use a two-step load pattern:

  1. Once-Off Initial Load
  2. Extract the entire dataset using the Database Query component.
  3. No filter clause is used — all available rows are retrieved.

    Example:

    SELECT *
    FROM orders;

  4. Ongoing Incremental Loads

  5. Periodically extract only new or updated rows.
  6. Apply a filter based on a column such as

    last_updated
    , an incrementing primary key, or another suitable change-tracking field.

    Example with an Incremental Filter:

    SELECT *
    FROM orders
    WHERE last_updated > '2024-06-01 00:00:00';

    Adjust the filter value as needed on each run to capture only the data that has changed since the last successful extraction.
    More on this pattern can be found at Matillion's Incremental Load Data Replication Strategy.

Note:
Use the same Database Query component for both initial and incremental loads, modifying the query as appropriate.


Continue to use and adapt these guidelines and patterns to fit the specific requirements of your Informix-to-Databricks (or other platform) integration workflows.

Data Integration Architecture

Loading data in advance of integration exemplifies a "divide and conquer" approach by splitting the data workflow into two distinct steps: firstly ingesting raw data, and then performing integration and transformation. This staged strategy is a key advantage of the ELT (Extract, Load, Transform) architecture, as it enables teams to quickly load large volumes of raw data before investing time and resources into transforming and integrating that data. Data integration fundamentally requires some degree of transformation—such as cleaning, standardizing, or joining datasets—and Databricks provides robust support for implementing these data transformation pipelines within the platform itself. Another notable benefit of the ELT architecture is that both transformation and integration take place directly inside the target Databricks database environment. This means data processing is fast, on-demand, and highly scalable, all without the need for additional infrastructure or external processing services, allowing organizations to optimize costs while benefiting from cloud-native scalability.

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.