Extracting data from Sybase SQL Anywhere to Databricks
Extracting data from Sybase SQL Anywhere is a key requirement for organizations seeking to modernize their analytics capabilities by making legacy data available in cloud-based platforms such as Databricks. This process, however, involves several important considerations beyond simple data transfer. In this article, we will guide you through the essential steps to ensure a successful, reliable data pipeline from Sybase SQL Anywhere to Databricks. First, we will discuss how to create an identity in Sybase SQL Anywhere, which is necessary to enable secure and controlled access to the data you wish to extract. For users of Matillion—a popular ETL tool—we will detail how to verify whether the required JDBC driver is present, and, if not, how to acquire and install it. We'll also explore the network connectivity setup needed to enable the source-to-target integration, with particular attention to any configuration adjustments required for a smooth connection. Finally, we will outline strategies for querying your data, including both initial bulk extraction and efficient incremental updates. Whether you are modernizing an established architecture or implementing this integration for the first time, this guide will help you address the end-to-end process with confidence and clarity.
What is Sybase SQL Anywhere?
Sybase SQL Anywhere is a feature-rich relational database system designed for embedded, mobile, and small to medium enterprise environments. Initially developed by Watcom and now part of SAP, it offers cross-platform support for Windows, Linux, Mac OS, and various mobile systems. Optimized for distributed, occasionally connected applications, it features advanced synchronization (MobiLink), full transactional integrity, ANSI SQL compliance, and strong concurrency. Its lightweight design, simple deployment, and self-administration suit high-performance transactional and analytical workloads in resource-constrained or remote settings, making it ideal for developers needing reliable data management and synchronization for mobile or embedded applications.
What is Databricks?
Databricks is a unified analytics platform that provides a powerful managed database solution built on top of Apache Spark and Delta Lake. Designed for big data workloads, Databricks integrates data engineering, data science, and business analytics within a collaborative workspace. At its core, it leverages Delta Lake, an open-source storage layer that enables ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Databricks supports high-performance querying, lakehouse architecture, and native interoperability with popular cloud storage backends, making it well-suited for organizations managing petabyte-scale datasets and requiring robust, collaborative, and scalable analytics capabilities.
Why Move Data from Sybase SQL Anywhere into Databricks
Unlocking Analytics: The Value of Copying Data from Sybase SQL Anywhere to Databricks
A data engineer or architect may wish to copy data from Sybase SQL Anywhere into Databricks for several compelling reasons. Sybase SQL Anywhere often holds operational or historical data that can be highly valuable for advanced analytics and decision-making. By integrating this data with information from other sources—such as cloud storage, ERP systems, or clickstream data—organizations can unlock deeper insights and derive greater value through unified analysis and reporting. Utilizing Databricks as the platform for this integration offers clear advantages, as it allows large-scale data transformation, machine learning, and analytics to be performed without imposing additional workload or performance impact on the Sybase SQL Anywhere system. This separation ensures operational stability for the source database while enabling innovation and comprehensive data integration within Databricks.
Similar connectors
Creating a User in Sybase SQL Anywhere
To allow a new individual or application to connect to your Sybase SQL Anywhere database, you must create a database user. The following instructions explain how to do this using SQL statements.
1. Connect to the Database as an Administrator
You need sufficient privileges (such as those of a DBA) to create users. Connect via
dbisqlor your preferred client:
CONNECT TO <database-name> USER DBA IDENTIFIED BY '<dba-password>';
Replace
<database-name>and
<dba-password>with your database and DBA credentials.
2. Create the User
The basic syntax to create a user is:
CREATE USER <user-name> IDENTIFIED BY '<password>';
Example:
To create a user named
report_userwith the password
StrongP@ssw0rd:
CREATE USER report_user IDENTIFIED BY 'StrongP@ssw0rd';
3. Assign Permissions
New users have no permissions by default (other than CONNECT). To allow the new user to perform tasks, grant necessary rights. For example, to enable the user to read data from a table:
GRANT SELECT ON Sales TO report_user;
You can grant additional permissions as required (e.g.,
INSERT,
UPDATE,
DELETE, or entire roles).
4. Optional: Set User Options
You may configure options such as the user's default schema or roles. For example:
SET OPTION report_user.default_remote_ordering = 'On';
5. Verify the User
Test that the new user can connect:
CONNECT TO <database-name> USER report_user IDENTIFIED BY 'StrongP@ssw0rd';
Notes
- Usernames in SQL Anywhere are not case-sensitive, but passwords are.
- To remove a user, use:
DROP USER <user-name>;
Installing the JDBC Driver
When working with Matillion Data Productivity Cloud and Sybase SQL Anywhere, it is often necessary to configure a JDBC driver to enable connectivity. At the time of writing, the JDBC driver for Sybase SQL Anywhere is not included by default in Matillion because of licensing or redistribution restrictions. Therefore, you must download and install it manually.
Follow these steps to download and install the Sybase SQL Anywhere JDBC driver for use with Matillion Data Productivity Cloud:
1. Download the JDBC Driver
Go to the official SAP Sybase SQL Anywhere page to download the JDBC driver:
https://www.sap.com/products/sybase-sql-anywhere.html
- You may be required to create or log in to an SAP account to access the download.
- While downloading, prefer a Type 4 JDBC driver (also known as a "pure Java" driver), which does not require any native libraries and offers maximum compatibility across platforms.
2. Extract and Prepare the Driver
- Once the download is complete, extract the archive (usually a .zip or .tar.gz file) to a local directory.
- Locate the JDBC driver file(s). These are typically .jar files, such as
sajdbc4.jar
. - Ensure you have read and comply with the licensing terms of the driver.
3. Install the JDBC Driver in Matillion Data Productivity Cloud
Since the driver is not bundled with Matillion and must be uploaded manually, follow the official Matillion instructions for adding external drivers. The installation guide can be found here:
https://docs.matillion.com/data-productivity-cloud/agent/docs/uploading-external-drivers/
- Access your Matillion Data Productivity Cloud Agent interface.
- Follow the steps in the guide to upload and register the JDBC driver
.jar
files for use within your environment. - After installation, verify that the driver appears in the available drivers list.
4. Connect and Use the Sybase SQL Anywhere Driver
To utilize the installed JDBC driver, refer to the official usage documentation provided by Matillion:
https://docs.matillion.com/data-productivity-cloud/designer/docs/database-query/
- This guide provides detailed steps on creating and configuring database queries using your uploaded driver, ensuring connectivity with your SQL Anywhere database.
Checking network connectivity
To enable the Matillion Data Productivity Cloud to connect to your Sybase SQL Anywhere database, you must ensure that incoming connections are permitted based on your deployment configuration:
-
Full SaaS agent configuration:
Allow incoming connections from the IP addresses used by the Matillion service. The required IP addresses are listed at:
https://docs.matillion.com/data-productivity-cloud/security/docs/allowing-ip-addresses/ -
Hybrid SaaS deployment:
Allow incoming connections from the network of your own virtual private cloud (VPC), where the Matillion agent is running. To help verify access from your environment, you can use utilities available at:
https://exchange.matillion.com/data-productivity-cloud/pipeline/check-network-access/
Additionally, if your Sybase SQL Anywhere database is referenced using a domain name (DNS), you must ensure that the configured Full SaaS or Hybrid SaaS agent can successfully resolve the DNS address. Proper DNS resolution is required for Matillion to establish a connection with the database.
Querying Data from Sybase SQL Anywhere
This guide provides technical instructions for querying data from a Sybase SQL Anywhere database, including example SQL statements and best practices for large-scale data integration projects. Considerations for datatype conversion and incremental loading are also detailed.
1. Example SQL SELECT Statements
Below are common patterns for querying data in Sybase SQL Anywhere using SQL:
Selecting All Columns from a Table
SELECT * FROM customers;
Selecting Specific Columns
SELECT customer_id, customer_name, email FROM customers;
Adding Filtering Logic
SELECT order_id, order_date, total_amount FROM orders WHERE order_date >= '2024-01-01' AND total_amount > 100;
Aggregating Data
SELECT product_category, COUNT(*) AS num_sales, SUM(total_amount) AS revenue FROM sales GROUP BY product_category;
2. Datatype Conversion Considerations
When moving data between Sybase SQL Anywhere and platforms such as Databricks, it is crucial to recognize that the databases may use different datatypes. You may need to CAST or CONVERT datatypes during extraction or loading. Here is an example:
SELECT
order_id,
CAST(order_date AS CHAR(10)) AS order_date_char,
CAST(total_amount AS DOUBLE) AS total_amount_double
FROM orders;
Always review the datatype mapping documentation for both the source (Sybase SQL Anywhere) and the destination system (e.g., Databricks), as some implicit conversions may occur and could affect data quality.
3. Initial Versus Incremental Data Loads
The best strategy for large data sets is to:
- Load all data once (the initial load).
- Subsequently, load only changes (incremental loads).
Both processes can make use of the same Database Query component (such as in Matillion or similar ETL tools). The difference is in how records are filtered at extraction time:
Initial Load Pattern
Query: (No filter clause, returns all records)
SELECT * FROM customers;- Used once to extract the full dataset into the target platform. - Fastest when used in conjunction with bulk/batched loading.
Incremental Load Pattern
Query: (Filter clause selects only new or changed records)
SELECT * FROM customers WHERE last_update > '2024-06-01 00:00:00';- Typically uses a
last_updateor
modified_datecolumn. - Ensures only new/updated records since the last load are transferred, optimizing performance and network usage. - The exact filter condition will depend on your table schema and change tracking strategy.
For more on this strategy, see: Incremental Load Data Replication Strategy (Matillion)
When configuring the Database Query component, parameterize the filter clause to dynamically update the timestamp or marker used in each incremental batch.
Ensure all queries and filters match your source data structure and intended use case. Always test in a development environment before production use.
Data Integration Architecture
Loading data into the Databricks environment prior to integration exemplifies a "divide and conquer" strategy that separates the process into distinct steps—first loading (Extract and Load), then transforming and integrating the data within the target system. This is a fundamental advantage of the ELT (Extract, Load, Transform) architecture, streamlining workflows and enhancing manageability. Effective data integration invariably involves transformation; leveraging data transformation pipelines in Databricks is considered a best practice here, as these pipelines allow for robust, automated, and repeatable processing of data to align with analytical or operational requirements. Another compelling benefit of the ELT approach lies in its use of the target Databricks database itself for executing all transformation and integration steps. This not only delivers fast, on-demand, and scalable data processing but also negates the need for maintaining or paying for additional external data processing infrastructure, resulting in significant cost and operational efficiencies.