Extracting data from Teradata to Amazon Redshift
Extracting data from Teradata is a common requirement for organizations migrating workloads or integrating with cloud-based data platforms. Amazon Redshift offers a scalable and high-performance destination, but a successful transfer entails several preparatory steps and careful planning. In this article, we will guide you through the key stages of the migration process. We will begin by outlining how to create an identity in Teradata, ensuring that you have the necessary privileges for data access and extraction. For Matillion users, we will explain how to check for the presence of—and, if necessary, acquire—the appropriate JDBC driver, which enables seamless connectivity between Matillion, Teradata, and Redshift. Next, we will discuss how to confirm and configure the required network connectivity between your on-premises or existing cloud environments and your Redshift target. Finally, we will address strategies for querying and transferring data, exploring both initial full loads and techniques for handling ongoing incremental updates. By the end of this article, you will have a clear understanding of the essential steps involved in moving data efficiently and securely from Teradata to Redshift.
What is Teradata?
Teradata is a massively parallel processing (MPP) relational database management system designed to handle large-scale data warehousing applications and real-time analytics. Built on a shared-nothing architecture, it distributes both data and workload evenly across multiple nodes, ensuring high scalability, performance, and fault tolerance. Teradata supports standard SQL and integrates with a variety of big data and business intelligence tools, enabling complex queries, advanced analytics, and robust data integration. Its flexible deployment options—on-premises, cloud, or hybrid—allow organizations to tailor solutions for data-driven decision-making and operational efficiency. With optimizations for partitioning, compression, and workload management, Teradata remains a leading choice for enterprises managing high-volume, mission-critical data workloads.
What is Amazon Redshift?
Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud, designed for fast querying and analysis of large datasets. Utilizing columnar storage and massively parallel processing (MPP) architectures, Redshift enables high-performance analytical querying while optimizing storage efficiency and data compression. It is compatible with standard SQL and integrates seamlessly with a wide range of AWS services, including Amazon S3 for data loading, Amazon EMR for processing, and AWS Glue for ETL operations. Redshift's scalability allows users to easily adjust cluster size based on workload requirements, while its security features—such as VPC isolation, encryption at rest and in transit, and role-based access control—ensure robust data protection.
Why Move Data from Teradata into Amazon Redshift
The Case for Copying Data from Teradata to Redshift for Advanced Analytics
There are several important reasons why a data engineer or architect might choose to copy data from Teradata into Amazon Redshift. Teradata often stores rich and potentially valuable datasets accumulated over years of enterprise operations. By integrating this data with information from other sources within Redshift, organizations can unlock deeper insights and realize greater analytic value than with siloed data alone. Furthermore, performing data integration and analysis directly on Redshift, rather than on Teradata, prevents additional workloads from burdening the Teradata environment, thereby preserving its performance for critical operational tasks. This approach enables scalable analytics while maintaining the stability of existing systems.
Similar connectors
Creating a User in Teradata
This guide provides instructions on how to create a user in a Teradata database. You will require appropriate privileges (typically as a Database Administrator or with CREATE USER privilege) to perform these actions.
Prerequisites
- Access to Teradata SQL Assistant, BTEQ, or a database client that can execute SQL against your Teradata instance.
- Sufficient space in the immediate owner database for the new user.
- Optional: A secured password policy (if relevant for your environment).
Basic Syntax
The basic syntax for creating a user in Teradata is:
CREATE USER user_name
FROM database_name
AS
PERM = perm_space
[ SPOOL = spool_space ]
[ TEMPORARY = temp_space ]
[ PASSWORD = 'password' ]
[ DEFAULT DATABASE = database_name ]
[ ACCOUNT = 'account_string' ]
[ PROFILE = profile_name ]
[ STARTUP = 'startup_expression' ]
;
user_name
: The name of the new user. Must be unique.database_name
: The parent database or owner (could be DBC or another database).perm_space
: Amount of permanent space (in bytes).spool_space
: Optional. Temporary working space (in bytes).temp_space
: Optional. Temporary table space (in bytes).password
: Initial password for the user.- Other options as required (see Teradata documentation for all possibilities).
Example: Create a User
This example creates a user named
joe_userwithin the
dbcdatabase, allocating 2GB of permanent space, 1GB of spool space, a temporary space of 500MB, and assigns an initial password.
CREATE USER joe_user
FROM dbc
AS
PERM = 2e9 /* 2 GB */
SPOOL = 1e9 /* 1 GB */
TEMPORARY = 500000000 /* 500 MB */
PASSWORD = 'InitPassw0rd!'
DEFAULT DATABASE = joe_user;
Note: In practice, always enforce strong password policies per your organization’s standards.
Additional Options
- Account: To assign a specific account string, add
ACCOUNT = '...'
in the clause. - Profile: To assign a security or resource profile, use
PROFILE = ...
. - Default Database: Set the user’s default database with
DEFAULT DATABASE = ...
.
Granting Roles and Privileges
After creation, you will typically grant necessary privileges or roles:
``` GRANT SELECT, INSERT, UPDATE ON database_name TO joe_user;
-- Or grant a predefined role: GRANT my_role TO joe_user; ```
Replace resource names as appropriate.
Viewing Existing Users
To list users in Teradata, query the DBC.Users or DBC.UsersV system tables:
SELECT UserName, PermSpace, SpoolSpace, AccountName FROM DBC.UsersV;
Important Notes
- Only users or databases that have available space can create users.
- User names should conform to your naming standards and should avoid reserved keywords.
- Be sure to follow your organization’s governance for user lifecycle management.
Installing the JDBC driver
Matillion Data Productivity Cloud enables connectivity to a wide variety of databases using JDBC drivers. Due to licensing or redistribution restrictions, certain JDBC drivers—such as the Teradata JDBC driver—are not included by default. If you wish to connect to a Teradata database, you must manually download and add the JDBC driver to your Matillion environment.
Downloading the Teradata JDBC Driver
Follow these steps to download the appropriate driver:
-
Visit the Official Download Page
Access the official Teradata JDBC driver download page at:
https://downloads.teradata.com/download/connectivity/jdbc-driver -
Select the Correct Version
- Look specifically for a Type 4 JDBC driver as it is designed for 100% Java compatibility and is best suited for integration with Matillion.
-
Ensure you download the driver version that matches your Teradata database version and your corporate Java requirements.
-
Download the Driver
The driver package is typically a compressed (.zip) file. Download and extract the contents to a location accessible from your local machine.
Installing the Driver in Matillion Data Productivity Cloud
Once you have obtained and extracted the Teradata JDBC driver, you need to provide it to your Matillion Data Productivity Cloud environment. Follow the steps below:
- Prepare the JAR Files
- Locate the main JDBC
.jar
files extracted from the download package. -
Retain all associated library files if the main JDBC driver requires them.
-
Upload the Driver to Matillion
- Access your Matillion Data Productivity Cloud environment.
- Navigate to the agent or workspace where you want to enable Teradata connectivity.
- Use the documented process for external driver upload, reference:
Matillion Doc – Uploading External Drivers
This page provides a step-by-step guide for uploading
.jarfiles, specifying driver class names, and confirming the driver is available for use.
Using the Teradata JDBC Driver
Once installation is complete, you can begin configuring connections and using the driver within Matillion’s database query components. Refer to the usage instructions for details on setting up connections, specifying connection strings, and using queries:
Be sure to validate your connection and check for any driver-specific configuration steps as outlined in the Teradata and Matillion documentation.
Checking network connectivity
Before connecting Matillion Data Productivity Cloud to your Teradata database, you must ensure that the database allows incoming network connections based on your deployment configuration:
-
Full SaaS Agent Configuration: You must configure your Teradata database to allow incoming connections from the Matillion SaaS agents. The required source IP addresses are listed on the following page: Allowing IP addresses for Matillion Data Productivity Cloud.
-
Hybrid SaaS Deployment: In this configuration, you should allow incoming connections from your own virtual private cloud (VPC). You may use the network connectivity utilities available at Matillion Exchange to help you confirm and diagnose connectivity from your environment.
Additionally, if your Teradata database is referenced by a DNS hostname rather than a direct IP address, you must ensure that the relevant Matillion agent (Full SaaS or Hybrid SaaS) is able to resolve the DNS address to establish a successful connection.
Querying Data from a Teradata Database
This guide provides instructions and best practices for querying data from a Teradata database, specifically for use cases involving extract and replication to environments such as Amazon Redshift. Instructions cover SQL SELECT query examples, handling datatype conversions, and a recommended pattern for initial and incremental data loads. Additional incremental load strategy information is available at Matillion Exchange.
1. Sample Teradata SQL Queries
Basic SELECT Query
SELECT
customer_id,
customer_name,
city,
created_date
FROM
sales.customers;
SELECT with Filtering
SELECT
order_id,
order_date,
amount
FROM
sales.orders
WHERE
order_date >= DATE '2024-01-01';
SELECT with Join and Aggregation
SELECT
c.customer_id,
c.customer_name,
COUNT(o.order_id) AS total_orders,
SUM(o.amount) AS total_spent
FROM
sales.customers c
INNER JOIN
sales.orders o
ON c.customer_id = o.customer_id
WHERE
o.order_date BETWEEN DATE '2024-01-01' AND DATE '2024-06-30'
GROUP BY
c.customer_id, c.customer_name;
2. Datatype Conversion: Teradata to Redshift
When transferring data between Teradata and Amazon Redshift, be aware of potential datatype differences. For example:
BYTEINT
(Teradata) →SMALLINT
orINTEGER
(Redshift)NUMBER
,DECIMAL(p,s)
(Teradata) →NUMERIC(p,s)
(Redshift)VARCHAR(n)
(Teradata/Redshift) – Usually compatible, but verify maximum sizes.TIMESTAMP(n)
(Teradata) →TIMESTAMP
(Redshift)DATE
(Teradata/Redshift) – Typically compatible.
Always review and test your schema mappings to ensure smooth data movement.
3. Best Load Patterns: Initial and Incremental Loads
Initial Load
The initial load involves extracting the entire dataset from the Teradata source without any filtering. This is generally a one-time process to populate the target (e.g., Redshift) with a full snapshot of the data.
Example (No Filter)
SELECT
*
FROM
sales.orders;
Configuration Notes: - Use the Database Query component. - Do not specify any
WHEREclause.
Incremental Load
After the initial load, perform regular incremental loads to transfer only new or changed records. This conserves resources and minimizes processing time.
Example (With Filter)
Add a filter condition, most commonly on a column such as
last_updatedor
created_date:
SELECT
*
FROM
sales.orders
WHERE
last_updated > '2024-06-20 00:00:00';
Replace the filter value with the timestamp of the last successful load (using a variable or parameter, if supported).
Configuration Notes: - Continue to use the same Database Query component. - Parameterize the filter value when possible for automation.
For more about incremental load patterns, see: Matillion - Incremental Load Data Replication Strategy
Data Integration Architecture
Loading data in advance of integration, as practiced in Extract, Load, Transform (ELT) architectures, offers a “divide and conquer” approach by splitting the process into two sequential steps: first, raw data is loaded into the target Redshift database, and then it’s transformed and integrated as needed. This separation not only streamlines operations but also enables businesses to handle workflows more flexibly. Since data integration inherently requires transformation—such as cleaning, harmonizing, and joining data from different sources—the most effective way to manage these tasks is through robust data transformation pipelines. These pipelines can be orchestrated within Redshift itself, leveraging its scalable resources. A further advantage of the ELT model is that all complex processing and integration occur directly inside Redshift, eliminating the need for separate, costly data processing infrastructure. As a result, transformations are executed quickly and can be initiated on-demand, making the overall integration process both efficient and highly scalable.