The Agentic Advantage series: Real-world AI success stories from leading CDAOs

Register Now

Using GenAI to support migrations from Redshift to Databricks

Problem statement

With Databricks' rapid growth, we are seeing an uptick in migration projects from other platforms, including Netezza and Redshift. A platform such as Databricks offers enormous benefits for companies looking for a holistic approach to data engineering, data science, and data warehousing. 

However, migrations can be an expensive, exhaustive, and intensive process due to years of technical development on the source platform that needs to be ported over. Differences in design patterns, feature/functionality, and overall compatibility can cause serious headaches for those involved. Fortunately for us today, we now have generative AI to help us through these challenges.

In this blog, we will walk through a pattern you can adopt with Matillion if you want to migrate from another platform to Databricks. While this example showcases Redshift to Databricks, you will see how this approach can apply across other platforms with some slight changes.

In short, we are looking to capture all source DDLs from tables in a Redshift schema(s), load those scripts into a delta table in our Databricks environment, and then run them through an LLM to help us convert the original DDL syntax to Databricks sparkSQL compliant code. 

Once the code is converted, we will systematically run the DDLs via Matillion and capture their deployment status (pass/fail) so we can identify which DDLs need some human interaction.

In our example, we saw a conversion rate north of 75%, which is significant if you consider this on a larger scale. By using this GenAI approach, you can potentially reduce your migration workloads to 25% of the originally anticipated effort. If you translate that to work hours and employee costs, that could mean weeks of work and thousands of dollars saved while expediting the return of value of the investment into Databricks and Matillion.

Capturing the source DDL

Pulling a list of tables from Redshift

For this demonstration, we are looking to pull the list of table names from a schema called ‘public’ in our Redshift environment. This is done using a Database Query component in Matillion and executing the following query:

select *
FROM information_schema.tables
WHERE table_type = 'BASE TABLE'
and table_schema = 'public'

Please note that to pull from multiple schemas, alter the table_schema filter from ‘=’ to an ‘IN’ statement. The component will output this list of tables into a new delta table called redshift_stg_info_schema_tables.

Pulling DDLs from Redshift

Some platforms like Oracle and Snowflake have easy-to-use functions to get DDL scripts for a database object, but unfortunately, this is not true for all platforms. This is the case with Redshift, where you have to stitch together the DDLs instead of having a convenient function to call and get the response in a single response.

Fortunately, AWS has an aws-redshift-utils public repo that provides some handy scripts for Redshift users. For this scenario, we used the v_generate_tbl_ddl.sql to query the DDLs for our list of tables. A table iterator will read the list of table names from our newly loaded redshift_stg_info_schema_tables table and dynamically pass the values to the DDL query via a loop.    

We now have the list of DDL scripts for each table in our delta table in Databricks, albeit broken up across multiple rows. This is a simple problem to solve using native Databricks functions like ARRAY_JOIN() and COLLECT_LIST().

Collapsing multi-line DDL

With our DDLs being delivered from Redshift across multiple lines, we need to run a quick transformation to group these records into a single row per table. 

Here is the SparkSQL script that achieves that in Matillion:

WITH ordered_ddl AS (
 SELECT
   table_id,
   schemaname,
   tablename,
   ddl
 FROM
   `se-unity-catalog`.`demo_salesengineering`.redshift_stg_table_ddl_scripts
 ORDER BY
   table_id, schemaname, tablename, seq
)
SELECT
 table_id,
 schemaname,
 tablename,
 ARRAY_JOIN(COLLECT_LIST(ddl), ' ') AS full_ddl
FROM
 ordered_ddl
GROUP BY
 table_id, schemaname, tablename
ORDER BY tablename, table_id

Using GenAI to convert DDL 

Now that we have imported the DDLs and grouped them into a single row, the last transformation step is to run them through an LLM using the native ai_query() function from Databricks. 

Matillion provides a native component in our transformation pipelines that allows users to integrate these functions easily. By choosing a model (in this case, DBRX), you can provide it with a prompt and pass it any relevant columns from our source table (full_ddl is the column into which we collapsed the DDLs in our previous steps). 

Here is the sample prompt provided in our User Prompt field:

You are an expert in converting DDL sql statements from Redshift into Databricks sparkSQL compliant DDL. **VERY IMPORTANT** 
1. You will only reply with the corresponding sparkSQL DDL script. 
2. DO NOT provide any other additional context in your response. 
3. Do not add backslash \ characters in table or column names.
4. Ensure create or replace is used for the DDL statement.
5. Ensure that backtick characters ` are placed around the catalog and schema names in the DDL statement)   
6. The catalog value should be ''se-unity-catalog'' and the schema value should be ''redshift_dbx_migration_dbrx''
7. Do not include ```sql prefix syntax
8. Persist all scale and precision values in the generated DDL where possible.
9. DO NOT MAKE ANY PYTHON REFERENCES in the converted ddl
10. Replace any violating characters '' ,;{}()\n\t='' in the column names of your schema with an underscore _ character.

And this is the comparison of the Redshift DDL and the output from the GenAI prompt:

Here is some sample output:

Redshift DDLDatabricks DDL (via GenAI)

-- DROP TABLE public.accounts;

 

CREATE TABLE IF NOT EXISTS public.accounts (

    id VARCHAR(18) ENCODE RAW,

    isdeleted BOOLEAN ENCODE RAW,

    masterrecordid VARCHAR(18) ENCODE RAW,

    name VARCHAR(255) ENCODE RAW,

    "type" VARCHAR(255) ENCODE RAW,

    parentid VARCHAR(18) ENCODE RAW,

    billingstreet VARCHAR(255) ENCODE RAW,

    billingcity VARCHAR(40) ENCODE RAW,

    billingstate VARCHAR(80) ENCODE RAW,

    billingpostalcode VARCHAR(20) ENCODE RAW,

    billingcountry VARCHAR(80) ENCODE RAW,

    billinglatitude DOUBLE PRECISION ENCODE RAW,

    billinglongitude DOUBLE PRECISION ENCODE RAW,

    billinggeocodeaccuracy VARCHAR(255) ENCODE RAW,

    shippingstreet VARCHAR(255) ENCODE RAW,

    shippingcity VARCHAR(40) ENCODE RAW,

    shippingstate VARCHAR(80) ENCODE RAW,

    shippingpostalcode VARCHAR(20) ENCODE RAW,

    shippingcountry VARCHAR(80) ENCODE RAW,

    shippinglatitude DOUBLE PRECISION ENCODE RAW,

    shippinglongitude DOUBLE PRECISION ENCODE RAW,

    shippinggeocodeaccuracy VARCHAR(255) ENCODE RAW,

    phone VARCHAR(40) ENCODE RAW,

    fax VARCHAR(40) ENCODE RAW,

    accountnumber VARCHAR(40) ENCODE RAW,

    website VARCHAR(255) ENCODE RAW,

    photourl VARCHAR(255) ENCODE RAW,

    sic VARCHAR(20) ENCODE RAW,

    industry VARCHAR(255) ENCODE RAW,

    annualrevenue NUMERIC(18,0) ENCODE RAW,

    numberofemployees BIGINT ENCODE RAW,

    ownership VARCHAR(255) ENCODE RAW,

    tickersymbol VARCHAR(20) ENCODE RAW,

    description VARCHAR(32000) ENCODE RAW,

    rating VARCHAR(255) ENCODE RAW,

    site VARCHAR(80) ENCODE RAW,

    ownerid VARCHAR(18) ENCODE RAW,

    createddate TIMESTAMP WITHOUT TIME ZONE ENCODE RAW,

    createdbyid VARCHAR(18) ENCODE RAW,

    lastmodifieddate TIMESTAMP WITHOUT TIME ZONE ENCODE RAW,

    lastmodifiedbyid VARCHAR(18) ENCODE RAW,

    systemmodstamp TIMESTAMP WITHOUT TIME ZONE ENCODE RAW,

    lastactivitydate TIMESTAMP WITHOUT TIME ZONE ENCODE RAW,

    lastvieweddate TIMESTAMP WITHOUT TIME ZONE ENCODE RAW,

    lastreferenceddate TIMESTAMP WITHOUT TIME ZONE ENCODE RAW,

    jigsaw VARCHAR(20) ENCODE RAW,

    jigsawcompanyid VARCHAR(20) ENCODE RAW,

    cleanstatus VARCHAR(255) ENCODE RAW,

    accountsource VARCHAR(255) ENCODE RAW,

    dunsnumber VARCHAR(9) ENCODE RAW,

    tradestyle VARCHAR(255) ENCODE RAW,

    naicscode VARCHAR(8) ENCODE RAW,

    naicsdesc VARCHAR(120) ENCODE RAW,

    yearstarted VARCHAR(4) ENCODE RAW,

    sicdesc VARCHAR(80) ENCODE RAW,

    dandbcompanyid VARCHAR(18) ENCODE RAW,

    operatinghoursid VARCHAR(18) ENCODE RAW,

    customerpriority__c VARCHAR(255) ENCODE RAW,

    sla__c VARCHAR(255) ENCODE RAW,

    active__c VARCHAR(255) ENCODE RAW,

    numberoflocations__c DOUBLE PRECISION ENCODE RAW,

    upsellopportunity__c VARCHAR(255) ENCODE RAW,

    slaserialnumber__c VARCHAR(10) ENCODE RAW,

    slaexpirationdate__c TIMESTAMP WITHOUT TIME ZONE ENCODE RAW,

    revenue_rank__c DOUBLE PRECISION ENCODE RAW,

    customer_count__c DOUBLE PRECISION ENCODE RAW,

    order_count__c DOUBLE PRECISION ENCODE RAW

) DISTSTYLE AUTO;

 

CREATE OR REPLACE TABLE `se-unity-catalog`.`redshift_dbx_migration_dbrx`.accounts (

  `id` STRING,

  `isdeleted` BOOLEAN,

  `masterrecordid` STRING,

  `name` STRING,

  `type` STRING,

  `parentid` STRING,

  `billingstreet` STRING,

  `billingcity` STRING,

  `billingstate` STRING,

  `billingpostalcode` STRING,

  `billingcountry` STRING,

  `billinglatitude` DOUBLE,

  `billinglongitude` DOUBLE,

  `billinggeocodeaccuracy` STRING,

  `shippingstreet` STRING,

  `shippingcity` STRING,

  `shippingstate` STRING,

  `shippingpostalcode` STRING,

  `shippingcountry` STRING,

  `shippinglatitude` DOUBLE,

  `shippinglongitude` DOUBLE,

  `shippinggeocodeaccuracy` STRING,

  `phone` STRING,

  `fax` STRING,

  `accountnumber` STRING,

  `website` STRING,

  `photourl` STRING,

  `sic` STRING,

  `industry` STRING,

  `annualrevenue` DECIMAL(18,0),

  `numberofemployees` LONG,

  `ownership` STRING,

  `tickersymbol` STRING,

  `description` STRING,

  `rating` STRING,

  `site` STRING,

  `ownerid` STRING,

  `createddate` TIMESTAMP,

  `createdbyid` STRING,

  `lastmodifieddate` TIMESTAMP,

  `lastmodifiedbyid` STRING,

  `systemmodstamp` TIMESTAMP,

  `lastactivitydate` TIMESTAMP,

  `lastvieweddate` TIMESTAMP,

  `lastreferenceddate` TIMESTAMP,

  `jigsaw` STRING,

  `jigsawcompanyid` STRING,

  `cleanstatus` STRING,

  `accountsource` STRING,

  `dunsnumber` STRING,

  `tradestyle` STRING,

  `naicscode` STRING,

  `naicsdesc` STRING,

  `yearstarted` STRING,

  `sicdesc` STRING,

  `dandbcompanyid` STRING,

  `operatinghoursid` STRING,

  `customer_priority__c` STRING,

  `sla__c` STRING,

  `active__c` STRING,

  `numberoflocations__c` DOUBLE,

  `upsellopportunity__c` STRING,

  `slaserialnumber__c` STRING,

  `slaexpirationdate__c` TIMESTAMP,

  `revenue_rank__c` DOUBLE,

  `customer_count__c` DOUBLE,

  `order_count__c` DOUBLE

);

Executing DDL in Databricks

Now that we have the converted DDLs for Databricks stored in our delta table, the final step is to run them through and capture if the DDL executed successfully or failed. 

Just like we looped over the table names from our Redshift schema to get the individual DDLs over to Databricks, we will loop over all the converted DDLs from our delta table and pass those into a sub job that executes the DDL and updates the execution status.

In our loop, we store the DDL into a variable called ${plv_dbx_ddl} and execute it via a SQL script component in our sub job. 

Using our flow control options (success = green line, fail = red line), we will call the update on the script table and pass in a success of fail value to another variable called plv_ddl_status to our transformation job.

This transformation job will ultimately be the process that changes our dbx_ddl_execution_status field to Success or Fail so we can see what the results of our DDL execution was.

Summary

After running this process through our list of tables, we were able to convert 16 out of 20 tables—an 80% success rate. As with any pipeline, especially a GenAI integrated one, there is always room for improvement. But for a job that we were able to build in a few hours, it certainly provides a great step forward to the art of the possible when it comes to migrations and the use of GenAI in real-world applications.

If you’re a visual learner or prefer to watch a video walkthrough of this blog, please watch the video below!

 
Konrad Bafia
Konrad Bafia

Manager, Sales Engineering

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.