Scale your data team’s output by up to 100x. We'd love to prove it.

Challenge Maia at Snowflake Summit

Surrogate Keys in SQL - a practical guide

In database design, keys are a crucial element that helps ensure data integrity by enforcing uniqueness. In data warehousing, you'll almost certainly have to deal with data from relational databases underpinning source systems. These are usually modeled following guidelines known as the Third Normal Form (3NF). One of the fundamental rules of 3NF is that every table must have a column (or columns) that prevent duplicate records, which is known as a primary key.

Primary keys are also critical for another aspect of data quality: establishing relationships between tables. When a primary key appears in another, related table, it's known as a foreign key.

Meanwhile, back in the real world 😀, you'll almost certainly have to deal with data sources that don't have a primary key, and where foreign keys don't always link correctly. This is especially common with semi-structured and unstructured data, and when integrating data from multiple places.

Furthermore, primary keys can be complex and unwieldy. A primary key can span many columns, with diverse datatypes, sometimes involving long strings. They are difficult to manage in a data warehouse environment and can cause performance issues.

The answer to many of these problems is to use surrogate keys.

Primary Key vs Surrogate Key

primary key is a combination of columns that uniquely identifies records. If a table has a primary key, no record can have the same primary key as any other record in that table.

surrogate key is a single column that has been artificially added - typically during ETL - and which also contains values that are unique across the table. Because surrogate keys are made up, of technical values, they are sometimes also known as synthetic keys.

A really simple way to tell the difference between a Primary Key and a Surrogate Key is to check if that information exists in the original source system.

  • If it does, it's most likely a primary key
  • If it does not exist in the source it's probably a surrogate key

Why use Surrogate Keys?

In data warehousing, surrogate keys are widely used in place of the natural primary key. Surrogate keys are often numbers or hash codes. They are deliberately devoid of meaning, but they provide a consistent identifier for database operations. This is one of the foundations of data quality, especially important in scenarios involving complex data integrations or migrations.

A surrogate key is beneficial when the natural primary key is impractical due to its size or complexity. Cloud data platforms handle filters, joins, and aggregations more quickly using small, numeric surrogate keys. This can often be a first, simple step to addressing query performance issues.

In data architecture design terms, surrogate keys offer several additional important and more subtle benefits:

  • Operational systems only address their own immediate needs. That means a primary key may not remain unique in a data warehouse that is integrating data from multiple source systems. For example, two different CRM systems may identify different customers with the same primary key, or (more difficult to resolve) identify the same customer with different primary keys.
  • You may receive data - for example from a file - which does have a primary key, but which is only unique within that file. Tomorrow's file will contain more records, completely different, but sharing the same "primary key". The real unique identifier needs more information (a timestamp or a file name), and so requires a surrogate key for uniqueness.
  • It's surprisingly common for operational systems to have reusable or mutable primary keys. That is, keys that change over time, or that can be re-used over time to mean different things. When that happens, the original meanings can only be retained using a surrogate key.
  • Operational systems often face challenges in preserving an accurate history of changes over time. After all this isn't their primary function: it's one of the tasks for a data warehouse. When a product's attribute is updated in the source system, the change simply appears as if it has always existed that way, which leads to a problem known as time variance. Data warehousing tackles this by maintaining two separate records: each has a unique surrogate key to distinguish them, but they share the same primary key. This approach, commonly implemented using a dimension table, is referred to as a type 2 change.
  • Surrogate keys are an excellent way to anonymize data that would otherwise contain personally identifiable information (known as PII). A good unique identifier for a person is their social security number. However storing it and using it as such would involve regulatory and legal considerations. Substituting the social security number for a meaningless surrogate key preserves the unique identity of the record while removing the PII.

Many data warehouse designs incorporate both:

  • A surrogate key - which becomes the new primary key in the data warehouse
  • The original primary key - is held alongside the other ordinary columns, and is very useful when it comes to lineage and traceability

Surrogate keys are clearly a vital part of every ETL process. How to create them?

How to create Surrogate Keys in ETL

SEQUENCE objects

Many - although by no means all - cloud data platforms support the concept of a SEQUENCE object for generating auto-incrementing, unique numeric values. These are independent objects that are defined and managed separately from tables

To create a sequence in SQL, you typically use a CREATE SEQUENCE statement, specifying parameters such as the initial value, increment step, maximum value, and whether the sequence should cycle when the maximum value is reached.

Once a sequence is created, use the NEXTVAL SQL function to generate the next value in the sequence when inserting new records. For example:

CREATE SEQUENCE customer_sk_seq;

INSERT INTO customers (surrogate_key, name)
VALUES (customer_sk_seq.NEXTVAL, 'John Doe');
Benefits and drawbacks of SEQUENCE objects

Sequence objects guarantee to create a unique identifier for each record, so they are perfect for surrogate key generation. They are not vulnerable to concurrency issues that can be associated with other methods of generating surrogate keys.

There may be a performance overhead associated with using a SEQUENCE object, and there is no guarantee that the resulting values will be continuous.

Not all cloud data platforms have SEQUENCE objects.

IDENTITY, UNIQUEIDENTIFIER and AUTOINCREMENT

SQL-based databases often provide a declarative way to automatically generate surrogate keys by using special column types.

It is done when the table is created. For example, Snowflake uses AUTOINCREMENT like this:

CREATE OR REPLACE TABLE STG_TRANSACTION (
    surrogate_key NUMBER(8,0) AUTOINCREMENT NOT NULL,
    ...

The Databricks equivalent is:

CREATE OR REPLACE TABLE `STG_TRANSACTION` (
    `surrogate_key` BIGINT GENERATED ALWAYS AS IDENTITY,
    ...

In Amazon Redshift:

CREATE  TABLE IF NOT EXISTS STG_TRANSACTION
    surrogate_key INT IDENTITY NOT NULL,
    ...

There are similar options in Postgres and Microsoft SQL Server (the IDENTITY property), MySQL and SQLite (the AUTOINCREMENT attribute) and in many other databases.

Benefits and drawbacks of IDENTITY columns

They are simple to implement but require special DDL during table creation, and special DML to ignore the column whenever rows are inserted.

It can be tricky to alter the definition once the table has been created, especially when it contains a large amount of data.

IDENTITY columns are not offered in exactly the same way on all cloud data platforms.

GUID as a Surrogate Key

Generating random GUIDs (Globally Unique Identifiers) is a popular method for creating surrogate key unique identifiers. They are usually expressed as long strings, and accessed through functions built into the database.

To access the functionality in Snowflake, it's the UUID_STRING() function, which you would use when inserting new records into a table. The Databricks equivalent is UUID().

On some databases, it's possible to combine this with the previously listed technique and have an IDENTITY automatically generated as a GUID.

Benefits and drawbacks of GUIDs

GUIDs are genuinely meaningless, so there's no possibility of having users accidentally assigning meaning to them. Generation is fast and works in parallel. GUID surrogate keys are especially useful with the most granular, transactional data.

GUIDs make it easy to integrate multiple systems without any risk of clashing primary keys.

Typical GUIDs use 16 bytes of storage, which is larger than typical integer keys. So there can be a marginal performance loss associated with the extra size.

Lastly, it's important to note that GUIDs are really not intended to be human-readable. That makes them unwieldy when debugging and during manual data inspection.

High Water Mark + Row Number

This is a widely used technique to create integer surrogate keys, which is applied immediately before inserting new records into a data warehouse table. The method is:

  • Find the highest existing surrogate key (the High Water Mark)
  • Number all the new records incrementally starting from 1 (the Row Number)
  • Every new surrogate key is the High Water Mark + the Row Number

Calculating the Row Number is done with a "windowing" or "analytic" ROW_NUMBER() SQL function.

Benefits and drawbacks of High Water Mark surrogate keys

This is an excellent way to generate unique integer surrogate keys. It eliminates the possibility of gaps, even in the event of a failure and subsequent re-run.

On a traditional database finding the high water mark is a potentially expensive operation. Similarly, generating the row numbers requires running a windowing function on the (maybe large) set of new records. However, modern, column-major database platforms are specifically designed to perform these kinds of calculations, which makes this method fast and cost-effective.

The High Water Mark surrogate key approach is best suited for batch mode inserts rather than streaming.

Clock Based Surrogate Keys

Being clock-based means surrogate keys are generated based on the current timestamp of the database. The current date and time (maybe down to microseconds) produces a sequence of unique values that ensure every record is uniquely identifiable.

All databases provide access to the current date and time through a CURRENT_TIMESTAMP() or similar function. This is converted to a numeric representation - perhaps by literally using all the numbers in the timestamp - or perhaps using a more sophisticated method such as epoch time.

Benefits and Drawbacks of Clock Based Surrogate Keys

In environments with high transaction volumes, or in batch mode (when all new records would be given an identical timestamp) it is necessary to add extra information to guarantee uniqueness. This is often done with a ROW_NUMBER() in the same way as for High Water Mark surrogate keys.

In relatively low volume, streaming environments, a millisecond-based surrogate key can be very simple and effective.

When generating the system time, be careful to avoid using timezones that observe daylight savings (DST). These can cause duplication while the DST changeover is happening.

Concatenated natural key

This method takes all the columns that uniquely identify a record, converts them to strings, and concatenates them to create one (long) surrogate key held in its own column.

It is usually necessary to use a separator character between the components of the surrogate key, ideally one that can not occur naturally in the data. Without this extra protection, the two-part key 'A', 'BC' would collide with 'AB', 'C', for example.

The SQL to generate a concatenated natural key typically looks like this:

SELECT keypart1 || '#' || keypart2 AS new_sk
FROM mytable;
Benefits and drawbacks of a concatenated natural key

This technique is simple, and offers great traceability: it's easy to convert back and forth to the real primary key.

However, it does not offer any of the other advantages of surrogate keys. Significant disadvantages include:

  • No support for a time variance
  • The long surrogate keys can be difficult to manage
  • Vulnerable to any change to the natural key fields (schema drift)
  • The concatenated key might unintentionally expose sensitive business logic or personally identifiable information.

Remember to be careful with multipart keys where some of the columns are nullable:

SELECT 'x' || '#' || NULL;   -- Result is NULL not x#

Bearing in mind all these disadvantages, this technique is often useful in the early stages of a project - during data exploration - but replaced by a more robust method before going live.

Hash of concatenated natural key

This technique builds upon the previous method of concatenating the natural key, adding a final step of applying a hash function to the long string.

The SQL to generate a surrogate key this way typically looks something like this:

SELECT MD5(keypart1 || '#' || keypart2) AS new_sk
FROM mytable;

Adding a hash function removes many of the disadvantages of concatenated natural keys, making it an excellent way to generate consistent, unique identifiers for business entities.

This is how surrogate keys are created for Hub, Link, and Satellite tables in Data Vault 2.0. Time variance is supported in Satellite tables by having a two-part primary key: the hash plus a timestamp.

Benefits and drawbacks of hashing natural keys

Hash values, unlike concatenated natural keys, are relatively small and have a predictable length, which makes them efficient for storage and processing. They are quick to create, and the method is deterministic, meaning that the same source record always produces an identical surrogate key. Being a one-way operation, hashing provides the benefit of removing personally identifiable information (PII), even when elements like a person's name are part of the natural key.

Despite providing anonymity by concealing the original natural key, joinability is preserved. That is, a hashed natural key will successfully join with and group with other records that share the identical natural key, even though the original natural key remains undisclosed.

For extra security, consider hashing the natural key with a salt, or encrypting it.

In a medallion data architecture, creating deterministic surrogate keys in the bronze layer makes it simple to merge records into the silver layer using the surrogate key.

Meaningful surrogate keys?

Normally surrogate keys are deliberately meaningless: the value of the key has no relationship to the record it identifies.

But there are two circumstances where you might consider using a meaningful surrogate key:

  • In a date or time dimension, where - for example - the value 2345 really does mean 11:45 pm
  • In a junk dimension, where 1 means "yes", "true" or "positive" and 0 means "no"

When the key is meaningful, there's no need to perform a lookup against a reference table (e.g. a dimension) to calculate the value, which makes the ETL faster. Also it's simple to read the date or time directly from the data, again without reference to a lookup table.

Further reading
Ian Funnell
Ian Funnell

Data Alchemist

Ian Funnell, Data Alchemist at Matillion, curates The Data Geek weekly newsletter and manages the Matillion Exchange.
Follow Ian on LinkedIn: https://www.linkedin.com/in/ianfunnell

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.