- Blog
- 10.28.2024
- Product, Data Fundamentals
Medallion Data Architecture for Life Science Data - Databricks Quickstart with Matillion

The Cancer Genome Atlas (TCGA), an initiative led by the National Cancer Institute (NCI) and the National Human Genome Research Institute (NHGRI), has carefully profiled tissue samples from 11,000 patients, encompassing 33 distinct cancer types. Through the Genomic Data Commons, this comprehensive dataset - which includes genomic, clinical, and biospecimen data - undergoes rigorous processing before distribution to the research community.
At the core of TCGA's data model are anonymized Cases, each linked to downloadable genome data files derived from tissue samples. Every file within the dataset contains expression quantification data for over 60,000 individual genes, which offers the possibility of gaining insight into gene function.
This combination of gene expression data with clinical diagnostics and exposure risk factors provides a powerful tool for any cancer genomics research group able to overcome the challenges of data extraction and organization. This article describes the Quickstart implementation on Databricks, using data pipelines managed by the Matillion Data Productivity Cloud.
Core challenges working with TCGA data
Getting into position to harness the immense potential of the TCGA data presents several main challenges:
- Accessing the API: Retrieving data from the Genomic Data Commons Data Portal requires performing sophisticated APIs interactions. Specifically, processing 'Search and Retrieval' requests, and paging through the results.
- Large data volumes: There are over 10,000 gene expression measurements, each containing data points on over 60,000 individual genes. Successfully managing, and accurately processing these files without overwhelming the GDC API is crucial.
- Semi-structured data manipulation: Transforming the semi-structured downloaded data into relational rows for statistical analysis requires the proficient use of Databricks functions for handling complex, multi-level JSON arrays.
- Data architecture implementation: Organization of the data into bronze (raw), silver (integrated), and gold (optimized for consumption) layers to streamline data pipelines and enhance usability. This is known as a Medallion data architecture.
Designing a Medallion Architecture
The TCGA data model is based on anonymized Cases, with associated genome data from tissue samples available for download in Files.
TCGA Data Model
One of the most efficient methods to prepare and structure this data for analysis is with a Medallion data architecture. Here's an outline of the data lineage:
TCGA Medallion Data Architecture
To implement these data flows, you'll need an ETL platform that's capable of negotiating with the Genomic Data Commons Data Portal API, and then transforming and integrating the resulting data using a cloud data platform. I'll demonstrate this using Databricks and the Matillion Data Productivity Cloud.
Medallion data architecture pipelines
Every Medallion data architecture begins with extracting the raw data and storing it in "bronze" layer tables. After that, transforming and integrating the raw data into the "silver" layer. To make the integrated data easy to consume, the last layer of data transformation restructures the tables again to create the "gold" layer.
Raw data extraction
The Matillion Exchange has a connector for the Genomic Data Commons Data Portal that includes the "cases" endpoint. Begin by downloading and installing it into your custom connector library. Use an orchestration pipeline to extract the data.
Extracting Case data from the Genomic Data Commons Data Portal
There is another very similar endpoint for extracting the names of all the gene expression files. After that it's possible to start downloading the actual gene expression files. To avoid overwhelming the public API, this is split into multiple passes, 50 files at a time.
Extracting gene expression files from the Genomic Data Commons Data Portal
All the raw data has now been extracted, so it's time to push it into the bronze layer.
Relationalizing data for the bronze medallion layer
The main challenge at this stage is to convert the newly downloaded, complex, nested JSON data into rows and columns. This is known as relationalization.
Databricks has powerful features built-in for semi-structured data analysis. In the example below more than 11,000 rows are compressed into a single, large JSON object. Matillion's visual interface captures the complex STRUCT of the data, taking advantage of the SCHEMA_OF_JSON SQL function.
Accessing semi-structured data
Matillion's "Explode" and "Relationalize" components flatten and convert the JSON data into records and columns.
Explode and Flatten complex JSON substructure
Now the bronze layer is populated, and the data is ready to be integrated.
Integrating data for the silver medallion layer
The main challenge for the silver layer is to integrate related tables using SQL joins. This is made much simpler by having already dealt with the complex semi-structured data structures in the bronze layer.
In this example, the case table is joined to the file-level "expressions" data on the common case-id column.
Bronze to Silver table joins
This creates three silver-level tables focussing on different areas of analysis: demographics, diagnoses, and exposures.
Restructuring data for the gold medallion layer
The silver layer typically contains data in its most normalized representation, meaning that it acts as a source of truth. But the silver layer data structures can be hard to read. They usually benefit from further processing to make the data most consumable.
The gene expression data is a good example of this. The normalized representation is in "narrow" format, with more than 60,000 records per case: one per gene expression.
Narrow format - silver layer data
But statistical techniques such as dimensionality reduction require the data to be in a "wide" format, also preferably focussing on just the genes that naturally have the most variability in expression.
This is done with two more transformations:
- Identifying the genes with the most variability
- Pivoting this subset of the genes, from narrow to wide format
Pivot most variable genes to wide format
After this step the data in the gold table is expressed in "wide" format, with one column for each of the 100 most variable genes.
Gene expression by case in wide format
The wide format data is ready for statistical analysis and machine learning techniques such as dimensionality reduction and clustering, to help us understand which gene expressions are most associated with which cancers.
Summary and next steps
The article is a Life Sciences demonstration of the design and implementation of data pipelines on Databricks using the Matillion Data Productivity Cloud to efficiently handle The Cancer Genome Atlas (TCGA) data, consisting of genomic, clinical, and biospecimen data from 11,000 patients covering 33 cancer types. It used a Medallion architecture, organizing data into bronze (raw), silver (integrated), and gold (optimized for consumption) layers.
The Matillion Data Productivity Cloud, a versatile data pipeline platform, simplified this process with no-code connectors, pre-built components, and robust features for managing semi-structured data using Databricks. Matillion empowers data teams to build and manage pipelines quickly, integrating seamlessly with hyperscalers, CDPs, LLMs, and AI/ML tools, democratizing access to cancer genomics data for large-scale research.
Find this Solution Accelerator on the Databricks Marketplace.
Ready to transform your approach to data analysis? Start your own Matillion Data Productivity Cloud free trial.
Already a Matillion user? You can download the pipelines from the Matillion Exchange and run them on your own platform.
Technical information on how to access TCGA data through the Genomic Data Commons Data Portal.
More background information on The Cancer Genome Atlas (TCGA).
Ian Funnell
Data Alchemist
Ian Funnell, Data Alchemist at Matillion, curates The Data Geek weekly newsletter and manages the Matillion Exchange.
Follow Ian on LinkedIn: https://www.linkedin.com/in/ianfunnell
Featured Resources
ETL and SQL: How They Work Together in Modern Data Integration
Explore how SQL and ETL power modern data workflows, when to use SQL scripts vs ETL tools, and how Matillion blends automation ...
Learn more WhitepapersUnlocking Data Productivity: A DataOps Guide for High-performance Data Teams
Download the DataOps White Paper today and start building data pipelines that are scalable, reliable, and built for success.
Learn more BlogWebhooks and Pushdown Python: Building Interactive and Efficient Data Applications
Part 5 of our blog series demonstrating the art of the possible, using Matillion products and features to build the MatiHelper ...
Learn more
Share: