Visit Matillion AI Playground at Snowflake Data Cloud Summit 24

Find out more

SQL Under the Aurora: Predictive Power of Snowflake Cortex and Matillion

Sunspots are regions of intense magnetic fields where solar flares and coronal mass ejections occur. These massive bursts of charged particles sometimes hit the Earth, where they travel along magnetic field lines towards the poles, enter the atmosphere, and result in aurora displays that we call the northern lights. The Sun has an approximately 11-year sunspot cycle, in which magnetic activities fluctuate between maximum and minimum. During a maximum, the sunspot count increases, and this is happening right now! 

Given the large amounts of past data on sunspot activity, I wondered if the machine learning (ML) algorithms available in Snowflake Cortex could have predicted one of the strongest geomagnetic storms in many years, that occurred between 10 and 11 May.

Storm G5 aurora, 10th and 11th May, 53॰ North

What is Snowflake Cortex ML?

Snowflake's Cortex ML offers low-code SQL functions that allow data engineers to run ML forecasting and anomaly algorithms, without requiring years of expertise in machine learning or data science. Cortex comes pre-equipped with various ML algorithms, especially focused on time series data - such as the sunspot cycle in this article.

With these functions, Snowflake users can take advantage of machine learning functionalities to analyze how specific metrics evolve and deliver predictive insights based on past trends.

Two steps are needed to create a forecast using Cortex ML:

  1. Create a FORECAST object, using training data
  2. Run the FORECAST method

 

Building a forecast model for a single time series

Historical sunspot data from the National Oceanic and Atmospheric Administration is in the form of a single time series. That means there are two key attributes in the data:

  • A timestamp - Snowflake requires a TIMESTAMP_NTZ
  • A "value" column - Snowflake requires a FLOAT

I used the monthly data, so the timestamp is the first of every month, going right back to the year 1749. The "value" column is the mean number of sunspots counted during that month.

To train the model, I created a table named SUNSPOT_NUMBER_TRAINING containing this data, but stopping at 2010.

Historical sunspot counts

Here's the SQL to create a FORECAST object for sunspot activity:

CREATE OR REPLACE SNOWFLAKE.ML.FORECAST sunspot_model(

  INPUT_DATA => SYSTEM$REFERENCE('TABLE', 'SUNSPOT_NUMBER_TRAINING', 'SESSION'),

  TIMESTAMP_COLNAME => 'TIMESTAMP',

  TARGET_COLNAME => 'VALUE');

Some technical notes to bear in mind when you are creating a forecast:

  • If possible, don't use all your actual data. Keep some real values from the model, so you can test its accuracy afterwards. That's why I limited the training data to before 2010
  • Input data is provided by a SYSTEM$REFERENCE to a table name. This function returns a reference to an existing table. It can be used to temporarily grant access to objects - a form of transitory RBAC - although in this case, I have assumed the table and the forecast are in the same schema, and owned by the same user
  • You will find it easier to use SYSTEM$REFERENCE if you stick to uppercase table and column names

It may take several minutes to create the forecast, depending on how much training data has been provided. Once it has been created, you should be able to find it with a SHOW FORECAST command, and it's ready to use!

Forecasting with Snowflake Cortex

For a single time series model like this one, forecasting is very simple. The only parameter is the number of new records to generate moving forward.

My training data ended in December 2010, and I wanted to include at least a couple of years beyond the current (2024), so I chose 16 years, or 192 months.

CALL sunspot_model!FORECAST(FORECASTING_PERIODS => 192)

The resulting forecast data has four columns:

  • TS - the forecasted timestamp
  • FORECAST - the forecast number of sunspots
  • LOWER_BOUND - Lower boundary of prediction interval
  • UPPER_BOUND - Upper boundary of prediction interval. The bounds diverge the further the prediction goes into the future, indicating increased uncertainty

To get hold of the data, run a RESULT_SCAN passing -1 meaning the previous command. This works provided you run the CTAS as the very next statement.

CREATE OR REPLACE TABLE SUNSPOT_NUMBER_FORECAST AS

SELECT *

FROM TABLE(RESULT_SCAN(-1));

Here's the first few predictions from my model:

Forecast sunspot counts

Having deliberately kept back the years 2011 onwards from training the model, it's easy to compare the prediction against reality.

To do this in SQL you'll need the original, unfiltered data, which I kept in a table named stg_sunspot_number:

SELECT TO_DATE(f.TS) AS TS,

       a."mean_total_sunspot_number" AS ACTUAL,

       f.FORECAST

FROM "stg_sunspot_number" a

RIGHT JOIN SUNSPOT_NUMBER_FORECAST f ON TO_DATE(f.TS) = TO_DATE(TO_CHAR(a."yyyy") || TO_CHAR(a."mm"), 'YYYYMM')

ORDER BY f.TS;

Notice there's a right join because the forecast extends to the year 2026. Here's how it looks as a line graph:

Forecast vs actual sunspot numbers, from Cortex ML FORECAST

Impressively, the model's predictions are very close to the highly unusual 11-year-long sunspot cycle. Remember the training data was truncated at 2010. There are some minor oddities between 2018 and 2021 where the prediction is negative, but those are easy to fix with a CASE statement like this:

CASE WHEN FORECAST < 0 THEN 0 ELSE FORECAST END

It's a bit much to expect a model that spans nearly 300 years to pick out a single date, but interestingly the peak prediction in January 2024 is only a few months away from May's major solar storm. The prediction remains high for another couple of years, which is great news for aurora watchers 😀

Matillion's interface to Cortex ML

The Matillion Data Productivity Cloud's close integration with Snowflake extends to Cortex ML. Here's a Matillion pipeline that creates the training data:

Creating model training data with the Matillion Data Productivity Cloud

Creating and running the model is also simple, for example like this:

Creating and running a time series forecast with the Matillion Data Productivity Cloud

Discover more about our joint capabilities at Snowflake Data Cloud Summit, and get ready to build forecasting, anomaly detection and more into your own data pipelines!

Ian Funnell
Ian Funnell

Data Alchemist

Ian Funnell, Data Alchemist at Matillion, curates The Data Geek weekly newsletter and manages the Matillion Exchange.