- Blog
- 05.16.2024
- Leveraging AI, Data Fundamentals, Product
SQL Under the Aurora: Predictive power of Snowflake Cortex and Matillion

Sunspots are regions of intense magnetic fields where solar flares and coronal mass ejections occur. These massive bursts of charged particles sometimes hit the Earth, where they travel along magnetic field lines towards the poles, enter the atmosphere, and result in aurora displays that we call the northern lights. The Sun has an approximately 11-year sunspot cycle, in which magnetic activities fluctuate between maximum and minimum. During a maximum, the sunspot count increases, and this is happening right now!
Given the large amounts of past data on sunspot activity, I wondered if the machine learning (ML) algorithms available in Snowflake Cortex could have predicted one of the strongest geomagnetic storms in many years, that occurred between 10 and 11 May.

Storm G5 aurora, 10th and 11th May, 53॰ North
What is Snowflake Cortex ML?
Snowflake's Cortex ML offers low-code SQL functions that allow data engineers to run ML forecasting and anomaly algorithms, without requiring years of expertise in machine learning or data science. Cortex comes pre-equipped with various ML algorithms, especially focused on time series data - such as the sunspot cycle in this article.
With these functions, Snowflake users can take advantage of machine learning functionalities to analyze how specific metrics evolve and deliver predictive insights based on past trends.
Two steps are needed to create a forecast using Cortex ML:
- Create a FORECAST object, using training data
- Run the FORECAST method
Building a forecast model for a single time series
Historical sunspot data from the National Oceanic and Atmospheric Administration is in the form of a single time series. That means there are two key attributes in the data:
- A timestamp - Snowflake requires a TIMESTAMP_NTZ
- A "value" column - Snowflake requires a FLOAT
I used the monthly data, so the timestamp is the first of every month, going right back to the year 1749. The "value" column is the mean number of sunspots counted during that month.
To train the model, I created a table named SUNSPOT_NUMBER_TRAINING containing this data, but stopping at 2010.

Historical sunspot counts
Here's the SQL to create a FORECAST object for sunspot activity:
CREATE OR REPLACE SNOWFLAKE.ML.FORECAST sunspot_model(
INPUT_DATA => SYSTEM$REFERENCE('TABLE', 'SUNSPOT_NUMBER_TRAINING', 'SESSION'),
TIMESTAMP_COLNAME => 'TIMESTAMP',
TARGET_COLNAME => 'VALUE');
Some technical notes to bear in mind when you are creating a forecast:
- If possible, don't use all your actual data. Keep some real values from the model, so you can test its accuracy afterwards. That's why I limited the training data to before 2010
- Input data is provided by a SYSTEM$REFERENCE to a table name. This function returns a reference to an existing table. It can be used to temporarily grant access to objects - a form of transitory RBAC - although in this case, I have assumed the table and the forecast are in the same schema, and owned by the same user
- You will find it easier to use SYSTEM$REFERENCE if you stick to uppercase table and column names
It may take several minutes to create the forecast, depending on how much training data has been provided. Once it has been created, you should be able to find it with a SHOW FORECAST command, and it's ready to use!
Forecasting with Snowflake Cortex
For a single time series model like this one, forecasting is very simple. The only parameter is the number of new records to generate moving forward.
My training data ended in December 2010, and I wanted to include at least a couple of years beyond the current (2024), so I chose 16 years, or 192 months.
CALL sunspot_model!FORECAST(FORECASTING_PERIODS => 192)
The resulting forecast data has four columns:
- TS - the forecasted timestamp
- FORECAST - the forecast number of sunspots
- LOWER_BOUND - Lower boundary of prediction interval
- UPPER_BOUND - Upper boundary of prediction interval. The bounds diverge the further the prediction goes into the future, indicating increased uncertainty
To get hold of the data, run a RESULT_SCAN passing -1 meaning the previous command. This works provided you run the CTAS as the very next statement.
CREATE OR REPLACE TABLE SUNSPOT_NUMBER_FORECAST AS
SELECT *
FROM TABLE(RESULT_SCAN(-1));
Here's the first few predictions from my model:

Forecast sunspot counts
Having deliberately kept back the years 2011 onwards from training the model, it's easy to compare the prediction against reality.
To do this in SQL you'll need the original, unfiltered data, which I kept in a table named stg_sunspot_number:
SELECT TO_DATE(f.TS) AS TS,
a."mean_total_sunspot_number" AS ACTUAL,
f.FORECAST
FROM "stg_sunspot_number" a
RIGHT JOIN SUNSPOT_NUMBER_FORECAST f ON TO_DATE(f.TS) = TO_DATE(TO_CHAR(a."yyyy") || TO_CHAR(a."mm"), 'YYYYMM')
ORDER BY f.TS;
Notice there's a right join because the forecast extends to the year 2026. Here's how it looks as a line graph:

Forecast vs actual sunspot numbers, from Cortex ML FORECAST
Impressively, the model's predictions are very close to the highly unusual 11-year-long sunspot cycle. Remember the training data was truncated at 2010. There are some minor oddities between 2018 and 2021 where the prediction is negative, but those are easy to fix with a CASE statement like this:
CASE WHEN FORECAST < 0 THEN 0 ELSE FORECAST END
It's a bit much to expect a model that spans nearly 300 years to pick out a single date, but interestingly the peak prediction in January 2024 is only a few months away from May's major solar storm. The prediction remains high for another couple of years, which is great news for aurora watchers 😀
Matillion's interface to Cortex ML
The Matillion Data Productivity Cloud's close integration with Snowflake extends to Cortex ML. Here's a Matillion pipeline that creates the training data:

Creating model training data with the Matillion Data Productivity Cloud
Creating and running the model is also simple, for example like this:

Creating and running a time series forecast with the Matillion Data Productivity Cloud
Discover more about our joint capabilities at Snowflake Data Cloud Summit, and get ready to build forecasting, anomaly detection and more into your own data pipelines!
Ian Funnell
Data Alchemist
Ian Funnell, Data Alchemist at Matillion, curates The Data Geek weekly newsletter and manages the Matillion Exchange.
Follow Ian on LinkedIn: https://www.linkedin.com/in/ianfunnell
Featured Resources
Big Data London 2025: Key Takeaways and Maia Highlights
There’s no doubt about it – Maia dominated at Big Data London. Over the two-day event, word spread quickly about Maia’s ...
BlogSay Hello to Ask Matillion, Your New AI Assistant for Product Answers
We’re excited to introduce a powerful new addition to the Matillion experience: Ask Matillion.
BlogRethinking Data Pipeline Pricing
Discover how value-based data pipeline pricing improves ROI, controls costs, and scales data processing without billing surprises.
Share: