Blog

Inferring Sessions from Event Data using Matillion ETL for Amazon Redshift

Inferring sessions from event data is the process of combining discrete events using a pre-defined approach. Typically, web-sites generate events (log files) data related to the actions/journey of a user during each visit. Data related to a specific visit allow a web-analytics practice to determine common patterns, such as:

  • How did the user arrive at the web-site?
  • Which pages did the user visit on the web-site? and in which order?
  • On an e-commerce web-site, which products/categories did the user click through? This data may later be used for affinity analysis

A session/visit may be reconstructed using two different approaches:

  • Time-oriented: by pre-defining an Inactivity Threshold, event data is classified as a session/visit if a sequence of actions originating from the same user is never inactive for a period of time which exceeds the inactivity threshold. This approach becomes crucial whenever the event data does not contain a Session Id, or where the same Session Id is reused for a period of time which spans across more than one visit.
  • Navigation-oriented: based on a web-site map (graph), a session is a sequence of events related to pages (vertices) which are connected by hyperlinks (edges). If a user moves from page1 to page2, but page1 does not contain a hyperlink to page2, this should be considered a new session/visit.

This document will demonstrate a simple, repeatable and scalable time-oriented approach to infer sessions/visits from event data using Matillion ETL for Amazon Redshift.

Getting Started with the example

The attached export contains one Matillion ETL for Amazon Redshift Folder comprising:

  • Orchestration Job (Sessionize Setup): This job sets up the schema and generates sample data for our transformation job;
  • Transformation Job (Sessionize Calculation): This job uses the generated sample data as an input to infer Session IDs for each web-site visit.

To import the Jobs, you will need an environment connected to a Redshift schema and use Project -> Import Job and select the file.

Inactivity Threshold

The Inactivity Threshold is the maximum amount of time which may separate two events (page views) within the same visit. If, between two events, the user is inactive for a period of time equal or exceeding the Inactivity Threshold, we will consider the two events to belong to distinct sessions/visits.

Our Transformation Job will use an Environment Variable called timeout to define the Inactivity Threshold to 30 minutes. You can set an Environment Variable by accessing Project -> Manage Environment Variables:

 

inferring-sessions-matillion-etl-for-redshift-TIMEOUT

The timeout environment variable

 

Environment Variables may be used on your jobs and their values may be dynamically changed. For more information, please visit Using Variables.

Schema and Sample Data

The Orchestration Job will:

  • Create (if not exists) a table containing factual data regarding page visits;
  • Truncate the page visits fact table;
  • Generate sample data: two distinct users visit the web-site on two distinct occasions each;
  • Run transformation job.
inferring-sessions-matillion-etl-for-redshift-orchestration

Orchestration Job

Schema

The page views table contains information about each page visited, user and a timestamp indicating when the user visited the page:

 

inferring-sessions-matillion-etl-for-redshift-pageviews

PageViews Table

 

Sample Data

The Orchestration Job will use a “SQL Script” component to generate sample data for two users, each visiting the web-site on two distinct occasions:

 

inferring-sessions-matillion-etl-for-redshift-4

Sample Data

 

Notes:

  • Each blue box contains data for a specific user;
  • Yellow break-lines denote new sessions/visits for each user, i.e. adjacent events are split by at least 30m.

After completing schema and sample data setup, the orchestration job will trigger the “Sessionize Calculation” transformation job.

Transformation Job

Our Transformation Job will consist of 5 steps:

  1. Table Input: Reads the data from the page views fact table
  2. Lead/Lag: For each user and event, calculates the timestamp of the previous event
  3. Calculator: Compares time gap of current and previous events with the Inactivity Threshold to determine a new session flag/integer
  4. Window Calculation: Determines a Session ID per user using the new session flag/integer
  5. Calculator: Determines a Unique Session ID

 

inferring-sessions-matillion-etl-for-redshift-transformation

Transformation Job

 

Table Input: fact_pageview

inferring-sessions-matillion-etl-for-redshift-fact-pageview

The Table Input component reads data of all columns from the page views table:

 

inferring-sessions-matillion-etl-for-redshift-table-input

Table Input Component Properties

 

Lead/Lag: Calculate Previous Event

inferring-sessions-matillion-etl-for-redshift-calculateThe Lead/Lag component will partition the data by user, and for each event will calculate a new field (previous_timestamp) containing the timestamp of the previous event for the same user. If and when an event is the first page view of a specific user, the previous_timestamp value will be NULL.

 

inferring-sessions-matillion-etl-for-redshift-lead-lag

Lead/Lag Properties

 

The lead/lag component “Partitions Data” by user (uid) and orders the data within each partition by the event_timestamp in ascending order:

 

inferring-sessions-matillion-etl-for-redshift-10

Lead/Lag Ordering Within Partitions

 

For each event within the user partition, we will calculate the timestamp of the previous event according to the order criteria previously described. The previous event timestamp will be stored on output column previous_event_timestamp.

 

inferring-sessions-matillion-etl-for-redshift-11

Lead/Lag Function

 

After applying the first transformation, we can use the Lead/Lag component Sample tab and check how the data will look like:

 

inferring-sessions-matillion-etl-for-redshift-12

Lead/Lag Sample Data

 

As illustrated, we now have, for each event, the timestamp for the previous event by the same user. For the first event of each user, the previous_event_timestamp is NULL.

Calculator: Calculate New Session Flag/Integer

inferring-sessions-matillion-etl-for-redshift-new-sessionOur first Calculator component on our Transformation Job will determine an integer value (0/1) to indicate when a page view is the start of a new session.

 

inferring-sessions-matillion-etl-for-redshift-new-session-properties

new_session Calculator Properties

 

The new session flag value is based on one of three possible scenarios:

  1. The previous_event_timestamp is NULL, thus indicating the event refers to first time the user has visited the web-site. In this case the value will be 1 (new session).
  2. The time gap between the previous_event_timestamp and the current event is equal or greater than the Inactivity Threshold. In this case the value will also be 1 (new session).
  3. The time gap between the previous_event_timestamp and the current event is less than the Inactivity Threshold. In this case the value will be 0 (not a new session).

The result/flag will be stored on the new field new_session:

 

inferring-sessions-matillion-etl-for-redshift-15

Calculator Formula for the new_session field

 

The Calculator formula uses the Environment Variable ${timeout} to perform the comparison. We can use the Calculator component Sample tab and check how the data will look like:

 

inferring-sessions-matillion-etl-for-redshift-new-session-sample-data

new_session Calculator Sample Data

 

As illustrated, we now have, for each row, an Integer flag indicating whether the event represents a new session.

Window Calculation: Determine Session ID

inferring-sessions-matillion-etl-for-redshift-determine-session-idThe Window Calculation component will determine a Session ID for each user session based on the

new_session field.

 

inferring-sessions-matillion-etl-for-redshift-window-calculation

Window Calculation Properties

 

The Window Calculation, similar to the Lead/Lag, partitions the data by user and orders the data within each partition by event_timestamp in ascending order:

 

inferring-sessions-matillion-etl-for-redshift-ordering

Window Calculation Ordering

 

For each event within the user partition, we will calculate an aggregated sum of the new_session field, thus determining a unique Session ID per user partition:

 

inferring-sessions-matillion-etl-for-redshift-function

Window Calculation Function

 

The result of our Window Calculation will store the unique Session ID per user on the new field sid. Remaining properties should be set to:

  • Lower Bound: unbounded preceding. The aggregation “starts” at the first row of the partition
  • Upper Bound: current row. The aggregation uses window values from the lower bound up to the current row

We can use the Window Calculation component Sample tab and check how the data will look like:

 

inferring-sessions-matillion-etl-for-redshift-window-calculation

Window Calculation Sample Data

 

As illustrated, we now have, for each row, a Session ID (sid) uniquely identifying a Session per user.

Calculator: Calculate New Session Flag/Integer

inferring-sessions-matillion-etl-for-redshift-determine-uniqueThe last step on the Transformation Job is a Calculator component which will concatenate the user identifier with the session identifier in order to determine a unique Session ID across the whole set of data.

 

inferring-sessions-matillion-etl-for-redshift-unique-session

Unique Session ID Calculator

 

The result of the concatenation will be stored on the new field unique_sid:

 

inferring-sessions-matillion-etl-for-redshift-24

Calculator Formula for the unique_sid field

 

We can use the Calculator component Sample tab and check how the data will look like:

 

inferring-sessions-matillion-etl-for-redshift-25

Unique Session ID Calculator Sample Data

 

As illustrated, we now have, for each row, a Unique Session ID (unique_sid) uniquely identifying a Session.

Further Consideration

On a production environment you should also consider:

  • Storing results on the same or different table, using a Table Output component at the end of the Transformation Job
  • Segment event data by a pre-determined period of time, Sessionizing only data for that period
  • Build error flows whenever one of the components fails to execute

We hope you found this helpful. For more information, visit the Matillion ETL for Amazon Redshift Support Center.

Attachments

Want to try this out for yourself? Download the required JSON file here