Using the Cloud Storage Load Generator Tool in Matillion ETL for BigQuery to Load a CSV file
Matillion uses the Extract-Load-Transform (ELT) approach to delivering quick results for a wide range of data processing purposes: everything from customer behavior analytics, financial analysis, and even reducing the cost of synthesizing DNA.
The Cloud Storage Load component presents an easy-to-use graphical interface, enabling you to connect to a file stored on a Cloud Storage Bucket and pull data from that file into BigQuery. This is very popular with our customers to load data stored in files into BigQuery and combine this data with data from additional external sources.
The connector is completely self-contained: no additional software installation is required. It’s within the scope of an ordinary Matillion license, so there is no additional cost for using the features.
In this example we will look at loading CSV files, containing flight data, stored on a Cloud Storage Bucket. We have one year’s worth of flights data per file.
Cloud Storage Load Component
In order to run the Cloud Storage Load Component, you first need to create a table in BigQuery. This is the table you will load the CSV file into. You then need to configure the Cloud Storage Load component. While this is not difficult, since Matillion prompts you for all the information, it can take some time, especially when dealing with a data set that spans a lot of columns. To save time and reduce human error you can use the Cloud Storage Load Generator in Matillion ETL for BigQuery.
Cloud Storage Load Generator
We recommend using the Cloud Storage Load Generator to quickly configure the necessary components (Cloud Storage Load Component and Create Table Component) to load the contents of the files into BigQuery. Simply select the Cloud Storage Load Generator from the ‘Tools’ folder and drag it onto the canvas. The Load Generator will pop up.
Select the ellipsis next to Cloud Storage URL Location and then pick the file you want to be loaded into BigQuery.
From the list of Cloud Storage Buckets, you need to select the correct bucket and sub-folder. Then you can select the file you want to load:
Back in the Load Generator, change the compression options in the drop-down and select the compression type for the file:
Matillion will automatically unzip Gzip and BZip2 files during the load.
Click the Get Sample button to prompt Matillion to connect into the file and sample the top 50 rows of data. Matillion will guess the file type and also the metadata for the file based on the sample shown in the Load Generator.
Here we can see Matillion has identified the file is a CSV file with a comma field delimiter and newline Record delimiter. Based on the sample of data, Matillion also identifies the top row is the header. Matillion will use this header, in addition to the data sample in order to identify the metadata for the table you are creating in BigQuery.
If required, you can change any of these fields.
Click on the OK button and Matillion will generate the required components to load this data to BigQuery:
Running the Cloud Storage Load
Before you can run the components, you need to connect them to the Start component. This will mean the Create/Replace Table component will turn green. The Cloud Storage Load will remain red as it is relying on the table in the first component to exist. However, you can still run the job because the first component will create the table you need.
You can run the Orchestration job, either manually or by using the Scheduler, to query your data and bring it into BigQuery.
Cloud Storage URL Location
The Cloud Storage Load component supports wildcards. Rather than creating one component per file to load into BigQuery, you can change the Google Storage URL Location property to the first 2 digits followed by * as shown below:
When this is run it will then loop through all of the 20xx files and load them all into the same table in BigQuery. If you wanted to load all files within a bucket you could also use the * as the prefix.
Number of Errors Allowed
Another option our customers frequently use is the ‘Number of Errors Allowed’ property. You can use this to set the maximum number of individual parsing errors that cause the whole load to fail. These rows will either have columns with the incorrect metadata, e.g. a string where a date is expected or numbers which are too long or will not be in the format expected. By changing the ‘Number of Errors Allowed’ setting, values causing parsing errors will be substituted as null values.
Allow Jagged Rows
Finally, you have the option to accept rows that are missing trailing optional columns. The missing values are treated as nulls. If false, records with missing trailing columns are treated as bad records, and if there are too many bad records, an invalid error is returned in the job result. The default value is false. Only applicable to CSV, ignored for other formats.
Transforming the Data
Once you have brought the required data from Cloud Storage into BigQuery, you can use it in a Transformation job, perhaps to combine with existing data:
In this way, you can build out the rest of your downstream transformations and analysis, taking advantage of BigQuery’s power and scalability.
For more tips on loading your data into BigQuery download our comprehensive 60+ page guide on Optimizing Google BigQuery.