Google BigQuery is an industry leading cloud-based data warehouse. While it is still a SQL database, it’s built for cloud and therefore the difference in infrastructure, when compared to on-premise database, may cause your SQL commands to not be as performant or cost effective as you would expect. These 5 SQL clause optimizations for Google BigQuery will help you improve performance and control already low costs of running queries in Google BigQuery.
1. Avoid SELECT*
When you run a query using a SELECT *, BigQuery has to read ALL the storage volumes. Whereas if you query only certain columns using a SELECT col1, col2, col3… then BigQuery only needs to retrieve data for the selected columns. By specifying the columns required to get the results you are looking for, you limit the amount of work and thus the amount of BigQuery has to do to return the needed data. This is the simplest and most important technique, both for performance optimization and cost reduction.
2. LIMIT and WHERE clauses
LIMIT is a final step which restricts the amount of data shown, after the full query has been executed. Therefore, LIMIT clause DOESN’T affect query cost.
It is similar story for a WHERE clause, which is applied AFTER the data has been scanned. The cost of a query depends on the number of bytes read by the query BEFORE applying the WHERE clause.
If you find you are repeatedly querying a large base table, consider the following
- Query it just once, using a superset of all the necessary WHERE clauses, and materialize an intermediate table. Then repeatedly query the intermediate table, using individual WHERE clauses.
- Use partitioning on the base table. Read more about Partitioning in our eBook Optimizing Google BigQuery.
3. WITH clauses
These are sometimes known as Common Table Expressions (CTEs). They can be useful syntactically, but are not materialized at runtime.
It might look like the CTE will only need 1 scan, but it will actually require 2. Your query therefore will double in cost and take twice as long as you might expect. If you’re executing a complex CTE repeatedly, the solution is to use the CTE beforehand to create an intermediate table, and then use the intermediate table in the SQL rather than the complex CTE expression.
4. Materialize Common Joins
This recommendation applies if you find that you are repeatedly performing a join between two tables.
Even if you have WHERE clauses on both tables, it’s likely to be cheaper and faster, in the long run, to materialize this join into an intermediate table.
- Run the join once to create the new intermediate table
- Query the intermediate table repeatedly (no join is required)
5. Avoid OLTP Patterns
Google BigQuery’s unique architecture leads to great batch mode and analytic SQL performance. But it can be easy to accidentally slip into design patterns from the OLTP world, which are usually not optimal.
For example, you should AVOID:
- User-Defined Functions (UDFs)
- High-Compute Queries
- Single-Row Data Manipulation Language