- Blog
- 03.20.2018
5 SQL Clause Optimizations for Google BigQuery
Google BigQuery is an industry leading cloud-based data warehouse. While it is still a SQL database, it’s built for cloud and therefore the difference in infrastructure, when compared to on-premise database, may cause your SQL commands to not be as performant or cost effective as you would expect. These 5 SQL clause optimizations for Google BigQuery will help you improve performance and control already low costs of running queries in Google BigQuery.
1. Avoid SELECT*
When you run a query using a SELECT *, BigQuery has to read ALL the storage volumes. Whereas if you query only certain columns using a SELECT col1, col2, col3… then BigQuery only needs to retrieve data for the selected columns. By specifying the columns required to get the results you are looking for, you limit the amount of work and thus the amount of BigQuery has to do to return the needed data. This is the simplest and most important technique, both for performance optimization and cost reduction.
2. LIMIT and WHERE clauses
LIMIT is a final step which restricts the amount of data shown, after the full query has been executed. Therefore, LIMIT clause DOESN’T affect query cost.
It is similar story for a WHERE clause, which is applied AFTER the data has been scanned. The cost of a query depends on the number of bytes read by the query BEFORE applying the WHERE clause.
If you find you are repeatedly querying a large base table, consider the following
alternatives:
- Query it just once, using a superset of all the necessary WHERE clauses, and materialize an intermediate table. Then repeatedly query the intermediate table, using individual WHERE clauses.
- Use partitioning on the base table. Read more about Partitioning in our eBook Optimizing Google BigQuery.
3. WITH clauses
These are sometimes known as Common Table Expressions (CTEs). They can be useful syntactically, but are not materialized at runtime.
It might look like the CTE will only need 1 scan, but it will actually require 2. Your query therefore will double in cost and take twice as long as you might expect. If you’re executing a complex CTE repeatedly, the solution is to use the CTE beforehand to create an intermediate table, and then use the intermediate table in the SQL rather than the complex CTE expression.
4. Materialize Common Joins
This recommendation applies if you find that you are repeatedly performing a join between two tables.
Even if you have WHERE clauses on both tables, it’s likely to be cheaper and faster, in the long run, to materialize this join into an intermediate table.
That means:
- Run the join once to create the new intermediate table
- Query the intermediate table repeatedly (no join is required)
5. Avoid OLTP Patterns
Google BigQuery’s unique architecture leads to great batch mode and analytic SQL performance. But it can be easy to accidentally slip into design patterns from the OLTP world, which are usually not optimal.
For example, you should AVOID:
- User-Defined Functions (UDFs)
- High-Compute Queries
- Single-Row Data Manipulation Language
- Skew
These SQL clause recommendations for Google BigQuery should help ensure performance and the cost efficiency of your queries. For further Query Optimization and cost controlling techniques download our comprehensive 60+ page guide on Google BigQuery.
Featured Resources
10 Best Practices for Maintaining Data Pipelines
Mastering Data Pipeline Maintenance: A Comprehensive GuideBeyond ...
NewsMatillion Adds AI Power to Pipelines with Amazon Bedrock
Data Productivity Cloud adds Amazon Bedrock to no-code generative ...
BlogData Mesh vs. Data Fabric: Which Approach Is Right for Your Organization? Part 3
In our recent exploration, we've thoroughly analyzed two key ...
Share: