Skip to main content

Snowflake Cloud Data Platform Architecture & Basic Concepts

 

As big data continues to get bigger, more organizations are turning to cloud data warehouses. The cloud is the only platform that provides the flexibility and scalability that are needed to accommodate today’s massive data volumes. Initially released in 2014, Snowflake is one of the top cloud data platforms on the market..

 

What is Snowflake Cloud Data Platform?

 

Snowflake is a cloud data platform that’s provided as a fully-managed service. It can be used for data warehousing, data lakes, data engineering, data analytics, data science, data application development, and for securely sharing and consuming shared data. One cool feature of Snowflake is that it supports a near-unlimited number of concurrent workloads, so your users can always do what they need to do, when they need to do it.

 

What is the Snowflake Cloud Data Platform Architecture?

 

Snowflake has a fairly unique architecture. The platform includes storage, compute, and cloud services layers that are physically separated but logically integrated. This means that you can enable virtually all of your users and data workloads to access a single copy of your data without impacting performance. With everyone accessing the same version of your data, there are no data silos. Everyone has the same source of the truth.

Snowflake charges by credits. A Snowflake credit is a unit of measure that’s used to pay for the consumption of resources on Snowflake. A Snowflake credit is consumed when a customer is using resources, such as when a virtual warehouse is running, the cloud services layer is being used, or serverless features are being used.

 

Here are some other features of Snowflake:

 

Hybrid Snowflake’s architecture is a hybrid of shared-disk and shared-nothing architectures. Shared-nothing architecture is a distributed architecture, where each node is independent and self-sufficient. In shared-disk architectures, all data is accessible from all cluster nodes. Snowflake combines these two architectures, using a central data repository for persisted data that is accessible from all compute nodes. When processing queries, Snowflake uses massively parallel processing (MPP) compute clusters, and each node in the cluster stores a portion of the data set locally. With this hybrid model, Snowflake has the data management simplicity of a shared-disk architecture plus the performance benefits of a shared-nothing architecture.

Cloud agnostic Unlike many cloud data warehouses, Snowflake doesn’t run on its own cloud. It is available globally on Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure. Because it has a common and interchangeable code base, you can move your data to any cloud in any region, without having to re-do your application code. However, Snowflake cannot run on a private cloud infrastructure, either on-premises or hosted.

Separate storage and compute Snowflake’s architecture separates storage from compute. This means that users aren’t competing for resources. Further, there are no limits on the number of queries or workloads that can be run simultaneously, and no limits on the number of users accessing data. All workloads can simultaneously leverage the compute power they need, when they need it.

Three-layered Snowflake’s hybrid architecture has three layers that scale independently of one another: the database storage layer, the cloud services layer, and the query processing layer.

  • Database storage: Snowflake has a scalable cloud blob storage for storing structured and semi-structured data, including JSON, AVRO, and Parquet. The storage layer contains tables, schemas, databases, and diverse data. Tables can store multiple petabytes of data. Data in Snowflake tables is automatically divided into micro-partitions, which are contiguous units of storage. Each micro-partition contains between 50 MB and 500 MB of uncompressed data.
  • Cloud services layer: The cloud services layer provides services such as authentication, infrastructure management, and access control. It also provides metadata management.
  • Query processing layer: The query processing layer handles query execution. Snowflake processes queries using “virtual warehouses.” Each virtual warehouse is an MPP compute cluster made up of multiple compute nodes and each virtual warehouse is an independent compute cluster. As a result, each virtual warehouse operates independently and has no impact on the performance of the other virtual warehouses.

Supports a range of data types Snowflake supports a broad range of data types and can store them in their native forms, so you’re not creating new data silos.

Scales elastically Snowflake provides automatic cloud elasticity, so when you need more capacity resources, Snowflake automatically adds them. You only pay for what you use.

 

Snowflake Cloud Data Platform Best Practices

 

Here are a few best practices for using Snowflake efficiently and economically:

 

Choose your warehouse size based on query type

 

Snowflake uses per-second billing, so the size of the platform you choose doesn’t necessarily matter. In fact, you can run larger platforms (sizes run from X-Small to 4X-Large) and then just suspend them when they’re not in use.

What’s more important than the size of your warehouse is the type of queries you’ll be running. Data engineers may want to create separate platforms for different environments such as development, testing, and production. A smaller platform is likely sufficient for development or testing environments. For production environments, larger platforms sizes are usually necessary.

Snowflake recommends that users experiment with different types of queries and different platform sizes to determine the combinations that best meet your specific query needs and workload.

 

Use separate warehouses for load and query operations

 

Loading large datasets can impact your query performance. Snowflake therefore recommends dedicating separate instances for loading and querying operations to optimize performance for both loading and querying. Another issue to consider regarding data loading is that loading performance is influenced more by the number of files being loaded along with the size of each file than by the size of the warehouse. A smaller instance may therefore be sufficient for data loading purposes.

 

Minimize small queries

 

As we’ve already learned, the Snowflake architecture separates its platform into three distinct functions: compute resources (implemented as virtual warehouses), data storage, and cloud services. The costs associated with using Snowflake are based on your usage of each of these functions. Because Snowflake uses per-second billing, it’s not cost-effective to run small queries. Using Snowflake for small queries is kind of like using a backhoe when you really need a hand shovel.

 

Enable auto-suspension and auto-resumption

 

Snowflake provides features that can help you save credits and therefore reduce costs. Warehouses do not accrue credit usage when they’re suspended. Therefore, if you enable the auto-suspension and auto-resumption features, you can help cut costs. With these features enabled, Snowflake will automatically suspend an instance after a specified period of inactivity. It will also enable the platform after you submit a query it, and the instance is the current one for the session.

Want to Learn More About Snowflake Cloud Data Platform?

 

Modern businesses seeking a competitive advantage must harness their data to gain better business insights. Matillion enables your data journey by extracting and loading your data and transforming it in the cloud, allowing you to be more productive, gain new insights and make better business decisions.

 

Effortlessly load source system data into your cloud data warehouse with Matillion Data Loader, a free SaaS-based data integration tool.