Atlanta Snowflake Study Group

SnowPro Core exam preparation

Snowflake AI Data Cloud Features and Architecture

Back to Home

Snowflake is an overloaded term—it refers to a company, a Software-as-a-Service platform, and the name of a system-defined, read-only database that tracks metadata and usage data about your organization. This topic covers the data platform.

The Snowflake data platform is cloud-native, designed specifically for cloud environments. Its three-layer architecture separates compute, storage, and cloud services for independent scaling. The platform works across AWS, Azure, and GCP, enabling customers to leverage existing cloud infrastructure and share data seamlessly across cloud environments.

Snowflake stores data in a compressed, columnar format that accelerates analytical queries by reading only required columns, making it ideal for business intelligence and data warehousing. Its micro-partitioning storage model automatically organizes data for optimal performance.

Outline key features of the Snowflake Data Cloud

Interoperable stage

A stage is a location for storing data files for loading into or unloading from Snowflake tables. Stages enable interoperability with external cloud storage. Two main types exist:

Two stage types: Internal (Snowflake-managed) vs External (your cloud storage: S3, GCS, Azure). External stages enable direct integration with existing data lakes.

Elastic compute

Elastic compute is delivered through Virtual Warehouses—clusters of compute resources (CPU, memory, temporary storage) that execute queries, data loading, and DML operations. Key capabilities:

Elastic Compute = Virtual Warehouses

  • Scale Up/Down: Change warehouse size (X-Small → Large)
  • Scale In/Out: Add/remove clusters for concurrency
  • Pay only for compute when using it (separation of storage and compute)

Storage and compute separation enables pay-per-use pricing and independent scaling.

Snowflake’s layers

Snowflake uses a hybrid shared-disk and shared-nothing architecture with three layers:

  1. Database Storage: Foundation layer storing data in compressed, columnar format within cloud provider storage (AWS, Azure, GCP). Designed for scalability, resilience, and efficiency.
  2. Query Processing (Compute): Independent virtual warehouses executing queries without shared compute resources, enabling elastic compute.
  3. Cloud Services: The “brain” coordinating authentication, infrastructure management, metadata management, query optimization, and access control—making Snowflake fully managed.

Three-Layer Architecture (Hybrid shared-disk + shared-nothing):

  1. Database Storage - Compressed, columnar format in cloud storage (AWS/Azure/GCP)
  2. Query Processing - Independent virtual warehouses (no shared compute)
  3. Cloud Services - The “brain”: authentication, metadata, query optimization, access control

Overview of Snowflake editions

Snowflake offers four editions with progressively advanced features:

Editions progression: Standard → Enterprise (+ multi-cluster, materialized views) → Business Critical (+ HIPAA/PCI, failover) → VPS (complete isolation)

Outline key Snowflake tools and interfaces

Snowsight

Snowsight is Snowflake’s modern web interface for:

Snowsight = Modern web UI for SQL, visualizations, monitoring, and management

SnowSQL

SnowSQL is the command-line client for interactive and batch SQL operations:

Snowflake connectors

Snowflake provides connectors for integrating tools and applications:

Key Connectors: Spark, Python, Kafka (for streaming data)

Snowflake drivers

Drivers for various languages enable custom application development:

Snowpark

Snowpark is a developer framework supporting Python, Java, and Scala for building data pipelines and applications that run natively in Snowflake. Provides a DataFrame-style API similar to Spark or pandas.

Snowpark: Developer framework for Python/Java/Scala with DataFrame API. Runs natively in Snowflake (pushdown compute).

SnowCD

SnowCD (Snowflake Connectivity Diagnostic) is a command-line tool for diagnosing network connections, testing latency, DNS resolution, and firewall issues.

Streamlit in Snowflake

Streamlit enables building and deploying Python web apps natively in Snowflake, creating interactive data applications securely connected to your data without additional infrastructure.

Streamlit in Snowflake: Build and deploy Python web apps natively in Snowflake (no additional infrastructure needed)

Cortex (AI/ML services)

Snowflake Cortex provides managed AI/ML services requiring no ML expertise:

Cortex AI/ML Services: LLM Functions (in SQL), Forecasting, Anomaly Detection, Contribution Explorer - no ML expertise required

Snowflake SQL API

REST API for programmatic SQL execution:

Outline Snowflake’s catalog and objects

Databases

Databases are logical groupings of schemas, tables, views, and other objects—the top-level organizational unit for data.

Stages

Stages are locations for data files, either internal (Snowflake-managed) or external (your cloud storage), used for loading and unloading data.

Schema types

Schemas logically group database objects within a database. Three types:

Three Schema Types: Permanent (standard), Transient (no failure recovery, lower cost), Temporary (session-based)

Table types

Five table types with different characteristics:

Table Types & Time Travel:

  • Permanent: 7-day Time Travel (90 days Enterprise), has Fail-safe
  • Transient: 24-hour Time Travel max, NO Fail-safe
  • Temporary: Session-specific
  • External: Query files in external stages as tables
  • Iceberg: Apache Iceberg format with evolution & time travel on data lake

View types

Views are saved queries that behave like tables. Three types:

Three View Types:

  • Standard: Query executed each time
  • Materialized: Stored result, auto-updated (performance boost)
  • Secure: Hides query definition for data privacy

Data types

Rich SQL data type support:

Semi-structured data types: VARIANT, ARRAY, OBJECT (key differentiator for JSON/XML)

User Defined Functions (UDFs)

UDFs extend Snowflake functionality with custom logic, supporting:

UDFs support: SQL, JavaScript, Java, Python

User Defined Table Functions (UDTFs)

UDTFs return row sets, behaving like tables. Written in SQL, JavaScript, Java, or Python for complex data transformations.

Stored procedures

Stored procedures encapsulate SQL statements and procedural logic for complex business logic and task automation. Written in SQL (Snowflake Scripting), JavaScript, Java, or Python.

Streams

Streams capture change data capture (CDC) information, recording all inserts, updates, and deletes for processing changed data in pipelines.

Streams: Capture CDC (Change Data Capture) - inserts, updates, deletes. Essential for incremental data pipelines.

Tasks

Tasks schedule SQL statements or stored procedures for pipeline automation. Can be chained into complex workflows (task trees/DAGs).

Tasks: Scheduled SQL/stored procedures. Can be chained for complex workflows (task trees/DAGs).

Pipes

Pipes define continuous data ingestion from external stages. Snowpipe automatically loads new files as they arrive—efficient and scalable for streaming data.

Pipes (Snowpipe): Continuous, automatic data ingestion from external stages. Loads files as they arrive.

Shares

Shares encapsulate database sharing information, enabling Secure Data Sharing between Snowflake accounts and non-Snowflake users via reader accounts.

Shares: Enable Secure Data Sharing between Snowflake accounts (even to non-Snowflake users via reader accounts).

Sequences

Sequences generate unique number sequences, commonly used for primary key values.

Outline Snowflake storage concepts

Micro-partitions

Micro-partitions are Snowflake’s fundamental storage unit. Data is automatically divided into immutable micro-partitions (50-500MB uncompressed). Metadata stored per partition (column value ranges) enables efficient query pruning.

Micro-partitions: Fundamental storage unit

  • Immutable, 50-500MB uncompressed
  • Automatic partitioning
  • Metadata per partition (column ranges) enables query pruning

Data clustering

Data clustering co-locates related data in the same or nearby micro-partitions, improving query performance by reducing micro-partition scans. Define a clustering key for automatic maintenance.

Data Clustering: Co-locate related data in same/nearby micro-partitions. Define clustering key → automatic maintenance → improved query performance.

Data storage monitoring

Monitor storage usage via ACCOUNT_USAGE schema views:

Snowsight also provides storage usage visualization for cost understanding and optimization.

Storage Monitoring: Use ACCOUNT_USAGE schema (TABLE_STORAGE_METRICS, STORAGE_USAGE) or Snowsight for visualization.