Cloud Data Platform Architecture Patterns
A comprehensive guide to designing cloud-native data platforms—covering lakehouse architecture, multi-cloud considerations, and platform comparisons across AWS, GCP, and Azure
Cloud data platforms have evolved from simple storage-compute architectures to sophisticated ecosystems of integrated services. This guide examines the dominant architectural patterns—data lakes, data warehouses, and the emerging lakehouse paradigm—while providing practical guidance on platform selection and multi-cloud strategies.
- Familiarity with cloud computing concepts (compute, storage, networking)
- Basic understanding of data warehousing and data lakes
- Experience with at least one major cloud provider

Cloud Data Platform Architecture Patterns
The Evolution of Cloud Data Architecture
Cloud data platforms have evolved through distinct generations:
┌──────────────────────────────────────────────────────────────────────────────┐
│ EVOLUTION OF CLOUD DATA PLATFORMS │
├──────────────────────────────────────────────────────────────────────────────┤
│ │
│ Gen 1: Lift and Shift (2010-2015) │
│ ───────────────────────────────── │
│ • Traditional data warehouses migrated to cloud VMs │
│ • Same architecture, just different hosting │
│ • High operational burden, limited scalability │
│ Examples: Oracle on EC2, Teradata on Azure VMs │
│ │
│ Gen 2: Cloud-Native Warehouses (2015-2019) │
│ ───────────────────────────────────────────── │
│ • Purpose-built for cloud (separation of storage and compute) │
│ • Elastic scaling, pay-per-use pricing │
│ • Still required ETL into proprietary formats │
│ Examples: Snowflake, BigQuery, Redshift │
│ │
│ Gen 3: Data Lakes (2016-2020) │
│ ───────────────────────────── │
│ • Store everything in object storage (S3, GCS, ADLS) │
│ • Schema-on-read flexibility │
│ • Decoupled storage from processing engines │
│ • Data quality and governance challenges │
│ Examples: Hadoop on EMR, Spark on Databricks │
│ │
│ Gen 4: Lakehouse (2020-Present) │
│ ──────────────────────────────── │
│ • Open table formats on object storage (Delta, Iceberg, Hudi) │
│ • ACID transactions, schema enforcement, time travel │
│ • Best of lakes (flexibility) and warehouses (reliability) │
│ • Unified platform for all data workloads │
│ Examples: Databricks Lakehouse, Snowflake with Iceberg, Dremio │
│ │
└──────────────────────────────────────────────────────────────────────────────┘
The Lakehouse Architecture
The lakehouse combines data lake flexibility with data warehouse reliability.
Core Principles
┌──────────────────────────────────────────────────────────────────────────────┐
│ LAKEHOUSE ARCHITECTURE │
├──────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ CONSUMPTION LAYER │ │
│ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │
│ │ │ BI │ │ SQL │ │ ML │ │ Streaming │ │ │
│ │ │ (Tableau) │ │ (Redash) │ │ (MLflow) │ │ (Flink) │ │ │
│ │ └────────────┘ └────────────┘ └────────────┘ └────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ QUERY ENGINES │ │
│ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │
│ │ │ Spark SQL │ │ Trino │ │ Photon │ │ Presto │ │ │
│ │ │ │ │ │ │ │ │ │ │ │
│ │ └────────────┘ └────────────┘ └────────────┘ └────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ OPEN TABLE FORMAT LAYER │ │
│ │ ┌───────────────────────────────────────────────────────────────┐ │ │
│ │ │ Delta Lake / Apache Iceberg / Apache Hudi │ │ │
│ │ │ │ │ │
│ │ │ • ACID Transactions • Time Travel │ │ │
│ │ │ • Schema Evolution • Partition Evolution │ │ │
│ │ │ • Unified Batch/Stream • Z-Ordering / Compaction │ │ │
│ │ └───────────────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ FILE FORMAT LAYER │ │
│ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │
│ │ │ Parquet │ │ ORC │ │ Avro │ │ │
│ │ │ (columnar) │ │ (columnar) │ │ (row-wise) │ │ │
│ │ └────────────┘ └────────────┘ └────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ OBJECT STORAGE LAYER │ │
│ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │
│ │ │ Amazon S3 │ │ GCS │ │ ADLS │ │ │
│ │ │ │ │ │ │ │ │ │
│ │ └────────────┘ └────────────┘ └────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────────────┘
Open Table Format Comparison
| Feature | Delta Lake | Apache Iceberg | Apache Hudi |
|---|---|---|---|
| Ecosystem | Databricks, Spark | Vendor-neutral, broad support | Uber origin, AWS focus |
| ACID Transactions | Yes | Yes | Yes |
| Time Travel | Yes | Yes | Yes |
| Schema Evolution | Yes | Yes (best support) | Yes |
| Partition Evolution | Limited | Full support | Limited |
| Streaming Support | Excellent | Good | Excellent |
| Small File Handling | Auto-compaction | Compaction needed | Auto-compaction |
| Query Engines | Spark, Trino, Flink | Spark, Trino, Flink, Dremio | Spark, Trino, Flink |
| Cloud Integration | AWS, Azure, GCP | AWS, Azure, GCP | AWS focus |
Platform Comparison
AWS Data Platform
┌──────────────────────────────────────────────────────────────────────────────┐
│ AWS DATA PLATFORM │
├──────────────────────────────────────────────────────────────────────────────┤
│ │
│ INGESTION │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Kinesis │ │ MSK │ │ DMS │ │ AppFlow │ │
│ │ (Stream) │ │ (Kafka) │ │ (Database) │ │ (SaaS) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ STORAGE │
│ ┌───────────────────────────────────────────────────────────────────────┐ │
│ │ Amazon S3 │ │
│ │ ┌─────────────────────────────────────────────────────────────────┐ │ │
│ │ │ S3 Standard │ S3 Intelligent │ S3 Glacier │ S3 Glacier Deep │ │ │
│ │ │ (Hot data) │ Tiering │ (Archive) │ Archive │ │ │
│ │ └─────────────────────────────────────────────────────────────────┘ │ │
│ └───────────────────────────────────────────────────────────────────────┘ │
│ │
│ PROCESSING │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ EMR │ │ Glue │ │ Athena │ │ Lambda │ │
│ │ (Spark) │ │ (Serverless)│ │ (Query) │ │ (Functions) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ WAREHOUSE │
│ ┌──────────────┐ ┌──────────────────────────────────────────────────────┐ │
│ │ Redshift │ │ Redshift Serverless / Redshift Spectrum │ │
│ │ (Cluster) │ │ (Query S3 directly) │ │
│ └──────────────┘ └──────────────────────────────────────────────────────┘ │
│ │
│ ML & ANALYTICS │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ SageMaker │ │ QuickSight │ │ Bedrock │ │ OpenSearch │ │
│ │ (ML) │ │ (BI) │ │ (Gen AI) │ │ (Search) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ GOVERNANCE │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Lake Form- │ │ Glue │ │ IAM / │ │ CloudTrail │ │
│ │ ation │ │ Catalog │ │ KMS │ │ (Audit) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────────────┘
GCP Data Platform
┌──────────────────────────────────────────────────────────────────────────────┐
│ GCP DATA PLATFORM │
├──────────────────────────────────────────────────────────────────────────────┤
│ │
│ INGESTION │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Pub/Sub │ │ Datastream │ │ Transfer │ │ Dataflow │ │
│ │ (Stream) │ │ (CDC) │ │ Service │ │ (Streaming) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ STORAGE │
│ ┌───────────────────────────────────────────────────────────────────────┐ │
│ │ Google Cloud Storage │ │
│ │ ┌─────────────────────────────────────────────────────────────────┐ │ │
│ │ │ Standard │ Nearline │ Coldline │ Archive │ │ │
│ │ │ (Hot) │ (Monthly) │ (Yearly) │ (Rare access) │ │ │
│ │ └─────────────────────────────────────────────────────────────────┘ │ │
│ └───────────────────────────────────────────────────────────────────────┘ │
│ │
│ PROCESSING │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Dataproc │ │ Dataflow │ │ Cloud │ │ Cloud │ │
│ │ (Spark) │ │ (Beam) │ │ Functions │ │ Run │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ WAREHOUSE │
│ ┌────────────────────────────────────────────────────────────────────────┐ │
│ │ BigQuery │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Native │ │ BigLake │ │ Omni │ │ │
│ │ │ Tables │ │ (External) │ │ (Multi-cloud)│ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │
│ └────────────────────────────────────────────────────────────────────────┘ │
│ │
│ ML & ANALYTICS │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Vertex AI │ │ Looker │ │ BQML │ │ AutoML │ │
│ │ (ML) │ │ (BI) │ │ (ML in BQ) │ │ (No-code) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ GOVERNANCE │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Dataplex │ │ Data │ │ IAM / │ │ Cloud │ │
│ │ │ │ Catalog │ │ VPC-SC │ │ Logging │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────────────┘
Azure Data Platform
┌──────────────────────────────────────────────────────────────────────────────┐
│ AZURE DATA PLATFORM │
├──────────────────────────────────────────────────────────────────────────────┤
│ │
│ INGESTION │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Event Hubs │ │ Kafka │ │ Data │ │ ADF │ │
│ │ (Stream) │ │ (HDInsight) │ │ Factory │ │ (Pipelines) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ STORAGE │
│ ┌───────────────────────────────────────────────────────────────────────┐ │
│ │ Azure Data Lake Storage Gen2 (ADLS) │ │
│ │ ┌─────────────────────────────────────────────────────────────────┐ │ │
│ │ │ Hot │ Cool │ Cold │ Archive │ │ │
│ │ │ (< 1 day │ (Monthly)│ (Yearly) │ (Rare access) │ │ │
│ │ └─────────────────────────────────────────────────────────────────┘ │ │
│ └───────────────────────────────────────────────────────────────────────┘ │
│ │
│ PROCESSING │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ HDInsight │ │ Databricks │ │ Azure │ │ Azure │ │
│ │ (Spark) │ │ │ │ Synapse │ │ Functions │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ WAREHOUSE │
│ ┌────────────────────────────────────────────────────────────────────────┐ │
│ │ Azure Synapse Analytics │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Dedicated │ │ Serverless │ │ Spark │ │ │
│ │ │ Pools │ │ SQL │ │ Pools │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │
│ └────────────────────────────────────────────────────────────────────────┘ │
│ │
│ ML & ANALYTICS │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Azure ML │ │ Power BI │ │ Cognitive │ │ OpenAI │ │
│ │ │ │ │ │ Services │ │ Service │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ GOVERNANCE │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Purview │ │ Unity │ │ Azure │ │ Monitor │ │
│ │ │ │ Catalog │ │ AD / RBAC │ │ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────────────┘
Platform Selection Matrix
| Factor | AWS | GCP | Azure |
|---|---|---|---|
| Best For | Broadest ecosystem, enterprise | Analytics/ML workloads, simplicity | Microsoft shops, hybrid |
| Warehouse | Redshift (good) | BigQuery (excellent) | Synapse (good) |
| Lakehouse | EMR + S3 + Lake Formation | BigLake + GCS | Databricks + ADLS |
| Streaming | Kinesis (good), MSK (Kafka) | Pub/Sub + Dataflow (excellent) | Event Hubs (good) |
| Serverless | Excellent (Lambda, Glue) | Excellent (Dataflow, Functions) | Good (Functions) |
| ML Platform | SageMaker (comprehensive) | Vertex AI (excellent) | Azure ML (good) |
| Governance | Lake Formation (maturing) | Dataplex (good) | Purview (good) |
| Cost Model | Complex but flexible | Simpler, slot-based | Complex |
| Enterprise Features | Excellent | Good | Excellent |
Reference Architecture: Enterprise Lakehouse
# terraform/main.tf (pseudo-code for architecture reference)
"""
Enterprise Lakehouse Reference Architecture on AWS.
Demonstrates production patterns for cloud data platforms.
"""
# Network Foundation
module "network" {
source = "./modules/network"
vpc_cidr = "10.0.0.0/16"
availability_zones = ["us-east-1a", "us-east-1b", "us-east-1c"]
# Isolated subnets for data workloads
private_subnets = {
data_processing = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
analytics = ["10.0.11.0/24", "10.0.12.0/24", "10.0.13.0/24"]
}
# VPC endpoints for AWS services
endpoints = ["s3", "glue", "athena", "kinesis", "secretsmanager"]
}
# Data Lake Storage
module "data_lake" {
source = "./modules/data-lake"
bucket_name = "company-data-lake-prod"
# Medallion architecture zones
zones = {
raw = {
prefix = "raw/"
lifecycle = "INTELLIGENT_TIERING"
retention = "7_YEARS"
versioning = true
}
bronze = {
prefix = "bronze/"
lifecycle = "INTELLIGENT_TIERING"
retention = "3_YEARS"
versioning = true
}
silver = {
prefix = "silver/"
lifecycle = "STANDARD"
retention = "2_YEARS"
versioning = true
}
gold = {
prefix = "gold/"
lifecycle = "STANDARD"
retention = "1_YEAR"
versioning = true
}
}
# Encryption
kms_key_id = module.security.data_encryption_key_id
# Replication for disaster recovery
replication = {
enabled = true
destination = "us-west-2"
}
}
# Data Catalog and Governance
module "governance" {
source = "./modules/governance"
# AWS Lake Formation for fine-grained access
lake_formation = {
data_lake_locations = [module.data_lake.bucket_arn]
# Tag-based access control
lf_tags = {
pii = ["true", "false"]
department = ["engineering", "finance", "marketing", "sales"]
sensitivity = ["public", "internal", "confidential", "restricted"]
}
}
# Glue Data Catalog
glue_catalog = {
databases = ["raw", "bronze", "silver", "gold"]
crawlers = {
bronze_crawler = {
path = "s3://company-data-lake-prod/bronze/"
schedule = "cron(0 */4 * * ? *)" # Every 4 hours
}
}
}
}
# Processing Clusters
module "processing" {
source = "./modules/processing"
# EMR for batch processing
emr_clusters = {
batch_processing = {
release = "emr-6.10.0"
applications = ["Spark", "Hive", "Delta"]
instance_type = "r5.4xlarge"
core_nodes = 10
task_nodes_min = 0
task_nodes_max = 50
spot_percentage = 80
}
}
# Glue for serverless ETL
glue_jobs = {
bronze_ingestion = {
worker_type = "G.1X"
workers = 10
glue_version = "4.0"
}
}
}
# Query and Analytics
module "analytics" {
source = "./modules/analytics"
# Athena for ad-hoc queries
athena = {
workgroups = {
analyst = {
output_location = "s3://company-athena-results/analyst/"
bytes_scanned_cutoff = 10737418240 # 10 GB
enforce_workgroup_config = true
}
data_scientist = {
output_location = "s3://company-athena-results/data-science/"
bytes_scanned_cutoff = 107374182400 # 100 GB
}
}
}
# Redshift for warehouse workloads
redshift = {
cluster_type = "ra3.xlplus"
number_of_nodes = 4
# Serverless for variable workloads
serverless = {
enabled = true
base_capacity = 32 # RPUs
max_capacity = 256
}
# Spectrum for querying S3
spectrum_role = module.governance.redshift_spectrum_role_arn
}
}
# Streaming Layer
module "streaming" {
source = "./modules/streaming"
# MSK for Kafka
msk = {
kafka_version = "3.4.0"
broker_nodes = 6
instance_type = "kafka.m5.2xlarge"
ebs_volume_size = 1000
# Configuration
auto_create_topics = false
retention_hours = 168 # 7 days
}
# Kinesis for serverless streaming
kinesis = {
streams = {
clickstream = {
shard_count = 10
retention_hours = 24
}
}
}
# Flink for stream processing
flink = {
applications = {
realtime_aggregation = {
parallelism = 10
}
}
}
}
# Observability
module "observability" {
source = "./modules/observability"
# CloudWatch dashboards
dashboards = ["data-platform-overview", "processing-jobs", "query-performance"]
# Alarms
alarms = {
processing_failure = {
metric = "glue_job_failures"
threshold = 1
period = 300
}
query_performance = {
metric = "athena_query_duration"
threshold = 300000 # 5 minutes
statistic = "p95"
}
}
# Data quality monitoring
data_quality = {
enabled = true
engine = "great_expectations"
}
}
Multi-Cloud Strategies
Design for Portability
# multicloud/abstractions.py
"""
Abstraction layer for multi-cloud data operations.
Enables portability across AWS, GCP, and Azure.
"""
from abc import ABC, abstractmethod
from typing import List, Dict, Any
from enum import Enum
class CloudProvider(Enum):
AWS = "aws"
GCP = "gcp"
AZURE = "azure"
class ObjectStorage(ABC):
"""Abstract interface for object storage operations."""
@abstractmethod
def read_parquet(self, path: str) -> 'DataFrame':
pass
@abstractmethod
def write_parquet(self, df: 'DataFrame', path: str, partition_by: List[str] = None):
pass
@abstractmethod
def list_objects(self, prefix: str) -> List[str]:
pass
class S3Storage(ObjectStorage):
"""AWS S3 implementation."""
def __init__(self, bucket: str, region: str = "us-east-1"):
self.bucket = bucket
self.region = region
self._client = boto3.client('s3', region_name=region)
def read_parquet(self, path: str) -> 'DataFrame':
import pandas as pd
full_path = f"s3://{self.bucket}/{path}"
return pd.read_parquet(full_path)
def write_parquet(self, df: 'DataFrame', path: str, partition_by: List[str] = None):
import pyarrow.parquet as pq
full_path = f"s3://{self.bucket}/{path}"
df.to_parquet(full_path, partition_cols=partition_by)
class GCSStorage(ObjectStorage):
"""Google Cloud Storage implementation."""
def __init__(self, bucket: str, project: str):
self.bucket = bucket
self.project = project
self._client = storage.Client(project=project)
def read_parquet(self, path: str) -> 'DataFrame':
import pandas as pd
full_path = f"gs://{self.bucket}/{path}"
return pd.read_parquet(full_path)
class ADLSStorage(ObjectStorage):
"""Azure Data Lake Storage Gen2 implementation."""
def __init__(self, account: str, container: str):
self.account = account
self.container = container
def read_parquet(self, path: str) -> 'DataFrame':
import pandas as pd
full_path = f"abfss://{self.container}@{self.account}.dfs.core.windows.net/{path}"
return pd.read_parquet(full_path)
def get_storage(provider: CloudProvider, config: Dict[str, Any]) -> ObjectStorage:
"""Factory for cloud storage implementations."""
if provider == CloudProvider.AWS:
return S3Storage(config['bucket'], config.get('region', 'us-east-1'))
elif provider == CloudProvider.GCP:
return GCSStorage(config['bucket'], config['project'])
elif provider == CloudProvider.AZURE:
return ADLSStorage(config['account'], config['container'])
else:
raise ValueError(f"Unknown provider: {provider}")
Multi-Cloud Data Mesh
┌──────────────────────────────────────────────────────────────────────────────┐
│ MULTI-CLOUD DATA MESH │
├──────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────┐ │
│ │ FEDERATED CATALOG │ │
│ │ (Unity Catalog / Dataplex) │ │
│ └──────────────┬──────────────────┘ │
│ │ │
│ ┌────────────────────────┼────────────────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │
│ │ AWS DOMAIN │ │ GCP DOMAIN │ │ AZURE DOMAIN │ │
│ │ │ │ │ │ │ │
│ │ ┌────────────┐ │ │ ┌────────────┐ │ │ ┌────────────┐ │ │
│ │ │ Customer │ │ │ │ Analytics │ │ │ │ Finance │ │ │
│ │ │ Data │ │ │ │ Data │ │ │ │ Data │ │ │
│ │ │ Products │ │ │ │ Products │ │ │ │ Products │ │ │
│ │ └────────────┘ │ │ └────────────┘ │ │ └────────────┘ │ │
│ │ │ │ │ │ │ │
│ │ ┌────────────┐ │ │ ┌────────────┐ │ │ ┌────────────┐ │ │
│ │ │ S3 + │ │ │ │ GCS + │ │ │ │ ADLS + │ │ │
│ │ │ Delta │ │ │ │ Iceberg │ │ │ │ Delta │ │ │
│ │ └────────────┘ │ │ └────────────┘ │ │ └────────────┘ │ │
│ └──────────────────┘ └──────────────────┘ └──────────────────┘ │
│ │ │ │ │
│ └────────────────────────┼────────────────────────┘ │
│ ▼ │
│ ┌─────────────────────────────────┐ │
│ │ CROSS-CLOUD QUERY ENGINE │ │
│ │ (Starburst / Dremio) │ │
│ └─────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────────────┘
Cost Optimization
Storage Tiering Strategy
# cost_optimization/storage_tiering.py
"""
Automated storage tiering for cost optimization.
Moves data between tiers based on access patterns.
"""
from datetime import datetime, timedelta
from typing import Dict, List
from dataclasses import dataclass
@dataclass
class TieringPolicy:
"""Storage tiering policy configuration."""
hot_to_warm_days: int = 30
warm_to_cold_days: int = 90
cold_to_archive_days: int = 365
hot_tier: str = "STANDARD"
warm_tier: str = "STANDARD_IA"
cold_tier: str = "GLACIER_IR"
archive_tier: str = "DEEP_ARCHIVE"
class StorageTieringManager:
"""Manages automatic storage tiering based on access patterns."""
def __init__(self, s3_client, policy: TieringPolicy = None):
self.s3 = s3_client
self.policy = policy or TieringPolicy()
def analyze_access_patterns(self, bucket: str, prefix: str) -> Dict:
"""Analyze object access patterns for tiering decisions."""
# Get CloudWatch metrics for object access
# In production, this would query S3 Storage Lens or access logs
return {
'total_objects': 1000000,
'total_size_bytes': 10 * 1024**4, # 10 TB
'by_last_access': {
'hot': {'count': 50000, 'size': 500 * 1024**3}, # < 30 days
'warm': {'count': 200000, 'size': 2 * 1024**4}, # 30-90 days
'cold': {'count': 400000, 'size': 4 * 1024**4}, # 90-365 days
'archive': {'count': 350000, 'size': 3.5 * 1024**4}, # > 365 days
}
}
def calculate_cost_savings(self, analysis: Dict) -> Dict:
"""Calculate potential cost savings from optimal tiering."""
# AWS S3 pricing (us-east-1, approximate)
pricing = {
'STANDARD': 0.023, # per GB/month
'STANDARD_IA': 0.0125,
'GLACIER_IR': 0.004,
'DEEP_ARCHIVE': 0.00099,
}
by_access = analysis['by_last_access']
# Current cost (all in STANDARD)
total_size_gb = analysis['total_size_bytes'] / 1024**3
current_cost = total_size_gb * pricing['STANDARD']
# Optimized cost
optimized_cost = (
by_access['hot']['size'] / 1024**3 * pricing['STANDARD'] +
by_access['warm']['size'] / 1024**3 * pricing['STANDARD_IA'] +
by_access['cold']['size'] / 1024**3 * pricing['GLACIER_IR'] +
by_access['archive']['size'] / 1024**3 * pricing['DEEP_ARCHIVE']
)
return {
'current_monthly_cost': current_cost,
'optimized_monthly_cost': optimized_cost,
'monthly_savings': current_cost - optimized_cost,
'annual_savings': (current_cost - optimized_cost) * 12,
'savings_percentage': (1 - optimized_cost / current_cost) * 100
}
def apply_lifecycle_rules(self, bucket: str) -> None:
"""Apply lifecycle rules based on tiering policy."""
lifecycle_config = {
'Rules': [
{
'ID': 'TierToWarm',
'Filter': {'Prefix': ''},
'Status': 'Enabled',
'Transitions': [
{
'Days': self.policy.hot_to_warm_days,
'StorageClass': self.policy.warm_tier
}
]
},
{
'ID': 'TierToCold',
'Filter': {'Prefix': ''},
'Status': 'Enabled',
'Transitions': [
{
'Days': self.policy.warm_to_cold_days,
'StorageClass': self.policy.cold_tier
}
]
},
{
'ID': 'TierToArchive',
'Filter': {'Prefix': ''},
'Status': 'Enabled',
'Transitions': [
{
'Days': self.policy.cold_to_archive_days,
'StorageClass': self.policy.archive_tier
}
]
}
]
}
self.s3.put_bucket_lifecycle_configuration(
Bucket=bucket,
LifecycleConfiguration=lifecycle_config
)
Key Takeaways
-
Lakehouse is the future: The combination of open table formats (Delta, Iceberg) with object storage provides the best of data lakes and warehouses.
-
Choose platforms based on fit, not hype: AWS offers the broadest ecosystem, GCP excels at analytics/ML, Azure integrates best with Microsoft shops.
-
Design for portability: Use open formats and abstraction layers. Multi-cloud is increasingly the enterprise reality.
-
Governance is foundational: Don't bolt on security and governance after the fact. Build them into the architecture from day one.
-
Optimize costs continuously: Storage tiering, reserved capacity, and serverless options can reduce costs by 50%+ when properly managed.
-
Serverless isn't always cheaper: Understand the cost model. Serverless excels at variable workloads but can be expensive at scale.
References
- Databricks. (2023). "Lakehouse: A New Generation of Open Platforms."
- Apache Iceberg Documentation: Table Format Specification
- AWS Well-Architected Framework: Data Analytics Lens
- Google Cloud Architecture Center: Data Analytics Reference Architectures
- Azure Architecture Center: Analytics End-to-End with Azure Synapse
- Armbrust, M. et al. (2021). "Lakehouse: A New Generation of Open Platforms." CIDR.
Key Takeaways
- ✓The lakehouse architecture combines the flexibility of data lakes with the performance of data warehouses
- ✓Platform choice should be driven by existing ecosystem, team skills, and specific workload requirements—not marketing
- ✓Multi-cloud is inevitable for most enterprises; design for portability from day one
- ✓Serverless and managed services reduce operational burden but require careful cost management
- ✓Data governance and security architecture must be foundational, not bolted on

