Course overview

How to Design Cloud Platforms

41 modules
169 lessons
Part 1

Foundations: What a Cloud Platform Is

  1. Foundations: What a Cloud Platform IsSign in

  2. Why "Service to Serverless"Sign in

  3. How to Use This CourseSign in

  4. The Incremental Ladder (Step 0 to Step 7)Sign in

  5. The Course LensesSign in

  6. Diagram Legend and Notation TypesSign in

  7. From Datacenters to "As-a-Service" AbstractionsSign in

  8. IaaS, PaaS, and FaaS as a Responsibility SpectrumSign in

  9. Shared Responsibility as a Boundary ContractSign in

Part 2

Virtualization Fundamentals for a “Mini-Cloud”

  1. Virtualization Fundamentals for a “Mini-Cloud”Sign in

  2. Hypervisors as the First Isolation BoundarySign in

  3. VMs, Images, and Snapshots as a Packaging and Lifecycle ModelSign in

  4. Overcommit, Live Migration, and Early Storage BoundariesSign in

Part 3

Containers as the Density Layer (on VMs)

  1. Containers as the Density Layer (on VMs)Sign in

  2. Containers vs VMs: Isolation and Lifecycle TradeoffsSign in

  3. Images, Layers, and Registries as Distribution BoundariesSign in

  4. Manual Scheduling and Handcrafted Cluster Configs: What Can Fail TogetherSign in

Part 4

Baby SDN: Virtual Networks and Security Groups

  1. Baby SDN: Virtual Networks and Security GroupsSign in

  2. Virtual Networks, Subnets, Routing, and Internet EgressSign in

  3. Security Groups as Coarse Firewall Rules and Policy Attachment PointsSign in

  4. Load Balancers at a Conceptual Level: Service IPs and Failure MaskingSign in

Part 5

First Control Plane: API + UI for a Tiny Cloud

  1. First Control Plane: API + UI for a Tiny CloudSign in

  2. Create and Delete VMs and Networks via API: The First System of RecordSign in

  3. Dashboard and CLI as Control-Plane Clients: Shaping Safe WorkflowsSign in

  4. State Stores as "Source of Truth": Desired State, Drift, and RepairSign in

Part 6

IAM Fundamentals: Identities, Roles, Policies

  1. IAM Fundamentals: Identities, Roles, PoliciesSign in

  2. Users vs Service Identities: Who Can Act, and How You Attribute ActionsSign in

  3. RBAC and Scopes: Mapping Roles and Permissions to ResourcesSign in

  4. Policy Evaluation at a High Level: Enforcement Points and Failure ModesSign in

Part 7

Resource Hierarchies and Delegated Administration

  1. Resource Hierarchies and Delegated AdministrationSign in

  2. Orgs to Folders to Projects to Resources - The Governance TreeSign in

  3. Permission Inheritance and Scoping - Minimizing Blast Radius by DefaultSign in

  4. Delegated Administration - Operating at Scale Without Central BottlenecksSign in

Part 8

Auth, Federation, SSO, and Credential Strategy

  1. Auth, Federation, SSO, and Credential StrategySign in

  2. Federation with External IdPs (Conceptual): Identity Boundaries and TrustSign in

  3. SSO to Console and APIs: Consistent Authentication PathsSign in

  4. Short-Lived Credentials and Key Management (High Level): Reducing Standing AccessSign in

Part 9

Audit Logging and Governance Hooks

  1. Audit Logging and Governance HooksSign in

  2. Control-Plane Audit Logs: Who/What/Where/When, and Why It Matters OperationallySign in

  3. SIEM Integration Concepts: Turning Events into Detection and ResponseSign in

  4. Guardrails vs Hard Blocks: Governance Policies as Platform ContractsSign in

Part 10

Tenancy Models and Shared Responsibility

  1. Tenancy Models and Shared ResponsibilitySign in

  2. Physical vs Logical Tenancy: Choosing Where Isolation LivesSign in

  3. Account and Project Boundaries: Scoping Identity and ResourcesSign in

  4. Soft vs Hard Multitenancy: Tradeoffs in Cost, Risk, and OperabilitySign in

Part 11

Isolation Boundaries: Network, Compute, Storage

  1. Isolation Boundaries: Network, Compute, StorageSign in

  2. Network Isolation: VPC-Like Constructs, Peering, and Private ConnectivitySign in

  3. Compute Isolation: Quotas, Noisy Neighbors, and Fairness ControlsSign in

  4. Storage Isolation: Encryption and Per-Tenant Key Ideas (Conceptual)Sign in

Part 12

Multi-Tenant Data Planes and Blast Radius

  1. Multi-Tenant Data Planes and Blast RadiusSign in

  2. Shared Control Plane; Shared vs Per-Tenant Data Planes: Choosing BoundariesSign in

  3. Designing Safe Shared Services: Gateways, Queues, and Databases (Conceptual)Sign in

  4. Blast Radius Controls: Compartmentalization and Graceful DegradationSign in

Part 13

Compliance Zones and Regulated Tenants

  1. Compliance Zones and Regulated TenantsSign in

  2. Data Residency and Compliance Zones: Partitioning the PlatformSign in

  3. Baselines and Blueprints per Tenant: Repeatable ControlsSign in

  4. Evidence and Auditability: Proving What Happened and What Is EnforcedSign in

Part 14

Multitenancy in Serverless and Managed Services

  1. Multitenancy in Serverless and Managed ServicesSign in

  2. Hidden Sharing in Serverless Runtimes: Where Tenants MeetSign in

  3. Fairness and Tenant SLOs in Shared Compute PoolsSign in

  4. Isolation Constraints for Managed Services: What Cannot Be CustomizedSign in

Part 15

Cluster Managers and Scheduling Basics

  1. Cluster Managers and Scheduling BasicsSign in

  2. Node Pools and Capacity Pools: Defining Placement DomainsSign in

  3. Binpacking, Anti-Affinity, and Guarantees: Scheduling Goals and Failure TradeoffsSign in

  4. Queueing and Priorities: Who Gets Capacity Under ContentionSign in

Part 16

Declarative Control Planes and Reconciliation Loops

  1. Declarative Control Planes and Reconciliation LoopsSign in

  2. Desired vs Observed State: What Declarative Means OperationallySign in

  3. Controllers and Reconcilers: Convergence, Retries, and BackoffSign in

  4. Versioned APIs and Compatibility: Evolving a Platform SafelySign in

Part 17

Service Discovery and Load Balancing

  1. Service Discovery and Load BalancingSign in

  2. DNS and Service Registries: Naming as a Platform BoundarySign in

  3. L4 vs L7 Load Balancing (Conceptual): Where Policy and Retries LiveSign in

  4. Health Checks and Out-of-Rotation Behavior: Isolating Partial FailureSign in

Part 18

Deployments, Rollouts, and Runtime Management

  1. Deployments, Rollouts, and Runtime ManagementSign in

  2. Rolling, Blue-Green, and Canary: Change as a Controlled ExperimentSign in

  3. Config as Data: Secrets, Config Maps, and Flags as BoundariesSign in

  4. Draining, Pausing, and Rescheduling: Coordinating with Load and StateSign in

Part 19

Serverless as Higher-Order Orchestration

  1. Serverless as Higher-Order OrchestrationSign in

  2. Functions and Jobs plus Triggers: Execution as a Managed BoundarySign in

  3. Event-Based Autoscaling: Signals, Backpressure, and Thundering HerdsSign in

  4. Limits of Serverless Abstractions: What You Still Have to DesignSign in

Part 20

Observability at Platform Scale

  1. Observability at Platform ScaleSign in

  2. Metrics, Logs, Traces, and Events: Four Signals and What They AnswerSign in

  3. Standardizing Telemetry Across Services: Consistency for OperatorsSign in

  4. Cardinality and Observability Cost: When Visibility Becomes a Budget RiskSign in

Part 21

SLOs, SLIs, and Error Budgets

  1. SLOs, SLIs, and Error BudgetsSign in

  2. Platform SLOs vs Customer SLOs: Setting Realistic ContractsSign in

  3. Selecting SLIs: Latency, Availability, Correctness, and What They HideSign in

  4. Error Budgets: Decision Tools for Balancing Change and StabilitySign in

Part 22

Failure Domains and Resilient Topologies

  1. Failure Domains and Resilient TopologiesSign in

  2. Zones, Regions, and Global Control Planes: Naming Correlated FailureSign in

  3. Multi-AZ and Multi-Region Patterns: Redundancy as a Boundary ChoiceSign in

  4. Latency vs Resilience: Understanding What You Pay to Reduce Blast RadiusSign in

Part 23

Incident Response and On-Call for Platforms

  1. Incident Response and On-Call for PlatformsSign in

  2. On-Call Patterns for Shared Platforms: Routing and OwnershipSign in

  3. Runbooks, Playbooks, and Automation: Scaling Response Without HeroicsSign in

  4. Post-Incident Reviews and Learning Loops: Turning Incidents Into Design ChangesSign in

Part 24

Chaos Engineering and Failure Injection

  1. Chaos Engineering and Failure InjectionSign in

  2. Host, Network, and API Failure Modes: Choosing Experiments That MatterSign in

  3. Validating Autoscaling, Failover, and Self-Healing: Measuring the Control LoopsSign in

  4. Folding Results Back Into Design: Resilience as a Continuous RefactorSign in

Part 25

Capacity Planning Fundamentals

  1. Capacity Planning FundamentalsSign in

  2. Demand Modeling, Utilization, and Headroom: Translating Uncertainty into BuffersSign in

  3. Compute, Storage, and Network Capacity: Understanding Independent BottlenecksSign in

  4. Procurement and Buffer Strategy: Lead Time as a Failure DomainSign in

Part 26

Allocation, Reservations, and Overcommit

  1. Allocation, Reservations, and OvercommitSign in

  2. Reserved vs Best-Effort Resources: Setting Tenant ExpectationsSign in

  3. CPU and Memory Overcommit Risks: Efficiency vs Correlated FailureSign in

  4. Placement and Anti-Noisy-Neighbor Controls: Guardrails for FairnessSign in

Part 27

Metering, Pricing, and Billing Pipelines

  1. Metering, Pricing, and Billing PipelinesSign in

  2. Billing Units: Choosing What You Measure and What Customers Can PredictSign in

  3. Metering Pipelines and Aggregation: Turning Events Into Durable UsageSign in

  4. Invoices and Showback/Chargeback: Trust, Disputes, and CorrectionsSign in

Part 28

Cost Controls and Optimization

  1. Cost Controls and OptimizationSign in

  2. Budgets, Alerts, and Quotas: Cost as a Control Plane ContractSign in

  3. Rightsizing, Autoscaling, and Capacity Optimization: Feedback Loops for WasteSign in

  4. Reserved and Discount Program Patterns: Incentives and Lock-In Trade-offsSign in

Part 29

Sustainability and Efficiency as Platform Requirements

  1. Sustainability and Efficiency as Platform RequirementsSign in

  2. Energy and Carbon Considerations (Conceptual): Externalities as ConstraintsSign in

  3. Hardware Refresh and Density Strategy: Keeping the Fleet EfficientSign in

  4. "Cost Efficiency" as a Platform SLO: Operationalizing EfficiencySign in

Part 30

Cloud API Design Patterns

  1. Cloud API Design PatternsSign in

  2. Resource-Oriented APIs: Modeling the Platform as a Graph of ResourcesSign in

  3. Idempotency, Pagination, Long-Running Operations: Building Safe ClientsSign in

  4. Versioning and Deprecation: Changing Contracts Without Breaking TenantsSign in

Part 31

CLIs, SDKs, and Developer Experience

  1. CLIs, SDKs, and Developer ExperienceSign in

  2. CLI Command Structure and Auth UX: Avoiding Sharp EdgesSign in

  3. SDK Design: Thin Wrappers vs Higher-Level AbstractionsSign in

  4. Docs, Samples, and Golden Paths: Teaching the Intended WorkflowSign in

Part 32

Service Catalogs and Internal Marketplaces

  1. Service Catalogs and Internal MarketplacesSign in

  2. Service Catalogs and Templates: Productizing Internal CapabilitiesSign in

  3. Provisioning Flows With Policy Checks: Self-Service With GuardrailsSign in

  4. Tagging, Ownership Metadata, and Discovery: Making Operations TractableSign in

Part 33

External Marketplaces and Partner Ecosystems

  1. External Marketplaces and Partner EcosystemsSign in

  2. Third-Party Listings and Integrations: Expanding the Platform SafelySign in

  3. Billing and Entitlement Integration: Turning Usage into AgreementsSign in

  4. Trust and Security Vetting: Controlling Supply Chain RiskSign in

Part 34

Governance and Policy Engines

  1. Governance and Policy EnginesSign in

  2. Org-Wide Policies (Locations, Types, Sizes): Guardrails for SprawlSign in

  3. Constraint Frameworks (Conceptual): Evaluating Policy at ScaleSign in

  4. Governance With Good DX: Exceptions, Previews, and Safe DefaultsSign in

Part 35

Global Platforms: Regions, Zones, and Control Plane Consistency

  1. Global Platforms: Regions, Zones, and Control Plane ConsistencySign in

  2. Region/Zone Design: Partitioning for Availability and LatencySign in

  3. Global APIs vs Per-Region Endpoints: Routing, Failover, and Contract BoundariesSign in

  4. Consistency Models for Control Planes: What "Global State" Can MeanSign in

Part 36

Core Service Families: Compute, Storage, Network

  1. Core Service Families: Compute, Storage, NetworkSign in

  2. Compute Services: VMs, Containers, and Serverless as Products and BoundariesSign in

  3. Storage Services: Block, Object, File, and Databases from a Platform ViewSign in

  4. Network Services: Routing, Load Balancing, and Private ConnectivitySign in

Part 37

Data and AI Platforms as “Platforms on the Platform”

  1. Data and AI Platforms as “Platforms on the Platform”Sign in

  2. Data Warehouses, Lakes, and Streaming (Conceptual): Shared Services With Strong ContractsSign in

  3. Managed ML Services Patterns: Multi-Tenant Training and Inference BoundariesSign in

  4. Governance and Lineage Hooks: Observing Data Movement Across DomainsSign in

Part 38

Cloud Provider Security Posture

  1. Cloud Provider Security PostureSign in

  2. Provider Internal Security Responsibilities: Securing the Platform ItselfSign in

  3. Tenant Isolation, Key Management, and Identity at Scale: Designing for Least TrustSign in

  4. Threat Modeling and Red Teams: Testing the Boundaries You ClaimSign in

Part 39

Operating the Organization: Platform Teams and Rollouts

  1. Operating the Organization: Platform Teams and RolloutsSign in

  2. Platform Org Structures and Product Lines: Aligning Ownership With BoundariesSign in

  3. Change Management, Rollout Waves, and Feature Flags: Reducing Correlated FailureSign in

  4. Customer Comms: Status Pages, SLAs, and Public PostmortemsSign in

Part 40

Evolution and Next-Gen Platforms

  1. Evolution and Next-Gen PlatformsSign in

  2. Backward Compatibility: Introducing New Abstractions Without Breaking TenantsSign in

  3. Deprecation and Migration Playbooks: Operationalizing Change Over YearsSign in

  4. Long-Term Bets: Serverless, Edge, and the Next Boundary ShiftSign in

Part 41

Capstone: Design a Cloud Platform Slice

  1. Capstone: Design a Cloud Platform SliceSign in

  2. Capstone Problem Statement and ConstraintsSign in

  3. Deliverables: Diagrams and a "Platform Spec" DocumentSign in

  4. Self-Review and Evaluation Rubric: Checking Boundary Claims Against FailuresSign in