Course overview

How to design Real-Time & Streaming Data Systems

52 modules
210 lessons
—
Part 1

Appendices

  1. Appendix A - Diagram Templates by StepSign in

  2. Appendix B - Mapping Concepts to Real-World Streaming StacksSign in

  3. Appendix C - Readiness ChecklistsSign in

  4. Appendix D - GlossarySign in

Part 2

Course Setup and the Incremental Ladder

  1. Course Setup and the Incremental LadderSign in

  2. Why Signals to StreamsSign in

  3. How to Use This CourseSign in

  4. The Incremental Ladder (Step 0 to Step 7)Sign in

  5. The Course LensesSign in

  6. Diagram Legend and Notation TypesSign in

Part 3

Signals, Events, and Streams

  1. Signals, Events, and StreamsSign in

  2. Signals vs EventsSign in

  3. Streams vs BatchSign in

  4. The Core Loop (Producer to Decision)Sign in

Part 4

Time in Data Systems

  1. Time in Data SystemsSign in

  2. Event Time vs Processing TimeSign in

  3. Clocks and SkewSign in

  4. Time Semantics as a ContractSign in

Part 5

Ordering, Lateness, and Out-of-Order Data

  1. Ordering, Lateness, and Out-of-Order DataSign in

  2. Ordering Realities: Per-Partition Ordering vs the Myth of Global OrderSign in

  3. Lateness and Out-of-Order: Late Arrivals, Backfills, and What "Correct" Means in MotionSign in

  4. Watermarks (Conceptually): Deciding "How Late Is Too Late" and Operationalizing That DecisionSign in

Part 6

From Logs and Metrics to Streams

  1. From Logs and Metrics to StreamsSign in

  2. Telemetry as Streams: Logs, Metrics, Traces as Time-Series and Event StreamsSign in

  3. Aggregation as Streaming: Metrics Rollups, Downsampling, and Near-Real-Time ViewsSign in

  4. Streaming Boundaries: Where Observability Pipelines Resemble Product Event Pipelines (and Where They Differ)Sign in

Part 7

Diagramming Real-Time Data Systems

  1. Diagramming Real-Time Data SystemsSign in

  2. Timeline Diagrams: Event Sequences, Causality, and Ordering AssumptionsSign in

  3. Producer-Broker-Consumer Flows: Queues and Logs as System TopologiesSign in

  4. Processing DAGs and Stateful Operators: How to Draw Streaming Logic and Its StateSign in

Part 8

Step 0 Modeling: Time-Series and Event Shapes

  1. Step 0 Modeling: Time-Series and Event ShapesSign in

  2. Modeling Time-Series: Measurements, Tags/Labels, Values, and Cardinality RiskSign in

  3. Modeling Events: Keys, Timestamps, Payloads, and Append-Only ThinkingSign in

  4. Time-Series vs Events vs Logs: Choosing the Right Mental Model for the QuestionSign in

Part 9

Step 0 Sampling, Granularity, and Aggregation

  1. Step 0 Sampling, Granularity, and AggregationSign in

  2. Sampling and Downsampling: Choosing Resolution Without Destroying MeaningSign in

  3. Buckets and Rolling Aggregates: Time-Based Grouping and "Sliding" InterpretationsSign in

  4. Storage/Query Trade-offs: Write Patterns, Query Patterns, and Retention ConstraintsSign in

Part 10

Step 0 Minimal Stream Architecture

  1. Step 0 Minimal Stream ArchitectureSign in

  2. One Producer, One Consumer: The Smallest Working StreamSign in

  3. Minimal Stores: Simple Append, Simple Reads, Simple ReplaysSign in

  4. Early Failure Modes: Timestamp Mistakes, Inconsistent Keys, and Accidental Distributed TimeSign in

Part 11

Step 1 Messaging Models

  1. Step 1 Messaging ModelsSign in

  2. Point-to-Point vs Pub/Sub: Distribution, Fan-Out, and Coordination CostsSign in

  3. Competing Consumers: Work Distribution and Concurrency BoundariesSign in

  4. Messaging Anti-Patterns: Request/Response Misuse and Coupling That Breaks ScalingSign in

Part 12

Step 1 Queues vs Append-Only Logs

  1. Step 1 Queues vs Append-Only LogsSign in

  2. Queue Semantics: Destructive Consumption and "Who Owns the Work"Sign in

  3. Log Semantics: Immutable Append, Offsets, and Replay as a FeatureSign in

  4. Choosing the Model: Latency, Durability, Reprocessing, and Cost Trade SpacesSign in

Part 13

Step 1 Producers, Consumers, and Delivery Semantics

  1. Step 1 Producers, Consumers, and Delivery SemanticsSign in

  2. Producer Responsibilities: Batching, Retry, Serialization, and BackoffSign in

  3. Consumer Responsibilities: Checkpoints, Idempotency, and Safe RestartsSign in

  4. At-Most vs At-Least Once: What "Duplicates" Mean to Your DownstreamSign in

Part 14

Step 1 Offsets, Replay, and Reprocessing

  1. Step 1 Offsets, Replay, and ReprocessingSign in

  2. Offsets as Positions: Why "Where Am I?" Is the Core Recovery PrimitiveSign in

  3. Rewind and Rebuild: Reconstructing Views From Logs and Designing for RepairSign in

  4. Reprocessing as Normal: Backfills, Bug Fixes, and Schema EvolutionSign in

Part 15

Step 1 Message and Event Schema Design

  1. Step 1 Message and Event Schema DesignSign in

  2. Event Envelopes: Metadata, Correlation IDs, Trace IDs, and Routing FieldsSign in

  3. Compatibility Over Time: Forward/Backward Evolution and Versioning DisciplineSign in

  4. Schema Registries (Conceptual): Governance Mechanisms for Change at ScaleSign in

Part 16

Streaming Platforms as Distributed Logs

  1. Streaming Platforms as Distributed LogsSign in

  2. Topics and Partitions: Scaling Throughput via Parallel StreamsSign in

  3. Replication and Durability: Why "A Log" Becomes "A System"Sign in

  4. Control Plane Roles: Brokers, Controllers, and Coordination (Conceptual)Sign in

Part 17

Consumer Groups and Parallelism

  1. Consumer Groups and ParallelismSign in

  2. Consumer Groups: Scaling Reads While Preserving Per-Partition OrderSign in

  3. Assignment and Rebalancing: Why Coordination Events Cause Latency SpikesSign in

  4. Designing for Rebalance: Idempotency, Warm Caches, and Safe PausesSign in

Part 18

Retention, Compaction, and Storage Policies

  1. Retention, Compaction, and Storage PoliciesSign in

  2. Retention as a Product Choice: Time/Size Policies and How Far Back Can We Repair?Sign in

  3. Compaction as a Pattern: Latest-Value-Per-Key vs Full HistorySign in

  4. Storage Planning: Tiering, Cost Controls, and Long-Lived StreamsSign in

Part 19

Acknowledgments and Delivery Guarantees

  1. Acknowledgments and Delivery GuaranteesSign in

  2. Producer Acks: Durability vs Latency vs CostSign in

  3. Committing Offsets: Recovery Behavior and Failure WindowsSign in

  4. Guarantee Trade-offs: The Practical Meaning of "Durable Enough"Sign in

Part 20

Operating Streaming Clusters

  1. Operating Streaming ClustersSign in

  2. Capacity Planning: Throughput, Partitions, Replication, and DiskSign in

  3. Upgrades and Scaling: Operational Risk, Rolling Changes, and Traffic SafetySign in

  4. Multi-Tenancy: Quotas, Isolation, and Noisy-Neighbor ProblemsSign in

Part 21

Managed vs Self-Managed Streaming Services

  1. Managed vs Self-Managed Streaming ServicesSign in

  2. Responsibility Split: What You Keep vs What the Provider OwnsSign in

  3. Cost and Lock-In: Portability, Hybrid Patterns, and Migration PlanningSign in

  4. Operating Model Fit: Team Skill, Reliability Needs, and Compliance BoundariesSign in

Part 22

Batch vs Streaming vs Micro-Batch

  1. Batch vs Streaming vs Micro-BatchSign in

  2. Streaming as "Always-On Computation": Why Latency Budgets Shape EverythingSign in

  3. Micro-Batch Models: Throughput Wins, Latency Costs, and Operational SimplicitySign in

  4. Choosing the Model: Workload Shape, Cost, and Correctness ExpectationsSign in

Part 23

Stateless Stream Processing

  1. Stateless Stream ProcessingSign in

  2. Core Operators: Map/Filter/Route/Enrich and Simple AggregatesSign in

  3. Lookups and Enrichment: Joining Against Reference Data Without Building State Machines AccidentallySign in

  4. Idempotent Transforms: Safe Replays and Repeatable OutputsSign in

Part 24

Processing Topologies and DAGs

  1. Processing Topologies and DAGsSign in

  2. DAGs as the Mental Model: Sources -> Operators -> SinksSign in

  3. Partitioning Through the Graph: Where Keys Change and Costs SpikeSign in

  4. Operator Boundaries: Coupling, Isolation, and Debugging ErgonomicsSign in

Part 25

What Frameworks Provide

  1. What Frameworks ProvideSign in

  2. Runtime Scheduling and Execution: What "the Framework" Does for YouSign in

  3. Connectors and Integrations: Sources and Sinks as First-Class Reliability BoundariesSign in

  4. Preview of State: Why State Management Is the Hard Part You Are About to InheritSign in

Part 26

Designing Stream Processing Jobs

  1. Designing Stream Processing JobsSign in

  2. Decomposing Problems into Operators: Choosing Responsibilities That Survive ChangeSign in

  3. Debuggability by Design: Observability Hooks and Introspection PointsSign in

  4. Operational Readiness: Failure Modes, Restart Behavior, and Safe RolloutsSign in

Part 27

State in Streaming Systems

  1. State in Streaming SystemsSign in

  2. Per-Key State: The Unit of Scalability (and the Unit of Pain)Sign in

  3. Aggregates and Summaries: Building "Live Views" over StreamsSign in

  4. State Growth and Eviction: TTLs, Compaction, and What Happens When State Never DiesSign in

Part 28

Windowing Concepts

  1. Windowing ConceptsSign in

  2. Tumbling, Sliding, and Hopping Windows: Three Ways to Slice Time (and Three Ways to Confuse People)Sign in

  3. Session Windows: Gap-Based Grouping and Behavioral InterpretationSign in

  4. Late Events and Completion: Allowed Lateness and When Results Become "Final Enough"Sign in

Part 29

Streams–Tables Duality

  1. Streams–Tables DualitySign in

  2. Stream as Changelog: Turning Events into a Table of "Current Truth"Sign in

  3. Materialized Views: Serving Real-Time Queryable StateSign in

  4. Rebuild and Repair: Correctness via Replay and Deterministic ReconstructionSign in

Part 30

Joins in Streaming

  1. Joins in StreamingSign in

  2. Stream-Stream Joins: Windows, State, and Time AlignmentSign in

  3. Stream-Table Joins: Enrichment and Dimension Lookups in MotionSign in

  4. Watermarks and State Costs: How "Correctness" Consumes MemorySign in

Part 31

Stateful Operator Design

  1. Stateful Operator DesignSign in

  2. Safe State Handling: Snapshots, Migrations, and RebalancesSign in

  3. Scaling Stateful Operators: Partitioning Strategies and HotspotsSign in

  4. Operational Realities: State Bootstrap Time and Recovery SLOsSign in

Part 32

Data Modeling for Streaming Use Cases

  1. Data Modeling for Streaming Use CasesSign in

  2. Entities vs Events in Real Time: Modeling for Joins, Aggregates, and ReplaysSign in

  3. Keys and Correlation: Multi-Stream Linkage Without ChaosSign in

  4. Designing for the Future: Schema Choices That Keep Reprocessing ViableSign in

Part 33

Failures in Streaming Systems

  1. Failures in Streaming SystemsSign in

  2. Failure Taxonomy: Node Loss, Partitions, Slow Tasks, Poison MessagesSign in

  3. Partial vs Full Failure: What Restarts Actually RestartSign in

  4. Retries Revisited: Idempotency as the Survival TraitSign in

Part 34

Checkpointing and Recovery

  1. Checkpointing and RecoverySign in

  2. Consistent Snapshots: What It Means to "Save State" in a DAGSign in

  3. Replay from Offsets + State: Rebuilding Exactly What You HadSign in

  4. Checkpoint Frequency Trade-offs: Overhead vs Recovery Point ObjectivesSign in

Part 35

Delivery Semantics in Depth

  1. Delivery Semantics in DepthSign in

  2. At-Most Once: When Losing Data Is Acceptable (and How to Be Honest About It)Sign in

  3. At-Least Once: Deduplication, Idempotent Sinks, and Practical CorrectnessSign in

  4. Exactly/Effectively Once: End-to-End Constraints and Where Systems CheatSign in

Part 36

Transactions and Two-Phase Commit (Conceptual)

  1. Transactions and Two-Phase Commit (Conceptual)Sign in

  2. Atomic Write + Offset Commit: The Core Problem StatementSign in

  3. When 2PC-Like Patterns Appear: Coordinating External Side EffectsSign in

  4. When Not to Use It: Complexity Cliffs and Safer AlternativesSign in

Part 37

Compensating Actions and Sagas in Streaming

  1. Compensating Actions and Sagas in StreamingSign in

  2. Irreversible Side Effects: Why "Undo" Is a Product DecisionSign in

  3. Eventual Correctness: Compensation Patterns and ReconciliationSign in

  4. Designing for Human Repair: Auditability and Replayable HistorySign in

Part 38

Testing and Validating Guarantees

  1. Testing and Validating GuaranteesSign in

  2. Failure Injection: Restarts, Partitions, Slowdowns, and Chaos-Style TestsSign in

  3. Synthetic Load Tests: Burst, Steady-State, and Pathological KeysSign in

  4. Proving It in Practice: Observing Semantics Through Metrics and LogsSign in

Part 39

Observability for Streaming Pipelines

  1. Observability for Streaming PipelinesSign in

  2. Core Metrics: Throughput, Lag, End-to-End Latency, Watermark ProgressionSign in

  3. Logs and Structured Operator Events: Making Failures DiagnosableSign in

  4. Tracing Across Pipelines: Linking Ingest -> Process -> Serve in One StorySign in

Part 40

Backpressure and Flow Control

  1. Backpressure and Flow ControlSign in

  2. What Backpressure Means: When Sinks or Operators Cannot Keep UpSign in

  3. Buffers and Bounded Queues: Where Latency Hides and Outages BeginSign in

  4. Shedding and Degradation: Dropping, Sampling, and Protective LimitsSign in

Part 41

Scaling and Resource Management

  1. Scaling and Resource ManagementSign in

  2. Scaling via Partitions and Parallelism: Changing the Shape of WorkSign in

  3. Resource Bottlenecks: CPU vs Memory vs Network vs Disk in PracticeSign in

  4. Autoscaling and Elasticity: What Can Scale Automatically (and What Cannot)Sign in

Part 42

Hotspots and Skew

  1. Hotspots and SkewSign in

  2. Key Skew: The Enemy of ParallelismSign in

  3. Mitigation Patterns: Salting, Repartitioning, Load-Aware RoutingSign in

  4. Detecting Skew: Signals in Lag, Latency, and Per-Partition MetricsSign in

Part 43

SLOs and Capacity Planning

  1. SLOs and Capacity PlanningSign in

  2. Streaming SLOs: Freshness, Latency, Availability, and Correctness PostureSign in

  3. Capacity Models: Headroom, Peaks, and Failure ScenariosSign in

  4. Planning for Growth: New Producers, New Consumers, and Retention ExpansionsSign in

Part 44

Operational Playbooks and Runbooks

  1. Operational Playbooks and RunbooksSign in

  2. Incident Types: Lag Spikes, Stuck Partitions, Failing Jobs, Corrupt StateSign in

  3. Triage and Mitigation: Pause, Drain, Reroute, Replay, Roll BackSign in

  4. Postmortems and Continuous Improvement: Turning Incidents into Safer DefaultsSign in

Part 45

End-to-End Architecture Patterns

  1. End-to-End Architecture PatternsSign in

  2. Layered Architecture: Ingestion -> Transport -> Processing -> ServingSign in

  3. Unified vs Split Architectures: Speed/Batch Layering and Convergence PatternsSign in

  4. Serving Patterns: Materialized Views, OLAP-ish Stores, and Operational StoresSign in

Part 46

Real-Time Analytics and Monitoring

  1. Real-Time Analytics and MonitoringSign in

  2. Observability Pipelines as Streaming: Metrics and Logs as First-Class StreamsSign in

  3. Dashboards and Near-Real-Time Queries: Freshness Guarantees and CostSign in

  4. Integration Boundaries: Streaming -> Time-Series Stores and Query EnginesSign in

Part 47

Personalization, Recommendations, and Fraud Detection

  1. Personalization, Recommendations, and Fraud DetectionSign in

  2. Event-Driven Feature Pipelines: Turning Behavior into Features in MotionSign in

  3. Real-Time Scoring (Conceptual): Model Calls as Sinks with Strict BudgetsSign in

  4. Feedback Loops: Online Learning Risks and Governance BoundariesSign in

Part 48

IoT and Edge Streaming

  1. IoT and Edge StreamingSign in

  2. Device Streams: Intermittent Networks and Store-and-Forward RealitiesSign in

  3. Gateways and Aggregation: Edge Filtering and Upstream CompressionSign in

  4. Buffering and Reconciliation: Correctness Across Connectivity GapsSign in

Part 49

Multi-Region and Hybrid Architectures

  1. Multi-Region and Hybrid ArchitecturesSign in

  2. Regional Clusters and Locality: Latency and Sovereignty ConstraintsSign in

  3. Cross-Region Replication: Durability vs Cost vs Consistency Trade-offsSign in

  4. Hybrid Patterns: On-Prem Sources, Cloud Processing, Shared TopicsSign in

Part 50

Governance, Schemas, and Evolution

  1. Governance, Schemas, and EvolutionSign in

  2. Schema Governance: Registries, Reviews, and Compatibility PoliciesSign in

  3. Ownership and Naming: Topic Conventions and Domain BoundariesSign in

  4. Migration Patterns: Resharding, Topic Evolution, and Safe RewritesSign in

Part 51

Platformizing Streaming

  1. Platformizing StreamingSign in

  2. Streaming as an Internal Platform: Self-Service Ingestion and GuardrailsSign in

  3. Templates and Golden Paths: Standard Jobs, Standard Connectors, Standard SLOsSign in

  4. Developer Experience: Testing, Local Runs, Staging, and Safe Deployment WorkflowsSign in

Part 52

Reference Architectures and Case-Style Patterns

  1. Reference Architectures and Case-Style PatternsSign in

  2. Observability Pipeline: From Telemetry to Alerts and DashboardsSign in

  3. Clickstream Analytics and Experimentation Feeds: Joins, Windows, and BackfillsSign in

  4. Payments/Fraud Streams and IoT Telemetry: Exactly-Once Posture vs Best-Effort PostureSign in