Course overview

How to Design Observability & Telemetry Systems

52 modules
210 lessons
—
Part 1

Appendices

  1. Appendix A - Diagram Templates by StepSign in

  2. Appendix B - Mapping Concepts to Real-World Observability StacksSign in

  3. Appendix C - Readiness Checklists for Moving Up the LadderSign in

  4. Appendix D - Glossary (Canonical Definitions)Sign in

Part 2

Course Setup and the Incremental Ladder

  1. Course Setup and the Incremental LadderSign in

  2. Why "Logs to Insights"Sign in

  3. How to Use This CourseSign in

  4. The Incremental Ladder (Step 0 to Step 7)Sign in

  5. The Course LensesSign in

  6. Diagram Legend and Notation TypesSign in

Part 3

From Monitoring to Observability

  1. From Monitoring to ObservabilitySign in

  2. Monitoring vs ObservabilitySign in

  3. Black-Box vs White-Box ViewsSign in

  4. Observability as New QuestionsSign in

Part 4

The Three Core Signal Families

  1. The Three Core Signal FamiliesSign in

  2. Logs, Metrics, TracesSign in

  3. Cardinality, Volume, StructureSign in

  4. Why You Need All ThreeSign in

Part 5

Observability as Part of System Design

  1. Observability as Part of System DesignSign in

  2. Instrumentation as PrerequisiteSign in

  3. Telemetry at Design TimeSign in

  4. Aligning with BoundariesSign in

Part 6

Observability Layers and Planes

  1. Observability Layers and PlanesSign in

  2. Application to Business LayersSign in

  3. Control Plane vs Data Plane SignalsSign in

  4. Cross-Cutting ConcernsSign in

Part 7

Diagramming Observability Systems

  1. Diagramming Observability SystemsSign in

  2. Telemetry Flow DiagramsSign in

  3. Observability Overlays on ArchitectureSign in

  4. Signal Maps and Dependency GraphsSign in

Part 8

Step 0 Logs: The Narrative of a System

  1. Step 0 Logs: The Narrative of a SystemSign in

  2. Text vs Structured LogsSign in

  3. Log Levels with DisciplineSign in

  4. Per-Request vs Periodic LogsSign in

Part 9

Step 0 Metrics: Numbers Over Time

  1. Step 0 Metrics: Numbers Over TimeSign in

  2. Metric TypesSign in

  3. Labels and CardinalitySign in

  4. Core Service MetricsSign in

Part 10

Step 0 Traces: End-to-End Request Journeys

  1. Step 0 Traces: End-to-End Request JourneysSign in

  2. Spans and Trace TreesSign in

  3. Context and AttributesSign in

  4. Partial Traces and SamplingSign in

Part 11

Step 0 Choosing the Right Signal for the Question

  1. Step 0 Choosing the Right Signal for the QuestionSign in

  2. Detection vs Measurement vs DiagnosisSign in

  3. Signal Anti-PatternsSign in

  4. First Combined ViewSign in

Part 12

Step 0 Minimal Observability for a Single Service

  1. Step 0 Minimal Observability for a Single ServiceSign in

  2. Minimum Viable Log Structure and Key MetricsSign in

  3. Basic Tracing Around Critical PathsSign in

  4. First Service Health DashboardSign in

Part 13

Instrumentation as Code

  1. Instrumentation as CodeSign in

  2. Libraries and APIsSign in

  3. Abstraction vs ControlSign in

  4. Close to Business LogicSign in

Part 14

Structured Logging and Context

  1. Structured Logging and ContextSign in

  2. Key-Value LoggingSign in

  3. Correlation IDs and Trace IDs in LogsSign in

  4. Redaction and PII-Safe PatternsSign in

Part 15

Metrics Instrumentation Patterns

  1. Metrics Instrumentation PatternsSign in

  2. Request and Error CountersSign in

  3. Latency HistogramsSign in

  4. Resource and Business MetricsSign in

Part 16

Tracing and Context Propagation

  1. Tracing and Context PropagationSign in

  2. Propagating Context Across ServicesSign in

  3. Auto-Instrumentation vs Manual SpansSign in

  4. Sampling StrategiesSign in

Part 17

Cross-Cutting Instrumentation Concerns

  1. Cross-Cutting Instrumentation ConcernsSign in

  2. Incoming Request InstrumentationSign in

  3. Outgoing CallsSign in

  4. Shared Libraries as Instrumentation PointsSign in

Part 18

Instrumentation Quality and Hygiene

  1. Instrumentation Quality and HygieneSign in

  2. Naming ConventionsSign in

  3. Avoiding High Cardinality and NoiseSign in

  4. Reviews and GuidelinesSign in

Part 19

Telemetry Agents and Sidecars

  1. Telemetry Agents and SidecarsSign in

  2. Host Agents and CollectorsSign in

  3. Sidecars vs Shared Daemons vs ExportersSign in

  4. Security and Resource CostsSign in

Part 20

Logs Collection Pipelines

  1. Logs Collection PipelinesSign in

  2. Shippers and AggregatorsSign in

  3. Log FormatsSign in

  4. Buffering, Backpressure, LossSign in

Part 21

Metrics Collection Mechanics

  1. Metrics Collection MechanicsSign in

  2. Pull vs Push CollectionSign in

  3. Exporter ModelSign in

  4. Aggregation and DownsamplingSign in

Part 22

Tracing Export Pipelines

  1. Tracing Export PipelinesSign in

  2. Exporters, Collectors, BackendsSign in

  3. Batching and CompressionSign in

  4. High Trace VolumesSign in

Part 23

Multi-Hop Telemetry Flows

  1. Multi-Hop Telemetry FlowsSign in

  2. Local to Regional to CentralSign in

  3. Gateways, Relays, Edge BufferingSign in

  4. Designing for PartitionsSign in

Part 24

Observability Ingestion Architecture

  1. Observability Ingestion ArchitectureSign in

  2. Central APIs vs Per-Signal PathsSign in

  3. Multi-Tenant Pipelines and IsolationSign in

  4. Ingestion SLOsSign in

Part 25

Storage Models for Telemetry

  1. Storage Models for TelemetrySign in

  2. Time-Series, Logs, Traces BackendsSign in

  3. Row vs Column Trade-OffsSign in

  4. Hot/Warm/Cold TiersSign in

Part 26

Indexing Logs

  1. Indexing LogsSign in

  2. Choosing Fields to IndexSign in

  3. Full-Text vs Structured IndexingSign in

  4. Index Size and PerformanceSign in

Part 27

Metrics Storage and Querying

  1. Metrics Storage and QueryingSign in

  2. Time-Series ModelsSign in

  3. Downsampling and Retention TiersSign in

  4. Query PatternsSign in

Part 28

Trace Storage and Retrieval

  1. Trace Storage and RetrievalSign in

  2. Trace-ID LookupsSign in

  3. Partial vs Full Trace StorageSign in

  4. Attribute IndexingSign in

Part 29

Retention Policies and Cost Management

  1. Retention Policies and Cost ManagementSign in

  2. Per-Signal and Per-Environment RetentionSign in

  3. Hot-Short vs Long-Term ArchivalSign in

  4. Cardinality and Volume ControlsSign in

Part 30

Multi-Region and Multi-Tenant Observability Storage

  1. Multi-Region and Multi-Tenant Observability StorageSign in

  2. Sharding StrategiesSign in

  3. Region-Local vs Global StorageSign in

  4. Residency and Regulation ConstraintsSign in

Part 31

Designing Health Dashboards

  1. Designing Health DashboardsSign in

  2. Golden SignalsSign in

  3. Service Overviews vs Deep DivesSign in

  4. Use-Case DashboardsSign in

Part 32

Querying Telemetry for Visualizations

  1. Querying Telemetry for VisualizationsSign in

  2. Query Patterns Across SignalsSign in

  3. Chart ConstructionSign in

  4. Avoiding Misleading VisualsSign in

Part 33

Alerting Fundamentals

  1. Alerting FundamentalsSign in

  2. Threshold AlertsSign in

  3. Multi-Signal ConditionsSign in

  4. Alert Fatigue PreventionSign in

Part 34

Alert Routing and Incident Workflows

  1. Alert Routing and Incident WorkflowsSign in

  2. On-Call, Escalation, OwnershipSign in

  3. Integrations with Paging and TicketingSign in

  4. Runbooks and Documentation LinksSign in

Part 35

Visual Narratives and Event Annotation

  1. Visual Narratives and Event AnnotationSign in

  2. Overlaying Change EventsSign in

  3. Timeline CorrelationSign in

  4. Dashboards as StoriesSign in

Part 36

UX of Observability Tools

  1. UX of Observability ToolsSign in

  2. Query Performance and Responsive UIsSign in

  3. Navigation PathsSign in

  4. Approachability for Non-ExpertsSign in

Part 37

SLIs, SLOs, and SLAs

  1. SLIs, SLOs, and SLAsSign in

  2. SLIs as Measured EvidenceSign in

  3. Choosing User-Centered SLOsSign in

  4. SLAs vs SLOsSign in

Part 38

Error Budgets and Decision-Making

  1. Error Budgets and Decision-MakingSign in

  2. Allowed UnreliabilitySign in

  3. Burn Rate and ConsumptionSign in

  4. Using Budgets to Guide ChangeSign in

Part 39

Implementing SLOs in Telemetry

  1. Implementing SLOs in TelemetrySign in

  2. SLI Queries on MetricsSign in

  3. Rolling Windows and Time SlicesSign in

  4. SLO Dashboards and AlertsSign in

Part 40

Multi-Region and Multi-Service SLOs

  1. Multi-Region and Multi-Service SLOsSign in

  2. Per-Region vs Global SLOsSign in

  3. Dependency-Chain SLOsSign in

  4. Where to MeasureSign in

Part 41

Reliability Reviews and Reporting

  1. Reliability Reviews and ReportingSign in

  2. SLO Review CadenceSign in

  3. Error-Budget PostmortemsSign in

  4. Communicating ReliabilitySign in

Part 42

Types of Incidents and Failures

  1. Types of Incidents and FailuresSign in

  2. Gradual Degradations vs Sudden SpikesSign in

  3. Partial Outages vs Region FailuresSign in

  4. Unknown-Unknowns vs Known FailuresSign in

Part 43

Anomaly Detection Basics

  1. Anomaly Detection BasicsSign in

  2. Baselines and Trend DetectionSign in

  3. Multi-Dimensional AnomaliesSign in

  4. Sensitivity vs False PositivesSign in

Part 44

Correlating Signals

  1. Correlating SignalsSign in

  2. Linking Logs, Metrics, TracesSign in

  3. Correlation and Request IDsSign in

  4. Visual CorrelationSign in

Part 45

RCA Workflows

  1. RCA WorkflowsSign in

  2. From Alert to HypothesisSign in

  3. Dependency Maps and Service GraphsSign in

  4. Structured Analysis and Incident NarrativesSign in

Part 46

Tools for Investigation

  1. Tools for InvestigationSign in

  2. Trace Explorers and Log PivotingSign in

  3. Service Maps and Topology ViewsSign in

  4. Saved Queries and MacrosSign in

Part 47

Learning from Incidents

  1. Learning from IncidentsSign in

  2. Blameless ReviewsSign in

  3. Updating Dashboards, Alerts, InstrumentationSign in

  4. Capturing Failure SignaturesSign in

Part 48

Observability as a Platform

  1. Observability as a PlatformSign in

  2. Central Platform for All TeamsSign in

  3. Platform as ProductSign in

  4. Self-Service OnboardingSign in

Part 49

Multi-Tenancy, Governance, and Access Control

  1. Multi-Tenancy, Governance, and Access ControlSign in

  2. Tenant Isolation - Team/Project/Service Boundaries and What Can Fail TogetherSign in

  3. RBAC for Telemetry - Controlling Who Can See What Without Blocking Legitimate DebuggingSign in

  4. Privacy and Data Minimization - Compliance Posture and Reducing Telemetry to What Is NecessarySign in

Part 50

Cost Management and Telemetry Budgeting

  1. Cost Management and Telemetry BudgetingSign in

  2. Cost Drivers - Volume, Retention, Cardinality, and Query Patterns as the Main LeversSign in

  3. Budgets and Quotas - Allocating Telemetry Capacity and Making Trade-offs Explicit by Team or ServiceSign in

  4. Feedback Loops to Instrumentation and Sampling - Controlling Cost by Changing What You Emit, Not Only Where You Store ItSign in

Part 51

Integrations with the Rest of the Platform

  1. Integrations with the Rest of the PlatformSign in

  2. CI/CD Integration - Deploy Markers and Pipeline Metrics as Essential Investigation ContextSign in

  3. Incident, Flags, Config Systems - Connecting Observability to Other Control Systems Without Duplicating TruthSign in

  4. Business Analytics Boundaries - Integrating Product Metrics While Preserving Semantics and GovernanceSign in

Part 52

Reference Architectures and Maturity Models

  1. Reference Architectures and Maturity ModelsSign in

  2. Small-Team Stack - Minimal Pipelines and Storage That Still Support Reliable DebuggingSign in

  3. Mid-Size Org Platform - Shared Platform with SLO Discipline and Standardized WorkflowsSign in

  4. Large Org Platform - Multi-Region, Multi-Cloud, Governed Observability with Cost Controls and Tenancy BoundariesSign in