Course overview

How to Design High-Performance & Batch Computing Systems

52 modules
210 lessons
—
Part 1

Course Setup and the Incremental Ladder

  1. Course Setup and the Incremental LadderSign in

  2. Why "Jobs to Clusters"Sign in

  3. How to Use This CourseSign in

  4. The Incremental Ladder (Step 0 -> Step 7)Sign in

  5. The Course LensesSign in

  6. Diagram Legend and Notation TypesSign in

Part 2

What Is a High-Performance & Batch Computing System?

  1. What Is a High-Performance & Batch Computing System?Sign in

  2. Batch vs Interactive: Throughput vs LatencySign in

  3. Workload Examples: What Clusters Are Asked to DoSign in

  4. Cluster vs Single-Node Multicore: Boundaries Change When You Scale OutSign in

Part 3

Workload Taxonomy

  1. Workload TaxonomySign in

  2. Resource-Bound Workloads as Scheduling InputsSign in

  3. Job Shapes and Queue DynamicsSign in

  4. The Coupling Spectrum and Why It Dominates DesignSign in

Part 4

Hardware and Cluster Basics

  1. Hardware and Cluster BasicsSign in

  2. Nodes and Their Resources as the Execution SubstrateSign in

  3. Racks and Failure Domains: How Correlated Failures AppearSign in

  4. Homogeneous vs Heterogeneous Fleets and the Complexity of SpecializationSign in

Part 5

Control Planes vs Data Planes

  1. Control Planes vs Data PlanesSign in

  2. Control Planes as the Decision Layer of a ClusterSign in

  3. Data Planes as the Throughput Path for Work and DataSign in

  4. Separating Concerns So Control Can Scale Without Data-Path CouplingSign in

Part 6

Diagramming Cluster & Batch Systems

  1. Diagramming Cluster & Batch SystemsSign in

  2. Job Lifecycle DiagramsSign in

  3. Cluster Topology DiagramsSign in

  4. Task Graphs and Checkpoint FlowsSign in

Part 7

Step 0 Parallel Computation Models (Conceptual)

  1. Step 0 Parallel Computation Models (Conceptual)Sign in

  2. Data Parallel, Task Parallel, Pipeline ParallelSign in

  3. Shared Memory vs Shared NothingSign in

  4. Throughput vs Latency ParallelismSign in

Part 8

Step 0 Jobs, Tasks, and Resource Specifications

  1. Step 0 Jobs, Tasks, and Resource SpecificationsSign in

  2. Jobs and TasksSign in

  3. Resource RequestsSign in

  4. Static vs Dynamic NeedsSign in

Part 9

Step 0 Basic Job Scheduling Concepts

  1. Step 0 Basic Job Scheduling ConceptsSign in

  2. Queues, Priorities, FairnessSign in

  3. FCFS and BackfillingSign in

  4. Preemption and Gang SchedulingSign in

Part 10

Step 0 Cluster Resource Management

  1. Step 0 Cluster Resource ManagementSign in

  2. Resource ViewsSign in

  3. Node-Level vs Cluster-Level DecisionsSign in

  4. Bin-Packing and FragmentationSign in

Part 11

Step 0 Minimal Job Queue & Scheduler

  1. Step 0 Minimal Job Queue & SchedulerSign in

  2. Central Queue + Workers PatternSign in

  3. Retries and Basic Failure HandlingSign in

  4. First ObservabilitySign in

Part 12

Batch vs Streaming vs Interactive

  1. Batch vs Streaming vs InteractiveSign in

  2. Batch WindowsSign in

  3. When Batch Fits and When It Does NotSign in

  4. Workload-Mode SelectionSign in

Part 13

DAGs and Job Graphs

  1. DAGs and Job GraphsSign in

  2. Jobs as StagesSign in

  3. DAG DependenciesSign in

  4. Task Parallelism Within StagesSign in

Part 14

Common Batch Framework Concepts

  1. Common Batch Framework ConceptsSign in

  2. Map/Reduce-Style AbstractionsSign in

  3. Stages and BarriersSign in

  4. Drivers and ExecutorsSign in

Part 15

Workload Characterization

  1. Workload CharacterizationSign in

  2. ETL, Analytics, ML Training, SimulationsSign in

  3. Stress ProfilesSign in

  4. Many Small vs Few Large JobsSign in

Part 16

Job Submission and Configuration

  1. Job Submission and ConfigurationSign in

  2. Job SpecsSign in

  3. Retries and TimeoutsSign in

  4. Templates and Common PatternsSign in

Part 17

First Batch Computing Environment

  1. First Batch Computing EnvironmentSign in

  2. Single Cluster + One FrameworkSign in

  3. Standard Pipelines and Ad Hoc JobsSign in

  4. Basic MonitoringSign in

Part 18

Data Partitioning and Locality

  1. Data Partitioning and LocalitySign in

  2. Partitioning StrategiesSign in

  3. Co-Locating Compute with DataSign in

  4. Repartitioning and ShufflesSign in

Part 19

Embarrassingly Parallel Workloads

  1. Embarrassingly Parallel WorkloadsSign in

  2. Independent TasksSign in

  3. Minimal CoordinationSign in

  4. Large Arrays of Small TasksSign in

Part 20

Task Graphs and Dependencies

  1. Task Graphs and DependenciesSign in

  2. DAG RepresentationSign in

  3. Upstream and Downstream SemanticsSign in

  4. Scheduling on Dependency CompletionSign in

Part 21

Load Balancing and Stragglers

  1. Load Balancing and StragglersSign in

  2. Skew and Straggler CausesSign in

  3. Dynamic Assignment and SpeculationSign in

  4. Hotspot MonitoringSign in

Part 22

Coordinated Parallel Computations (Conceptual)

  1. Coordinated Parallel Computations (Conceptual)Sign in

  2. Barriers and SynchronizationSign in

  3. Collectives ConceptuallySign in

  4. HPC-Style vs Batch-Style CoordinationSign in

Part 23

Designing for Parallel Efficiency

  1. Designing for Parallel EfficiencySign in

  2. Task Granularity vs OverheadSign in

  3. Coordination vs SimplicitySign in

  4. Matching Models to Job TypesSign in

Part 24

Storage Layers in Batch Systems

  1. Storage Layers in Batch SystemsSign in

  2. Local, Shared, Distributed StorageSign in

  3. Throughput, IOPS, LatencySign in

  4. Temporary vs Durable StorageSign in

Part 25

Data Locality and Movement

  1. Data Locality and MovementSign in

  2. Move Compute to Data vs Data to ComputeSign in

  3. Caching StrategiesSign in

  4. Intermediate Data HandlingSign in

Part 26

High-Performance I/O Patterns

  1. High-Performance I/O PatternsSign in

  2. Sequential vs Random AccessSign in

  3. Parallel I/O and StripingSign in

  4. Throttling and BackpressureSign in

Part 27

Cluster Networking Basics

  1. Cluster Networking BasicsSign in

  2. East-West vs North-SouthSign in

  3. Bandwidth, Latency, OversubscriptionSign in

  4. Network Impact on Shuffle WorkloadsSign in

Part 28

Networking for HPC and Data-Intensive Workloads

  1. Networking for HPC and Data-Intensive WorkloadsSign in

  2. High-Speed Interconnect ConceptsSign in

  3. Topologies and Failure DomainsSign in

  4. Identifying Network BottlenecksSign in

Part 29

Storage/Network Aware Scheduling

  1. Storage/Network Aware SchedulingSign in

  2. Scheduling by Data LocationSign in

  3. Balancing Network LoadSign in

  4. Locality vs Queue TimeSign in

Part 30

Failure Modes in Batch & HPC Systems

  1. Failure Modes in Batch & HPC SystemsSign in

  2. Node and Process FailuresSign in

  3. Network Partitions and Slow NodesSign in

  4. Framework and Controller FailuresSign in

Part 31

Retry, Idempotency, and Determinism

  1. Retry, Idempotency, and DeterminismSign in

  2. Safe RetriesSign in

  3. Idempotent WritesSign in

  4. Deterministic ReplaysSign in

Part 32

Checkpointing Concepts

  1. Checkpointing ConceptsSign in

  2. Periodic State SavingSign in

  3. Job-Level vs Task-Level CheckpointsSign in

  4. Incremental vs Full CheckpointsSign in

Part 33

Recovery Strategies

  1. Recovery StrategiesSign in

  2. Rolling Back to CheckpointsSign in

  3. Re-Run Tasks vs Re-Run StagesSign in

  4. Long-Running Job RecoverySign in

Part 34

Consistency and Output Semantics

  1. Consistency and Output SemanticsSign in

  2. At-Least-Once, At-Most-Once, and Effectively-Once OutputsSign in

  3. Partial Results VisibilitySign in

  4. Finalization and CommitSign in

Part 35

Testing and Validating Fault Tolerance

  1. Testing and Validating Fault ToleranceSign in

  2. Failure Injection in Test EnvironmentsSign in

  3. Chaos-Style Experiments for BatchSign in

  4. Recovery ObservabilitySign in

Part 36

Capacity Planning for Batch Clusters

  1. Capacity Planning for Batch ClustersSign in

  2. Estimating Compute and Storage NeedsSign in

  3. Peak vs AverageSign in

  4. Growth PlanningSign in

Part 37

Utilization Metrics and Efficiency

  1. Utilization Metrics and EfficiencySign in

  2. Resource UtilizationSign in

  3. Queue Wait and Turnaround TimeSign in

  4. Fragmentation and SlackSign in

Part 38

Scheduling Policies for Cost and Fairness

  1. Scheduling Policies for Cost and FairnessSign in

  2. Priorities and Fair SharingSign in

  3. Preemption, Quotas, DeadlinesSign in

  4. Service Classes and Batch WindowsSign in

Part 39

Elastic and Cloud-Based Clusters

  1. Elastic and Cloud-Based ClustersSign in

  2. Autoscaling WorkersSign in

  3. Spot and Preemptible CapacitySign in

  4. Hybrid BurstingSign in

Part 40

Chargeback, Showback, and Budgeting

  1. Chargeback, Showback, and BudgetingSign in

  2. Cost AttributionSign in

  3. Usage Reports and BudgetsSign in

  4. Incentives for Efficient JobsSign in

Part 41

Optimization Playbooks

  1. Optimization PlaybooksSign in

  2. Reducing WasteSign in

  3. Right-Sizing JobsSign in

  4. Smoothing PeaksSign in

Part 42

Single vs Multi-Cluster Architectures

  1. Single vs Multi-Cluster ArchitecturesSign in

  2. One Big Cluster vs ManySign in

  3. Isolation StrategiesSign in

  4. Shared vs Dedicated ResourcesSign in

Part 43

Multi-Region and Hybrid Architectures

  1. Multi-Region and Hybrid ArchitecturesSign in

  2. Regional ClustersSign in

  3. Cross-Region Data MovementSign in

  4. Hybrid On-Prem + CloudSign in

Part 44

Heterogeneous Hardware Fleets

  1. Heterogeneous Hardware FleetsSign in

  2. Scheduling Specialized HardwareSign in

  3. Partitioned Fleets vs Unified ViewSign in

  4. Hardware Lifecycle and UpgradesSign in

Part 45

Multi-Tenant Batch Platforms

  1. Multi-Tenant Batch PlatformsSign in

  2. Isolation Between Teams and WorkloadsSign in

  3. Policies, Quotas, and EnforcementSign in

  4. Fairness at ScaleSign in

Part 46

Integration with Data & ML Platforms

  1. Integration with Data & ML PlatformsSign in

  2. Batch for ETL, Features, TrainingSign in

  3. Interfaces to Warehouses, Lakes, Model PlatformsSign in

  4. Coordinated SchedulesSign in

Part 47

Observability and SLOs for Clusters

  1. Observability and SLOs for ClustersSign in

  2. Cluster-Level MetricsSign in

  3. Job-Level MetricsSign in

  4. SLO DesignSign in

Part 48

End-to-End Job Lifecycle

  1. End-to-End Job LifecycleSign in

  2. Request and PlanSign in

  3. Run and ObserveSign in

  4. Close and RecordSign in

Part 49

Platform as a Product

  1. Platform as a ProductSign in

  2. Platform PositioningSign in

  3. Onboarding and TemplatesSign in

  4. Developer Experience SurfacesSign in

Part 50

Governance, Compliance, and Safety

  1. Governance, Compliance, and SafetySign in

  2. Access Control for Data and ComputeSign in

  3. Resource Safety LimitsSign in

  4. Policies and Audit LogsSign in

Part 51

Reliability and Incident Management

  1. Reliability and Incident ManagementSign in

  2. Incident Types: Capacity Exhaustion, Scheduler Failures, Storage OutagesSign in

  3. Response Playbooks: Stabilizing the Platform Under Load While Preserving Fairness and TrustSign in

  4. Post-Incident Improvement: Turning Outages into Durable Fixes in Scheduling, Capacity, and GovernanceSign in

Part 52

Reference Architectures and Maturity Models

  1. Reference Architectures and Maturity ModelsSign in

  2. Early Stage Reference: Simple Queue, Scripts, and a Single ClusterSign in

  3. Mid Stage Reference: Batch Framework, Locality-Aware Design, and Basic AutoscalingSign in

  4. Advanced Stage Reference: Multi-Cluster, Heterogeneous Fleets, Integrated Platforms, and Robust SLOsSign in