Project Title: BIFROST: A Modular Simulation Framework for Multi-Objective Scheduling of ML Pipelines on Heterogeneous Cloud Infrastructure 
Student: Marco Mehta
Course: BSc Hons Computer Science with Year in Industry
	
Abstract:
A machine learning (ML) pipeline is a sequence of stages (data preprocessing, model
training, evaluation) that run across a cluster of machines, often with different hardware.
Deciding which stage runs on which machine is the scheduling problem tackled here.
Comparing different scheduling strategies is hard: live cluster experiments are expensive
and difficult to reproduce, while simulation is only useful if its inputs accurately reflect
real hardware behaviour. This dissertation treats the problem as one of measurement
before scheduling, with two complementary artefacts.
A reproducible benchmarker profiles nine supervised-learning workloads across six Amazon
Web Services (AWS) cloud server types, producing empirical distributions of execution time,
startup latency, and estimated energy for each pipeline stage, measured over 30 isolated runs
per (workload, hardware) pair. BIFROST (Benchmark-Informed Framework for Resource
Oriented Scheduling Trade-offs), a modular discrete-event simulation framework, consumes
these profiles to evaluate scheduling strategies against a configurable set of objectives.
The evaluation uses five objectives: total elapsed time (makespan), energy consumption,
aggregate deadline overrun (tardiness), plan churn across successive scheduling decisions
(scheduling instability), and unevenness of work distribution across nodes (load imbalance).
The simulator additionally models three runtime dynamics that a realistic cloud cluster
exhibits and that make scheduling decisions harder: bursty job arrivals, automatic cluster
resizing in response to workload, and intermittent task and node failures.
The framework is exercised through a 1,680-run empirical study using four scheduling
strategies of increasing sophistication: two objective-blind baselines (First-Come-First
Served and Random), a heuristic that minimises makespan alone (Heterogeneous Earliest
Finish-Time, HEFT), and a multi-objective evolutionary algorithm (Non-dominated Sorting
Genetic Algorithm II, NSGA-II). These are compared across three principal scenarios under
a pre-committed non-parametric statistical protocol. Profile data validity is established
before any scheduling result is interpreted.
The results fall into three qualitatively different regimes depending on how much spare
capacity the cluster has relative to the workload it is being asked to run. When the cluster
has more compute capacity than the workload needs, the four strategies produce clearly
different scores on every objective, and NSGA-II wins on four of the five. When workload
pressure on the cluster is moderate, the final objective scores measured after workload
completion show no statistically significant difference between strategies, yet a per-decision
audit reveals NSGA-II actively trading off energy against load imbalance at the moment
each task is placed on a machine. Under severe contention, the scheduler is left with only
one viable option on 94.3% of its decisions, because the hardware constraints eliminate all
other trade-off candidates, structurally limiting how much strategies can differ on their
final scores.
The contribution is a validated, reusable harness for measurement and simulation in
multi-objective ML pipeline scheduling research. It also provides empirical evidence that
whether scheduling strategies can be told apart at all depends on how constrained the
cluster is relative to the workload.