Optimizing Spark Notebooks in Microsoft Fabric: A Practical Guide to Cost, Performance, and Observability

March 23, 2026

Let’s be honest — most teams running Spark notebooks in Microsoft Fabric have no real idea what they’re spending or where that spend is going. The jobs run, the data moves, the dashboards refresh, and everyone assumes the platform is working the way it should.

But here’s what the numbers show: up to 80–90% of Spark execution time in a typical Fabric environment is idle or wasted. Not occasionally. Consistently.

This guide is for data engineers who want to get past the surface level and understand what’s happening inside their notebooks — the cluster behavior, the execution patterns, the shuffle costs — and more importantly, what to do about it.

Understanding the Hidden Cost of Spark in Fabric

Every time a Spark notebook kicks off, Fabric spins up a cluster, assigns executors and cores, and starts burning Compute Units. That part most people understand. What catches teams off guard is that the meter runs regardless of whether the workload is using those resources or not.

And the visibility problem makes it worse. Fabric doesn’t give you clean per-notebook CU consumption data out of the box. So you end up in this awkward situation where you know cost is accumulating, you can see the overall numbers going up, but you can’t easily pinpoint which notebooks are the culprits or why they’re expensive. You’re basically trying to cut a bill you can’t fully read.

That’s the gap this guide is meant to close

Spark Cluster Assignment: What’s Really Happening?

Every notebook session runs on a Spark cluster. The cluster size controls how many executors and cores get provisioned — and those resources stay allocated for the life of the session.

The problem most teams run into is that cluster size gets set once (usually at the start of a project, when engineers are running heavy exploratory queries) and never touched again. So a notebook doing a simple filter on a small table ends up running on the same cluster configuration as one joining tens of millions of rows. Same cost, wildly different workload.

For lighter jobs running on oversized clusters, what you get is:

Executors sitting allocated but doing nothing useful

Cores running well below capacity for most of the job

Higher CU consumption with zero performance benefit to show for it

The fix sounds simple — match the cluster to the workload. In practice, most teams just haven’t made the time to go back and do it.

Executors and Cores: Where Processing Actually Happens

An executor is a JVM process. It’s the thing that does the work — executing tasks, running transformations, and handling shuffle operations. When people talk about Spark performance, they’re really talking about how well executors are being used.

If you’ve provisioned more executors than the job needs, they don’t disappear — they just sit there consuming memory and CPU while contributing to nothing. It’s the compute equivalent of hiring ten people for a job that needs two. The work still gets done at the same speed; you just pay eight people who spent the day reading the news.

Right sizing your executor configuration to match actual job requirements is one of the most direct levers you have over cost. It’s also one of the most skipped.

Spark Execution Deep Dive: Jobs, Stages, and Shuffles

If you want to optimize Spark jobs, you need a working mental model of how execution flows.

Jobs get triggered by actions — a write, a display call, a collection. Each job represents one complete execution flow from start to finish.

Stages are how Spark breaks a job into chunks it can actually schedule. Each stage reads some data, does something to it, and passes it along. Where there are dependencies between operations, stages run sequentially. Where there aren’t, Spark can parallelize them.

Shuffles are where the real cost hides. A shuffle happens when Spark needs to move data between executors — which is required for joins, aggregations, and GroupBy operations. Data gets written to disk, sent over the network, and read back on the other side. It’s slow, expensive, and poorly written pipelines; it happens far more than it needs to.

When you’re digging into a job that’s taking longer or costing more than expected, shuffle volume is almost always the first place to look.

Common Performance Issue: Idle Execution

The scenario plays out like this: someone sets up a large cluster, runs a notebook that processes a relatively small dataset, and the job finishes in a few minutes. Looks fine. But when you actually look at executor activity during that run, they were idle for most of it.

You paid for an hour of compute. Maybe ten minutes of it involved actual processing. The rest was overhead, startup time, and executors waiting around with nothing to do.

Now multiply across 30 notebooks running on a daily schedule. That idle time stops being a rounding error and starts being a real line item — one that compounds every single day without anyone noticing, because the jobs are technically completing and nobody’s watching the utilization numbers.

Adaptive Query Execution (AQE): A Missing Optimization

AQE is one of those features that should be on by default in every Fabric environment. Often it isn’t.

Without AQE, Spark locks in a query plan before execution starts and sticks with it no matter what the data looks like at runtime. That works fine when your data behaves exactly as expected. It falls apart when it doesn’t — which, in real-world pipelines, is more often than you’d like.

AQE lets Spark adjust its plan for mid-execution based on actual statistics. It can switch join strategies on the fly, collapse small partitions that would otherwise cause unnecessary overhead, and cut shuffle volume in ways that static planning simply can’t. Enabling it is a small configuration change that regularly delivers a noticeable improvement in both speed and cost. There’s really no good reason to leave it off.

Spark History Server: Your Best Friend for Optimization

Before you change any configurations, open the Spark History Server. Seriously — this tool tells you exactly what happened during a notebook to run, and it will almost always show you something you weren’t expecting.

The four areas to focus on:

Jobs — Which ones ran longest, and did they finish or get retried?

Stages — Where did time actually accumulate in the execution flow?

Shuffle — How much data was being moved around, and between which stages?

Executors — Were they busy throughout, or were large chunks of runtime just idle?

Most performance problems leave obvious fingerprints in the History Server. A stage that took 45 minutes when everything else took under a minute. Shuffle reads that are ten times larger than they should be. Executors that show 5% CPU utilization across a 30-minute job. These aren’t subtle — you just must know how to look.

Observability Metrics You Should Track

Optimization without measurement is guesswork with extra steps. If you’re serious about keeping Fabric costs under control, these are the metrics worth tracking on an ongoing basis — not just during a one-time audit.

1. Runtime Metrics

Which notebooks consistently take the longest? These are your highest-cost workloads and the obvious starting point for any optimization effort. Track execution time per notebook over time, not just as a point-in-time snapshot.

2. Data Movement Metrics

Shuffle read and write volumes tell you a lot about query quality. Jobs with disproportionately high shuffle relative to the amount of data they’re processing are almost always doing something inefficient at the query level.

3. Resource Utilization

Executor CPU and memory utilization. Low utilization on a large cluster is a clear signal that you’re over-provisioned. High utilization on a small cluster might mean you’ve gone too far in the other direction. You’re looking for something that runs efficiently without leaving resources sitting idle.

4. Configuration Gaps

Keep track of which notebooks have AQE disabled and which are running on clusters that haven’t been reviewed. These are known problems, not unknowns — and known problems have a way of staying unfixed until someone builds a process to surface them regularly.

Optimization Strategy: Right-Sizing Spark Workloads

Step 1: Identify Inefficient Notebooks

Pull runtime, shuffle, and utilize data and rank your notebooks with cost and inefficiency. You’re looking for the combination of high cost and low utilization — that’s where the easiest wins are. Don’t try to optimize everything at once; start with the handful of notebooks that account for the majority of your compute spend.

Step 2: Downsize Clusters

Once you know which notebooks are over-provisioned, reduce executor count and core allocation iteratively. Run the job, check the History Server, and adjust again if needed. The goal isn’t to make the cluster as small as possible — it’s to find the configuration that gets the job done without carrying unnecessary dead weight.

Step 3: Enable Smart Features

Turn on AQE. For joins that involve a small reference table, use broadcast joins to skip the shuffle entirely. These are low-effort, high-return changes that should happen before you start rewriting any query logic.

Step 4: Implement Observability

Build dashboards in Fabric or Power BI that surface the metrics above at a notebook level. Make them visible to the team, not just to whoever is running the optimization project. When engineers can see in real time what their notebooks are costing, it changes how they write and configure them — and that cultural shift is what makes the improvements stick long-term.

Final Thought

Spark optimization in Fabric isn’t a one-time cleanup project. The environments that stay efficient are the ones where cost and performance visibility are built into how the team operates — where regressions get caught early, where cluster configs get reviewed when workloads change, and where engineers have enough observability to make informed decisions rather than guesses.

The good news is that most Fabric environments have a lot of room to improve, and the gains from doing it properly are real. Better performance, lower cost, and a data platform that scales the way it was supposed to when the migration was first sold to the business. That’s worth the effort.

Services