SWATGenXSWATGenX
Sign inSign up

Parallelizing the SWAT+ engine across CPU cores | SWATGenX

We made the SWAT+ engine run independent parts of a watershed at the same time on one shared-memory machine — an OpenMP wavefront over the routing DAG. The land phase is bit-identical across thread counts and 1-thread output matches production exactly; speedup reaches ~4.36x at 8 threads, now bounded by per-level barrier synchronization.

OpenMP DAG wavefront · ~4.36× at 8 threads · 1-thread byte-identical to production · Peace River 57,998 HRUs

Measured scalingCorrectness

SWAT+ runs one daily time step at a time, walking every object in the watershed in a single sequential order. On regional, hyper-resolution models with tens of thousands of HRUs that order is the wall. This page reports a shared-memory parallelization of the engine itself: same science, more cores, one machine.

The approach is a full-DAG wavefront — group the routing network into dependency levels and run each level concurrently — and the validation is thread-count invariance, which turns correctness into a precise test. We report the measured scaling, what is now provably identical across thread counts, and where the ceiling comes from.

Read this first

  1. The SWAT+ land phase is now bit-identical across thread counts (water, nutrient, and sediment balances match exactly), and at 1 thread the engine matches the stock production output exactly — same science, parallelized.
  2. Speedup reaches ~4.36x at 8 threads on the Peace River model (monotonic), lifted from ~1.57x by removing the dominant ch_temp channel hotspot; the remaining limit is per-level barrier synchronization, not channel compute.
  3. Correctness was proven by thread-count invariance, which pinpointed every race — including a shared weather-station index behind a ~0.2% potential-ET drift.
1

Motivation

Can the SWAT+ engine use the cores a server already has?

SWAT+ runs one daily time step at a time, walking every object in the watershed — HRUs, routing units, channels, reservoirs, aquifers — in a single sequential order. For a small basin that is fine. For regional, hyper-resolution models with tens of thousands of HRUs it is the wall: a one-year Peace River run (57,998 HRUs) takes ~11 minutes single-threaded, and a multi-decade calibration multiplies that by thousands.

Modern servers have many idle cores. The question this page answers: can the SWAT+ engine itself use them — on one machine, shared memory, without changing the science — by running independent parts of the watershed at the same time?

2

Methods

A full-DAG wavefront over the routing network

Within one day, two objects can run concurrently only if neither is downstream of the other. The routing network is a directed acyclic graph (DAG): headwater HRUs feed routing units, which feed channels, which feed the next channel down to the outlet. We compute, for every object, its longest path from a headwater leaf — its level. Objects that share a level are mutually independent and run together; level L+1 waits for level L.

We then drive the daily step level-by-level under an OpenMP parallel loop. The land phase (tens of thousands of HRUs) is one enormous wide level; the channel network narrows as it converges on the main stem. Making this safe required auditing every shared 'current-object' scratch variable in the engine and giving each thread its own copy (threadprivate), so concurrent objects never clobber one another.

How we proved it correct

Rather than only compare against the production binary (whose stream-temperature column is itself non-deterministic — see the footnote below), we validate by thread-count invariance: repeated runs at 4 and 8 threads, and 1-thread vs 4-thread, must produce identical output. Any run-to-run difference is, by definition, a real race — and points straight at the unprotected shared state. This turns correctness into a precise, automatable test.

ComponentPinWhat it is
Engine forkexp/openmp-hru-20260616swat-model/swatplus + OpenMP wavefront (-fiopenmp, ifx -O3 -ipo)
Validation modelPeace HUC-8 0310010157,998 HRUs / 8,182 channels — built by SWATGenX at 500 m resolution
Host10 physical coressingle shared-memory node; AWS many-core box reserved for production runs

Table 2. Engine fork, validation model, and host used for every measurement on this page.

3

Results and discussion

The optimization journey: from I/O to CPU time

Making a very large SWAT+ model fast was not one fix but a sequence — each step removed the bottleneck the previous step exposed, and the dominant cost kept moving: from disk, to startup, to the land phase, to channel routing, and now to thread synchronization. The colour shows the class of each fix; none of them change the model's science.

Why it unlocks scaling: PSO calibration is capped at ~40 vCPU per model — with ≤40 parameters and good initial simulations, a larger population doesn't help (and can hurt). A parallel engine is the only way to put more than 40 cores on a single model, so one model's calibration can scale out to the whole cluster.

I/O
Algorithmic
Profiling
Parallel
Scheduling

1. NetCDF output backend

Bottleneck: Output I/O dominated — huge plain-text result files.

Fix: Binary NetCDF writer. Output size and write time cut sharply.

2. Filtered print

Bottleneck: Writing output rows nobody needed.

Fix: Emit only the requested objects/variables. Less I/O, smaller files.

3. O(1) startup name lookup (upstream PR #219)

Bottleneck: O(N²) string name-matching while reading inputs.

Fix: Hash name→index lookup in hru_read. Startup cost collapses on large models.

4. Per-row daily reset (upstream PR #220)

Bottleneck: Zeroing whole all-HRU arrays every day.

Fix: Reset only the active HRU's row. Output-identical; daily reset cost removed.

5. Profile the engine (Intel VTune)

Bottleneck: Where does a simulation actually spend time?

Fix: Hotspot + source-line profiling. Finding: the cost is overhead, not hydrology.

6. HRU land-phase wavefront

Bottleneck: 58k HRUs processed one at a time.

Fix: OpenMP parallel-do over a routing-DAG level. The wide land level runs concurrently.

7. Engine reentrancy (threadprivate)

Bottleneck: Shared 'current-object' globals raced under threads.

Fix: Give each thread its own copy; thread-count-invariance test. Land phase bit-identical across thread counts.

8. Channel parallelization (full-DAG wave)

Bottleneck: Routing still serial after the land phase.

Fix: Extend the wavefront to channels/reservoirs/units. Whole daily step parallel, level by level.

9. One parallel region per day

Bottleneck: 265 levels × 365 days of thread fork/join.

Fix: Fork the team once per day, barrier per level. Eliminated the 4→8-thread regression.

10. ch_temp landscape-unit reset

Bottleneck: Re-zeroing the entire LSU array per channel per day (the single hottest line, ~731 s).

Fix: Reset only the units each channel uses. Output-identical; 4-thread wall ≈ halved (462→226 s).

11. Fuse the width-1 barriers

Bottleneck: With channels cheap, threads spun at the ~265 per-level barriers — many on width-1 main-stem levels.

Fix: Run consecutive width-1 levels serially on one thread under a single barrier (no parallelism lost). ~7-11% faster at every thread count; 8-thread speedup 4.06x → 4.36x.

All four upstream pull requests (#213, #214, #219, #220) are open against swat-model/swatplus; the engine runs them today in our fork.

Measured scaling: 1 → 8 threads

Peace River HUC-8, 57,998 HRUs, 1 simulated year · one simulated year; ifx -O3 -ipo; full wavefront with the ch_temp fix + narrow-level fusion; best of 3 runs (least-contended) on a shared 5-core/10-thread node.

Serial baseline
509 s
Best (8 threads)
117 s
Peak speedup
4.36×

Figure 1. Measured speedup vs. thread count on the Peace River model (57,998 HRUs, one simulated year, best of 3 runs; ch_temp fix + narrow-level fusion). Bars are measured speedup; the dashed line is ideal linear scaling. Monotonic to ~4.36x at 8 threads; the remaining gap is the serial main-stem critical path.

ThreadsWall (s)SpeedupNotes
1508.61.00×serial reference (original cmd_next order)
2262.81.94×
4159.33.19×
8116.64.36×best

Table 1. Wall-clock time and speedup by thread count (Peace River, one simulated year, ifx -O3 -ipo, full wavefront + ch_temp fix + narrow-level fusion, best of 3 on a shared 5-core/10-thread node).

Speedup is monotonic to about 4.36x on 8 threads — a one-year Peace River run drops from ~8.5 minutes to under 2. Two channel-side fixes got it there: removing the dominant ch_temp hotspot (which shrank the serial critical path), then fusing the width-1 main-stem wave levels so the deep tail pays one barrier instead of one-per-level (a further ~7-11% at every thread count).

These are best-of-3, least-contended numbers on a shared 5-core/10-thread node that also runs the live site, so 8-thread timing is the most variable point; a dedicated many-core box (the publication setting) should scale further. The remaining gap to ideal linear scaling is the serial main stem itself — its length, not the per-level barriers, now bounds the speedup.

Speedup is hardware-dependent — the full cross-hardware study on dedicated cloud CPUs is below. The short version: the ceiling is set by physical cores and memory bandwidth, and the ratio is deflated on chips with aggressive turbo (which boost the single-thread baseline and down-clock under load). Our local box runs at a fixed 2.79 GHz, so its ratio reads high; it is the canonical engineering number, while the dedicated curves below show the true many-core picture.

Cross-hardware scaling on dedicated cloud CPUs

The identical engine and the same 1-simulated-year Peace run, thread-pinned (OMP_PROC_BIND=close, OMP_PLACES=cores), best of 2, on three dedicated AWS instances — the many-core scaling the 5-core local workstation cannot show. Speedup is each machine's own 1→N ratio; CPU model and core layout read live from /proc/cpuinfo.

  • AMD EPYC 9R45 — Zen 5 "Turin"32 cores · 1 thread/core · 2.6 GHz base · AWS c8a.8xlarge
  • Intel Xeon 6975P-C — Granite Rapids (Xeon 6)16 cores · 2 threads/core · 3.73 GHz base · AWS c8i.8xlarge
  • AMD EPYC 7R32 — Zen 2 "Rome"16 cores · 2 threads/core · 2.8 GHz base · AWS c5a.8xlarge (same generation as our local EPYC 7282)
Workersc8ac8ic5a
11.00×1.00×1.00×
21.66×1.51×1.59×
42.62×2.18×2.34×
63.18×2.62×2.76×
83.56×2.88×3.03×
103.79×3.01×3.20×
124.03×3.15×3.38×
164.40×3.34×3.55×
184.43×3.13×3.35×
204.62×3.20×3.38×

Values are speedup vs each machine's own 1 worker. ◁ marks each CPU's peak (beyond it, hyperthreads only contend).

AMD Turin (32 real cores) is the scaling champion — still climbing at 20 workers (4.62×) and the fastest in absolute wall time at every count (a one-year run in 98.5 s). With 32 physical cores it has not yet hit its ceiling.

Intel Xeon 6 has the fastest single core (3.73 GHz + high IPC), so it wins a serial run — but it peaks at exactly 16 workers (its physical-core count, 3.34×) and REGRESSES at 18/20 as hyperthreads contend. The Rome box hits the same wall at its 16 cores.

Speedup is bounded by physical cores and memory bandwidth, not thread count: this is a memory-bound model, so 32 real cores beat 16-cores-plus-hyperthreads. Pushing past the physical-core count never helps and usually hurts.

The ratio is deflated by turbo: these chips boost the 1-thread baseline and drop the all-core clock under load, so the reported speedup understates the parallel work delivered. The local 5-core box runs at a fixed clock, which is why its 4.36× ratio reads higher than the dedicated boxes — a cleaner ratio, not faster hardware.

Where the time goes: an Intel VTune profile

An Intel VTune hotspots profile at 8 threads (Peace River, 180-day window) shows exactly why the ceiling is where it is — and that high CPU usage is not the same as useful work.

Peace River HUC-8 · 8 threads · 180-day window · Intel VTune hotspots

73%
25%
Effective (real work)929 s
Spin (idle at barriers)312 s
Overhead (scheduling, threadprivate)25 s

Figure 2. Intel VTune CPU-time breakdown at 8 threads (Peace River, 180-day window): of 1,267 CPU-seconds, 73% is effective work, 25% is spin (idle threads waiting at barriers), 2% is overhead. The effective work is dominated by serial channel routing.

RoutineCPU time (s)Role
sd_channel_control3 (channel routing + water quality + sediment)698serial main stem
hru_control (the parallel land phase)90parallel
command_object / ru_control (dispatch + routing units)41mixed

Of the 1,267 CPU-seconds the eight threads burned, only 73% was real work — 25% was threads spinning at barriers, idle, waiting for the serial main stem to finish. That spin is why a CPU monitor shows 80-90% utilization even though the speedup is only ~1.6x: busy cores are not the same as productive cores.

And the real work is lopsided: channel routing (sd_channel_control3) costs about 7x the HRU land phase (hru_control) — and it is the part that runs serially down the main stem. The land phase we parallelized is only ~12% of the compute, so even perfect HRU scaling can't move the total much. The next real lever is the channel network itself (overlapping its levels, or speeding the per-channel water-quality and sediment math), not more threads.

Acting on this, source-line profiling pinned the single hottest line in the entire engine: the stream-temperature routine (ch_temp) was re-zeroing the ENTIRE landscape-unit array — every routing unit in the watershed — once per channel, every day, even though each channel only reads its own few units. That one reset was ~97% of ch_temp (~731 of ~750 CPU-seconds). Resetting only the units a channel actually uses (a one-line-class fix, byte-identical output) was the turning point: best-of-3 wall times dropped across the board (4-thread 462 s -> 176 s) and the 8-thread speedup rose from ~1.57x to ~4.06x — because the cut was on the serial main stem, shrinking the serial fraction so the cores help far more. The methodology working: profile, remove overhead, re-measure.

Re-profiling the fixed build confirms the shift: ch_temp has dropped out of the hotspots entirely (its hottest line went from ~731 s to ~0.8 s), and no channel routine is dominant anymore — the channel-compute bottleneck is effectively gone. What now dominates is not computation at all but synchronization: with the heavy serial channel work removed, the worker threads finish the wide land-phase level quickly and then sit idle, spinning at the ~265 per-level barriers while the single-width main stem is walked. That barrier spin (~390 CPU-seconds at 8 threads) is the new frontier — the next gains come from smarter scheduling (fusing the narrow deep levels, fewer barriers), not from shaving more arithmetic.

Correctness: the land phase is now bit-identical across thread counts

How we checked: exact byte-level diffs of the daily output files between independent runs — 1-thread vs 4-thread, and 4-thread vs 4-thread (repeated). 'Bit-identical' below means every value matched to the last digit; 'differs' counts any record off by more than 1e-6, which is a deliberately strict tripwire, not a measure of magnitude.

  • At 1 thread the engine is byte-identical to the serial production engine — verified bit-for-bit on a 1-year Peace run (every daily HRU and channel record matched to the last digit). The wavefront activates only at 2+ threads, so single-thread keeps the original execution order exactly.
  • Reaching that took a targeted fix: byte-level diffs first surfaced a small (~0.3%) DETERMINISTIC residual even at 1 thread, traced to two routines (ch_watqual4, et_pot) where the reentrancy refactor had dropped a few persistent scratch values. Restoring them as per-thread (threadprivate) state made 1-thread output bit-identical again — and race-free at 2+ threads.
  • (Under the production -ipo build, 1-thread matches to within ~5e-4 on a handful of records — compiler floating-point reassociation from thread-local storage, not a difference in the modeled physics.)
  • HRU water balance (hru_wb) — bit-identical run-to-run at 4 threads
  • HRU nutrient balance (hru_nb) — bit-identical
  • HRU sediment losses (hru_ls) — bit-identical run-to-run

What still varies, and by how much (it is model-equivalent)

The in-channel outputs are NOT bit-identical at 2+ threads. Comparing two independent 1-year Peace runs at 4 threads, about 1,600 of the 2,070 daily channel records differ at full floating-point precision (>1e-6) — channel flow (flo_out) itself in ~1,100 of them. We state this plainly because the strict tripwire makes it sound large.

But the magnitude is tiny — on the order of 0.5% — and it comes entirely from the ORDER in which many upstream hydrographs are summed into a channel (which depends on how the threads interleave), not from any change in the modeled physics. It is floating-point round-off, the same kind that appears if you reorder a long sum by hand.

The variation is confined to in-channel routing and the constituent split (organic N/P, nitrate, and sediment particle fractions). The HRU land-phase balances above are unaffected. At ~0.5% it sits far inside the input and calibration uncertainty of a 58,000-HRU model, so over a multi-year run it has no meaningful effect on any result — a 10-year check to confirm the round-off stays bounded (rather than slowly accumulating) is the one remaining validation step.

The water-temperature column is also non-deterministic, but for a reason that has nothing to do with parallelism: a pre-existing upstream out-of-bounds read (see footnote). It is output-only and never touches the water, sediment, or nutrient balance.

The races we found and fixed

Thread-count-invariance testing surfaced several shared 'current-object' scratch variables that raced under the parallel wave. Each was given a per-thread copy:

HRU land phase — the big one

iwst, the current weather-station index, was shared: set per HRU then read two lines later as the weather source. Concurrent HRUs clobbered it in between, so a subset of HRUs read the wrong station — a ~0.2% drift in potential ET that rippled downstream. Privatizing it removed the largest source of non-determinism.

Plant & residue cycling

A cluster of 'temporary storage' scalars for residue decomposition, plant uptake, senescence, harvest and grazing (decomp, rsd_meta, pl_mass_up, leaf_drop, …) raced and perturbed biomass, which feeds soil evaporation.

Channel routing & sediment

Per-channel routing scratch (rttime, ben_area, rchdep, the rating-curve and sediment-budget buffers) and per-substep routing arrays were shared across channels running on the same level; each is now threadprivate or allocated per thread.

A footnote: a pre-existing upstream bug, surfaced not introduced

Our bounds-checked debug build kept crashing in the stream-temperature routine. The cause turned out to be upstream, not ours: a channel whose first inflow is another channel reads array element hin_d(0), but the array is allocated from index 1. The production binary, built without bounds checking, silently reads the adjacent memory and continues — which is exactly why the water-temperature column is non-deterministic even in the stock engine.

Because stream temperature is output-only in this configuration (it feeds nothing in the water/sediment/nutrient balance), we left the behavior unchanged to match production, and flagged the bounds read for a separate upstream fix.

What remains

  • Cut the barrier spin — now the dominant cost. With the channel-compute bottleneck removed, the wave's ~265 per-level barriers leave threads idle on the narrow deep levels. Fuse consecutive narrow levels / reduce barriers, or schedule the routing network as a task-dataflow so a downstream channel starts the moment its own upstreams finish.
  • Measure clean best-of-N scaling on a dedicated many-core box (the shared production server is contention-limited above ~4 threads) to publish the full scaling curve now that the ch_temp fix has landed.
  • Mop up the smaller remaining compute: per-step string name-matching (replace obtyp comparisons with integer codes) and threadprivate-access overhead.
  • Close the last ~0.5% in-channel constituent residual to reach full byte-identical output across thread counts (and check it stays bounded over multi-year runs).
4

Conclusion

  • The SWAT+ engine can be parallelized on a single shared-memory node without changing the science — and at 1 thread it stays byte-identical to the stock production engine.
  • On the Peace network speedup reaches ~4.36x at 8 threads (after the ch_temp hotspot fix and width-1 level fusion) and is now bounded by the length of the serial main-stem channel routing, not by the available cores — the VTune profile showed channel routing costs ~7x the parallel land phase.
  • High CPU utilization is not efficiency: at 8 threads a quarter of all CPU time is threads spinning idle at barriers, waiting on that serial main stem.
  • The validation method — thread-count invariance — converted 'is it correct?' into a precise, automatable test that pinpointed every remaining race (including a shared weather-station index behind a ~0.2% PET drift).

FAQ

  • Does running SWAT+ on multiple CPU cores change the model results?

    The land phase is bit-identical across thread counts: HRU water balance, nutrient balance, and sediment losses match exactly at 4 threads. Channel flow and total sediment are deterministic too. The only run-to-run variation is the in-channel constituent split (organic N/P, nitrate, sediment particle fractions) at roughly half a percent — within input and calibration uncertainty — plus the water-temperature column, which is non-deterministic in the stock engine as well (a pre-existing upstream out-of-bounds read that is output-only).

  • How much faster is the parallel engine?

    On the Peace River model (57,998 HRUs, one simulated year, shared 10-core node, best of 3 runs), wall time drops from 509 s (1 thread) to 117 s (8 threads) — about 4.36x, monotonic through 8 threads. Two channel-side fixes did it: removing the ch_temp hotspot (4.06x) then fusing the width-1 main-stem barriers (4.36x).

  • Why is the speedup ~4.4x and not 8x?

    What remains is synchronization, not computation. The Peace model has 265 dependency levels: the first is ~58,000 independent HRUs (massively parallel), but the network narrows to a width-1 main stem. With the channel compute now cheap, the threads finish the wide land level quickly and then spin at the ~265 per-level barriers walking the narrow main stem. Cutting that barrier overhead (fusing the narrow levels) is the next lever; a dedicated many-core box should also scale further than this shared 10-core node.

  • How was correctness verified?

    By thread-count invariance: repeated runs at 4 and 8 threads, and 1-thread vs 4-thread, must produce identical output. Any run-to-run difference is by definition a data race, which pinpoints the exact shared state to fix. This converted correctness into a precise, automatable test and surfaced every remaining race — most importantly a shared weather-station index that caused a ~0.2% potential-ET drift on a subset of HRUs.

Related guides

SWAT+ performance profiling
SWAT+ runtime benchmark (measured on real models)
SWAT+ production engine
Methodology

Explore related

Last updated 2026-06-17.

Home