Has anyone made SWAT+ run in parallel for both HRUs and streams?

Yes. SWATGenX built a shared-memory OpenMP parallelization of the SWAT+ engine that runs both the HRU land phase and the channel/routing network in parallel — an open fork of swat-model/swatplus at https://github.com/rafiei-vahid/swatplus (branch main). It parallelizes a single simulation on one machine (not just ensembles of independent runs) with a wavefront over the routing directed-acyclic graph: objects that share a dependency level are mutually independent and run concurrently. On a dedicated 32-core AWS c8a node the routing wavefront scales to 5.33× at 24 threads, and about 7.1× end-to-end versus our single-core baseline build. It is byte-identical to the serial engine at every thread count, in both the HRU-parallel mode and the full routing wavefront — every variable of every output file, including in-stream water quality. The one caveat is the toolchain: production-scale builds use the Intel ifx compiler. Full methodology and benchmarks are in an engine-acceleration study in preparation (2026).

Is the parallel engine byte-identical to the serial engine?

Yes, at every thread count and in both parallel modes. The HRU-parallel mode and the full routing wavefront each reproduce the serial engine bit-for-bit across every variable of every output file, including in-stream water quality, and two independent parallel runs are identical to each other. The evidence we lead with is a fixture demonstrated to be capable of failing, because a test that cannot fail certifies nothing: on a 777-HRU model whose scheduled fertilizer, pesticide, tile-drainage and PFAS operations are all verified to execute, eighteen runs — three trials at each of 2, 4 and 8 threads, in both parallel modes — left all 935,130 compared values unchanged, while the uncorrected binary failed the same check on 1.0–2.4% of those values in 5 of 5 trials. A broader five-basin comparison (6,972 to 28,559 HRUs, two simulated years, water quality active) covering 8,144,362,709 values per mode also shows none differing; it is reported as breadth, not as proof, because it exercises the order dependencies but never reaches the race. Reaching exact agreement required removing five mechanisms that carried state between simulated objects: three order dependencies — where the answer depended on visitation order — and two genuine data races. Two of the order dependencies are defects in SWAT+ itself, present in the serial engine and inherited unchanged from upstream.

How much faster is the parallel engine?

Two independent layers. Serial overhead removal needs no threads and changes no results: 1.67× on the c8a reference node, up to 2.47× on more memory-starved CPUs (c8i Intel) and 2.16× on c5a. Those figures are measured on a serial-only build, and they are not what a single-threaded run of the shipped engine delivers: the production binary is compiled with OpenMP, which costs 1.20–1.24× at one thread against a serial-only build of the same core (measured twice on c8a). That tax is close to fixed, so it is repaid only once a model is large enough — across seven watersheds from 777 to 57,998 HRUs, one-thread runs of the shipped engine came in at 0.70–0.92× of stock on the six smaller basins and 1.24× on the largest. Use more than one thread and every one of them is far ahead; the serial ladder is a statement about the engineering, not a free speedup on a one-core run. On top of that, the routing wavefront scales a single model to 5.33× at 24 threads on a dedicated 32-core AWS c8a node (3.95× at 8). Composed against our single-core baseline build on the same machine, a one-year run of the 57,998-HRU Peace River model drops from 191.9 s to 26.9 s — about 7.1× end-to-end. That baseline already carries the NetCDF backend and the channel_sd print filter, together worth about 1.1×, so against unmodified upstream SWAT+ the figure is larger, near 8×; we quote 7.1× because it is the one we measured directly, and a measurement is a stronger claim than a composition. That fastest configuration is the full routing wavefront, and it is byte-identical to the serial engine — reproducibility is not traded away for it. Enforcing the correct execution order costs 0.05% of the available eight-thread parallelism. All timings are on quiet, dedicated AWS instances, thread-pinned, best of two; we report no shared- or contended-host numbers.

How do I turn parallel routing on or off?

Thread count is set with OMP_NUM_THREADS; at one thread the engine runs the original serial path. Routing parallelism is controlled by SWATPLUS_ROUTING_SERIAL: =1 keeps channel routing in command order (HRUs still parallel), =0 enables the full routing wavefront, which is the faster of the two. Both are byte-identical to the serial engine, so the choice is purely about speed — routing-serial is kept as a fallback and an independent cross-check.

What compiler does the parallel engine need?

The Intel Fortran compiler (ifx, -O3 -ipo; OpenMP via -fiopenmp). At regional hyper-resolution scale (tens of thousands of HRUs) gfortran is not reliable for these large models, so ifx is used for production-scale builds. The parallelization itself is standard OpenMP, and there is no outstanding correctness item: both parallel modes are byte-identical to the serial engine. An earlier diagnosis blaming ifx for placing derived-type temporaries in shared static storage was investigated and disproved — the real causes were Fortran's implicit-SAVE rule and two order dependencies in SWAT+ itself, none of which is compiler-specific.

How was correctness verified?

By thread-count invariance: repeated runs at different thread counts, and 1-thread versus the stock serial engine, must produce identical output to the last digit. A run-to-run difference at a fixed thread count is a data race, which pinpoints the shared state to fix. A parallel-versus-serial difference that is perfectly reproducible is something else: order dependence, where state leaks between simulated objects so the answer depends on the sequence they were visited in — not a race, and not something a race detector reports. Insisting on bit-for-bit equality surfaced all of them. It found a shared weather-station index that caused a ~0.2% potential-ET drift, and in the end five mechanisms carrying state between simulated objects — three order dependencies and two genuine data races. Two of the order dependencies are defects in SWAT+ itself that a streamflow-only check, and ThreadSanitizer alone, would both have missed.

SWAT+ engine acceleration

Parallelizing the SWAT+ engine across CPU cores

An OpenMP wavefront over the routing DAG runs a watershed’s independent parts at once — byte-identical across every thread count, scaling to 5.33× at 24 threads on a 32-core node.

Measured on the 57,998-HRU Peace River benchmark on dedicated AWS instances. From the SWATGenX engine-acceleration study, in preparation (2026); open-source engine fork on GitHub.

Byte-identical at every thread count
5.33× at 24 threads (AWS c8a)
~7.1× vs stock serial
1.67–2.47× serial, no threads

Thread scaling — one simulated yearAWS c8a · 32 cores

One-year run: 143 s → 27 s. Self-relative speedup of the parallel binary; still climbing at 24 threads.

This page is the web companion to the SWATGenX engine-acceleration study — Accelerating a Regional Hyper-Resolution SWAT+ Model Without Changing the Science (manuscript in preparation, 2026). Every number below is measured on the 57,998-HRU Peace River benchmark on dedicated AWS instances and matches the manuscript; the open engine, with each optimization as an individually validated commit, is on the fork's main branch.

Watch: the routing wavefront explained

A 4-minute explainer of why watershed models are slow, why the channel network limits what can run in parallel, and how the wavefront unlocks it without touching the science.

Motivation

Can the SWAT+ engine use the cores a server already has?

SWAT+ runs one daily time step at a time, walking every object in the watershed — HRUs, routing units, channels, reservoirs, aquifers — in a single sequential order. For a small basin that is fine. For regional, hyper-resolution models with tens of thousands of HRUs it is the wall: a one-year run of the 57,998-HRU Peace River benchmark takes ~3 minutes on a fast modern core (and 5–6 on an older one), and a multi-decade calibration multiplies that by thousands of evaluations.

Modern servers have many idle cores. The question this page answers: can the SWAT+ engine itself use them — on one machine, shared memory, without changing the science — by running independent parts of the watershed at the same time?

Methods

A full-DAG wavefront over the routing network

Within one day, two objects can run concurrently only if neither is downstream of the other. The routing network is a directed acyclic graph (DAG): headwater HRUs feed routing units, which feed channels, which feed the next channel down to the outlet. We compute, for every object, its longest path from a headwater leaf — its level. Objects that share a level are mutually independent and run together; level L+1 waits for level L.

We then drive the daily step level-by-level under an OpenMP parallel loop. The land phase (tens of thousands of HRUs) is one enormous wide level; the channel network narrows as it converges on the main stem. Making this safe required auditing every shared 'current-object' scratch variable in the engine and giving each thread its own copy (threadprivate), so concurrent objects never clobber one another.

How we proved it correct

We validate by thread-count invariance: a parallel run and a serial run of the same binary must produce identical output to the last digit, and so must two independent parallel runs. That turns correctness into a precise, automatable test with no tolerance to argue about.

It is also a stronger test than it first appears. A run-to-run difference at a fixed thread count is a data race. But a parallel-versus-serial difference that is perfectly reproducible is something else — order dependence, where state leaks from one simulated object to the next and the answer depends on the sequence they were visited in. That is not a race, no race detector reports it, and it is present in the serial engine too. Insisting on bit-for-bit equality is what surfaced it; a streamflow-only check, or ThreadSanitizer alone, would have missed it entirely.

Component	Pin	What it is
Engine fork	rafiei-vahid/swatplus	swat-model/swatplus + OpenMP wavefront (-fiopenmp, ifx -O3 -ipo); every optimization a separate commit
Benchmark model	Peace HUC-8 03100101	57,998 HRUs / 8,181 channels / 9,341 routing units — built by SWATGenX from NHDPlus HR + gSSURGO + PRISM
Hosts	dedicated AWS EC2	c8a / c8i / c5a .8xlarge — one model per box, thread-pinned (OMP_PROC_BIND=close), best of two

Table 2. Engine fork, benchmark model, and dedicated hosts used for the measurements on this page.

Results and discussion

The optimization journey: from I/O to CPU time

Making a very large SWAT+ model fast was not one fix but a sequence — each step removed the bottleneck the previous step exposed, and the dominant cost kept moving: from disk, to startup, to the land phase, to channel routing, and now to thread synchronization. The colour shows the class of each fix; none of them change the model's science.

Why it unlocks scaling: PSO calibration is capped at ~40 vCPU per model — with ≤40 parameters and good initial simulations, a larger population doesn't help (and can hurt). A parallel engine is the only way to put more than 40 cores on a single model, so one model's calibration can scale out to the whole cluster.

I/O

Algorithmic

Profiling

Parallel

Scheduling

1. NetCDF output backend

Bottleneck: Output I/O dominated — huge plain-text result files.

Fix: Binary NetCDF writer. → Output size and write time cut sharply.

PR #213 ↗results ↗commit 60f42ee ↗

2. Filtered print

Bottleneck: Writing output rows nobody needed.

Fix: Emit only the requested objects/variables. → Less I/O, smaller files.

PR #214 ↗results ↗commit f4bd54b ↗

3. O(1) startup name lookup (upstream PR #219)

Bottleneck: O(N²) string name-matching while reading inputs.

Fix: Hash name→index lookup in hru_read. → Startup cost collapses on large models.

PR #219 ↗results ↗

4. Per-row daily reset (upstream PR #220)

Bottleneck: Zeroing whole all-HRU arrays every day.

Fix: Reset only the active HRU's row. → Output-identical; daily reset cost removed.

PR #220 ↗results ↗

5. Profile the engine (Intel VTune)

Bottleneck: Where does a simulation actually spend time?

Fix: Hotspot + source-line profiling. → Finding: the cost is overhead, not hydrology.

results ↗

6. HRU land-phase wavefront

Bottleneck: 58k HRUs processed one at a time.

Fix: OpenMP parallel-do over a routing-DAG level. → The wide land level runs concurrently.

commit 051db48 ↗

7. Engine reentrancy (threadprivate)

Bottleneck: Shared 'current-object' globals raced under threads.

Fix: Give each thread its own copy; thread-count-invariance test. → Land phase bit-identical across thread counts.

commit 2f3af67 ↗

8. Channel parallelization (full-DAG wave)

Bottleneck: Routing still serial after the land phase.

Fix: Extend the wavefront to channels/reservoirs/units. → Whole daily step parallel, level by level.

commit 25ef8cf ↗

9. One parallel region per day

Bottleneck: Hundreds of levels × 365 days of thread fork/join.

Fix: Fork the team once per day, barrier per level. → Eliminated the 4→8-thread regression.

commit bda59be ↗

10. ch_temp landscape-unit reset

Bottleneck: Re-zeroing the entire LSU array per channel per day (the single hottest line, ~731 s).

Fix: Reset only the units each channel uses. → Output-identical; nearly free serially but shrinks the parallel critical path.

commit 6d41300 ↗

11. Fuse the width-1 barriers

Bottleneck: With channels cheap, threads spun at the per-level barriers — many on width-1 main-stem levels.

Fix: Run consecutive width-1 levels serially on one thread under a single barrier (no parallelism lost). → Lifts the scaling ceiling a further ~7–11% at every thread count.

commit 22e17a3 ↗

All four upstream pull requests (#213, #214, #219, #220) are open against swat-model/swatplus; the engine runs them today in our fork.

Two products: a serial gain for everyone, and thread scaling for one model

Peace River HUC-8, 57,998 HRUs, 1 simulated year, dedicated AWS c8a.8xlarge (AMD EPYC "Turin", 32 cores, SMT off) · thread scaling of the final engine, self-relative to its own 1-thread wall; ifx -O3 -ipo; daily channel_sd output; thread-pinned, best of two on a dedicated box.

Serial baseline

143 s

Best (24 threads)

27 s

Peak speedup

5.33×

Figure 1. Measured thread scaling of the final engine on the dedicated c8a box (AMD EPYC "Turin", 32 cores, SMT off), 57,998-HRU benchmark, one simulated year, full wavefront, self-relative to its own 1-thread wall. Bars are measured speedup; the dashed line is ideal linear scaling. The full wavefront reaches 5.33× at 24 threads; with the 1.67× serial gain on top, the end-to-end acceleration versus our single-core baseline build (which already carries the NetCDF backend and print filter, together ~1.1×; against unmodified upstream SWAT+ the figure is larger, near 8×, but we quote the one we measured directly) is ~7.1×.

Threads	Wall (s)	Speedup	Notes
1	143.1	1.00×	serial reference (parallel binary, 1 thread)
2	84.1	1.70×	—
4	51.3	2.79×	—
8	36.2	3.95×	—
16	29.2	4.90×	—
24	26.9	5.33×	best (32-core box)

Table 1. Wall-clock time and self-relative speedup by thread count (final engine, dedicated AWS c8a.8xlarge, one simulated year, ifx -O3 -ipo, full wavefront, daily channel_sd output, best of two).

The campaign yields two independent products. Group A — serial overhead removal (the same fixes that drive the I/O→CPU journey above) — needs no threads and changes no model behavior, so it accelerates every existing run on a single core: a cumulative 1.67× on this c8a node (the stock engine's one-year wall drops 191.9 s → 115.2 s), and larger on more memory-starved cores (up to 2.47× on the Intel "Granite Rapids" part). Group B — the routing-DAG wavefront — then scales a single model across cores.

On the dedicated 32-core c8a box the full wavefront scales to 5.33× at 24 threads (table and chart, self-relative to its own 1-thread wall). Multiplying the two layers, the end-to-end acceleration versus our single-core baseline build (which already carries the NetCDF backend and print filter, together ~1.1×; against unmodified upstream SWAT+ the figure is larger, near 8×, but we quote the one we measured directly) is ~7.1× — a one-year run drops from 191.9 s to 26.9 s. Two channel-side fixes got the scaling there: removing the dominant ch_temp serial hotspot (which shrank the Amdahl serial fraction and so lifted the achievable ceiling), then fusing the width-1 main-stem wave levels.

Scaling is bounded by physical cores and memory bandwidth, not thread count — this is a memory-bound model. The 32-core Turin keeps climbing to 24 threads; the 16-core Intel and AMD parts peak at their physical-core count and regress when threads spill onto SMT siblings. The full cross-hardware study on three dedicated cloud CPUs is below.

Cross-hardware scaling on dedicated cloud CPUs

The identical engine and the same one-simulated-year benchmark workload, thread-pinned (OMP_PROC_BIND=close, OMP_PLACES=cores), best of two, on three dedicated AWS instances. Speedup is each machine's own 1→N ratio on the parallel binary; the single-core baseline wall (our first production rung, not unmodified upstream) and the Group-A serial gain are listed per machine so the two layers can be composed.

AMD EPYC — Zen 5 "Turin" — 32 cores · SMT off · stock 191.9 s · serial gain 1.67× · AWS c8a.8xlarge
Intel Xeon — "Granite Rapids" — 16 cores · SMT on · stock 272.2 s · serial gain 2.47× · AWS c8i.8xlarge
AMD EPYC — Zen 2 "Rome" — 16 cores · SMT on · stock 350.6 s · serial gain 2.16× · AWS c5a.8xlarge

Workers	c8a	c8i	c5a
1	1.00×	1.00×	1.00×
2	1.70×	1.59×	1.62×
4	2.79×	2.41×	2.47×
8	3.95×	3.29×	3.21×
12	4.54×	3.75×	3.60×
16	4.90×	4.01× ◁	3.81× ◁
20	5.18×	3.73×	3.51×
24	5.33× ◁	3.74×	3.58×

Values are speedup vs each machine's own 1 worker. ◁ marks each CPU's peak (beyond it, hyperthreads only contend).

AMD Turin (32 real cores, SMT off) is the scaling champion — still climbing at 24 threads (5.33×) and fastest in absolute wall time (a one-year run in 26.9 s). With the 1.67× serial gain on top, that is ~7.1× end-to-end versus our single-core baseline build (which already carries the NetCDF backend and print filter, together ~1.1×; against unmodified upstream SWAT+ the figure is larger, near 8×, but we quote the one we measured directly).

Intel Granite Rapids has the fastest single core, so it wins the Group-A serial round (2.47×) — but it peaks at exactly 16 workers (its physical-core count, 4.01×) and REGRESSES at 20/24 as threads land on SMT siblings. The Rome box hits the same wall at its 16 cores (3.81×).

Speedup is bounded by physical cores and memory bandwidth, not thread count: this is a memory-bound model, so 32 real cores beat 16-cores-plus-SMT. Pushing past the physical-core count never helps and usually hurts.

The two layers are orthogonal and multiply: the serial gain is largest on the most memory-starved cores (2.47× on Intel), while the thread scaling is largest where there are the most real cores (5.33× on the 32-core AMD). The interactive calculator below composes them for any CPU/core/mode choice.

Estimate your speedup

Pick a CPU, a core count, and a scheduling mode to estimate how much faster one engine run (e.g. one calibration evaluation) becomes versus our single-core baseline build. Every curve is directly measured on a dedicated AWS instance (one model per box, threads pinned, best of two runs) with the 57,998-HRU benchmark and daily channel_sd output. Speedup = baseline single-thread wall / final-engine N-thread wall; serial and parallel gains compose multiplicatively. That baseline is our first production rung (fork 768f1d1), which already carries the NetCDF backend and the channel_sd print filter — together worth about 1.1× — so against unmodified upstream SWAT+ the figures are larger, near 8×. We quote the measured one.

CPU

c8a Turin 32c

c8i Granite Rapids 16c

c5a Rome 16c

Scheduling mode

Full wavefront

Routing-serial

HRUs and channel routing both parallel. Byte-identical to serial — every variable, including in-stream water quality.

Cores: 8

Serial engine (Group A)

1.67×

Parallel @ 8 cores

3.18×

Total vs stock 1-core

5.30×

Total = serial × parallel = baseline single-core wall (192 s) ÷ final-engine 8-thread wall (36.2 s), both at the same daily channel_sd output scope. All three curves are directly measured on this CPU.

Total speedup (serial × parallel) versus our single-core baseline build, AWS c8a — AMD EPYC "Turin", 32 cores, full wavefront mode. Dashed line = ideal linear.

Wall-clock curves were measured on the engine as it stood in June 2026. The 2026-07-28 correctness work costs 0.05% of the available 8-thread parallelism by schedule census (maximum wave width 12,700 → 11,323), so the curves remain representative, but they have not yet been re-measured on the current binary.

Where the time goes: an Intel VTune profile

An Intel VTune hotspots profile of the final parallel engine at 8 threads (180-day window) shows exactly why the ceiling is where it is — and that high CPU usage is not the same as useful work.

Peace River HUC-8 · final engine · 8 threads · 180-day window · Intel VTune hotspots

73%

25%

Effective (real work) — 929 s

Spin (idle at barriers) — 312 s

Overhead (scheduling, threadprivate) — 25 s

Figure 2. Intel VTune CPU-time breakdown of the final engine at 8 threads (Peace River, 180-day window): of ~1,267 CPU-seconds, 73% is effective work, 25% is spin (idle threads waiting at barriers), 2% is overhead. The effective work is dominated by serial channel routing (sd_channel_control3), ~7.7× the parallel land phase.

Routine	CPU time (s)	Role
sd_channel_control3 (channel routing + water quality + sediment)	698	serial main stem
hru_control (the parallel land phase)	90	parallel
command_object / ru_control (dispatch + routing units)	41	mixed

Of the ~1,267 CPU-seconds the eight threads burned, only ~73% was real work — ~25% was threads spinning at barriers, idle, waiting for the serial main stem to finish. That spin is why a CPU monitor shows 80–90% utilization even though the per-thread payoff is bounded: busy cores are not the same as productive cores.

And the real work is lopsided: channel routing (sd_channel_control3, 698 s) costs about 7.7× the HRU land phase (hru_control, 90 s) — and it is the part that runs serially down the main stem. The land phase we parallelized is only ~12% of the compute, so even perfect HRU scaling can't move the total much. The next real lever is the channel network itself (overlapping its levels, or speeding the per-channel water-quality and sediment math), not more threads.

Acting on this, source-line profiling pinned the single hottest line in the entire engine: the stream-temperature routine (ch_temp) was re-zeroing the ENTIRE landscape-unit array — every routing unit in the watershed — once per channel, every day, even though each channel only reads its own few units. Resetting only the units a channel actually uses (an output-identical fix) barely changes the serial time but markedly raised the multi-thread ceiling, because the cut was on the serial main stem and shrank the Amdahl serial fraction. The methodology working: profile, remove overhead, re-measure.

Re-profiling the fixed build confirms the shift: ch_temp has dropped out of the hotspots entirely (its hottest line went from ~731 s to ~0.8 s), and no channel routine is dominant anymore. What now dominates is not computation but synchronization — with the heavy serial channel work removed, the worker threads finish the wide land-phase level quickly and then sit idle at the per-level barriers while the single-width main stem is walked. That barrier spin (~25% of CPU time at 8 threads) is the new frontier — the next gains come from smarter scheduling (task-dataflow over the routing network), not from shaving more arithmetic.

Correctness: byte-identical to the serial engine, in both parallel modes

How we check: thread-count invariance — exact byte-level diffs of every variable of every daily output file, between a parallel run and a serial run of the same binary, and between two independent parallel runs. ‘Byte-identical’ means every value matched to the last digit at a 1e-12 relative tolerance, not ‘close enough’.

Both parallel modes are byte-identical to serial. The HRU-parallel mode and the full wavefront — which runs the channel-routing DAG in parallel as well — each reproduce the serial engine bit-for-bit across every variable of every output file, including in-stream water quality. Two independent parallel runs are also identical to each other.
Verified across the fleet: five basins from 6,972 to 28,559 HRUs, two simulated years with in-stream water quality active, eight threads. 8,144,362,709 values compared per mode against a serial reference. Zero differed.
The engine also passes the pre-production ship gate: byte-identity at 1e-12 on both benchmark models, and no regression against the previous engine.
This was not free, and it was not achieved by making the parallel engine approximate. It was achieved by finding and fixing what actually made the two runs disagree — see below.

What was really wrong: order dependence, not a race

For a long time we described the remaining disagreement as a data race in the in-stream particulate constituents, then — after a race detector came back clean on the fixtures we were running — as benign floating-point reassociation. Both diagnoses were wrong, and the second was the more expensive, because it concluded that exactness was unattainable rather than that we had not found the cause yet.

What we found instead were five mechanisms carrying state from one simulated object to the next, in two classes that a single label had hidden. Three are order dependencies: the answer depends on the sequence in which HRUs and channels happen to be visited, with no concurrent conflicting access anywhere, so they act in a single-threaded run too. One of those was ours — the wavefront assigned execution levels from the channel-connectivity file alone, so routing units were scheduled alongside the very HRUs whose daily hydrographs they consume, because that coupling travels through a membership table rather than a flow connection.

The other two are genuine data races, and they are why the earlier 'clean detector' reading did not hold. One let concurrent HRUs overwrite each other's current management operation, so an HRU could apply another HRU's fertilizer amount on the wrong date; the other let two wetland HRUs clobber each other's reservoir-release decision. Three of the five share one cause — a Fortran rule that quietly gives an initialised local variable a single static copy shared by every thread. Whether that surfaces as an order dependency or as a race depends only on whether the value is read inside the same call that wrote it.

A race detector failed against all five, in three different ways, without once malfunctioning. It cannot see ordering, so the three order dependencies were invisible to it by construction. For one of the races our test models never executed the statement involved — the schedules were empty — so there was nothing to observe. And against the last it did report real races, but the locations it named were compiler-generated temporaries, which supported a confident and wrong explanation that stood for five weeks. A clean report is bounded by what the instrumented run actually executes; a detailed report can still point away from the defect.

Two of the order dependencies are defects in SWAT+ itself, present in the serial engine and inherited unchanged from upstream. In the in-stream water-quality routine the algal growth rate is read on every channel but written only below a concentration cap, so at the cap a channel silently reused the previous channel's growth rate. And the daily-reset routine declares the sediment enrichment ratio as a local variable that shadows the real one, so the reset has never taken effect and each HRU inherited its predecessor's value, which is what moved channel CBOD and dissolved oxygen.

Both of those change results in serial. They are invisible there because a serial run is perfectly repeatable — it is simply ordered by accident. That is the general lesson: requiring a parallel run to match a serial one bit-for-bit is a stronger correctness test than race detection, because it catches order dependence, which no race detector reports. For a model built as a sequential loop over objects, the serial run is not automatically a reference.

Enforcing the correct ordering costs 0.05% of the available eight-thread parallelism — the widest wave narrows from 12,700 objects to 11,323 and the schedule gains one level. Correctness and speed were never in tension here. The routing-serial mode remains available as a fallback and an independent cross-check, but it is no longer the price of reproducibility.

What we found, and fixed

Thread-count-invariance testing surfaced two distinct classes of defect. The first was genuine shared scratch state, which a per-thread copy fixes. The second — the one that took longest and mattered most — was not a race at all.

HRU land phase — shared scratch

iwst, the current weather-station index, was shared: set per HRU then read two lines later as the weather source. Concurrent HRUs clobbered it in between, so a subset of HRUs read the wrong station — a ~0.2% drift in potential ET that rippled downstream. Privatizing it removed the largest source of non-determinism.

Plant & residue cycling — shared scratch

A cluster of ‘temporary storage’ scalars for residue decomposition, plant uptake, senescence, harvest and grazing (decomp, rsd_meta, pl_mass_up, leaf_drop, …) raced and perturbed biomass, which feeds soil evaporation.

Channel routing & sediment — shared scratch

Per-channel routing scratch (rttime, ben_area, rchdep, the rating-curve and sediment-budget buffers) and per-substep routing arrays were shared across channels running on the same level; each is now threadprivate or allocated per thread.

A missing dependency edge — ours

The wavefront built execution levels from the channel-connectivity files alone. Routing units declare no inflow there, so they landed on the first level beside the HRUs they consume — a coupling that travels through a membership table, not a flow connection — and read hydrographs that were a day stale. Stock SWAT+ enforces this ordering and says why; our rewrite of the level assignment dropped it. The schedule now derives the HRU-to-routing-unit edges directly from the membership tables.

Order dependence in SWAT+ itself — not a race, and not ours

Two defects made results depend on the order objects are visited, in serial as much as in parallel. The in-stream water-quality routine reads the algal growth rate unconditionally but writes it only below a concentration cap, so at the cap a channel reused the previous channel’s value. And the daily-reset routine declares the sediment enrichment ratio as a local that shadows the module variable, so the reset never took effect and each HRU inherited its predecessor’s — which is what moved channel CBOD and dissolved oxygen. Both are present unchanged in upstream SWAT+; both are fixed here, and reported upstream.

A footnote: a pre-existing upstream bug, surfaced not introduced

Our bounds-checked debug build kept crashing in the stream-temperature routine. The cause turned out to be upstream, not ours: a channel whose first inflow is another channel reads array element hin_d(0), but the array is allocated from index 1. The production binary, built without bounds checking, silently reads the adjacent memory and continues — which is exactly why the water-temperature column is non-deterministic even in the stock engine.

Because stream temperature is output-only in this configuration (it feeds nothing in the water/sediment/nutrient balance), we left the behavior unchanged to match production, and flagged the bounds read for a separate upstream fix.

What remains

Cut the barrier spin — now the dominant cost. With the channel-compute bottleneck removed, the wave's per-level barriers leave threads idle on the narrow deep levels. Schedule the routing network as a task-dataflow so a downstream channel starts the moment its own upstreams finish.
Mop up the smaller remaining compute: per-step string name-matching (replace obtyp comparisons with integer codes) and threadprivate-access overhead.
Finish the sweep for order dependence. A structured def-use pass over the engine found further variables that are read on a path where they may never have been written, in both SWAT+ and our fork. None of them breaks byte-identity in the models we run — they are inert here, or they fail identically in serial and parallel — but each is the same class of defect as the two we fixed, and each makes some model's answer depend on object numbering.
Re-measure the scaling curves on a dedicated AWS instance against the current binary. The correctness work costs 0.05% of available parallelism by schedule census, but the published wall-clock numbers predate it.

Conclusion

The SWAT+ engine can be parallelized on a single shared-memory node without changing the science. Both parallel modes — HRUs alone, and the full wavefront that also parallelizes channel routing — are byte-identical to the serial engine across every output variable, including in-stream water quality.
The campaign yields two orthogonal products: a machine-portable serial gain (1.67× on c8a, up to 2.47× on memory-starved cores) measured on a serial-only build, and thread parallelism that scales a single model to 5.33× at 24 cores — together ~7.1× end-to-end versus our single-core baseline build (which already carries the NetCDF backend and print filter, together ~1.1×; against unmodified upstream SWAT+ the figure is larger, near 8×, but we quote the one we measured directly). The serial gain is a property of the engineering, not a free speedup for a one-core run: the binary we ship is compiled with OpenMP, which costs 1.20–1.24× at one thread, so on models below roughly 30k HRUs a single-threaded run is slower than the baseline until the model is large enough to repay it.
Scaling is bounded by physical cores and memory bandwidth, not thread count; the VTune profile shows serial channel routing costs ~7.7× the parallel land phase, so the main-stem routing — not more threads — is the remaining lever.
Reproducibility is no longer a trade-off. Correct ordering costs 0.05% of the available eight-thread parallelism, so the fastest mode is also the exactly reproducible one; the routing-serial mode survives as a cross-check rather than as the price of determinism.
The strongest result is a method, not a speedup. Requiring a parallel run to match a serial one bit-for-bit found two long-standing defects in SWAT+ that make results depend on the order objects are simulated — defects that are present in the serial engine, that no race detector reports, and that a decade of serial use could not surface.

FAQ

Has anyone made SWAT+ run in parallel for both HRUs and streams?
Yes. SWATGenX built a shared-memory OpenMP parallelization of the SWAT+ engine that runs both the HRU land phase and the channel/routing network in parallel — an open fork of swat-model/swatplus at https://github.com/rafiei-vahid/swatplus (branch main). It parallelizes a single simulation on one machine (not just ensembles of independent runs) with a wavefront over the routing directed-acyclic graph: objects that share a dependency level are mutually independent and run concurrently. On a dedicated 32-core AWS c8a node the routing wavefront scales to 5.33× at 24 threads, and about 7.1× end-to-end versus our single-core baseline build. It is byte-identical to the serial engine at every thread count, in both the HRU-parallel mode and the full routing wavefront — every variable of every output file, including in-stream water quality. The one caveat is the toolchain: production-scale builds use the Intel ifx compiler. Full methodology and benchmarks are in an engine-acceleration study in preparation (2026).
Is the parallel engine byte-identical to the serial engine?
Yes, at every thread count and in both parallel modes. The HRU-parallel mode and the full routing wavefront each reproduce the serial engine bit-for-bit across every variable of every output file, including in-stream water quality, and two independent parallel runs are identical to each other. The evidence we lead with is a fixture demonstrated to be capable of failing, because a test that cannot fail certifies nothing: on a 777-HRU model whose scheduled fertilizer, pesticide, tile-drainage and PFAS operations are all verified to execute, eighteen runs — three trials at each of 2, 4 and 8 threads, in both parallel modes — left all 935,130 compared values unchanged, while the uncorrected binary failed the same check on 1.0–2.4% of those values in 5 of 5 trials. A broader five-basin comparison (6,972 to 28,559 HRUs, two simulated years, water quality active) covering 8,144,362,709 values per mode also shows none differing; it is reported as breadth, not as proof, because it exercises the order dependencies but never reaches the race. Reaching exact agreement required removing five mechanisms that carried state between simulated objects: three order dependencies — where the answer depended on visitation order — and two genuine data races. Two of the order dependencies are defects in SWAT+ itself, present in the serial engine and inherited unchanged from upstream.
How much faster is the parallel engine?
Two independent layers. Serial overhead removal needs no threads and changes no results: 1.67× on the c8a reference node, up to 2.47× on more memory-starved CPUs (c8i Intel) and 2.16× on c5a. Those figures are measured on a serial-only build, and they are not what a single-threaded run of the shipped engine delivers: the production binary is compiled with OpenMP, which costs 1.20–1.24× at one thread against a serial-only build of the same core (measured twice on c8a). That tax is close to fixed, so it is repaid only once a model is large enough — across seven watersheds from 777 to 57,998 HRUs, one-thread runs of the shipped engine came in at 0.70–0.92× of stock on the six smaller basins and 1.24× on the largest. Use more than one thread and every one of them is far ahead; the serial ladder is a statement about the engineering, not a free speedup on a one-core run. On top of that, the routing wavefront scales a single model to 5.33× at 24 threads on a dedicated 32-core AWS c8a node (3.95× at 8). Composed against our single-core baseline build on the same machine, a one-year run of the 57,998-HRU Peace River model drops from 191.9 s to 26.9 s — about 7.1× end-to-end. That baseline already carries the NetCDF backend and the channel_sd print filter, together worth about 1.1×, so against unmodified upstream SWAT+ the figure is larger, near 8×; we quote 7.1× because it is the one we measured directly, and a measurement is a stronger claim than a composition. That fastest configuration is the full routing wavefront, and it is byte-identical to the serial engine — reproducibility is not traded away for it. Enforcing the correct execution order costs 0.05% of the available eight-thread parallelism. All timings are on quiet, dedicated AWS instances, thread-pinned, best of two; we report no shared- or contended-host numbers.
How do I turn parallel routing on or off?
Thread count is set with OMP_NUM_THREADS; at one thread the engine runs the original serial path. Routing parallelism is controlled by SWATPLUS_ROUTING_SERIAL: =1 keeps channel routing in command order (HRUs still parallel), =0 enables the full routing wavefront, which is the faster of the two. Both are byte-identical to the serial engine, so the choice is purely about speed — routing-serial is kept as a fallback and an independent cross-check.
What compiler does the parallel engine need?
The Intel Fortran compiler (ifx, -O3 -ipo; OpenMP via -fiopenmp). At regional hyper-resolution scale (tens of thousands of HRUs) gfortran is not reliable for these large models, so ifx is used for production-scale builds. The parallelization itself is standard OpenMP, and there is no outstanding correctness item: both parallel modes are byte-identical to the serial engine. An earlier diagnosis blaming ifx for placing derived-type temporaries in shared static storage was investigated and disproved — the real causes were Fortran's implicit-SAVE rule and two order dependencies in SWAT+ itself, none of which is compiler-specific.
How was correctness verified?
By thread-count invariance: repeated runs at different thread counts, and 1-thread versus the stock serial engine, must produce identical output to the last digit. A run-to-run difference at a fixed thread count is a data race, which pinpoints the shared state to fix. A parallel-versus-serial difference that is perfectly reproducible is something else: order dependence, where state leaks between simulated objects so the answer depends on the sequence they were visited in — not a race, and not something a race detector reports. Insisting on bit-for-bit equality surfaced all of them. It found a shared weather-station index that caused a ~0.2% potential-ET drift, and in the end five mechanisms carrying state between simulated objects — three order dependencies and two genuine data races. Two of the order dependencies are defects in SWAT+ itself that a streamflow-only check, and ThreadSanitizer alone, would both have missed.

Related guides

SWAT+ performance profiling

SWAT+ runtime benchmark (measured on real models)

SWAT+ production engine

Methodology

Explore related

Hydrology calibration methods

Methodology

Manual vs automated SWAT+

Watershed Explorer

Last updated 2026-07-13.

Home