SWAT+ runtime benchmark — execution time & output optimization | SWATGenX
Technical note — not a product overview.
Methods × model scale (473 → 11k → 94k HRUs): what you can change, what is fixed, and measured CAN/CANNOT conclusions.
This page reports SWAT+ single-run wall time and output configuration—not the particle swarm calibration algorithm. Streamflow calibration is one consumer of the R1 print profile (filtered daily channel_sd NetCDF); parallel PSO particles and init amortization are workflow levers documented as a footnote.
Metrics are exported from a standardized runtime battery (May 2026): R0 compute-only, R1 production filtered NC, R2 fair NC vs TXT format compare, S0 HRU structure analysis, plus archived toolchain, I/O stress, and reverted fork experiments on the large tier.
Executive summary
90-day matrix · Intel ifx -O3 -ipoStandardized runtime battery: output methods × model scale (473 → 11k → 94k HRUs). Streamflow calibration consumes the R1 print profile; PSO parallelism and init amortization are workflow levers—not Fortran toggles.
- Use NetCDF instead of formatted TXT when writing all daily streams (~3–4× on I/O-dominated 20 d stress; not on filtered production output).
- Use Intel ifx with -O3 -ipo (~13% vs -O2 on medium tier; bit-identical routed-flow NC).
- Set swift_out=0 for streamflow-focused runs (~8% on medium tier 1 yr cal NC).
- Expect wall time to scale roughly with HRU count in the daily land-phase loop (compare R0/R1 across tiers).
- Run independent forward models in parallel when optimizing parameters (workflow lever, not a Fortran toggle).
- Materially speed large filtered-output runs by choosing NC vs TXT (~±2% at 94k HRU).
- Gain wall time from wq_cha gating or command-loop dispatch patches (measured null; reverted on L tier).
- Remove minimum-area HRUs post-build and expect large savings when they are ~6% of count but ~0.08% of area (L tier structure).
- Skip parity-safe physics on a fixed TxtInOut and still trust streamflow outputs.
- Output scope: R0→R1 delta is filtered channel_sd staging cost, not calibration algorithm overhead.
- Init cost (~30% of 90 d wall on L): paid every forward run—short windows amplify init fraction.
- Discretization: coarser HRU rules at model build (QSWAT+/generation), not post-hoc Fortran flags.
- Invalid compares: legacy showcase TXT (yearly channel_sd only) vs filtered daily NC are not format tests.
Adopted runtime stack
Production- SWAT+ fork
- a6b1a2a
- Compiler
- Intel ifx release_o3_ipo (Release, no -g)
- Output profile
- filtered channel_sd daily NetCDF + print_filter
- codes.bsn
- swift_out=0, wq_cha=0
Scale ladder (R0 vs R1)
Physics floor (R0) vs production filtered output (R1). Gap is output staging cost; slope shows HRU scaling.
- R0 compute only
- R1 filtered channel_sd NC
Methods × tiers
| Method (90 d window) | Tier S | Tier M | Tier L |
|---|---|---|---|
| R0 — compute only (no channel_sd write) | 0.010 | 0.21 | 2.75 |
| R1 — filtered channel_sd daily NC (production profile) | 0.010 | 0.27 | 3.21 |
| R2 — filtered channel_sd NC vs TXT (fair format compare) | — | NC 0.28 · TXT 0.28 | NC 3.27 · TXT 3.24 |
Cells show seconds per simulated day. R2 on tier L: NC vs TXT differ ~±2% (within noise). Invalid X0 (legacy showcase print) is excluded from this matrix.
HRU structure (S0)
Minimum-area HRU count vs area fraction — upper bound if removed linearly by count (not a promise).
| Tier | Model | HRUs | Min-area HRUs | Count % | Area % | Max linear save |
|---|---|---|---|---|---|---|
| S | 03080102 | 473 | 20 | 4.23% | 0.034% | 4.2% |
| M | 09471300 | 11,284 | 951 | 8.43% | 0.148% | 8.4% |
| L | 03100101 | 94,303 | 5,484 | 5.82% | 0.083% | 5.8% |
Toolchain (T1 / T2)
Medium tier, 20-day calibration NC window — different from 90-day matrix above.
gfortran vs Intel ifx (M, 20 d cal NC)
Production uses Intel ifx.
ifx Release flag matrix (M, 20 d)
Production uses release_o3_ipo; bit-identical NC vs -O2 ref.
I/O stress (IO)
Reverted fork experiments (L tier)
| Experiment | Baseline | Patched | Δ | Verdict |
|---|---|---|---|---|
| Gate channel WQ on wq_cha | 428.7 s | 429.8 s | +0.3% | reverted |
| command.f90 integer dispatch | 428.7 s | 439.9 s | +2.6% | reverted |
VTune hotspots (L tier R1, 90 d)
Daily land-phase and channel routing dominate; I/O is a small fraction on filtered output.
Workflow note (calibration context)
Particle swarm calibration runs many independent forward models. Init cost (~30% of a 90-day wall on large models) is paid per particle; parallel particles amortize study wall time better than chasing single-run micro-optimizations. See calibration methods for the optimizer stack — this page covers execution-time and output configuration only.
Related guides
Explore related
Last updated 2026-05-30.