SWAT+ runtime benchmark — execution time & output optimization | SWATGenX

Technical note — not a product overview.

Methods × model scale (473 → 11k → 94k HRUs): what you can change, what is fixed, and measured CAN/CANNOT conclusions.

This page reports SWAT+ single-run wall time and output configuration—not the particle swarm calibration algorithm. Streamflow calibration is one consumer of the R1 print profile (filtered daily channel_sd NetCDF); parallel PSO particles and init amortization are workflow levers documented as a footnote.

Metrics are exported from a standardized runtime battery (May 2026): R0 compute-only, R1 production filtered NC, R2 fair NC vs TXT format compare, S0 HRU structure analysis, plus archived toolchain, I/O stress, and reverted fork experiments on the large tier.

Executive summary

90-day matrix · Intel ifx -O3 -ipo

Standardized runtime battery: output methods × model scale (473 → 11k → 94k HRUs). Streamflow calibration consumes the R1 print profile; PSO parallelism and init amortization are workflow levers—not Fortran toggles.

You canCAN
  • Use NetCDF instead of formatted TXT when writing all daily streams (~3–4× on I/O-dominated 20 d stress; not on filtered production output).
  • Use Intel ifx with -O3 -ipo (~13% vs -O2 on medium tier; bit-identical routed-flow NC).
  • Set swift_out=0 for streamflow-focused runs (~8% on medium tier 1 yr cal NC).
  • Expect wall time to scale roughly with HRU count in the daily land-phase loop (compare R0/R1 across tiers).
  • Run independent forward models in parallel when optimizing parameters (workflow lever, not a Fortran toggle).
You cannotCANNOT
  • Materially speed large filtered-output runs by choosing NC vs TXT (~±2% at 94k HRU).
  • Gain wall time from wq_cha gating or command-loop dispatch patches (measured null; reverted on L tier).
  • Remove minimum-area HRUs post-build and expect large savings when they are ~6% of count but ~0.08% of area (L tier structure).
  • Skip parity-safe physics on a fixed TxtInOut and still trust streamflow outputs.
Depends onDEPENDS
  • Output scope: R0→R1 delta is filtered channel_sd staging cost, not calibration algorithm overhead.
  • Init cost (~30% of 90 d wall on L): paid every forward run—short windows amplify init fraction.
  • Discretization: coarser HRU rules at model build (QSWAT+/generation), not post-hoc Fortran flags.
  • Invalid compares: legacy showcase TXT (yearly channel_sd only) vs filtered daily NC are not format tests.

Adopted runtime stack

Production
SWAT+ fork
a6b1a2a
Compiler
Intel ifx release_o3_ipo (Release, no -g)
Output profile
filtered channel_sd daily NetCDF + print_filter
codes.bsn
swift_out=0, wq_cha=0

Scale ladder (R0 vs R1)

Physics floor (R0) vs production filtered output (R1). Gap is output staging cost; slope shows HRU scaling.

  • R0 compute only
  • R1 filtered channel_sd NC
47394,303HRUs (log scale)00.851.72.553.4s / sim day

Methods × tiers

Method (90 d window)Tier STier MTier L
R0 — compute only (no channel_sd write)0.0100.212.75
R1 — filtered channel_sd daily NC (production profile)0.0100.273.21
R2 — filtered channel_sd NC vs TXT (fair format compare)NC 0.28 · TXT 0.28NC 3.27 · TXT 3.24

Cells show seconds per simulated day. R2 on tier L: NC vs TXT differ ~±2% (within noise). Invalid X0 (legacy showcase print) is excluded from this matrix.

HRU structure (S0)

Minimum-area HRU count vs area fraction — upper bound if removed linearly by count (not a promise).

TierModelHRUsMin-area HRUsCount %Area %Max linear save
S03080102473204.23%0.034%4.2%
M0947130011,2849518.43%0.148%8.4%
L0310010194,3035,4845.82%0.083%5.8%

Toolchain (T1 / T2)

Medium tier, 20-day calibration NC window — different from 90-day matrix above.

gfortran vs Intel ifx (M, 20 d cal NC)

Production uses Intel ifx.

gfortranifx00.250.50.751

ifx Release flag matrix (M, 20 d)

Production uses release_o3_ipo; bit-identical NC vs -O2 ref.

I/O stress (IO)

Reverted fork experiments (L tier)

ExperimentBaselinePatchedΔVerdict
Gate channel WQ on wq_cha428.7 s429.8 s+0.3%reverted
command.f90 integer dispatch428.7 s439.9 s+2.6%reverted

VTune hotspots (L tier R1, 90 d)

for_cpstr_eq18.8%81.9 s CPU
__intel_avx_rep_memset17.4%75.8 s CPU
hru_control_15.8%68.8 s CPU
sd_channel_control3_11.5%50.0 s CPU
proc_hru_5.4%23.6 s CPU

Daily land-phase and channel routing dominate; I/O is a small fraction on filtered output.

Workflow note (calibration context)

Particle swarm calibration runs many independent forward models. Init cost (~30% of a 90-day wall on large models) is paid per particle; parallel particles amortize study wall time better than chasing single-run micro-optimizations. See calibration methods for the optimizer stack — this page covers execution-time and output configuration only.

Related guides

Explore related

Last updated 2026-05-30.