SWAT+ runtime benchmark on real watershed models | SWATGenX
Measured one-year runtime, output I/O, HRU scaling, and calibration-time estimates across six basin sizes — with the settings we use in SWATGenX production calibration.
473 · 11k · 20k · 37k · 52k · 94k HRUs — measured on our server
Calibration runtime and output size are two practical limits in SWAT+ modeling. This page is a benchmark study on six real watershed SWAT+ model packages — from 473 to 94,303 HRUs — with measured one-year runs to show where simulation time is spent, which optimizations matter, and how long full calibrations may take on typical servers.
The benchmark keeps basin inputs fixed and changes one factor at a time: full NetCDF versus formatted-text export, filtered calibration output, compiler build, and HRU-count scaling. The figures and tables report timings from our production server. The closing section extrapolates seven-year calibration wall-time ranges from measured init and daily-loop costs.
Read this first
- For streamflow calibration, reduce printed output before tuning file format or compiler flags.
- Use NetCDF for full daily and monthly HRU, stream, basin, and region exports.
- Treat compiler results as host-specific; confirm timings on your own CPU and filesystem.
Fork contributions
SWATGenX uses two SWAT+ engine extensions from our fork: gauge-limited printing through print_filter.prt and daily NetCDF output through cdfout=y. Both have been proposed upstream but are not yet part of the official swat-model/swatplus release.
- print_filter.prt — writes daily channel_sd output only at calibration gauges, using a narrow print.prt and swift_out=0.
- NetCDF (cdfout=y) — writes daily NetCDF-4 files instead of formatted text when full stream output is enabled.
rafiei-vahid/swatplus (feature/netcdf-cdfout) · swat-model/swatplus
Motivation
The question
Calibration is where SWAT+ runtime actually bites. To fit a model to a gauge, SWATGenX runs it hundreds to thousands of times — one full forward simulation per particle, per iteration of the optimizer — so the wall time of a single run, and the disk I/O it spends writing output, decide whether basin-scale calibration takes an afternoon or a week. The question this page answers is concrete: what actually governs a SWAT+ run’s wall time, and what should you compile and print to make calibration tractable without changing the science.
We answer it empirically, on three real SWATGenX-built models spanning small, medium, and large calibration workloads, changing one factor at a time: output format (NetCDF vs text), print scope (full export vs a gauge-limited calibration profile), the compiler build, and how runtime scales with HRU count.
Methods
A controlled, single-variable benchmark
Every run covers the same calendar year on the same model package; within each tier the inputs stay fixed and exactly one factor changes at a time, so a difference in wall time is attributable to that factor and nothing else. All timings are wall-clock seconds on the same host, and the ifx variants are held to a byte-identical output parity rule so a faster build never means a different answer.
- Simulation period
- 365 simulated days (calendar year 2021)
- Production binary
- Intel ifx release_o3_ipo, from the production SWAT+ fork rafiei-vahid/swatplus @ commit 768f1d1 (rev 61.0.2.61-351-g768f1d1) — the production binary at the time these benchmarks were run; compiler-matrix rows are the same fork commit built with the listed flags. The current deployed engine (247e95b, the same build plus two output-identical engine fixes — see /swat-plus-engine) runs faster still; these tier timings are conservative upper bounds.
- Metrics
- Wall-clock seconds, seconds per simulated day, bytes on disk
- Parity rule
- ifx variants match on channel_sd_day.nc (M tier, 20-day check); one-year runs use identical print settings
- Test CPU
- AMD EPYC 7282 — 10 vCPU KVM guest (5 cores × 2 threads on a 16-core host)
- OS / architecture
- Linux x86_64, Ubuntu 22.04 (KVM); 64-bit SWAT+ builds
- Compilers
- Intel oneAPI ifx 2026.0.0 and GNU Fortran 11.4.0. The -xHost variant targets this CPU (AVX2).
- Output storage
- Local ext4 on a QEMU virtio disk (~650 GB root volume). Each run writes to a temporary folder on that same filesystem, so the clock includes synchronous disk I/O for whatever print profile is enabled.
Each run covers one calendar year. Within each tier, model inputs stay fixed while one factor changes at a time.
Three real models: small, medium, large
We selected three SWATGenX-built SWAT+ models to represent small, medium, and large calibration workloads. Each row is a completed model package built from the same national data stack used in production: NHDPlus HR, PRISM climate, 250 m landuse/soil, and 30 m DEM. Use the Model ID link to open the public package page and download the ZIP.
Table 1. Benchmark basins by tier, USGS model ID, basin name, HRU count, and channel count. Model ID links open the SWATGenX package page where you can download the ZIP.
| Tier | Model ID | Basin | HRUs | Channels |
|---|---|---|---|---|
| S | 03080102 | Oklawaha (FL) | 473 | 45 |
| M | 09471300 | Upper San Pedro (AZ) | 11,284 | 350 |
| L | 03100101 | Peace River HUC-8 | 94,303 | 8,181 |
Results and discussion
Four results follow, each isolating one factor. The first two settle what to print — format matters when you export everything, but print scope matters more once a calibration run is limited to the gauges you fit. The third settles what to compile. The fourth checks a tempting shortcut — cutting HRUs — and finds it does not buy proportional speed once routing dominates.
Measured benchmark results on real watershed models
Full export is I/O-bound — NetCDF wins when you print everything
When all daily and monthly outputs are written, NetCDF is the practical default for the S and M tiers (Figure 1, Table 2).
This scenario uses the full-output print profile: all HRU, channel, basin, and region outputs at daily and monthly time steps (benchmark_1yr/print.prt). The only changed setting is cdfout: NetCDF (y) versus formatted text (n). We ran tiers S and M only; tier L was skipped because formatted-text output would be impractically large.
Figure 1. Wall-clock time for one calendar year with full daily and monthly export (tiers S and M). Blue bars: NetCDF; orange bars: formatted text.
Figure 2. Total output written to disk for the same full-export runs, in megabytes per tier.
Table 2. Full-export wall time, NetCDF speed advantage, and on-disk output size for NetCDF vs formatted text.
| Tier | NC wall (s) | TXT wall (s) | NC faster | NC size | TXT size |
|---|---|---|---|---|---|
| S · 03080102 | 21.1 | 26.1 | +19.2% | 125.7 MB | 607.3 MB |
| M · 09471300 | 269.4 | 597.0 | +54.9% | 3067.1 MB | 14573.0 MB |
| L · 03100101 | We did not run full daily+monthly NC vs TXT on the 94k-HRU tier — TXT output would be impractically large on disk. | ||||
Figure 1 and Table 2 show that NetCDF finished faster than formatted text when the model wrote every daily and monthly output: about 19% faster on tier S and 55% faster on tier M. Figure 2 shows the same pattern in disk use, with NetCDF producing about 4.8× less output. For full-export jobs, NetCDF is the better default.
Print scope dominates — filter to your gauges and format is second-order
For calibration-style runs, limiting output to gauge channel_sd changes runtime more than choosing NetCDF or text (Figures 3–4, Table 3).
This scenario uses a calibration-style print profile: daily channel_sd only, limited to gauge channels through print_filter.prt, with swift_out=0 and wq_cha=0 in codes.bsn. The same filter is used for NetCDF (cdfout=y) and formatted text (cdfout=n). For tiers S and M, the charts also show the full-export runs from Scenario 1 for comparison.
Figure 3. Daily streamflow (channel_sd) file size per tier. Short bars: calibration filter at gauge channels; tall bars: full export from scenario 1 (tiers S and M only).
Figure 4. Wall-clock time for filtered NetCDF, filtered text, and full-export baselines where we ran them. Lower bars mean faster runs.
Table 3. Filtered calibration runs: wall time, channel_sd file size, and seconds per simulated day.
| Tier | Filtered NC wall (s) | Filtered TXT wall (s) | NC channel_sd | TXT channel_sd | sec / sim day |
|---|---|---|---|---|---|
| S · 03080102 | 5.4 | 3.8 | 0.5 MB | 0.3 MB | 0.01 |
| M · 09471300 | 99.8 | 89.8 | 0.6 MB | 0.7 MB | 0.26 |
| L · 03100101 | 1,378 | 1,297 | 4.9 MB | 13.8 MB | 3.37 |
Figures 3–4 and Table 3 show that print scope matters more than file format during calibration. Limiting output to daily streamflow at calibration gauges shortened runs far more than switching between NetCDF and text: tier S dropped from 21 s to 5 s, and tier M dropped from 269 s to 100 s. With this narrow output profile, NetCDF and text remained close enough that the practical choice is your downstream toolchain.
Compiler choice — ifx wins, and we ship the portable build
Table 4 ranks compiler builds on our server; Figures 5–7 show the tier-level results. Repeat this test on your own CPU and filesystem before choosing a production binary.
This scenario uses the same filtered daily channel_sd NetCDF profile as Scenario 2. Each table cell reports one calendar-year run for one compiler build and one model tier. These results are specific to the CPU, compiler versions, and disk used in this test.
Reference: ifx -O2. Production: ifx -O3 -ipo (portable). The single fastest build is ifx -O3 -xHost, but it compiles for this exact CPU’s instruction set and is not portable; production ships -ipo, within about 2% of -xHost. On the 94k-HRU L tier both gfortran builds exited during model input reading, so they have no L-tier time; every Intel ifx build completed it.
- Test CPU
- AMD EPYC 7282 — 10 vCPU KVM guest (5 cores × 2 threads on a 16-core host)
- Architecture
- Linux x86_64, Ubuntu 22.04 (KVM); 64-bit SWAT+ builds
- Compilers
- Intel oneAPI ifx 2026.0.0 and GNU Fortran 11.4.0. The -xHost variant targets this CPU (AVX2).
- Output storage
- Local ext4 on a QEMU virtio disk (~650 GB root volume). Each run writes to a temporary folder on that same filesystem, so the clock includes synchronous disk I/O for whatever print profile is enabled.
Compiler rankings and absolute run times depend on your CPU, core count, compiler version, and disk speed. Use these numbers as a reference from our server — confirm on your own hardware before you pick a production build.
Table 4. Compiler build matrix: one-year wall time by model tier. Fastest build per tier is marked; medium tier also shows percent change vs ifx -O2.
| Build | Tier S wall (s) | Tier M wall (s) | Tier L wall (s) | vs ifx -O2 (M) |
|---|---|---|---|---|
| gfortran -O2 | 5.5 | 111.6 | 1,269 · fastest | +15.3% |
| gfortran -O3 | 6.5 | 157.0 | — | +62.2% |
| ifx -O2 | 5.6 | 96.8 | 1,393 | 0.0% |
| ifx -O3 | 5.5 | 92.2 | 1,392 | -4.7% |
| ifx -O3 -xHost | 6.0 | 104.7 | 1,364 | +8.2% |
| ifx -O3 -ipo · production | 5.0 · fastest | 89.5 · fastest | 1,353 | -7.5% |
Figure 5. Tier S (473 HRUs): wall time per compiler build on the filtered calibration print from scenario 2. Green bar: production ifx -O3 -ipo.
Figure 6. Tier M (11,284 HRUs): wall time per compiler build on the same filtered print profile. Green bar: production ifx -O3 -ipo.
Figure 7. Tier L (94,303 HRUs): wall time per compiler build. Both gfortran builds (-O2 and -O3) exited during model input reading on this tier; every Intel ifx build completed it. Green bar: production ifx -O3 -ipo.
Table 4 and Figures 5–7 show that Intel ifx was the best overall choice on this server: every ifx build beat every gfortran build that ran. The single fastest build was the CPU-targeted ifx -O3 -xHost. We ship the portable ifx -O3 -ipo in production. The large tier was the exception: gfortran -O2 finished about 6% faster than ifx -O3 -ipo. gfortran -O3, however, did not complete on that tier, and most SWATGenX calibration workloads are closer to the medium tier than to the 94k-HRU test basin. For our current server, ifx -O3 -ipo is the best production balance. Repeat the compiler matrix on your own hardware before treating any binary choice as final.
Where does the time actually go? See the profiling deep-dive
This page answers which build and settings run fastest. A separate study opens the stock SWAT+ engine with Intel VTune and asks where the CPU time goes. On a 94k-HRU basin the two largest costs are not hydrology — they are string name-matching and array zeroing — and two small, results-identical fixes cut a 3-month run by 1.58×.
- VTune before/after on the stock engine vs the same code with two fixes.
- Each fix traced to its source pattern and contributed upstream as an independent pull request.
- Byte-identical results; commit-pinned binaries.
Read the SWAT+ performance-profiling deep-dive →
For how these layers stack against the stock distribution (commit-pinned production vs stock), see the SWAT+ production engine page.
Beyond one run: the calibration-level levers
The timed scenarios above measure one SWAT+ forward run at a time. These workflow choices reduce I/O, parallelize the optimizer, and stop when the fit stops improving — savings at the calibration level rather than inside a single benchmark cell.
Reduce I/O during calibration
On our medium-tier benchmark model (09471300, Upper San Pedro (AZ), 11,284 HRUs), the shared PRISM folder holds 400 meteorological grid files totaling about 56 MB. A typical PSO calibration uses 48 particles for 50–75 iterations — 2,400–3,600 forward evaluations. If each evaluation copied those grids into a separate TxtInOut, that would add about 135–203 GB of avoidable disk traffic before simulation output is counted.
SWATGenX points pcp_path, tmp_path, slr_path, hmd_path, and wnd_path in file.cio at a shared directory instead. Index files stay in each particle's TxtInOut; the large grid files are read in place. This is already implemented in production and should be part of any calibration I/O plan. Disable with SWATGENX_DISABLE_PCP_PATH=1.
Coarser landuse and soil grids (optional at model build)
Hydrological delineation — stream network, subbasins, and routing — does not depend on landuse or soil grid resolution. Those layers define HRU land-cover and soil assignments after the watershed is built. Choosing a coarser landuse/soil alignment resamples NLCD and gSSURGO to fewer unique combinations and can reduce HRU count and calibration cost, with some compromise in spatial detail of land cover and soil properties. The benchmark models on this page use 250 m landuse/soil with a 30 m DEM; coarser settings are a build-time tradeoff, not a per-run switch.
Use an optimization algorithm with parallel processing
SWATGenX uses particle swarm optimization (PSO), a population-based swarm intelligence method inspired by how birds or fish move as a group toward food. The optimizer keeps a swarm of particles; each particle holds one candidate parameter set and is evaluated with a full SWAT+ forward run. Each iteration updates every particle through social search rules — tracking its own best score and the best score in the swarm — so the population moves toward better calibrations without exploring every combination in the parameter space. When the calibration problem is well posed, fewer forward runs are needed to converge; that depends on the search method and on model structure, not on PSO alone.
Typical pools contain 24–48 particles. We run as many forward models in parallel as CPU and RAM allow, then batch the rest evenly. Early-stop rules cut total simulation count: calibration ends when the best score stalls for several iterations or when the last ten scores barely change after iteration 25.
Caution. Delineation, inputs, and routing need a solid, robust setup — as we apply in SWATGenX with NHDPlus HR. Initialization of model inputs also matters: agricultural management and urban water use must be represented before calibration starts. In groundwater-dominated regions, SWAT+ alone often performs poorly and may need coupling to a dedicated groundwater model. Once the model is structurally sound, the optimizer adjusts parameters — within what the SWAT+ setup can represent — to improve simulated versus observed agreement.
Background: Calibration methods · How it works
What the numbers mean for your calibration
- Full-export runs are I/O-bound. NetCDF matters most when every daily and monthly output is written (Figures 1–2, Table 2).
- Calibration runs are dominated by print scope once output is limited to gauge streamflow (Figures 3–4, Table 3).
- With a filtered calibration profile, NetCDF and text are both workable; choose the format your post-processing tools already support (Table 3).
- Compiler gains on our EPYC host may not transfer to other CPUs, compiler versions, or filesystems (Table 4, Figures 5–7).
- Shared climate paths, parallel PSO with early stop, and coarser landuse/soil grids where acceptable reduce calibration overhead beyond what a single forward-run benchmark captures.
Conclusion
What to ship, and what to print
The fastest path through calibration is to narrow what you print before you tune anything else, write it in a format your post-processing already reads, and run a portable, well-optimized binary.
Runtime is governed first by print scope, not output format: once a calibration run is limited to the gauge channels you actually fit, NetCDF and text are both workable and the format becomes a second-order choice. Format matters where it should — full daily-and-monthly export is I/O-bound, and there NetCDF is the practical default. The compiler is the last lever: Intel ifx beat every gfortran build that ran, and we ship the portable ifx -O3 -ipo — within about 2% of the CPU-pinned -xHost, but able to run on any deployment CPU. And cutting HRUs is not the free speedup it looks like: once routing and model-graph overhead dominate, fewer HRUs do not buy proportional time. Every ranking here is specific to our CPU, compilers, and filesystem — the method transfers, the exact numbers may not, so repeat the matrix on your own hardware before fixing a binary.
- For calibration, print only the stream outputs you fit.
- Use NetCDF for full daily and monthly export workloads.
- For filtered calibration runs, NetCDF and text are both acceptable.
- On Intel hosts like ours, build production binaries with ifx -O3 -ipo — portable across deployment CPUs and within ~2% of the CPU-pinned -xHost — and verify on your own hardware.
- Share climate files across PSO particles and stop calibration when the fit no longer improves.
- Biggest runtime lever
- Print scope
- Full-export format
- NetCDF
- Production binary
- ifx -O3 -ipo
- Portability cost vs fastest
- ~2%
This page quantifies the cost of running a SWATGenX model. How the model is built — and why that build is accurate and reproducible — is covered on the NHDPlus HR vs TauDEM delineation, drainage-area audit, and USGS station-assignment pages.
FAQ
Why is my SWAT+ simulation so slow?
Runtime is often dominated by how much output you write and how you store it. Full daily and monthly exports can be I/O-bound; limiting calibration output to gauge streamflow, using NetCDF for large exports, and choosing an optimized compiler build usually matter more than small parameter tweaks.
How can I speed up SWAT+ calibration?
Reduce printed output first, then pick NetCDF or text for your toolchain, then tune compiler flags on your CPU. SWATGenX also shares climate grids across PSO particles, runs forward models in parallel, and stops calibration when the fit stops improving.
What does the SWATGenX SWAT+ runtime benchmark measure?
One calendar year of filtered calibration-style simulation on six real watershed SWAT+ models (473 to 94,303 HRUs). Scenarios change one factor at a time: full NetCDF vs text export, calibration print scope, compiler build, HRU-count scaling, and VTune CPU profiles, while basin inputs stay fixed. The page also estimates seven-year calibration wall-time ranges from measured init and daily-loop costs.
How long does SWAT+ calibration take on a large watershed?
It depends on HRU count, channel routing density, simulation window length, particle count (often 24-48), iteration count (often 30-75 before early stop), and available CPU cores. On our benchmark page, a 94k-HRU Peace River model on a 16-core server is estimated at roughly six days to three weeks for a full seven-year calibration search; smaller basins finish much faster.
How do I reduce SWAT+ output size for calibration?
Write daily channel_sd only at the gauges you calibrate against, with a narrow print.prt and swift_out=0 in codes.bsn. That cuts I/O far more than switching between NetCDF and text on the same print profile.
When should I use NetCDF instead of formatted TXT in SWAT+?
When you need full daily and monthly HRU and stream output for mapping or archives. In our benchmark, text export was several times larger and slower than NetCDF at the same full-export print scope.
How does SWATGenX avoid copying climate files during calibration?
We patch file.cio so pcp_path, tmp_path, slr_path, hmd_path, and wnd_path point at one shared meteorological grid folder. Each PSO particle reads from there instead of copying grid files into its TxtInOut.
How does parallel SWAT+ calibration work?
PSO runs 24–48 particles per iteration, as many in parallel as CPU and RAM allow. The run stops when scores stop improving for several iterations in a row, or when the best fit flatlines after iteration 25.
Related guides
Explore related
Last updated 2026-05-31.
