How long does a SWAT+ run take for a given HRU count?

On our server, one filtered calibration year of a single serial SWAT+ forward run takes about 5 s for a 473-HRU basin, ~90 s for 11,284 HRUs, and ~1,350 s for the 94,303-HRU Peace River model. Runtime rises with model size but not by HRU count alone — the 51,685-HRU Upper Gila basin runs slower than the larger Peace River because it has roughly twice the channels, and channel/routing density drives wall time. A single forward run drops further on the multi-core SWATGenX engine (see the parallel-engine page).

How can I speed up a slow SWAT+ simulation or calibration?

Reduce printed output first: writing daily channel_sd only at your calibration gauges cut a mid-size run from 269 s to 100 s in our benchmark — more than switching file format. Then use NetCDF for large full exports (~4.8× smaller than text), and build with a portable optimized compiler (Intel ifx -O3 -ipo, within ~2% of a CPU-pinned -xHost). These are serial-engine levers; the multi-core engine and swarm parallelism reduce calibration wall time on top of them.

Are the calibration-time estimates for the serial or parallel engine?

Serial. Every runtime on this page is a single-core forward run, and the seven-year calibration estimates extrapolate that serial per-run cost across the particle swarm. They are conservative upper bounds: a single forward run is at least as fast on the current production engine, and faster still on the opt-in multi-core engine documented on the parallel-engine page.

SWAT+ performance research

SWAT+ runtime benchmark on real watershed models

Measured one-year runtime, output I/O, HRU scaling, and calibration-time estimates across six real basins — with the exact settings SWATGenX uses in production calibration.

Filtered daily channel_sd, production Intel ifx -O3 -ipo, measured on our server. Companion to the SWATGenX engine-acceleration study (submitted to Geoscientific Model Development, 2026).

473 → 94k HRUs, six basins
NetCDF ~4.8× smaller than text
Print filter cuts tier-M run 269s → 100s
Portable ifx -O3 -ipo, within ~2% of -xHost

Runtime vs model size — one simulated year6 basins · filtered

Filtered daily channel_sd, production ifx -O3 -ipo. Runtime rises with model size, but channel/routing density — not HRU count alone — sets the wall time.

Calibration runtime and output size are two practical limits in SWAT+ modeling. This page is a benchmark study on six real watershed SWAT+ model packages — from 473 to 94,303 HRUs — with measured one-year runs to show where simulation time is spent, which optimizations matter, and how long full calibrations may take on typical servers.

The benchmark keeps basin inputs fixed and changes one factor at a time: full NetCDF versus formatted-text export, filtered calibration output, compiler build, and HRU-count scaling. The figures and tables report timings from our production server. The closing section extrapolates seven-year calibration wall-time ranges from measured init and daily-loop costs.

One scope note up front: every timing here is for a single-core (serial) SWAT+ run, and the seven-year calibration estimates extrapolate that serial per-run cost across the swarm. A single forward run shrinks further on the multi-core SWATGenX engine — the HRU land phase runs byte-identical across threads and channel routing has an opt-in wavefront — measured separately on the SWAT+ parallel-engine page. Read the numbers below as the serial-engine reference these accelerations build on.

Fork contributions

SWATGenX uses two SWAT+ engine extensions from our fork: gauge-limited printing through print_filter.prt and daily NetCDF output through cdfout=y. Both have been proposed upstream but are not yet part of the official swat-model/swatplus release.

print_filter.prt — writes daily channel_sd output only at calibration gauges, using a narrow print.prt and swift_out=0.
NetCDF (cdfout=y) — writes daily NetCDF-4 files instead of formatted text when full stream output is enabled.

rafiei-vahid/swatplus (feature/netcdf-cdfout) · swat-model/swatplus

Motivation

The question

Calibration is where SWAT+ runtime actually bites. To fit a model to a gauge, SWATGenX runs it hundreds to thousands of times — one full forward simulation per particle, per iteration of the optimizer — so the wall time of a single run, and the disk I/O it spends writing output, decide whether basin-scale calibration takes an afternoon or a week. The question this page answers is concrete: what actually governs a SWAT+ run’s wall time, and what should you compile and print to make calibration tractable without changing the science.

We answer it empirically, on three real SWATGenX-built models spanning small, medium, and large calibration workloads, changing one factor at a time: output format (NetCDF vs text), print scope (full export vs a gauge-limited calibration profile), the compiler build, and how runtime scales with HRU count.

Methods

A controlled, single-variable benchmark

Every run covers the same calendar year on the same model package; within each tier the inputs stay fixed and exactly one factor changes at a time, so a difference in wall time is attributable to that factor and nothing else. All timings are wall-clock seconds on the same host, and the ifx variants are held to a byte-identical output parity rule so a faster build never means a different answer.

Simulation period: 365 simulated days (calendar year 2021)
Production binary: Intel ifx release_o3_ipo, from the production SWAT+ fork rafiei-vahid/swatplus @ commit 768f1d1 (rev 61.0.2.61-351-g768f1d1) — the production binary at the time these benchmarks were run; compiler-matrix rows are the same fork commit built with the listed flags. The deployed engine has since advanced (247e95b added two output-identical engine fixes; the current unified build 6a4b7f1 adds opt-in multi-core, PFAS, and MODFLOW 6 — see /swat-plus-engine) and runs a single serial forward pass at least as fast, so these tier timings are conservative upper bounds. A single forward run drops further on the multi-core engine (see /swat-plus-parallel-engine).
Metrics: Wall-clock seconds, seconds per simulated day, bytes on disk
Parity rule: ifx variants match on channel_sd_day.nc (M tier, 20-day check); one-year runs use identical print settings
Test CPU: AMD EPYC 7282 — 10 vCPU KVM guest (5 cores × 2 threads on a 16-core host)
OS / architecture: Linux x86_64, Ubuntu 22.04 (KVM); 64-bit SWAT+ builds
Compilers: Intel oneAPI ifx 2026.0.0 and GNU Fortran 11.4.0. The -xHost variant targets this CPU (AVX2).
Output storage: Local ext4 on a QEMU virtio disk (~650 GB root volume). Each run writes to a temporary folder on that same filesystem, so the clock includes synchronous disk I/O for whatever print profile is enabled.

Each run covers one calendar year. Within each tier, model inputs stay fixed while one factor changes at a time.

Three real models: small, medium, large

We selected three SWATGenX-built SWAT+ models to represent small, medium, and large calibration workloads. Each row is a completed model package built from the same national data stack used in production: NHDPlus HR, PRISM climate, 250 m landuse/soil, and 30 m DEM. Use the Model ID link to open the public package page and download the ZIP.

Table 1. Benchmark basins by tier, USGS model ID, basin name, HRU count, and channel count. Model ID links open the SWATGenX package page where you can download the ZIP.

Tier	Model ID	Basin	HRUs	Channels
S	03080102	Oklawaha (FL)	473	45
M	09471300	Upper San Pedro (AZ)	11,284	350
L	03100101	Peace River HUC-8	94,303	8,181

Results and discussion

Four results follow, each isolating one factor. The first two settle what to print — format matters when you export everything, but print scope matters more once a calibration run is limited to the gauges you fit. The third settles what to compile. The fourth checks a tempting shortcut — cutting HRUs — and finds it does not buy proportional speed once routing dominates.

Measured benchmark results on real watershed models

Full export is I/O-bound — NetCDF wins when you print everything

When all daily and monthly outputs are written, NetCDF is the practical default for the S and M tiers (Figure 1, Table 2).

This scenario uses the full-output print profile: all HRU, channel, basin, and region outputs at daily and monthly time steps (benchmark_1yr/print.prt). The only changed setting is cdfout: NetCDF (y) versus formatted text (n). We ran tiers S and M only; tier L was skipped because formatted-text output would be impractically large.

Figure 1. Wall-clock time for one calendar year with full daily and monthly export (tiers S and M). Blue bars: NetCDF; orange bars: formatted text.

Figure 2. Total output written to disk for the same full-export runs, in megabytes per tier.

Table 2. Full-export wall time, NetCDF speed advantage, and on-disk output size for NetCDF vs formatted text.

Tier	NC wall (s)	TXT wall (s)	NC faster	NC size	TXT size
S · 03080102	21.1	26.1	+19.2%	125.7 MB	607.3 MB
M · 09471300	269.4	597.0	+54.9%	3067.1 MB	14573.0 MB
L · 03100101	We did not run full daily+monthly NC vs TXT on the 94k-HRU tier — TXT output would be impractically large on disk.

Figure 1 and Table 2 show that NetCDF finished faster than formatted text when the model wrote every daily and monthly output: about 19% faster on tier S and 55% faster on tier M. Figure 2 shows the same pattern in disk use, with NetCDF producing about 4.8× less output. For full-export jobs, NetCDF is the better default.

Print scope dominates — filter to your gauges and format is second-order

For calibration-style runs, limiting output to gauge channel_sd changes runtime more than choosing NetCDF or text (Figures 3–4, Table 3).

This scenario uses a calibration-style print profile: daily channel_sd only, limited to gauge channels through print_filter.prt, with swift_out=0 and wq_cha=0 in codes.bsn. The same filter is used for NetCDF (cdfout=y) and formatted text (cdfout=n). For tiers S and M, the charts also show the full-export runs from Scenario 1 for comparison.

Figure 3. Daily streamflow (channel_sd) file size per tier. Short bars: calibration filter at gauge channels; tall bars: full export from scenario 1 (tiers S and M only).

Figure 4. Wall-clock time for filtered NetCDF, filtered text, and full-export baselines where we ran them. Lower bars mean faster runs.

Table 3. Filtered calibration runs: wall time, channel_sd file size, and seconds per simulated day.

Tier	Filtered NC wall (s)	Filtered TXT wall (s)	NC channel_sd	TXT channel_sd	sec / sim day
S · 03080102	5.4	3.8	0.5 MB	0.3 MB	0.01
M · 09471300	99.8	89.8	0.6 MB	0.7 MB	0.26
L · 03100101	1,378	1,297	4.9 MB	13.8 MB	3.37

Figures 3–4 and Table 3 show that print scope matters more than file format during calibration. Limiting output to daily streamflow at calibration gauges shortened runs far more than switching between NetCDF and text: tier S dropped from 21 s to 5 s, and tier M dropped from 269 s to 100 s. With this narrow output profile, NetCDF and text remained close enough that the practical choice is your downstream toolchain.

Compiler choice — ifx wins, and we ship the portable build

Table 4 ranks compiler builds on our server; Figures 5–7 show the tier-level results. Repeat this test on your own CPU and filesystem before choosing a production binary.

This scenario uses the same filtered daily channel_sd NetCDF profile as Scenario 2. Each table cell reports one calendar-year run for one compiler build and one model tier. These results are specific to the CPU, compiler versions, and disk used in this test.

Reference: ifx -O2. Production: ifx -O3 -ipo (portable). The single fastest build is ifx -O3 -xHost, but it compiles for this exact CPU’s instruction set and is not portable; production ships -ipo, within about 2% of -xHost. On the 94k-HRU L tier both gfortran builds exited during model input reading, so they have no L-tier time; every Intel ifx build completed it.

Test CPU: AMD EPYC 7282 — 10 vCPU KVM guest (5 cores × 2 threads on a 16-core host)
Architecture: Linux x86_64, Ubuntu 22.04 (KVM); 64-bit SWAT+ builds
Compilers: Intel oneAPI ifx 2026.0.0 and GNU Fortran 11.4.0. The -xHost variant targets this CPU (AVX2).
Output storage: Local ext4 on a QEMU virtio disk (~650 GB root volume). Each run writes to a temporary folder on that same filesystem, so the clock includes synchronous disk I/O for whatever print profile is enabled.

Compiler rankings and absolute run times depend on your CPU, core count, compiler version, and disk speed. Use these numbers as a reference from our server — confirm on your own hardware before you pick a production build.

Table 4. Compiler build matrix: one-year wall time by model tier. Fastest build per tier is marked; medium tier also shows percent change vs ifx -O2.

Build	Tier S wall (s)	Tier M wall (s)	Tier L wall (s)	vs ifx -O2 (M)
gfortran -O2	5.5	111.6	1,269 · fastest	+15.3%
gfortran -O3	6.5	157.0	—	+62.2%
ifx -O2	5.6	96.8	1,393	0.0%
ifx -O3	5.5	92.2	1,392	-4.7%
ifx -O3 -xHost	6.0	104.7	1,364	+8.2%
ifx -O3 -ipo · production	5.0 · fastest	89.5 · fastest	1,353	-7.5%

Figure 5. Tier S (473 HRUs): wall time per compiler build on the filtered calibration print from scenario 2. Green bar: production ifx -O3 -ipo.

Figure 6. Tier M (11,284 HRUs): wall time per compiler build on the same filtered print profile. Green bar: production ifx -O3 -ipo.

Figure 7. Tier L (94,303 HRUs): wall time per compiler build. Both gfortran builds (-O2 and -O3) exited during model input reading on this tier; every Intel ifx build completed it. Green bar: production ifx -O3 -ipo.

Table 4 and Figures 5–7 show that Intel ifx was the best overall choice on this server: every ifx build beat every gfortran build that ran. The single fastest build was the CPU-targeted ifx -O3 -xHost. We ship the portable ifx -O3 -ipo in production. The large tier was the exception: gfortran -O2 finished about 6% faster than ifx -O3 -ipo. gfortran -O3, however, did not complete on that tier, and most SWATGenX calibration workloads are closer to the medium tier than to the 94k-HRU test basin. For our current server, ifx -O3 -ipo is the best production balance. Repeat the compiler matrix on your own hardware before treating any binary choice as final.

Fewer HRUs is not proportionally faster when routing dominates

Reducing HRU count — or even speeding up HRU calculations — does not guarantee proportional speedup when routing and model-graph overhead dominate (Figure 8, Figure 9, Table 8).

Many calibration workflows use HRU count as a rough proxy for CPU cost. That leads to a common assumption: if we coarsen landuse/soil inputs or remove small HRUs, runtime should fall by a similar fraction. This scenario tests that assumption.

If runtime scaled mainly with HRU count, removing about 30% of HRUs would be expected to reduce wall time by roughly 30%. We did not expect that simple relationship to hold on real basins, because SWAT+ also spends time in channel routing, hydrograph handling, startup, and object-loop operations.

We selected six SWATGenX showcase models spanning 473 to 94k HRUs. Tiers S and M are the standard benchmark pair; X20, X40, and X60 fill the middle of the ladder; tier L is the large-basin case. All runs use the same filtered calibration profile as Scenario 2: daily channel_sd NetCDF output, print_filter at calibration gauges, swift_out=0, and wq_cha=0. The production binary is Intel ifx -O3 -ipo, with 365 simulated days in 2021.

Table 7. Scenario 4 model ladder: six SWATGenX showcase basins spanning small to large HRU counts, with channel counts and calibration-gauge print_filter counts.

Tier	Model ID	Basin	HRUs	Channels	print_filter channels
S	03080102	Oklawaha (FL)	473	45	1
M	09471300	Upper San Pedro (AZ)	11,284	350	2
X20	03152000	Little Kanawha (WV)	19,530	1,615	5
X40	07174000	Verdigris River (KS)	36,855	2,370	3
X60	15060105	Upper Gila HUC-8 (AZ)	51,685	17,296	4
L	03100101	Peace River HUC-8	94,303	8,181	40

Figure 8. Filtered calibration wall time vs HRU count using the production ifx -O3 -ipo binary. The dashed line is a simple linear fit across all six models; HRU count explains part of the trend but not the outliers.

Linear fit: wall time ≈ 0.0169 × HRUs − 46 s (R² = 0.76). The fit is useful as a rough trend, not as a runtime predictor for individual basins.

Figure 9. Wall time vs channel count for the same six models. Upper Gila (X60) has fewer HRUs than Peace River (L), but more channels and a longer runtime, showing that routing density can dominate HRU count.

Table 8. Filtered calibration runtime across the six-model HRU ladder: wall time, initialization time, daily-loop time, and seconds per simulated day.

Tier	Model	HRUs	Channels	Wall (s)	Init (s)	Daily loop (s)	s/day
S	03080102	473	45	5.0	1	4	0.010
M	09471300	11,284	350	89.5	5	83	0.230
X20	03152000	19,530	1,615	156.7	13	142	0.390
X40	07174000	36,855	2,370	297.4	22	273	0.750
X60	15060105	51,685	17,296	1,437	56	1,374	3.770
L	03100101	94,303	8,181	1,353	149	1,197	3.290

The scaling ladder shows that runtime increases with model size, but not as a simple multiple of HRU count. The clearest example is Upper Gila (X60): it has 51,685 HRUs and took 1437 s, while Peace River (L) has 94,303 HRUs and took 1353 s. Upper Gila is slower despite having fewer HRUs because it has a denser routing network: 17,296 channels versus 8,181.

This means HRU count is useful as a rough size indicator, but it is not enough to predict calibration runtime. Channel count, routing density, startup cost, and object-loop overhead also matter. On the large Peace River model, initialization alone took about 149 s before day 1, or roughly 11% of total wall time.

The linear fit in Figure 8 gives a rough trend across the six models (R² ≈ 0.76), but the outliers are the important lesson. A model with fewer HRUs can still run slower if its routing network is denser.

Bottom line: HRU count matters, but it is not the whole runtime model. After output is narrowed, the remaining bottleneck is no longer just HRU calculation. Trimming sub-hectare HRUs alone is unlikely to deliver double-digit calibration speedups, and even ideal HRU-loop parallelism is bounded by hydrograph, routing, startup, and object-loop costs. For faster calibration, print less first and avoid unnecessary full exports. For fewer HRUs, coarsen the model during construction rather than trimming polygons afterward.

What this means for calibration time and cost

Scenario 4 measured one filtered calibration forward run per model (365 simulated days). A full SWATGenX calibration repeats that style of run many times while searching parameter space.

Typical watershed calibrations adjust on the order of 30–40 parameters. In our particle-swarm setup that usually means about 48 particles (often anywhere from 24 to 48). The maximum iteration count is commonly set to 75, but runs usually stop earlier when performance plateaus or stagnates — often somewhere between 30 and 75 iterations.

The table below extrapolates each model’s measured init and daily-loop split to a 7-year simulation window, then estimates calibration wall time as iterations × ⌈particles / cores⌉ × (one 7-year run). Each cell shows a range from 24 particles × 30 iterations to 48 particles × 75 iterations on three example servers (16, 32, and 64 cores). RAM is ignored here; in practice RAM can cap how many particles run at once even on a large machine.

Estimated calibration wall ≈ iterations × ⌈particles / cores⌉ × (init + 7 × daily loop), using Scenario 4 init/daily split per model.

Table 9. Estimated full-calibration wall-time ranges for a 7-year filtered calibration window. Each cell spans 24–48 particles and 30–75 iterations on 16-, 32-, and 64-core servers, derived from Scenario 4 init/daily-loop timing.

Tier	Model	7-yr run	16 cores	32 cores	64 cores
S	03080102	29 s	29 min–1.8 h	14 min–1.2 h	14 min–36 min
M	09471300	10 min	9.8 h–1.5 days	4.9 h–1.0 days	4.9 h–12.2 h
X20	03152000	17 min	16.8 h–2.6 days	8.4 h–1.7 days	8.4 h–21.0 h
X40	07174000	32 min	1.3 days–5.0 days	16.1 h–3.4 days	16.1 h–1.7 days
X60	15060105	2.7 h	6.7 days–3.6 wk	3.4 days–2.4 wk	3.4 days–8.4 days
L	03100101	2.4 h	5.9 days–3.2 wk	3.0 days–2.1 wk	3.0 days–7.4 days

These are planning estimates, not guarantees. Validation stages, filesystem contention, failed particles retried by the pool, and early-stop iteration counts all shift real wall time.

Doubling server cores helps only until cores ≥ particle count each iteration. Above that, faster calibration requires fewer particles, fewer iterations, a shorter simulation window, or the print-scope optimizations in Scenarios 1–2.

Where does the time actually go? See the profiling deep-dive

This page answers which build and settings run fastest. A separate study opens the stock SWAT+ engine with Intel VTune and asks where the CPU time goes. On a 94k-HRU basin the two largest costs are not hydrology — they are string name-matching and array zeroing — and two small, results-identical fixes cut a 3-month run by 1.58×.

VTune before/after on the stock engine vs the same code with two fixes.
Each fix traced to its source pattern and contributed upstream as an independent pull request.
Byte-identical results; commit-pinned binaries.

Read the SWAT+ performance-profiling deep-dive →

For how these layers stack against the stock distribution (commit-pinned production vs stock), see the SWAT+ production engine page.

Beyond one run: the calibration-level levers

The timed scenarios above measure one SWAT+ forward run at a time. These workflow choices reduce I/O, parallelize the optimizer, and stop when the fit stops improving — savings at the calibration level rather than inside a single benchmark cell.

Reduce I/O during calibration

On our medium-tier benchmark model (09471300, Upper San Pedro (AZ), 11,284 HRUs), the shared PRISM folder holds 400 meteorological grid files totaling about 56 MB. A typical PSO calibration uses 48 particles for 50–75 iterations — 2,400–3,600 forward evaluations. If each evaluation copied those grids into a separate TxtInOut, that would add about 135–203 GB of avoidable disk traffic before simulation output is counted.

SWATGenX points pcp_path, tmp_path, slr_path, hmd_path, and wnd_path in file.cio at a shared directory instead. Index files stay in each particle's TxtInOut; the large grid files are read in place. This is already implemented in production and should be part of any calibration I/O plan. Disable with SWATGENX_DISABLE_PCP_PATH=1.

Coarser landuse and soil grids (optional at model build)

Hydrological delineation — stream network, subbasins, and routing — does not depend on landuse or soil grid resolution. Those layers define HRU land-cover and soil assignments after the watershed is built. Choosing a coarser landuse/soil alignment resamples NLCD and gSSURGO to fewer unique combinations and can reduce HRU count and calibration cost, with some compromise in spatial detail of land cover and soil properties. The benchmark models on this page use 250 m landuse/soil with a 30 m DEM; coarser settings are a build-time tradeoff, not a per-run switch.

Use an optimization algorithm with parallel processing

SWATGenX uses particle swarm optimization (PSO), a population-based swarm intelligence method inspired by how birds or fish move as a group toward food. The optimizer keeps a swarm of particles; each particle holds one candidate parameter set and is evaluated with a full SWAT+ forward run. Each iteration updates every particle through social search rules — tracking its own best score and the best score in the swarm — so the population moves toward better calibrations without exploring every combination in the parameter space. When the calibration problem is well posed, fewer forward runs are needed to converge; that depends on the search method and on model structure, not on PSO alone.

Typical pools contain 24–48 particles. We run as many forward models in parallel as CPU and RAM allow, then batch the rest evenly. Early-stop rules cut total simulation count: calibration ends when the best score stalls for several iterations or when the last ten scores barely change after iteration 25.

Caution. Delineation, inputs, and routing need a solid, robust setup — as we apply in SWATGenX with NHDPlus HR. Initialization of model inputs also matters: agricultural management and urban water use must be represented before calibration starts. In groundwater-dominated regions, SWAT+ alone often performs poorly and may need coupling to a dedicated groundwater model. Once the model is structurally sound, the optimizer adjusts parameters — within what the SWAT+ setup can represent — to improve simulated versus observed agreement.

Background: Calibration methods · How it works

What the numbers mean for your calibration

Full-export runs are I/O-bound. NetCDF matters most when every daily and monthly output is written (Figures 1–2, Table 2).
Calibration runs are dominated by print scope once output is limited to gauge streamflow (Figures 3–4, Table 3).
With a filtered calibration profile, NetCDF and text are both workable; choose the format your post-processing tools already support (Table 3).
Compiler gains on our EPYC host may not transfer to other CPUs, compiler versions, or filesystems (Table 4, Figures 5–7).
Shared climate paths, parallel PSO with early stop, and coarser landuse/soil grids where acceptable reduce calibration overhead beyond what a single forward-run benchmark captures.
The scaling ladder shows that runtime increases with model size, but not as a simple multiple of HRU count. The clearest example is Upper Gila (X60): it has 51,685 HRUs and took 1437 s, while Peace River (L) has 94,303 HRUs and took 1353 s. Upper Gila is slower despite having fewer HRUs because it has a denser routing network: 17,296 channels versus 8,181.
This means HRU count is useful as a rough size indicator, but it is not enough to predict calibration runtime. Channel count, routing density, startup cost, and object-loop overhead also matter. On the large Peace River model, initialization alone took about 149 s before day 1, or roughly 11% of total wall time.
The linear fit in Figure 8 gives a rough trend across the six models (R² ≈ 0.76), but the outliers are the important lesson. A model with fewer HRUs can still run slower if its routing network is denser.
HRU count matters, but it is not the whole runtime model. After output is narrowed, the remaining bottleneck is no longer just HRU calculation. Trimming sub-hectare HRUs alone is unlikely to deliver double-digit calibration speedups, and even ideal HRU-loop parallelism is bounded by hydrograph, routing, startup, and object-loop costs. For faster calibration, print less first and avoid unnecessary full exports. For fewer HRUs, coarsen the model during construction rather than trimming polygons afterward.

Conclusion

What to ship, and what to print

The fastest path through calibration is to narrow what you print before you tune anything else, write it in a format your post-processing already reads, and run a portable, well-optimized binary.

Runtime is governed first by print scope, not output format: once a calibration run is limited to the gauge channels you actually fit, NetCDF and text are both workable and the format becomes a second-order choice. Format matters where it should — full daily-and-monthly export is I/O-bound, and there NetCDF is the practical default. The compiler is the last lever: Intel ifx beat every gfortran build that ran, and we ship the portable ifx -O3 -ipo — within about 2% of the CPU-pinned -xHost, but able to run on any deployment CPU. And cutting HRUs is not the free speedup it looks like: once routing and model-graph overhead dominate, fewer HRUs do not buy proportional time. Every ranking here is specific to our CPU, compilers, and filesystem — the method transfers, the exact numbers may not, so repeat the matrix on your own hardware before fixing a binary.

For calibration, print only the stream outputs you fit.
Use NetCDF for full daily and monthly export workloads.
For filtered calibration runs, NetCDF and text are both acceptable.
On Intel hosts like ours, build production binaries with ifx -O3 -ipo — portable across deployment CPUs and within ~2% of the CPU-pinned -xHost — and verify on your own hardware.
Share climate files across PSO particles and stop calibration when the fit no longer improves.
HRU count matters, but it is not the whole runtime model. After output is narrowed, the remaining bottleneck is no longer just HRU calculation. Trimming sub-hectare HRUs alone is unlikely to deliver double-digit calibration speedups, and even ideal HRU-loop parallelism is bounded by hydrograph, routing, startup, and object-loop costs. For faster calibration, print less first and avoid unnecessary full exports. For fewer HRUs, coarsen the model during construction rather than trimming polygons afterward.

Biggest runtime lever: Print scope
Full-export format: NetCDF
Production binary: ifx -O3 -ipo
Portability cost vs fastest: ~2%

This page quantifies the cost of running a SWATGenX model. How the model is built — and why that build is accurate and reproducible — is covered on the NHDPlus HR vs TauDEM delineation, drainage-area audit, and USGS station-assignment pages.

FAQ

Why is my SWAT+ simulation so slow?
Runtime is often dominated by how much output you write and how you store it. Full daily and monthly exports can be I/O-bound; limiting calibration output to gauge streamflow, using NetCDF for large exports, and choosing an optimized compiler build usually matter more than small parameter tweaks.
How can I speed up SWAT+ calibration?
Reduce printed output first, then pick NetCDF or text for your toolchain, then tune compiler flags on your CPU. SWATGenX also shares climate grids across PSO particles, runs forward models in parallel, and stops calibration when the fit stops improving.
What does the SWATGenX SWAT+ runtime benchmark measure?
One calendar year of filtered calibration-style simulation on six real watershed SWAT+ models (473 to 94,303 HRUs). Scenarios change one factor at a time: full NetCDF vs text export, calibration print scope, compiler build, HRU-count scaling, and VTune CPU profiles, while basin inputs stay fixed. The page also estimates seven-year calibration wall-time ranges from measured init and daily-loop costs.
How long does SWAT+ calibration take on a large watershed?
It depends on HRU count, channel routing density, simulation window length, particle count (often 24-48), iteration count (often 30-75 before early stop), and available CPU cores. On our benchmark page, a 94k-HRU Peace River model on a 16-core server is estimated at roughly six days to three weeks for a full seven-year calibration search; smaller basins finish much faster.
How do I reduce SWAT+ output size for calibration?
Write daily channel_sd only at the gauges you calibrate against, with a narrow print.prt and swift_out=0 in codes.bsn. That cuts I/O far more than switching between NetCDF and text on the same print profile.
When should I use NetCDF instead of formatted TXT in SWAT+?
When you need full daily and monthly HRU and stream output for mapping or archives. In our benchmark, text export was several times larger and slower than NetCDF at the same full-export print scope.
How does SWATGenX avoid copying climate files during calibration?
We patch file.cio so pcp_path, tmp_path, slr_path, hmd_path, and wnd_path point at one shared meteorological grid folder. Each PSO particle reads from there instead of copying grid files into its TxtInOut.
How does parallel SWAT+ calibration work?
PSO runs 24–48 particles per iteration, as many in parallel as CPU and RAM allow. The run stops when scores stop improving for several iterations in a row, or when the best fit flatlines after iteration 25.