Skip to content

Commit 517b92f

Browse files
Parallelize the build scripts (#1998)
Optimized/parallelized build scripts. A few notes: 1) The default number of build jobs is 20, but one could argue for using 40. When researching this, I looked up what the SRW is compiling with. That system uses 40 cores, which seems a little excessive, but on testing the global workflow, the actual number of cores being used at any given time rarely exceeds 16 when running with 40 cores. This is because the builds tend to use multiple threads in the beginning when compiling low-level modules while the higher-level modules are more or less serial AND because the GDASApp takes several minutes to initialize all of its subrepositories by which time the smaller builds are complete. 2) I also updated checkout.sh so that all checkouts are simultaneous. The CPU load for `git submodule` is quite low, so running 16 instead of 8 jobs at once is not much more expensive. 3) To make this work, I had to add `-j` options to most of the build scripts. The only exception is build_upp, for which the build script within the UPP is hard coded to use 6 cores. 4) I fixed a few small bugs in the build scripts along the way. 5) Lastly, this reduce the total build time from ~2.5 hours for the entire system (including GDAS and GSI in the same build) to ~40 minutes when running with `-j 40`. Resolves #1978
1 parent 67c050c commit 517b92f

13 files changed

+197
-174
lines changed

sorc/build_all.sh

+143-150
Original file line numberDiff line numberDiff line change
@@ -16,13 +16,15 @@ function _usage() {
1616
Builds all of the global-workflow components by calling the individual build
1717
scripts in sequence.
1818
19-
Usage: ${BASH_SOURCE[0]} [-a UFS_app][-c build_config][-h][-v]
19+
Usage: ${BASH_SOURCE[0]} [-a UFS_app][-c build_config][-h][-j n][-v]
2020
-a UFS_app:
2121
Build a specific UFS app instead of the default
2222
-c build_config:
2323
Selectively build based on the provided config instead of the default config
2424
-h:
2525
print this help message and exit
26+
-j:
27+
Specify maximum number of build jobs (n)
2628
-v:
2729
Execute all build scripts with -v option to turn on verbose where supported
2830
EOF
@@ -33,25 +35,25 @@ script_dir=$(cd "$(dirname "${BASH_SOURCE[0]}")" &> /dev/null && pwd)
3335
cd "${script_dir}" || exit 1
3436

3537
_build_ufs_opt=""
36-
_ops_opt=""
3738
_verbose_opt=""
3839
_partial_opt=""
40+
_build_job_max=20
3941
# Reset option counter in case this script is sourced
4042
OPTIND=1
41-
while getopts ":a:c:hov" option; do
43+
while getopts ":a:c:j:hv" option; do
4244
case "${option}" in
4345
a) _build_ufs_opt+="-a ${OPTARG} ";;
4446
c) _partial_opt+="-c ${OPTARG} ";;
4547
h) _usage;;
46-
o) _ops_opt+="-o";;
48+
j) _build_job_max="${OPTARG} ";;
4749
v) _verbose_opt="-v";;
4850
:)
4951
echo "[${BASH_SOURCE[0]}]: ${option} requires an argument"
50-
usage
52+
_usage
5153
;;
5254
*)
5355
echo "[${BASH_SOURCE[0]}]: Unrecognized option: ${option}"
54-
usage
56+
_usage
5557
;;
5658
esac
5759
done
@@ -105,170 +107,161 @@ ERRSCRIPT=${ERRSCRIPT:-'eval [[ $err = 0 ]]'}
105107
# shellcheck disable=
106108
err=0
107109

108-
#------------------------------------
109-
# build gfs_utils
110-
#------------------------------------
111-
if [[ ${Build_gfs_utils} == 'true' ]]; then
112-
echo " .... Building gfs_utils .... "
113-
# shellcheck disable=SC2086,SC2248
114-
./build_gfs_utils.sh ${_verbose_opt} > "${logs_dir}/build_gfs_utils.log" 2>&1
115-
# shellcheck disable=
116-
rc=$?
117-
if (( rc != 0 )) ; then
118-
echo "Fatal error in building gfs_utils."
119-
echo "The log file is in ${logs_dir}/build_gfs_utils.log"
120-
fi
121-
err=$((err + rc))
122-
fi
110+
declare -A build_jobs
111+
declare -A build_opts
123112

124113
#------------------------------------
125-
# build WW3 pre & post execs
114+
# Check which builds to do and assign # of build jobs
126115
#------------------------------------
127-
if [[ ${Build_ww3_prepost} == "true" ]]; then
128-
echo " .... Building WW3 pre and post execs .... "
129-
# shellcheck disable=SC2086,SC2248
130-
./build_ww3prepost.sh ${_verbose_opt} ${_build_ufs_opt} > "${logs_dir}/build_ww3_prepost.log" 2>&1
131-
# shellcheck disable=
132-
rc=$?
133-
if (( rc != 0 )) ; then
134-
echo "Fatal error in building WW3 pre/post processing."
135-
echo "The log file is in ${logs_dir}/build_ww3_prepost.log"
136-
fi
137-
err=$((err + rc))
138-
fi
139116

140-
#------------------------------------
141-
# build forecast model
142-
#------------------------------------
117+
# Mandatory builds, unless otherwise specified, for the UFS
118+
big_jobs=0
143119
if [[ ${Build_ufs_model} == 'true' ]]; then
144-
echo " .... Building forecast model .... "
145-
# shellcheck disable=SC2086,SC2248
146-
./build_ufs.sh ${_verbose_opt} ${_build_ufs_opt} > "${logs_dir}/build_ufs.log" 2>&1
147-
# shellcheck disable=
148-
rc=$?
149-
if (( rc != 0 )) ; then
150-
echo "Fatal error in building UFS model."
151-
echo "The log file is in ${logs_dir}/build_ufs.log"
152-
fi
153-
err=$((err + rc))
120+
build_jobs["ufs"]=8
121+
big_jobs=$((big_jobs+1))
122+
build_opts["ufs"]="${_verbose_opt} ${_build_ufs_opt}"
154123
fi
155-
156-
#------------------------------------
157-
# build GSI and EnKF - optional checkout
158-
#------------------------------------
159-
if [[ -d gsi_enkf.fd ]]; then
160-
if [[ ${Build_gsi_enkf} == 'true' ]]; then
161-
echo " .... Building gsi and enkf .... "
162-
# shellcheck disable=SC2086,SC2248
163-
./build_gsi_enkf.sh ${_ops_opt} ${_verbose_opt} > "${logs_dir}/build_gsi_enkf.log" 2>&1
164-
# shellcheck disable=
165-
rc=$?
166-
if (( rc != 0 )) ; then
167-
echo "Fatal error in building gsi_enkf."
168-
echo "The log file is in ${logs_dir}/build_gsi_enkf.log"
169-
fi
170-
err=$((err + rc))
171-
fi
172-
else
173-
echo " .... Skip building gsi and enkf .... "
124+
# The UPP is hardcoded to use 6 cores
125+
if [[ ${Build_upp} == 'true' ]]; then
126+
build_jobs["upp"]=6
127+
build_opts["upp"]=""
174128
fi
175-
176-
#------------------------------------
177-
# build gsi utilities
178-
#------------------------------------
179-
if [[ -d gsi_utils.fd ]]; then
180-
if [[ ${Build_gsi_utils} == 'true' ]]; then
181-
echo " .... Building gsi utilities .... "
182-
# shellcheck disable=SC2086,SC2248
183-
./build_gsi_utils.sh ${_ops_opt} ${_verbose_opt} > "${logs_dir}/build_gsi_utils.log" 2>&1
184-
# shellcheck disable=
185-
rc=$?
186-
if (( rc != 0 )) ; then
187-
echo "Fatal error in building gsi utilities."
188-
echo "The log file is in ${logs_dir}/build_gsi_utils.log"
189-
fi
190-
err=$((err + rc))
191-
fi
192-
else
193-
echo " .... Skip building gsi utilities .... "
129+
if [[ ${Build_ufs_utils} == 'true' ]]; then
130+
build_jobs["ufs_utils"]=3
131+
build_opts["ufs_utils"]="${_verbose_opt}"
132+
fi
133+
if [[ ${Build_gfs_utils} == 'true' ]]; then
134+
build_jobs["gfs_utils"]=1
135+
build_opts["gfs_utils"]="${_verbose_opt}"
136+
fi
137+
if [[ ${Build_ww3prepost} == "true" ]]; then
138+
build_jobs["ww3prepost"]=3
139+
build_opts["ww3prepost"]="${_verbose_opt} ${_build_ufs_opt}"
194140
fi
195141

196-
#------------------------------------
197-
# build gdas - optional checkout
198-
#------------------------------------
142+
# Optional DA builds
199143
if [[ -d gdas.cd ]]; then
200-
if [[ ${Build_gdas} == 'true' ]]; then
201-
echo " .... Building GDASApp .... "
202-
# shellcheck disable=SC2086,SC2248
203-
./build_gdas.sh ${_verbose_opt} > "${logs_dir}/build_gdas.log" 2>&1
204-
# shellcheck disable=
205-
rc=$?
206-
if (( rc != 0 )) ; then
207-
echo "Fatal error in building GDASApp."
208-
echo "The log file is in ${logs_dir}/build_gdas.log"
209-
fi
210-
err=$((err + rc))
211-
fi
212-
else
213-
echo " .... Skip building GDASApp .... "
144+
build_jobs["gdas"]=16
145+
big_jobs=$((big_jobs+1))
146+
build_opts["gdas"]="${_verbose_opt}"
147+
fi
148+
if [[ -d gsi_enkf.fd ]]; then
149+
build_jobs["gsi_enkf"]=8
150+
big_jobs=$((big_jobs+1))
151+
build_opts["gsi_enkf"]="${_verbose_opt}"
152+
fi
153+
if [[ -d gsi_utils.fd ]]; then
154+
build_jobs["gsi_utils"]=2
155+
build_opts["gsi_utils"]="${_verbose_opt}"
214156
fi
215-
216-
#------------------------------------
217-
# build gsi monitor
218-
#------------------------------------
219157
if [[ -d gsi_monitor.fd ]]; then
220-
if [[ ${Build_gsi_monitor} == 'true' ]]; then
221-
echo " .... Building gsi monitor .... "
222-
# shellcheck disable=SC2086,SC2248
223-
./build_gsi_monitor.sh ${_ops_opt} ${_verbose_opt} > "${logs_dir}/build_gsi_monitor.log" 2>&1
224-
# shellcheck disable=
225-
rc=$?
226-
if (( rc != 0 )) ; then
227-
echo "Fatal error in building gsi monitor."
228-
echo "The log file is in ${logs_dir}/build_gsi_monitor.log"
229-
fi
230-
err=$((err + rc))
231-
fi
232-
else
233-
echo " .... Skip building gsi monitor .... "
158+
build_jobs["gsi_monitor"]=1
159+
build_opts["gsi_monitor"]="${_verbose_opt}"
234160
fi
235161

236-
#------------------------------------
237-
# build UPP
238-
#------------------------------------
239-
if [[ ${Build_upp} == 'true' ]]; then
240-
echo " .... Building UPP .... "
241-
# shellcheck disable=SC2086,SC2248
242-
./build_upp.sh ${_ops_opt} ${_verbose_opt} > "${logs_dir}/build_upp.log" 2>&1
243-
# shellcheck disable=
244-
rc=$?
245-
if (( rc != 0 )) ; then
246-
echo "Fatal error in building UPP."
247-
echo "The log file is in ${logs_dir}/build_upp.log"
248-
fi
249-
err=$((err + rc))
250-
fi
162+
# Go through all builds and adjust CPU counts down if necessary
163+
requested_cpus=0
164+
build_list=""
165+
for build in "${!build_jobs[@]}"; do
166+
if [[ -z "${build_list}" ]]; then
167+
build_list="${build}"
168+
else
169+
build_list="${build_list}, ${build}"
170+
fi
171+
if [[ ${build_jobs[${build}]} -gt ${_build_job_max} ]]; then
172+
build_jobs[${build}]=${_build_job_max}
173+
fi
174+
requested_cpus=$(( requested_cpus + build_jobs[${build}] ))
175+
done
251176

252-
#------------------------------------
253-
# build ufs_utils
254-
#------------------------------------
255-
if [[ ${Build_ufs_utils} == 'true' ]]; then
256-
echo " .... Building ufs_utils .... "
257-
# shellcheck disable=SC2086,SC2248
258-
./build_ufs_utils.sh ${_verbose_opt} > "${logs_dir}/build_ufs_utils.log" 2>&1
259-
# shellcheck disable=
260-
rc=$?
261-
if (( rc != 0 )) ; then
262-
echo "Fatal error in building ufs_utils."
263-
echo "The log file is in ${logs_dir}/build_ufs_utils.log"
264-
fi
265-
err=$((err + rc))
177+
echo "Building ${build_list}"
178+
179+
# Go through all builds and adjust CPU counts up if possible
180+
if [[ ${requested_cpus} -lt ${_build_job_max} && ${big_jobs} -gt 0 ]]; then
181+
# Add cores to the gdas, ufs, and gsi build jobs
182+
extra_cores=$(( _build_job_max - requested_cpus ))
183+
extra_cores=$(( extra_cores / big_jobs ))
184+
for build in "${!build_jobs[@]}"; do
185+
if [[ "${build}" == "gdas" || "${build}" == "ufs" || "${build}" == "gsi_enkf" ]]; then
186+
build_jobs[${build}]=$(( build_jobs[${build}] + extra_cores ))
187+
fi
188+
done
266189
fi
267190

191+
procs_in_use=0
192+
declare -A build_ids
193+
194+
builds_started=0
195+
# Now start looping through all of the jobs until everything is done
196+
while [[ ${builds_started} -lt ${#build_jobs[@]} ]]; do
197+
for build in "${!build_jobs[@]}"; do
198+
# Has the job started?
199+
if [[ -n "${build_jobs[${build}]+0}" && -z "${build_ids[${build}]+0}" ]]; then
200+
# Do we have enough processors to run it?
201+
if [[ ${_build_job_max} -ge $(( build_jobs[build] + procs_in_use )) ]]; then
202+
if [[ "${build}" != "upp" ]]; then
203+
"./build_${build}.sh" -j "${build_jobs[${build}]}" "${build_opts[${build}]:-}" > \
204+
"${logs_dir}/build_${build}.log" 2>&1 &
205+
else
206+
"./build_${build}.sh" "${build_opts[${build}]}" > \
207+
"${logs_dir}/build_${build}.log" 2>&1 &
208+
fi
209+
build_ids["${build}"]=$!
210+
echo "Starting build_${build}.sh"
211+
procs_in_use=$(( procs_in_use + build_jobs[${build}] ))
212+
fi
213+
fi
214+
done
215+
216+
# Check if all builds have completed
217+
# Also recalculate how many processors are in use to account for completed builds
218+
builds_started=0
219+
procs_in_use=0
220+
for build in "${!build_jobs[@]}"; do
221+
# Has the build started?
222+
if [[ -n "${build_ids[${build}]+0}" ]]; then
223+
builds_started=$(( builds_started + 1))
224+
# Calculate how many processors are in use
225+
# Is the build still running?
226+
if ps -p "${build_ids[${build}]}" > /dev/null; then
227+
procs_in_use=$(( procs_in_use + build_jobs["${build}"] ))
228+
fi
229+
fi
230+
done
231+
232+
sleep 5s
233+
done
234+
235+
# Wait for all jobs to complete and check return statuses
236+
errs=0
237+
while [[ ${#build_jobs[@]} -gt 0 ]]; do
238+
for build in "${!build_jobs[@]}"; do
239+
# Test if each job is complete and if so, notify and remove from the array
240+
if [[ -n "${build_ids[${build}]+0}" ]]; then
241+
if ! ps -p "${build_ids[${build}]}" > /dev/null; then
242+
wait "${build_ids[${build}]}"
243+
build_stat=$?
244+
errs=$((errs+build_stat))
245+
if [[ ${build_stat} == 0 ]]; then
246+
echo "build_${build}.sh completed successfully!"
247+
else
248+
echo "build_${build}.sh failed with status ${build_stat}!"
249+
fi
250+
251+
# Remove the completed build from the list of PIDs
252+
unset 'build_ids[${build}]'
253+
unset 'build_jobs[${build}]'
254+
fi
255+
fi
256+
done
257+
258+
sleep 5s
259+
done
260+
268261
#------------------------------------
269262
# Exception Handling
270263
#------------------------------------
271-
if (( err != 0 )); then
264+
if (( errs != 0 )); then
272265
cat << EOF
273266
BUILD ERROR: One or more components failed to build
274267
Check the associated build log(s) for details.

sorc/build_gdas.sh

+2-1
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,10 @@
22
set -eux
33

44
OPTIND=1
5-
while getopts ":dov" option; do
5+
while getopts ":j:dv" option; do
66
case "${option}" in
77
d) export BUILD_TYPE="DEBUG";;
8+
j) export BUILD_JOBS=${OPTARG};;
89
v) export BUILD_VERBOSE="YES";;
910
:)
1011
echo "[${BASH_SOURCE[0]}]: ${option} requires an argument"

0 commit comments

Comments
 (0)