Batch Scheduler Support¶
Slurm¶
buildtest can submit jobs to Slurm assuming you have slurm executors defined
in your configuration file. The SlurmExecutor
class is responsible for managing slurm jobs which
will perform the following action
Check slurm binary
sbatch
andsacct
.Dispatch Job and acquire job ID using
sacct
.Poll all slurm jobs until all have finished
Gather Job results once job is complete via
sacct
.
buildtest will dispatch slurm jobs and poll all jobs until all
jobs are complete. If job is in PENDING or RUNNING state, then buildtest will
keep polling at a set interval defined by pollinterval
setting in buildtest.
Once job is not in PENDING or RUNNING stage, buildtest will gather job results
and wait until all jobs have finished.
In this example we have a slurm executor cori.slurm.knl_debug
,
in addition we can specify #SBATCH directives using sbatch
field.
The sbatch field is a list of string types, buildtest will
insert #SBATCH directive in front of each value.
Shown below is an example buildspec
1version: "1.0"
2buildspecs:
3 slurm_metadata:
4 description: Get metadata from compute node when submitting job
5 type: script
6 executor: cori.slurm.knl_debug
7 tags: [jobs]
8 sbatch:
9 - "-t 00:05"
10 - "-N 1"
11 run: |
12 export SLURM_JOB_NAME="firstjob"
13 echo "jobname:" $SLURM_JOB_NAME
14 echo "slurmdb host:" $SLURMD_NODENAME
15 echo "pid:" $SLURM_TASK_PID
16 echo "submit host:" $SLURM_SUBMIT_HOST
17 echo "nodeid:" $SLURM_NODEID
18 echo "partition:" $SLURM_JOB_PARTITION
buildtest will add the #SBATCH
directives at top of script followed by
content in the run
section. Shown below is the example test content. Every slurm
will insert #SBATCH --job-name
, #SBATCH --output
and #SBATCH --error
line
which is determined by the name of the test.
1#!/bin/bash
2#SBATCH -t 00:05
3#SBATCH -N 1
4#SBATCH --job-name=slurm_metadata
5#SBATCH --output=slurm_metadata.out
6#SBATCH --error=slurm_metadata.err
7export SLURM_JOB_NAME="firstjob"
8echo "jobname:" $SLURM_JOB_NAME
9echo "slurmdb host:" $SLURMD_NODENAME
10echo "pid:" $SLURM_TASK_PID
11echo "submit host:" $SLURM_SUBMIT_HOST
12echo "nodeid:" $SLURM_NODEID
13echo "partition:" $SLURM_JOB_PARTITION
The cori.slurm.knl_debug
executor in our configuration file is defined as follows
1system:
2 cori:
3 executors:
4 slurm:
5 knl_debug:
6 qos: debug
7 cluster: cori
8 options:
9 - -C knl,quad,cache
10 description: debug queue on KNL partition
With this setting, any buildspec test that use cori.slurm.knl_debug
executor will result
in the following launch option: sbatch --qos debug --clusters=cori -C knl,quad,cache </path/to/script.sh>
.
Unlike the LocalExecutor, the Run Stage, will dispatch the slurm job and poll
until job is completed. Once job is complete, it will gather the results and terminate.
In Run Stage, buildtest will mark test status as N/A
because job is submitted
to scheduler and pending in queue. In order to get job result, we need to wait
until job is complete then we gather results and determine test state. buildtest
keeps track of all buildspecs, testscripts to be run and their results. A test
using LocalExecutor will run test in Run Stage and returncode will be retrieved
and status can be calculated immediately. For Slurm Jobs, buildtest dispatches
the job and process next job. buildtest will show output of all tests after
Polling Stage with test results of all tests. A slurm job with exit code 0 will
be marked with status PASS
.
Shown below is an example build for this test
$ buildtest build -b buildspecs/jobs/metadata.yml
User: siddiq90
Hostname: cori02
Platform: Linux
Current Time: 2021/06/11 09:24:44
buildtest path: /global/homes/s/siddiq90/github/buildtest/bin/buildtest
buildtest version: 0.9.5
python path: /global/homes/s/siddiq90/.conda/envs/buildtest/bin/python
python version: 3.8.8
Test Directory: /global/u1/s/siddiq90/github/buildtest/var/tests
Configuration File: /global/u1/s/siddiq90/.buildtest/config.yml
Command: /global/homes/s/siddiq90/github/buildtest/bin/buildtest build -b buildspecs/jobs/metadata.yml
+-------------------------------+
| Stage: Discovering Buildspecs |
+-------------------------------+
+--------------------------------------------------------------------------+
| Discovered Buildspecs |
+==========================================================================+
| /global/u1/s/siddiq90/github/buildtest-cori/buildspecs/jobs/metadata.yml |
+--------------------------------------------------------------------------+
Discovered Buildspecs: 1
Excluded Buildspecs: 0
Detected Buildspecs after exclusion: 1
+---------------------------+
| Stage: Parsing Buildspecs |
+---------------------------+
schemafile | validstate | buildspec
-------------------------+--------------+--------------------------------------------------------------------------
script-v1.0.schema.json | True | /global/u1/s/siddiq90/github/buildtest-cori/buildspecs/jobs/metadata.yml
name description
-------------- --------------------------------------------------
slurm_metadata Get metadata from compute node when submitting job
+----------------------+
| Stage: Building Test |
+----------------------+
name | id | type | executor | tags | testpath
----------------+----------+--------+----------------------+----------+-------------------------------------------------------------------------------------------------------------------------
slurm_metadata | 722b3291 | script | cori.slurm.knl_debug | ['jobs'] | /global/u1/s/siddiq90/github/buildtest/var/tests/cori.slurm.knl_debug/metadata/slurm_metadata/0/slurm_metadata_build.sh
+---------------------+
| Stage: Running Test |
+---------------------+
[slurm_metadata] JobID: 43308838 dispatched to scheduler
name | id | executor | status | returncode
----------------+----------+----------------------+----------+--------------
slurm_metadata | 722b3291 | cori.slurm.knl_debug | N/A | -1
Polling Jobs in 30 seconds
________________________________________
Job Queue: [43308838]
Pending Jobs
________________________________________
+----------------+----------------------+----------+-----------+
| name | executor | jobID | jobstate |
+----------------+----------------------+----------+-----------+
| slurm_metadata | cori.slurm.knl_debug | 43308838 | COMPLETED |
+----------------+----------------------+----------+-----------+
Polling Jobs in 30 seconds
________________________________________
Job Queue: []
Completed Jobs
________________________________________
+----------------+----------------------+----------+-----------+
| name | executor | jobID | jobstate |
+----------------+----------------------+----------+-----------+
| slurm_metadata | cori.slurm.knl_debug | 43308838 | COMPLETED |
+----------------+----------------------+----------+-----------+
+---------------------------------------------+
| Stage: Final Results after Polling all Jobs |
+---------------------------------------------+
name | id | executor | status | returncode
----------------+----------+----------------------+----------+--------------
slurm_metadata | 722b3291 | cori.slurm.knl_debug | PASS | 0
+----------------------+
| Stage: Test Summary |
+----------------------+
Passed Tests: 1/1 Percentage: 100.000%
Failed Tests: 0/1 Percentage: 0.000%
Writing Logfile to: /tmp/buildtest_u8ehladd.log
A copy of logfile can be found at $BUILDTEST_ROOT/buildtest.log - /global/homes/s/siddiq90/github/buildtest/buildtest.log
The SlurmExecutor class is responsible for processing slurm job that may include:
dispatch, poll, gather, or cancel job. The SlurmExecutor will gather job metrics
via sacct using the following format fields:
Account, AllocNodes, AllocTRES, ConsumedEnergyRaw, CPUTimeRaw, Elapsed,
End, ExitCode, JobID, JobName, NCPUS, NNodes, QOS, ReqGRES,
ReqMem, ReqNodes, ReqTRES, Start, State, Submit, UID, User, WorkDir.
For a complete list of format fields see sacct -e
. For now, we support only these fields of interest
for reporting purpose.
buildtest can check status based on Slurm Job State, this is defined by State
field
in sacct. In next example, we introduce field slurm_job_state
which
is part of status
field. This field expects one of the following values: [COMPLETED, FAILED, OUT_OF_MEMORY, TIMEOUT ]
This is an example of simulating fail job by expecting a return code of 1 with job
state of FAILED
.
1version: "1.0"
2buildspecs:
3 wall_timeout:
4 type: script
5 executor: cori.slurm.debug
6 sbatch: [ "-t 2", "-C haswell", "-n 1"]
7 run: sleep 300
8 status:
9 slurm_job_state: "TIMEOUT"
If we run this test, buildtest will mark this test as PASS
because the slurm job
state matches with expected result defined by field slurm_job_state
. This job will
be TIMEOUT because we requested 2 mins while this job will sleep 300sec (5min).
Completed Jobs
________________________________________
+--------------+--------------------------+----------+----------+
| name | executor | jobID | jobstate |
+--------------+--------------------------+----------+----------+
| wall_timeout | cori.slurm.haswell_debug | 43309265 | TIMEOUT |
+--------------+--------------------------+----------+----------+
+---------------------------------------------+
| Stage: Final Results after Polling all Jobs |
+---------------------------------------------+
name | id | executor | status | returncode
--------------+----------+--------------------------+----------+--------------
wall_timeout | 3b43850c | cori.slurm.haswell_debug | PASS | 0
+----------------------+
| Stage: Test Summary |
+----------------------+
Passed Tests: 1/1 Percentage: 100.000%
Failed Tests: 0/1 Percentage: 0.000%
Writing Logfile to: /tmp/buildtest_k6h246yx.log
A copy of logfile can be found at $BUILDTEST_ROOT/buildtest.log - /global/homes/s/siddiq90/github/buildtest/buildtest.log
If you examine the logfile buildtest.log
you will see an entry of sacct
command run to gather
results followed by list of field and value output:
2021-06-11 09:52:27,826 [slurm.py:292 - poll() ] - [DEBUG] Querying JobID: '43309265' Job State by running: 'sacct -j 43309265 -o State -n -X -P --clusters=cori'
2021-06-11 09:52:27,826 [slurm.py:296 - poll() ] - [DEBUG] JobID: '43309265' job state:TIMEOUT
LSF¶
buildtest can support job submission to IBM Spectrum LSF if you have defined LSF executors in your configuration file.
The bsub
property can be used to specify #BSUB directive into job script. This example
will use the executor ascent.lsf.batch
executor that was defined in buildtest configuration.
1version: "1.0"
2buildspecs:
3 hostname:
4 type: script
5 executor: ascent.lsf.batch
6 bsub: [ "-W 10", "-nnodes 1"]
7
8 run: jsrun hostname
The LSFExecutor poll jobs and retrieve job state using
bjobs -noheader -o 'stat' <JOBID>
. The LSFExecutor will poll
job so long as they are in PEND or RUN state. Once job is not in
any of the two states, LSFExecutor will gather job results. buildtest will retrieve
the following format fields using bjobs
: job_name, stat, user, user_group, queue, proj_name,
pids, exit_code, from_host, exec_host, submit_time, start_time,
finish_time, nthreads, exec_home, exec_cwd, output_file, error_file to
get job record.
PBS¶
buildtest can support job submission to PBS Pro or OpenPBS
scheduler. Assuming you have configured PBS Executors in your configuration file you can submit jobs
to the PBS executor by selecting the appropriate pbs executor via executor
property in buildspec. The #PBS
directives can be specified using pbs
field which is a list of PBS options that get inserted at top of script. Shown
below is an example buildspec using the script schema.
version: "1.0"
buildspecs:
pbs_sleep:
type: script
executor: generic.pbs.workq
pbs: ["-l nodes=1", "-l walltime=00:02:00"]
run: sleep 10
buildtest will poll PBS jobs using qstat -x -f -F json <jobID>
until job is finished. Note that
we use -x option to retrieve finished jobs which is required inorder for buildtest to detect job
state upon completion. Please see PBS Limitation to ensure your PBS cluster supports job history.
Shown below is an example build of the buildspec using PBS scheduler.
[pbsuser@pbs buildtest]$ buildtest build -b general_tests/sched/pbs/hostname.yml
+-------------------------------+
| Stage: Discovering Buildspecs |
+-------------------------------+
Discovered Buildspecs:
/tmp/Documents/buildtest/general_tests/sched/pbs/hostname.yml
+---------------------------+
| Stage: Parsing Buildspecs |
+---------------------------+
schemafile | validstate | buildspec
-------------------------+--------------+---------------------------------------------------------------
script-v1.0.schema.json | True | /tmp/Documents/buildtest/general_tests/sched/pbs/hostname.yml
+----------------------+
| Stage: Building Test |
+----------------------+
name | id | type | executor | tags | testpath
-----------+----------+--------+-------------------+--------+---------------------------------------------------------------------------------------------
pbs_sleep | 2adfc3c1 | script | generic.pbs.workq | | /tmp/Documents/buildtest/var/tests/generic.pbs.workq/hostname/pbs_sleep/3/stage/generate.sh
+----------------------+
| Stage: Running Test |
+----------------------+
[pbs_sleep] JobID: 40.pbs dispatched to scheduler
name | id | executor | status | returncode | testpath
-----------+----------+-------------------+----------+--------------+---------------------------------------------------------------------------------------------
pbs_sleep | 2adfc3c1 | generic.pbs.workq | N/A | -1 | /tmp/Documents/buildtest/var/tests/generic.pbs.workq/hostname/pbs_sleep/3/stage/generate.sh
Polling Jobs in 10 seconds
________________________________________
Job Queue: ['40.pbs']
Completed Jobs
________________________________________
╒════════╤════════════╤═════════╤════════════╕
│ name │ executor │ jobID │ jobstate │
╞════════╪════════════╪═════════╪════════════╡
╘════════╧════════════╧═════════╧════════════╛
Pending Jobs
________________________________________
╒═══════════╤═══════════════════╤═════════╤════════════╕
│ name │ executor │ jobID │ jobstate │
╞═══════════╪═══════════════════╪═════════╪════════════╡
│ pbs_sleep │ generic.pbs.workq │ 40.pbs │ R │
╘═══════════╧═══════════════════╧═════════╧════════════╛
Polling Jobs in 10 seconds
________________________________________
Job Queue: ['40.pbs']
Completed Jobs
________________________________________
╒════════╤════════════╤═════════╤════════════╕
│ name │ executor │ jobID │ jobstate │
╞════════╪════════════╪═════════╪════════════╡
╘════════╧════════════╧═════════╧════════════╛
Pending Jobs
________________________________________
╒═══════════╤═══════════════════╤═════════╤════════════╕
│ name │ executor │ jobID │ jobstate │
╞═══════════╪═══════════════════╪═════════╪════════════╡
│ pbs_sleep │ generic.pbs.workq │ 40.pbs │ F │
╘═══════════╧═══════════════════╧═════════╧════════════╛
Polling Jobs in 10 seconds
________________________________________
Job Queue: []
Completed Jobs
________________________________________
╒═══════════╤═══════════════════╤═════════╤════════════╕
│ name │ executor │ jobID │ jobstate │
╞═══════════╪═══════════════════╪═════════╪════════════╡
│ pbs_sleep │ generic.pbs.workq │ 40.pbs │ F │
╘═══════════╧═══════════════════╧═════════╧════════════╛
Pending Jobs
________________________________________
╒════════╤════════════╤═════════╤════════════╕
│ name │ executor │ jobID │ jobstate │
╞════════╪════════════╪═════════╪════════════╡
╘════════╧════════════╧═════════╧════════════╛
+---------------------------------------------+
| Stage: Final Results after Polling all Jobs |
+---------------------------------------------+
name | id | executor | status | returncode | testpath
-----------+----------+-------------------+----------+--------------+---------------------------------------------------------------------------------------------
pbs_sleep | 2adfc3c1 | generic.pbs.workq | PASS | 0 | /tmp/Documents/buildtest/var/tests/generic.pbs.workq/hostname/pbs_sleep/3/stage/generate.sh
+----------------------+
| Stage: Test Summary |
+----------------------+
Executed 1 tests
Passed Tests: 1/1 Percentage: 100.000%
Failed Tests: 0/1 Percentage: 0.000%
Writing Logfile to: /tmp/buildtest_mu285m58.log
buildtest will preserve the job record from qstat -x -f -F json <jobID>
in the test report if job was complete.
If we take a look at the test result using buildtest inspect you will see the job
section is
prepopulated from the JSON record provided by qstat.
1[pbsuser@pbs buildtest]$ buildtest inspect 2adfc3c1
2{
3 "id": "2adfc3c1",
4 "full_id": "2adfc3c1-1c81-43d0-a151-6fa1a9818eb4",
5 "testroot": "/tmp/Documents/buildtest/var/tests/generic.pbs.workq/hostname/pbs_sleep/3",
6 "testpath": "/tmp/Documents/buildtest/var/tests/generic.pbs.workq/hostname/pbs_sleep/3/stage/generate.sh",
7 "stagedir": "/tmp/Documents/buildtest/var/tests/generic.pbs.workq/hostname/pbs_sleep/3/stage",
8 "rundir": "/tmp/Documents/buildtest/var/tests/generic.pbs.workq/hostname/pbs_sleep/3/run",
9 "command": "qsub -q workq /tmp/Documents/buildtest/var/tests/generic.pbs.workq/hostname/pbs_sleep/3/stage/generate.sh",
10 "outfile": "/tmp/Documents/buildtest/var/tests/generic.pbs.workq/hostname/pbs_sleep/3/stage/pbs_sleep.o40",
11 "errfile": "/tmp/Documents/buildtest/var/tests/generic.pbs.workq/hostname/pbs_sleep/3/stage/pbs_sleep.e40",
12 "schemafile": "script-v1.0.schema.json",
13 "executor": "generic.pbs.workq",
14 "tags": "",
15 "starttime": "Wed Mar 17 20:36:48 2021",
16 "endtime": "Wed Mar 17 20:36:48 2021",
17 "runtime": "00:00:10",
18 "state": "PASS",
19 "returncode": 0,
20 "output": "",
21 "error": "",
22 "job": {
23 "timestamp": 1616013438,
24 "pbs_version": "19.0.0",
25 "pbs_server": "pbs",
26 "Jobs": {
27 "40.pbs": {
28 "Job_Name": "pbs_sleep",
29 "Job_Owner": "pbsuser@pbs",
30 "resources_used": {
31 "cpupercent": 0,
32 "cput": "00:00:00",
33 "mem": "5620kb",
34 "ncpus": 1,
35 "vmem": "25632kb",
36 "walltime": "00:00:10"
37 },
38 "job_state": "F",
39 "queue": "workq",
40 "server": "pbs",
41 "Checkpoint": "u",
42 "ctime": "Wed Mar 17 20:36:48 2021",
43 "Error_Path": "pbs:/tmp/Documents/buildtest/var/tests/generic.pbs.workq/hostname/pbs_sleep/3/stage/pbs_sleep.e40",
44 "exec_host": "pbs/0",
45 "exec_vnode": "(pbs:ncpus=1)",
46 "Hold_Types": "n",
47 "Join_Path": "n",
48 "Keep_Files": "n",
49 "Mail_Points": "a",
50 "mtime": "Wed Mar 17 20:36:58 2021",
51 "Output_Path": "pbs:/tmp/Documents/buildtest/var/tests/generic.pbs.workq/hostname/pbs_sleep/3/stage/pbs_sleep.o40",
52 "Priority": 0,
53 "qtime": "Wed Mar 17 20:36:48 2021",
54 "Rerunable": "True",
55 "Resource_List": {
56 "ncpus": 1,
57 "nodect": 1,
58 "nodes": 1,
59 "place": "scatter",
60 "select": "1:ncpus=1",
61 "walltime": "00:02:00"
62 },
63 "stime": "Wed Mar 17 20:36:48 2021",
64 "session_id": 7154,
65 "jobdir": "/home/pbsuser",
66 "substate": 92,
67 "Variable_List": {
68 "PBS_O_HOME": "/home/pbsuser",
69 "PBS_O_LANG": "en_US.utf8",
70 "PBS_O_LOGNAME": "pbsuser",
71 "PBS_O_PATH": "/tmp/Documents/buildtest/bin:/tmp/Documents/github/buildtest/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/pbs/bin:/home/pbsuser/.local/bin:/home/pbsuser/bin",
72 "PBS_O_MAIL": "/var/spool/mail/pbsuser",
73 "PBS_O_SHELL": "/bin/bash",
74 "PBS_O_WORKDIR": "/tmp/Documents/buildtest/var/tests/generic.pbs.workq/hostname/pbs_sleep/3/stage",
75 "PBS_O_SYSTEM": "Linux",
76 "PBS_O_QUEUE": "workq",
77 "PBS_O_HOST": "pbs"
78 },
79 "comment": "Job run at Wed Mar 17 at 20:36 on (pbs:ncpus=1) and finished",
80 "etime": "Wed Mar 17 20:36:48 2021",
81 "run_count": 1,
82 "Stageout_status": 1,
83 "Exit_status": 0,
84 "Submit_arguments": "-q workq /tmp/Documents/buildtest/var/tests/generic.pbs.workq/hostname/pbs_sleep/3/stage/generate.sh",
85 "history_timestamp": 1616013418,
86 "project": "_pbs_project_default"
87 }
88 }
89 }
90}
91
92
93
94Output File
95______________________________
96
97
98
99
100Error File
101______________________________
102
103
104
105
106Test Content
107______________________________
108#!/bin/bash
109#PBS -l nodes=1
110#PBS -l walltime=00:02:00
111#PBS -N pbs_sleep
112source /tmp/Documents/buildtest/var/executors/generic.pbs.workq/before_script.sh
113sleep 10
114source /tmp/Documents/buildtest/var/executors/generic.pbs.workq/after_script.sh
115
116
117
118buildspec: /tmp/Documents/buildtest/general_tests/sched/pbs/hostname.yml
119______________________________
120version: "1.0"
121buildspecs:
122 pbs_sleep:
123 type: script
124 executor: generic.pbs.workq
125 pbs: ["-l nodes=1", "-l walltime=00:02:00"]
126 run: sleep 10
You can use batch
property to define schedule configuration that is translated into #PBS
directives. To learn more about batch property see Scheduler Agnostic Configuration.
In this example we show how one can use batch
property with the PBS executor instead of using
pbs
property. You may specify batch
and pbs
property to define PBS directives. This
example will allocate 1 node, 1 cpu, 500mb memory with 2min timelimit and send email notification.
version: "1.0"
buildspecs:
pbs_sleep:
type: script
executor: generic.pbs.workq
batch:
nodecount: "1"
cpucount: "1"
memory: "500mb"
email-address: "shahzebmsiddiqui@gmail.com"
timelimit: "00:02:00"
run: sleep 15
buildtest will translate the batch
property into #PBS
directives if their is an
equivalent option. Shown below is the generated test using the batch property.
#!/bin/bash
#PBS -l nodes=1
#PBS -l ncpus=1
#PBS -l mem=500mb
#PBS -WMail_Users=shahzebmsiddiqui@gmail.com
#PBS -l walltime=00:02:00
#PBS -N pbs_sleep
source /tmp/Documents/buildtest/var/executors/generic.pbs.workq/before_script.sh
sleep 15
source /tmp/Documents/buildtest/var/executors/generic.pbs.workq/after_script.sh
Cobalt¶
Cobalt is a job scheduler developed
by Argonne National Laboratory that runs on compute
resources and IBM BlueGene series. Cobalt resembles PBS
in terms of command line interface such as qsub
, qacct
however they
slightly differ in their behavior.
Cobalt support has been tested on JLSE and Theta
system. Cobalt directives are specified using #COBALT
this can be specified
using cobalt
property which accepts a list of strings. Shown below is an example
using cobalt property.
1version: "1.0"
2buildspecs:
3 yarrow_hostname:
4 executor: jlse.cobalt.yarrow
5 type: script
6 cobalt: ["-n 1", "--proccount 1", "-t 10"]
7 run: hostname
In this example, we allocate 1 node with 1 processor for 10min. This is translated into the following job script.
#!/usr/bin/bash
#COBALT -n 1
#COBALT --proccount 1
#COBALT -t 10
#COBALT --jobname yarrow_hostname
source /home/shahzebsiddiqui/buildtest/var/executors/cobalt.yarrow/before_script.sh
hostname
source /home/shahzebsiddiqui/buildtest/var/executors/cobalt.yarrow/after_script.sh
Let’s run this test and notice the job states.
$ buildtest build -b yarrow_hostname.yml
+-------------------------------+
| Stage: Discovering Buildspecs |
+-------------------------------+
Discovered Buildspecs:
/home/shahzebsiddiqui/jlse_tests/yarrow_hostname.yml
+---------------------------+
| Stage: Parsing Buildspecs |
+---------------------------+
schemafile | validstate | buildspec
-------------------------+--------------+------------------------------------------------------
script-v1.0.schema.json | True | /home/shahzebsiddiqui/jlse_tests/yarrow_hostname.yml
+----------------------+
| Stage: Building Test |
+----------------------+
name | id | type | executor | tags | testpath
-----------------+----------+--------+---------------+--------+-------------------------------------------------------------------------------------------------------------
yarrow_hostname | f86b93f6 | script | cobalt.yarrow | | /home/shahzebsiddiqui/buildtest/var/tests/cobalt.yarrow/yarrow_hostname/yarrow_hostname/3/stage/generate.sh
+----------------------+
| Stage: Running Test |
+----------------------+
[yarrow_hostname] JobID: 284752 dispatched to scheduler
name | id | executor | status | returncode | testpath
-----------------+----------+---------------+----------+--------------+-------------------------------------------------------------------------------------------------------------
yarrow_hostname | f86b93f6 | cobalt.yarrow | N/A | -1 | /home/shahzebsiddiqui/buildtest/var/tests/cobalt.yarrow/yarrow_hostname/yarrow_hostname/3/stage/generate.sh
Polling Jobs in 10 seconds
________________________________________
builder: yarrow_hostname in None
[yarrow_hostname]: JobID 284752 in starting state
Polling Jobs in 10 seconds
________________________________________
builder: yarrow_hostname in starting
[yarrow_hostname]: JobID 284752 in starting state
Polling Jobs in 10 seconds
________________________________________
builder: yarrow_hostname in starting
[yarrow_hostname]: JobID 284752 in running state
Polling Jobs in 10 seconds
________________________________________
builder: yarrow_hostname in running
[yarrow_hostname]: JobID 284752 in exiting state
Polling Jobs in 10 seconds
________________________________________
builder: yarrow_hostname in done
+---------------------------------------------+
| Stage: Final Results after Polling all Jobs |
+---------------------------------------------+
name | id | executor | status | returncode | testpath
-----------------+----------+---------------+----------+--------------+-------------------------------------------------------------------------------------------------------------
yarrow_hostname | f86b93f6 | cobalt.yarrow | PASS | 0 | /home/shahzebsiddiqui/buildtest/var/tests/cobalt.yarrow/yarrow_hostname/yarrow_hostname/3/stage/generate.sh
+----------------------+
| Stage: Test Summary |
+----------------------+
Executed 1 tests
Passed Tests: 1/1 Percentage: 100.000%
Failed Tests: 0/1 Percentage: 0.000%
When job starts, Cobalt will write a cobalt log file <JOBID>.cobaltlog
which
is provided by scheduler for troubleshooting. The output and error file are generated
once job finishes. Cobalt job progresses through job state starting
–> pending
–> running
–> exiting
.
buildtest will capture Cobalt job details using qstat -lf <JOBID>
and this
is updated in the report file.
buildtest will poll job at set interval, where we run qstat --header State <JobID>
to
check state of job, if job is finished then we gather results. Once job is finished,
qstat will not be able to poll job this causes an issue where buildtest can’t poll
job since qstat will not return anything. This is a transient issue depending on when
you poll job, generally at ALCF qstat will not report existing job within 30sec after
job is terminated. buildtest will assume if it’s able to poll job and is in exiting
stage that job is complete, if its unable to retrieve this state we check for
output and error file. If file exists we assume job is complete and buildtest will
gather the results.
buildtest will determine exit code by parsing cobalt log file, the file contains a line such as
Thu Nov 05 17:29:30 2020 +0000 (UTC) Info: task completed normally with an exit code of 0; initiating job cleanup and removal
qstat has no job record for capturing returncode so buildtest must rely on Cobalt Log file.:
Scheduler Agnostic Configuration¶
The batch
field can be used for specifying scheduler agnostic configuration
based on your scheduler. buildtest will translate the input into the appropriate
script directive supported by the scheduler. Shown below is a translation table
for the batch field
Field |
Slurm |
LSF |
PBS |
Cobalt |
---|---|---|---|---|
account |
|
|
|
|
begin |
|
|
N/A |
N/A |
cpucount |
|
|
|
|
email-address |
|
|
|
|
exclusive |
|
|
N/A |
N/A |
memory |
|
|
|
N/A |
network |
|
|
N/A |
N/A |
nodecount |
|
|
|
|
qos |
|
N/A |
N/A |
N/A |
queue |
|
|
|
|
tasks-per-core |
|
N/A |
N/A |
N/A |
tasks-per-node |
|
N/A |
N/A |
N/A |
tasks-per-socket |
|
N/A |
N/A |
N/A |
timelimit |
|
|
|
|
In this example, we rewrite the LSF buildspec to use batch
instead of bsub
field.
1version: "1.0"
2buildspecs:
3 hostname:
4 type: script
5 executor: ascent.lsf.batch
6 batch:
7 timelimit: "10"
8 nodecount: "1"
9 run: jsrun hostname
buildtest will translate the batch field into #BSUB directive as you can see in
the generated test. buildtest will automatically name the job based on the testname
therefore you will see that buildtest will insert #BSUB -J
, #BSUB -o
and #BSUB -e
directives in the test.
#!/usr/bin/bash
#BSUB -W 10
#BSUB -nnodes 1
#BSUB -J hostname
#BSUB -o hostname.out
#BSUB -e hostname.err
jsrun hostname
In next example we use batch
field with on a Slurm cluster that submits a sleep
job as follows.
1version: "1.0"
2buildspecs:
3 sleep:
4 type: script
5 executor: cori.slurm.knl_debug
6 description: sleep 2 seconds
7 tags: [tutorials]
8 batch:
9 nodecount: "1"
10 cpucount: "1"
11 timelimit: "5"
12 memory: "5MB"
13 exclusive: true
14
15 vars:
16 SLEEP_TIME: 2
17 run: sleep $SLEEP_TIME
The exclusive
field is used for getting exclusive node access, this is a boolean
instead of string. You can instruct buildtest to stop after build phase by using
--stage=build
which will build the script but not run it. If we inspect the
generated script we see the following.
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=5
#SBATCH --mem=5MB
#SBATCH --exclusive=user
SLEEP_TIME=2
sleep $SLEEP_TIME
The batch
property can translate some fields into #COBALT directives. buildtest
will support fields that are applicable with scheduler. Shown below is an example
with 1 node using 10min that runs hostname using executor jlse.cobalt.iris.
version: "1.0"
buildspecs:
iris_hostname:
executor: jlse.cobalt.iris
type: script
batch:
nodecount: "1"
timelimit: "10"
run: hostname
If we build the buildspec and inspect the testscript we see the following.
#!/usr/bin/bash
#COBALT --nodecount 1
#COBALT --time 10
#COBALT --jobname iris_hostname
hostname
The first two lines #COBALT --nodecount 1
and #COBALT --time 10
are translated
based on input from batch field. buildtest will automatically add #COBALT --jobname
based on the name of the test.
You may leverage batch
with sbatch
, bsub
, or cobalt
field to specify
your job directives. If a particular field is not available in batch
property
then utilize sbatch
, bsub
, cobalt
field to fill in rest of the arguments.
Jobs exceeds max_pend_time¶
Recall from Configuring buildtest that max_pend_time will cancel jobs if job exceed timelimit. buildtest will start a timer for each job right after job submission and keep track of time duration, and if job is in pending state and it exceeds max_pend_time, then job will be cancelled.
We can also override max_pend_time configuration via command line --max-pend-time
.
To demonstrate, here is an example where job was cancelled after job was pending and exceeds max_pend_time.
Note that cancelled job is not reported in final output nor updated in report hence
it won’t be present in the report (buildtest report
). In this example, we only
had one test so upon job cancellation we found there was no tests to report hence,
buildtest will terminate after run stage.
1$ buildtest build -b buildspecs/queues/shared.yml --max-pend-time 15 --poll-interval 5 -k
2
3
4User: siddiq90
5Hostname: cori08
6Platform: Linux
7Current Time: 2021/06/11 13:31:46
8buildtest path: /global/homes/s/siddiq90/github/buildtest/bin/buildtest
9buildtest version: 0.9.5
10python path: /global/homes/s/siddiq90/.conda/envs/buildtest/bin/python
11python version: 3.8.8
12Test Directory: /global/u1/s/siddiq90/github/buildtest/var/tests
13Configuration File: /global/u1/s/siddiq90/.buildtest/config.yml
14Command: /global/homes/s/siddiq90/github/buildtest/bin/buildtest build -b buildspecs/queues/shared.yml --max-pend-time 15 --poll-interval 5 -k
15
16+-------------------------------+
17| Stage: Discovering Buildspecs |
18+-------------------------------+
19
20+--------------------------------------------------------------------------+
21| Discovered Buildspecs |
22+==========================================================================+
23| /global/u1/s/siddiq90/github/buildtest-cori/buildspecs/queues/shared.yml |
24+--------------------------------------------------------------------------+
25Discovered Buildspecs: 1
26Excluded Buildspecs: 0
27Detected Buildspecs after exclusion: 1
28
29+---------------------------+
30| Stage: Parsing Buildspecs |
31+---------------------------+
32
33 schemafile | validstate | buildspec
34-------------------------+--------------+--------------------------------------------------------------------------
35 script-v1.0.schema.json | True | /global/u1/s/siddiq90/github/buildtest-cori/buildspecs/queues/shared.yml
36
37
38
39name description
40--------------------------- ------------------------------------------
41shared_qos_haswell_hostname run hostname through shared qos on Haswell
42
43+----------------------+
44| Stage: Building Test |
45+----------------------+
46
47 name | id | type | executor | tags | testpath
48-----------------------------+----------+--------+---------------------------+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------
49 shared_qos_haswell_hostname | 94b2de5d | script | cori.slurm.haswell_shared | ['queues', 'jobs', 'reframe'] | /global/u1/s/siddiq90/github/buildtest/var/tests/cori.slurm.haswell_shared/shared/shared_qos_haswell_hostname/2/shared_qos_haswell_hostname_build.sh
50
51
52
53+---------------------+
54| Stage: Running Test |
55+---------------------+
56
57[shared_qos_haswell_hostname] JobID: 43313766 dispatched to scheduler
58 name | id | executor | status | returncode
59-----------------------------+----------+---------------------------+----------+--------------
60 shared_qos_haswell_hostname | 94b2de5d | cori.slurm.haswell_shared | N/A | -1
61
62
63Polling Jobs in 5 seconds
64________________________________________
65Job Queue: [43313766]
66
67
68Pending Jobs
69________________________________________
70
71
72+-----------------------------+---------------------------+----------+----------+
73| name | executor | jobID | jobstate |
74+-----------------------------+---------------------------+----------+----------+
75| shared_qos_haswell_hostname | cori.slurm.haswell_shared | 43313766 | PENDING |
76+-----------------------------+---------------------------+----------+----------+
77
78
79Polling Jobs in 5 seconds
80________________________________________
81Job Queue: [43313766]
82
83
84Pending Jobs
85________________________________________
86
87
88+-----------------------------+---------------------------+----------+----------+
89| name | executor | jobID | jobstate |
90+-----------------------------+---------------------------+----------+----------+
91| shared_qos_haswell_hostname | cori.slurm.haswell_shared | 43313766 | PENDING |
92+-----------------------------+---------------------------+----------+----------+
93
94
95Polling Jobs in 5 seconds
96________________________________________
97Job Queue: [43313766]
98
99
100Pending Jobs
101________________________________________
102
103
104+-----------------------------+---------------------------+----------+----------+
105| name | executor | jobID | jobstate |
106+-----------------------------+---------------------------+----------+----------+
107| shared_qos_haswell_hostname | cori.slurm.haswell_shared | 43313766 | PENDING |
108+-----------------------------+---------------------------+----------+----------+
109
110
111Polling Jobs in 5 seconds
112________________________________________
113Cancelling Job because duration time: 21.177340 sec exceeds max pend time: 15 sec
114Job Queue: [43313766]
115
116
117Pending Jobs
118________________________________________
119
120
121+-----------------------------+---------------------------+----------+-----------+
122| name | executor | jobID | jobstate |
123+-----------------------------+---------------------------+----------+-----------+
124| shared_qos_haswell_hostname | cori.slurm.haswell_shared | 43313766 | CANCELLED |
125+-----------------------------+---------------------------+----------+-----------+
126
127
128Polling Jobs in 5 seconds
129________________________________________
130Job Queue: []
131Cancelled Tests:
132shared_qos_haswell_hostname
133After polling all jobs we found no valid builders to process
Cray Burst Buffer & Data Warp¶
For Cray systems, you may want to stage-in or stage-out into your burst buffer this
can be configured using the #DW
directive. For a list of data warp examples see
section on DataWarp Job Script Commands
In buildtest we support properties BB
and DW
which is a list of job directives
that get inserted as #BW and #DW into the test script. To demonstrate let’s start
off with an example where we create a persistent burst buffer named databuffer
of size
10GB striped. We access the burst buffer using the DW directive. Finally we
cd into the databuffer and write a 5GB random file.
Note
BB and DW directives are generated after scheduler directives. The #BB
comes before #DW
. buildtest will automatically add the directive #BB
and #DW when using properties BB and DW
1version: "1.0"
2buildspecs:
3 create_burst_buffer:
4 type: script
5 executor: cori.slurm.debug
6 batch:
7 nodecount: "1"
8 timelimit: "5"
9 cpucount: "1"
10 sbatch: ["-C knl"]
11 description: Create a burst buffer
12 tags: [jobs]
13 BB:
14 - create_persistent name=databuffer capacity=10GB access_mode=striped type=scratch
15 DW:
16 - persistentdw name=databuffer
17 run: |
18 cd $DW_PERSISTENT_STRIPED_databuffer
19 pwd
20 dd if=/dev/urandom of=random.txt bs=1G count=5 iflags=fullblock
21 ls -lh $DW_PERSISTENT_STRIPED_databuffer/
Next we run this test and inspect the generated test we will see that #BB
and #DW
directives
are inserted after the scheduler directives
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --time=5
#SBATCH --ntasks=1
#SBATCH --job-name=create_burst_buffer
#SBATCH --output=create_burst_buffer.out
#SBATCH --error=create_burst_buffer.err
#BB create_persistent name=databuffer capacity=10GB access_mode=striped type=scratch
#DW persistentdw name=databuffer
cd $DW_PERSISTENT_STRIPED_databuffer
pwd
dd if=/dev/urandom of=random.txt bs=1G count=5 iflag=fullblock
ls -lh $DW_PERSISTENT_STRIPED_databuffer
We can confirm their is an active burst buffer by running the following
$ scontrol show burst | grep databuffer
Name=databuffer CreateTime=2020-10-29T13:06:21 Pool=wlm_pool Size=20624MiB State=allocated UserID=siddiq90(92503)