buildtest.executors.slurm
¶
This module implements the SlurmExecutor class responsible for submitting jobs to Slurm Scheduler. This class is called in class BuildExecutor when initializing the executors.
Module Contents¶
Classes¶
The SlurmExecutor class is responsible for submitting jobs to Slurm Scheduler. |
|
This is a base class for holding job level data and common methods for used |
Attributes¶
- class buildtest.executors.slurm.SlurmExecutor(name, settings, site_configs, max_pend_time=None)¶
Bases:
buildtest.executors.base.BaseExecutor
The SlurmExecutor class is responsible for submitting jobs to Slurm Scheduler. The SlurmExecutor performs the following steps:
load: load slurm configuration from buildtest configuration file
dispatch: dispatch job to scheduler and acquire job ID
poll: wait for Slurm jobs to finish, if job is pending and exceeds max_pend_time then cancel job
gather: Once job is complete, gather job data
- type = slurm¶
- dispatch(self, builder)¶
This method is responsible for dispatching job to slurm scheduler and extracting job id. If job id is valid we pass the job to SlurmJob class and store object in
builder.job
.- Parameters
builder (BuilderBase, required) – builder object
- gather(self, builder)¶
Gather Slurm job data after job completion. In this step we call
builder.job.gather()
, and update builder metadata such as returncode, output and error file.- Parameters
builder (BuilderBase (subclass), required) – instance of BuilderBase
- launcher_command(self)¶
Return sbatch launcher command with options used to submit job
- load(self)¶
Load the a slurm executor configuration from buildtest settings.
- poll(self, builder)¶
This method is called during poll stage where we invoke
builder.job.poll()
to get updated job state. If job is pending or suspended we stop timer and check if job needs to be cancelled if time exceeds max_pend_time value.- Parameters
builder (BuilderBase, required) – builder object
- class buildtest.executors.slurm.SlurmJob(jobID, cluster=None)¶
Bases:
buildtest.executors.job.Job
This is a base class for holding job level data and common methods for used for batch job submission.
- cancel(self)¶
Cancel job by running
scancel <jobid>
. If job is specified to a slurm cluster we cancel job usingscancel <jobid> --clusters=<cluster>
. This method is called if job exceeds max_pend_time.
- complete(self)¶
This method is used for gathering job result we assume job is complete if it’s in any of the following state:
COMPLETED
,FAILED
,OUT_OF_MEMORY
,TIMEOUT
- exitcode(self)¶
Return job exit code
- gather(self)¶
Gather job record which is called after job completion. We use sacct to gather job record and return the job record as a dictionary. The command we run is
sacct -j <jobid> -X -n -P -o <field1>,<field2>,...,<fieldN>
. We retrieve the following format fields from job record:“Account”
“AllocNodes”
“AllocTRES”
“ConsumedEnergyRaw”
“CPUTimeRaw”
“Elapsed”
“End”
“ExitCode”
“JobID”
“JobName”
“NCPUS”
“NNodes”
“QOS”
“ReqGRES”
“ReqMem”
“ReqNodes”
“ReqTRES”
“Start”
“State”
“Submit”
“UID”
“User”
“WorkDir”
The output of sacct is parseable using the pipe symbol (|) and stored into a dict
$ sacct -j 42909266 -X -n -P -o Account,AllocNodes,AllocTRES,ConsumedEnergyRaw,CPUTimeRaw,Elapsed,End,ExitCode,JobID,JobName,NCPUS,NNodes,QOS,ReqGRES,ReqMem,ReqNodes,ReqTRES,Start,State,Submit,UID,User,WorkDir --clusters=cori nstaff|1|billing=272,cpu=272,energy=262,mem=87G,node=1|262|2176|00:00:08|2021-05-27T18:47:49|0:0|42909266|slurm_metadata|272|1|debug_knl|PER_NODE:craynetwork:1|87Gn|1|billing=1,cpu=1,node=1|2021-05-27T18:47:41|COMPLETED|2021-05-27T18:44:07|92503|siddiq90|/global/u1/s/siddiq90/.buildtest/tests/cori.slurm.knl_debug/metadata/slurm_metadata/0/stage
- is_cancelled(self)¶
If job is cancelled return
True
otherwise returnFalse
. Slurm will reportCANCELLED
for job state.
- is_complete(self)¶
If job is complete return
True
otherwise returnFalse
. Slurm will reportCOMPLETED
for job state.
- is_failed(self)¶
If job failed return
True
otherwise returnFalse
. Slurm will reportFAILED
for job state.
- is_out_of_memory(self)¶
If job is out of memory return
True
otherwise returnFalse
. Slurm will reportOUT_OF_MEMORY
for job state.
- is_pending(self)¶
If job is pending return
True
otherwise returnFalse
. Slurm Job state for pending isPENDING
.
- is_running(self)¶
If job is running return
True
otherwise returnFalse
. Slurm will reportRUNNING
for job state.
- is_suspended(self)¶
If job is suspended return
True
otherwise returnFalse
. Slurm will reportSUSPENDED
for job state.
- is_timeout(self)¶
If job timed out return
True
otherwise returnFalse
. Slurm will reportTIMEOUT
for job state.
- poll(self)¶
Poll job to extract job state and exit code. We also retrieve job work directory. We run the following commands to retrieve the following properties.
Job State:
sacct -j <jobid> -o State -n -X -P
ExitCode and Workdir:
sacct -j <jobid> -X -n -P -o ExitCode,Workdir
- state(self)¶
Return job state
- workdir(self)¶
Return job work directory
- buildtest.executors.slurm.logger¶