buildtest.scheduler.slurm
Module Contents
Classes
The SlurmJob class models a Slurm Job ID with helper methods to perform operation against an active slurm job. The SlurmJob class |
Attributes
- buildtest.scheduler.slurm.logger
- class buildtest.scheduler.slurm.SlurmJob(jobID, slurm_cmds, cluster=None)[source]
Bases:
buildtest.scheduler.job.Job
The SlurmJob class models a Slurm Job ID with helper methods to perform operation against an active slurm job. The SlurmJob class can poll job to get updated job state, gather job data upon completion of test and cancel job if necessary. We can also retrieve job state and determine if job is running, pending, suspended, or cancelled. Jobs are polled via sacct command which can retrieve pending, running and complete jobs.
- is_pending()[source]
If job is pending return
True
otherwise returnFalse
. Slurm Job state for pending isPENDING
.
- is_running()[source]
If job is running return
True
otherwise returnFalse
. Slurm will reportRUNNING
for job state.
- is_suspended()[source]
If job is suspended return
True
otherwise returnFalse
. Slurm will reportSUSPENDED
for job state.
- is_cancelled()[source]
If job is cancelled return
True
otherwise returnFalse
. Slurm will reportCANCELLED
for job state.
- is_complete()[source]
If job is complete return
True
otherwise returnFalse
. Slurm will reportCOMPLETED
for job state.
- is_failed()[source]
If job failed return
True
otherwise returnFalse
. Slurm will reportFAILED
for job state.
- is_out_of_memory()[source]
If job is out of memory return
True
otherwise returnFalse
. Slurm will reportOUT_OF_MEMORY
for job state.
- is_timeout()[source]
If job timed out return
True
otherwise returnFalse
. Slurm will reportTIMEOUT
for job state.
- complete()[source]
This method is used for gathering job result we assume job is complete if it’s in any of the following state:
COMPLETED
,FAILED
,OUT_OF_MEMORY
,TIMEOUT
- cancel()[source]
Cancel job by running
scancel <jobid>
. If job is specified to a slurm cluster we cancel job usingscancel <jobid> --clusters=<cluster>
. This method is called if job exceeds maxpendtime.
- poll()[source]
This method will poll job via
sacct
command to get updated job state by running the following command:sacct -j <jobid> -o State -n -X -P
Slurm will report the job state that can be parsed. Shown below is an example job that is
PENDING
state$ sacct -j 46641229 -o State -n -X -P PENDING
- get_output_and_error_files()[source]
This method will extract file paths to StdOut and StdErr using
scontrol show job <jobid>
command that will be used to set output and error file.siddiq90@login07> scontrol show job 23608796 JobId=23608796 JobName=perlmutter-gpu.slurm UserId=siddiq90(92503) GroupId=siddiq90(92503) MCS_label=N/A Priority=69119 Nice=0 Account=nstaff_g QOS=gpu_debug JobState=PENDING Reason=Priority Dependency=(null) Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=00:05:00 TimeMin=N/A SubmitTime=2024-03-28T12:36:05 EligibleTime=2024-03-28T12:36:05 AccrueTime=2024-03-28T12:36:05 StartTime=2024-03-28T12:36:14 EndTime=2024-03-28T12:41:14 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-03-28T12:36:12 Scheduler=Backfill:* Partition=gpu_ss11 AllocNode:Sid=login07:1529462 ReqNodeList=(null) ExcNodeList=(null) NodeList= NumNodes=1-1 NumCPUs=4 NumTasks=4 CPUs/Task=1 ReqB:S:C:T=0:0:*:* ReqTRES=cpu=4,mem=229992M,node=1,billing=4,gres/gpu=1 AllocTRES=(null) Socks/Node=* NtasksPerN:B:S:C=4:0:*:* CoreSpec=* MinCPUsNode=4 MinMemoryNode=0 MinTmpDiskNode=0 Features=gpu&a100 DelayBoot=00:00:00 OverSubscribe=NO Contiguous=0 Licenses=u1:1 Network=(null) Command=/global/u1/s/siddiq90/jobs/perlmutter-gpu.slurm WorkDir=/global/u1/s/siddiq90/jobs StdErr=/global/u1/s/siddiq90/jobs/slurm-23608796.out StdIn=/dev/null StdOut=/global/u1/s/siddiq90/jobs/slurm-23608796.out Power= TresPerJob=gres:gpu:1
- retrieve_jobdata()[source]
This method will get job record which is called after job completion. We use sacct to gather job record and return the job record as a dictionary. The command we run is
sacct -j <jobid> -X -n -P -o <field1>,<field2>,...,<fieldN>
. We retrieve the following format fields from job record:“Account”
“AllocNodes”
“AllocTRES”
“ConsumedEnergyRaw”
“CPUTimeRaw”
“Elapsed”
“ElapsedRaw”
“End”
“ExitCode”
“JobID”
“JobName”
“NCPUS”
“NNodes”
“QOS”
“ReqMem”
“ReqNodes”
“Start”
“State”
“Submit”
“UID”
“User”
“WorkDir”
The output of sacct is parseable using the pipe symbol (|) and stored into a dict
$ sacct -j 42909266 -X -n -P -o Account,AllocNodes,AllocTRES,ConsumedEnergyRaw,CPUTimeRaw,Elapsed,End,ExitCode,JobID,JobName,NCPUS,NNodes,QOS,ReqMem,ReqNodes,Start,State,Submit,UID,User,WorkDir --clusters=cori nstaff|1|billing=272,cpu=272,energy=262,mem=87G,node=1|262|2176|00:00:08|2021-05-27T18:47:49|0:0|42909266|slurm_metadata|272|1|debug_knl|87Gn|1|2021-05-27T18:47:41|COMPLETED|2021-05-27T18:44:07|92503|siddiq90|/global/u1/s/siddiq90/.buildtest/tests/cori.slurm.knl_debug/metadata/slurm_metadata/0/stage
We retrieve ExitCode and WorkDir via sacct command to get returncode. Slurm will write output and error file in WorkDir location. We run the following command below and parse the output. The ExitCode is in form
<exitcode>:<signal>
which is colon separated list. For more details on Slurm Exit code see https://slurm.schedmd.com/job_exit_code.html$ sacct -j 46294283 --clusters=cori -X -n -P -o ExitCode,Workdir 0:0|/global/u1/s/siddiq90/github/buildtest/var/tests/cori.slurm.knl_debug/hostname/hostname_knl/cd39a853/stage