buildtest.scheduler.slurm

Module Contents

Classes

SlurmJob

The SlurmJob class models a Slurm Job ID with helper methods to perform operation against an active slurm job. The SlurmJob class

Attributes

logger

buildtest.scheduler.slurm.logger
class buildtest.scheduler.slurm.SlurmJob(jobID, cluster=None)[source]

Bases: buildtest.scheduler.job.Job

The SlurmJob class models a Slurm Job ID with helper methods to perform operation against an active slurm job. The SlurmJob class can poll job to get updated job state, gather job data upon completion of test and cancel job if necessary. We can also retrieve job state and determine if job is running, pending, suspended, or cancelled. Jobs are polled via sacct command which can retrieve pending, running and complete jobs.

is_pending()[source]

If job is pending return True otherwise return False. Slurm Job state for pending is PENDING.

is_running()[source]

If job is running return True otherwise return False. Slurm will report RUNNING for job state.

is_suspended()[source]

If job is suspended return True otherwise return False. Slurm will report SUSPENDED for job state.

is_cancelled()[source]

If job is cancelled return True otherwise return False. Slurm will report CANCELLED for job state.

is_complete()[source]

If job is complete return True otherwise return False. Slurm will report COMPLETED for job state.

is_failed()[source]

If job failed return True otherwise return False. Slurm will report FAILED for job state.

is_out_of_memory()[source]

If job is out of memory return True otherwise return False. Slurm will report OUT_OF_MEMORY for job state.

is_timeout()[source]

If job timed out return True otherwise return False. Slurm will report TIMEOUT for job state.

complete()[source]

This method is used for gathering job result we assume job is complete if it’s in any of the following state: COMPLETED, FAILED, OUT_OF_MEMORY, TIMEOUT

state()[source]

Return job state

workdir()[source]

Return job work directory

exitcode()[source]

Return job exit code

cancel()[source]

Cancel job by running scancel <jobid>. If job is specified to a slurm cluster we cancel job using scancel <jobid> --clusters=<cluster>. This method is called if job exceeds maxpendtime.

poll()[source]

This method will poll job via sacct command to get updated job state by running the following command: sacct -j <jobid> -o State -n -X -P

Slurm will report the job state that can be parsed. Shown below is an example job that is PENDING state

$ sacct -j 46641229 -o State -n -X -P
PENDING
gather()[source]

Gather job record which is called after job completion. We use sacct to gather job record and return the job record as a dictionary. The command we run is sacct -j <jobid> -X -n -P -o <field1>,<field2>,...,<fieldN>. We retrieve the following format fields from job record:

  • “Account”

  • “AllocNodes”

  • “AllocTRES”

  • “ConsumedEnergyRaw”

  • “CPUTimeRaw”

  • “Elapsed”

  • “ElapsedRaw”

  • “End”

  • “ExitCode”

  • “JobID”

  • “JobName”

  • “NCPUS”

  • “NNodes”

  • “QOS”

  • “ReqMem”

  • “ReqNodes”

  • “Start”

  • “State”

  • “Submit”

  • “UID”

  • “User”

  • “WorkDir”

The output of sacct is parseable using the pipe symbol (|) and stored into a dict

$ sacct -j 42909266 -X -n -P -o Account,AllocNodes,AllocTRES,ConsumedEnergyRaw,CPUTimeRaw,Elapsed,End,ExitCode,JobID,JobName,NCPUS,NNodes,QOS,ReqMem,ReqNodes,Start,State,Submit,UID,User,WorkDir --clusters=cori
nstaff|1|billing=272,cpu=272,energy=262,mem=87G,node=1|262|2176|00:00:08|2021-05-27T18:47:49|0:0|42909266|slurm_metadata|272|1|debug_knl|87Gn|1|2021-05-27T18:47:41|COMPLETED|2021-05-27T18:44:07|92503|siddiq90|/global/u1/s/siddiq90/.buildtest/tests/cori.slurm.knl_debug/metadata/slurm_metadata/0/stage

We retrieve ExitCode and WorkDir via sacct command to get returncode. Slurm will write output and error file in WorkDir location. We run the following command below and parse the output. The ExitCode is in form <exitcode>:<signal> which is colon separated list. For more details on Slurm Exit code see https://slurm.schedmd.com/job_exit_code.html

$ sacct -j 46294283 --clusters=cori -X -n -P -o ExitCode,Workdir
0:0|/global/u1/s/siddiq90/github/buildtest/var/tests/cori.slurm.knl_debug/hostname/hostname_knl/cd39a853/stage