[{"content":"Slurm Usage Slurm is a workload manager mainly used in HPC and GPU clusters. It ensures that resources are utilized to their maximum potential. I used to use Meituan’s intern workload manager based on Slurm. Now, I use PSC, which is equipped with full Slurm functionality. Therefore, I am writing this blog to summarize its usage.\nThere are following main commands in slurm:\nJob Submission and Control Job Management Status Query Resource Management Job Submission and Control Commands The first category of commands deals with submitting and controlling jobs on the cluster.\nsbatch This is your primary tool for submitting batch jobs to the cluster. The command reads a script file containing resource requests and job steps.\n1 2 3 4 5 # Basic job submission sbatch job_script.sh # Submit with specific requirements sbatch --time=2:00:00 --mem=4G job_script.sh 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 # Job Identification -J, --job-name=jobname # Name of the job --comment=string # Add comment to job --wckey=wckey # Specify wckey for job # Resource Allocation -N, --nodes=N # Number of nodes -n, --ntasks=ntasks # Number of tasks -c, --cpus-per-task=ncpus # CPUs per task --mem=MB # Total memory per node --mem-per-cpu=MB # Memory per CPU --gpus=n # Number of GPUs --gres=resource_spec # Generic resource requirements # Time and Priority -t, --time=minutes # Time limit --deadline=timestamp # Job deadline --priority=value # Job priority (admin only) -H, --hold # Submit job in held state # Input/Output -o, --output=filename # Standard output file -e, --error=filename # Standard error file -i, --input=filename # Standard input file # Partition and Constraints -p, --partition=partition # Partition request -C, --constraint=list # Node feature constraints --reservation=name # Resource reservation name # Dependencies and Arrays -d, --dependency=dependency_list # Job dependencies --array=array_spec # Job array indices # Environment --export=env_vars # Export environment variables --chdir=directory # Working directory srun Use srun for running parallel jobs or interactive tasks. It\u0026rsquo;s particularly useful for immediate execution.\n1 2 3 4 5 # Run a simple command across nodes srun hostname # Launch a parallel program srun -n 4 ./parallel_program 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 # Job Configuration -N, --nodes=N # Number of nodes -n, --ntasks=ntasks # Number of tasks -c, --cpus-per-task=ncpus # CPUs per task -p, --partition=partition # Partition request # Resource Requirements --mem=MB # Memory per node --mem-per-cpu=MB # Memory per CPU --gpus=n # Number of GPUs --gres=resource_spec # Generic resource requirements # Time Limits -t, --time=minutes # Time limit --immediate # Exit if resources not available # Input/Output -o, --output=filename # Standard output file -e, --error=filename # Standard error file -i, --input=filename # Standard input file # Task Distribution --ntasks-per-node=n # Tasks per node --ntasks-per-socket=n # Tasks per socket --distribution=arbitrary # Task distribution method # MPI Options --mpi=type # MPI implementation type --cpu-bind=type # Bind tasks to CPUs salloc When you need interactive access to compute resources, salloc is your go-to command.\n1 2 3 4 5 # Request an interactive session salloc --nodes=1 --time=1:00:00 # Get a GPU-enabled session salloc --gpus=1 --time=2:00:00 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 # Resource Request -N, --nodes=N # Number of nodes -n, --ntasks=ntasks # Number of tasks -c, --cpus-per-task=ncpus # CPUs per task --mem=MB # Memory per node --mem-per-cpu=MB # Memory per CPU # GPU and Special Resources --gpus=n # Number of GPUs --gres=resource_spec # Generic resource requirements -C, --constraint=list # Node feature constraints # Time and Priority -t, --time=minutes # Time limit --immediate # Exit if resources not available -H, --hold # Submit allocation in held state # Partition and Reservation -p, --partition=partition # Partition request --reservation=name # Resource reservation name # Job Identification -J, --job-name=jobname # Name of job --comment=string # Add comment to allocation # Environment --export=env_vars # Export environment variables --chdir=directory # Working directory Example Memory Specification\n1 2 3 4 # Different ways to specify memory --mem=4G # 4 GB per node --mem-per-cpu=1G # 1 GB per CPU --mem=4096MB # Can use MB notation Time Specification\n1 2 3 4 # Different time formats --time=2:00:00 # Hours:Minutes:Seconds --time=120 # Minutes --time=2-00:00:00 # Days-Hours:Minutes:Seconds GPU Requests\n1 2 3 4 # Different ways to request GPUs --gpus=1 # Request 1 GPU --gres=gpu:1 # Alternative way to request 1 GPU --gres=gpu:tesla:2 # Request 2 Tesla GPUs Job Dependencies\n1 2 3 4 5 # Common dependency types --dependency=after:123 # Start after job 123 completes --dependency=afterany:123 # Start after job 123 ends (any state) --dependency=afternotok:123 # Start after job 123 fails --dependency=singleton # Only one job runs at a time Tips:\nPaths must be absolute, or they may lead somewhere you cannot find. Carefully set the time, make sure they will not be killed due to time limited. Job Management Commands These commands help you manage existing jobs in the system.\nscancel The command for terminating jobs:\n1 2 3 4 5 6 7 8 9 bashCopy # Cancel a specific job scancel 12345 # Cancel all jobs for a user scancel -u username # Cancel jobs in a partition scancel -p partition_name 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 # Job Identification -i, --interactive # Require response before canceling --ctld # Send cancel request to slurmctld instead of directly -n, --name=job_name # Cancel jobs with specified name --qos=qos_list # Cancel jobs with specified QOS --reservation=reservation_name # Cancel jobs with specified reservation # User and Account Control -u, --user=user_name # Cancel jobs of specified user -A, --account=account # Cancel jobs of specified account --wckey=wckey # Cancel jobs with specified wckey # Job State and Type -t, --state=states # Cancel jobs in specified state --full # Cancel full job allocation --hurry # Expedite job cancellation --signal=signal_number # Send specified signal to jobs # Partition and Node -p, --partition=partition_names # Cancel jobs in specified partitions -w, --nodelist=host_list # Cancel jobs on specified nodes # Batch Script -b, --batch # Cancel only batch jobs -f, --full # Cancel entire job allocation scontrol A powerful tool for viewing and modifying job configurations:\n1 2 3 4 5 # View job details scontrol show job 12345 # Modify job parameters scontrol update JobId=12345 TimeLimit=2:00:00 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 # Show Commands show job # Show job information show node # Show node information show partition # Show partition information show reservation # Show reservation information show config # Show system configuration # Update Commands update job JobId=id # Update job attributes update node NodeName=name # Update node attributes update partition PartitionName=name # Update partition attributes # Job Control hold JobId # Place hold on job release JobId # Release hold on job requeue JobId # Requeue a job suspend JobId # Suspend a job resume JobId # Resume a suspended job # Other Controls ping # Ping slurmctld daemon reconfigure # Reconfigure slurmctld takeover # Takeover from backup controller shutdown # Shutdown slurm daemons Status Query Commands These commands provide information about the current state of jobs and the system.\nsqueue The primary command for viewing the job queue:\n1 2 3 4 5 6 7 8 9 bashCopy # View all jobs squeue # View user-specific jobs squeue -u username # Custom format output squeue --format=\u0026#34;%.18i %.9P %.8j %.8u %.2t %.10M %.6D %R\u0026#34; 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 # Output Format -o, --format=format # Specify custom output format --sort=fields # Sort by specified fields -l, --long # Long output format -i, --iterate=seconds # Repeatedly display at intervals # Job Selection -j, --jobs=job_id_list # Show specific jobs -u, --user=user_list # Show user\u0026#39;s jobs -n, --name=name_list # Show jobs with name -w, --nodelist=node_list # Show jobs on specific nodes # Partition and State -p, --partition=partition_names # Show jobs in partition -t, --states=state_list # Show jobs in state --qos=qos_list # Show jobs with QOS # Time and Priority --start # Show expected start time --priority # Display job priority Example:\n1 2 3 4 5 6 7 8 9 10 11 12 # Common format specifiers %.18i # Job ID (18 characters) %.9P # Partition (9 characters) %.8j # Job name (8 characters) %.8u # User name (8 characters) %.2t # Job state (2 characters) %.10M # Time limit (10 characters) %.6D # Number of nodes (6 characters) %R # Reason for waiting # Example format string squeue --format=\u0026#34;%.18i %.9P %.8j %.8u %.2t %.10M %.6D %R\u0026#34; sinfo Use this to check partition and node information:\n1 2 3 4 5 6 7 8 # View partition information sinfo # Detailed node status sinfo -N # Specific partition details sinfo -p partition_name 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 # Output Format -o, --format=format # Specify custom format -l, --long # Display in long format -N, --Node # Display node-oriented output --summarize # Report summary information # Node Selection -n, --nodes=nodes # Report on specific nodes -p, --partition=partition # Report on specific partition -t, --states=states # Report on nodes in specified state # Display Options -R, --responding # Show only responding nodes -d, --dead # Show only non-responding nodes --hide # Do not display hidden partitions sacct For accessing job history:\n1 2 3 4 5 # View today\u0026#39;s jobs sacct # View jobs since a specific date sacct --starttime=2024-01-01 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 # Time Range -S, --starttime=time # Report start time -E, --endtime=time # Report end time -A, --accounts=accounts # Show jobs from specified accounts # Output Control -o, --format=format # Specify output format --units=unit # Display units (K,M,G,T,P) -X, --allocations # Only show allocation records -p, --parsable # Output in parsable format # Job Selection -j, --jobs=job_id_list # Show specific jobs -u, --user=user_list # Show specific users -s, --state=states # Show jobs in specified states Example:\n1 2 3 4 5 6 7 8 9 10 11 12 13 # Common format specifiers JobID # Job ID JobName # Job name State # Job state ExitCode # Exit code Submit # Submit time Start # Start time End # End time Elapsed # Elapsed time MaxRSS # Maximum memory used # Example format string sacct --format=\u0026#34;JobID,JobName,State,ExitCode,Submit,Start,End,Elapsed,MaxRSS\u0026#34; Resource Management Commands These commands help monitor resource allocation and priorities.\nsshare View fair-share scheduling information:\n1 2 3 4 5 # Basic share information sshare # Detailed information sshare - 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 # Basic Options -A, --accounts=names # Show shares for specified accounts -a, --all # Show all users, even those with no usage -l, --long # Long listing format with more details -p, --parsable # Display in parsable format -u, --users=user_names # Show shares for specified users # Output Format -o, --format=format_string # Format specification --noheader # No header on output --json # JSON output format # Time Range -s, --start=time # Start time for statistics -e, --end=time # End time for statistics Example:\n1 2 3 4 # Examples sshare -A myaccount # Show shares for specific account sshare -u username -l # Detailed share info for user sshare -o \u0026#34;Account,User,RawShares\u0026#34; # Custom format output sprio Check job priorities:\n1 2 3 4 5 # View job priorities sprio # Detailed priority information sprio -l 1 2 3 4 5 6 7 8 9 10 # Job Selection -j, --jobs=job_id # Show priority for specific jobs -u, --user=user_name # Show priorities for specific user -o, --format=format # Specify output format # Display Options -l, --long # Long display format -n, --noheader # No header in output --json # JSON output format -w, --weights # Show priority weights Example:\n1 2 3 sprio -u username # Show priorities for user\u0026#39;s jobs sprio -j 12345 # Show priority for specific job sprio -l # Show detailed priority information ","date":"2025-02-24T00:00:00Z","image":"/en/p/slurm-usage/cover.en.png","permalink":"/en/p/slurm-usage/","title":"Slurm-Usage"}]