All of our supercomputers use a system called PBS to make sure that everyone’s program have the resources they need (mostly CPU cores and memory) and that they’re not taking more than their fair share.
Anything that runs for more than a few seconds should be run inside a PBS job. This can be done by using a PBS script or an interactive PBS session.
Example PBS jobs for many of our programs can be found in
/usr/local/apps/example_jobs. We have a tutorial on running an example job.
PBS scripts are submitted by using the
qsub command. The status of PBS jobs can then be checked by using the
When a job is submitted, PBS checks to see if the resources required by the job are available. If so, the job is allowed to run. If not, the job is queued until the resources are available. If your job requires a lot of resources, it’s possible that jobs submitted after yours will run first because they require fewer resources.
It’s important to give some thought to how you break up your jobs and how many resources your job really needs. In general, several small jobs will usually start running sooner and finish faster than a single large job. A job might take less time with 32 CPU cores, but it could start running sooner and possibly finish sooner if you use only 16. It’s often worthwhile to see what resources are available and how many other jobs are queued before submitting your job.
All jobs should specify
walltime and CPU time (
cput). Walltime is simply how long a job is expected to run. CPU time is how much time will all the various CPUs spend on this job. For a single CPU job, the two should be the same, but for a multi-CPU job,
cput will be roughly
n x walltime, where n is the number of processors. Jobs that exceed the requested
cput will be killed by PBS.
#PBS -l walltime=8:00:00
This tells PBS how long you expect the job to run from the time it starts to the time it is finished. This example specifies 8 hours. If you job runs longer than specified, it will be killed by PBS.
#PBS -l cput=16:00:00
This tells PBS how much “CPU time” is needed by the job. For a single CPU job, it would be the same as walltime. For a multi-CPU job, it would be roughly n x walltime. If a job accumulates more cput than specified, it will be killed by PBS.
To have PBS email you when your job begins, end, or aborts, add the following to your PBS script:
#PBS -m bea
By default, these emails are sent to your user’s local email. Unless you have this email forwarded, you probably want to specify an alternative email address like so:
#PBS -M email@example.com
Multiple email addresses can be specified by separating them with commas.
Job Output Files
By default, all PBS jobs result in two output files being created by the PBS system. One captures the standard output of the job, the other captures standard error. They are named
jobname.ejobid, by default. These two files can be joined into a single file by adding:
#PBS -j oe
to your PBS script. In general, this is a good idea. The -o and -e options can also be used to rename the files.
Also by default, these files are written on the compute node and are not visible on the head node until after the job finishes. However, by adding:
#PBS -k oed
the files will be written directly to their final location and can be viewed while the job is still running. Note that this option is only available on Maple.
It’s important to note that jobs receive a default environment. Modules will need to be loaded and environmental variables will need to be set inside your PBS script. Also, in the case of the clusters, note that jobs are run on the compute nodes, and the software available there may be subtly different than that of the head node.