Running Jobs

 

Submitting Jobs via PBS

 

PBS, the Portable Batch System, is a networked subsystem for submitting, monitoring, and controlling a workload of batch jobs on MNC2.  With PBS, jobs can be scheduled for execution on MNC2 according to scheduling policies that attempt to fully utilize system resources without over committing those resources, while being fair to all users.  For more information about PBS, see the online manual page, which can be viewed by executing the command:

man pbs

Long running jobs on clark must be submitted to run under PBS.  Short test runs can be run from the interactive login session, but they should be limited to no more than 10 minutes of CPU time.

Jobs are submitted to be run under PBS via the qsub command.  For complete details on using qsub, see the online manual page, which can be viewed by executing the command:

man qsub

An important consideration when creating your PBS script is file input and output. If you reference your input files a lot, it is worth your time to add commands to your PBS script to create a local directory in /tmp (mkdir /tmp/$PBS_JOBID) and copy your files over to that directory before starting your program. PBS collects STDOUT and STDERR from your program and we highly recommend letting PBS take care of your output and not redirecting it to your home directory; when PBS handles the output, it writes yours files to the local disk while it is working, rather than the remote disk mounted on /home; the files are copied to your home directory at the end of the job. Writing to the local disk will improve the performance of your program (/home is accessed over the network, which is almost always slower than files on the disk in the compute node), but it does mean that you cannot see the output from your program while it is running. Also, if there is a problem with the remote file system while your program is running, it will be unaffected and will continue to run. With parallel programs, keep in mind that the local disk is local to each compute node, so if each task needs to read and write files, you need to distribute them to or gather them from all of the compute nodes as necessary. Finally, if your program uses scratch files, it is very worthwhile to set the scratch directory to a local disk (with parallel programs make sure that the data isn’t shared, or this won’t work).

In general, on the different clusters you should submit to the route queue – this will feed your job into the most appropriate queue so you don’t have to worry about queues changing definition. However, if you want to submit to a specific queue, you should look at the output of qstat -q to see what the queue definitions and limits are.

You can request specific attributes, such as number of nodes, memory or job runtime. Memory requests should be set per process using the pmem. Set the walltime to a number close but slightly longer than you expect the job to run. To request 2 nodes of 2 processors each, each process usinga max of 2gb of memory for 1 hour:

#PBS -l nodes=2:ppn=2,pmem=2000mb,walltime=1:00:00

To request memory, you should request memory in mb rather than gb, as well as making 1000mb=1gb. As an example for our 2gb nodes, they actually only have available after system usage maybe 2025mb. A request for 2gb==2048mb will never be honored. In a nutshell, round down a little for your memory approximations.

You need to estimate how long your job will run. If you do not estimate the wall clock time required by your run, (e.g. walltime=45:00), PBS will terminate your job after 15 minutes. However, if you specify an excessively long runtime, your job may be delayed in the queue longer than it should be. Therefore, please attempt to accurately estimate your wall clock runtime. (A modest amount of overestimation (10-20%) is probably ideal).

PBS scripts are rather simple. An MPI example for user your-user-name (using 14 processes):

Example: MPI Code

#!/bin/sh

#PBS -S /bin/sh

#PBS -N your-mpi-job

#PBS -l nodes=7:ppn=2,mem=1gb,walltime=1:00:00

#PBS -q route

#PBS -M YourEmailAddressGoesHere

#PBS -m abe

#PBS -V

#

echo “I ran on:”

cat $PBS_NODEFILE

cd ~/your_stuff

# Use mpirun to run with 7 nodes for 1 hour

mpirun -np 14 ./your-mpi-program

The PBS script parameters are as follows:

#PBS -N your-mpi-job

Name of the job in the queue is “your-mpi-job”. This can be anything as long as it is less that 13 characters long; you should make it descriptive so you know which of your jobs are running and queued.

#PBS -l nodes=7:ppn=2,walltime=1:00:00,pmem=1gb

Reserve 7 machines w/ 2 processors each (14 processors), each process using 1GB of memory, for 1 hour. Note the pmem is different than previous usage of mem.

#PBS -S /path/to/shell

Script is /bin/sh (see below)

#PBS -q default

Submit to the queue named default.

#PBS -M YourEmailAddressGoesHere

Email me at this address.

#PBS -m abe

Email me when the job aborts, begins, and ends.

#PBS -joe

Join your stdout and stderr output into one file, to be placed in your home directory.

The MPI (mpirun) parameters are as follows:

-np    Number pf processes.

-stdin <filename>    Use “filename” as standard input.

-t   Test but do not execute.

Example: OpenMP Code

If you’re running OpenMP code (w/ 1 or 2 processes on these machines):

#!/bin/sh

#PBS -S /bin/sh

#PBS -N your-openmp-job

#PBS -l nodes=1:ppn=2,mem=1gb,walltime=90:00

#PBS -q route

#PBS -M YourEmailAddressGoesHere

#PBS -m abe

#PBS -V

#

echo “I ran on:”

cat $PBS_NODEFILE

#

# Create a local directory to run and copy your files to local.

# Let PBS handle your output

mkdir /tmp/${PBS_JOBID}

cd /tmp/${PBS_JOBID}

cp ~/your_stuff .

export OMP_NUM_THREADS=2

./your-openmp-program

#Clean up your files

cd

/bin/rm -rf /tmp/${PBS_JOBID}

You may find it necessary to add the following to OpenMP jobs, should you run low on stack space due to the default stack size of 2 MB:

export MPSTKZ 8M

Example: Serial Code

If you have a serial code (e.g. octave) just set ‘nodes=1′.

For example:

#PBS -N your-serial-job

#PBS -l nodes=1,walltime=24:00,pmem=1gb

#PBS -q route

#PBS -M YourEmailAddressGoesHere

#PBS -m abe

#PBS -V

#

# Create a local directory to run and copy your files to local.

# Let PBS handle your output

mkdir /tmp/${PBS_JOBID}

cd /tmp/${PBS_JOBID}

cp ~/your_stuff .

octave < input.m > out.mat

#Clean up your files

cd

# Retrieve your output

cp /tmp/${PBS_JOBID}/* ~/your_stuff

/bin/rm -rf /tmp/${PBS_JOBID}

In this script, stdout and stderr will be directed into file JobName.o##.JobName was specified by the -N flag in the script file.

To submit a PBS script simply type:

qsub your-scriptname

where your-scriptname is the name of your PBS script. Note that PBS runs yourscript under the your shell, unless otherwise told to do so.

To check the status of your job in the queue, type:

qstat your-job-id

To see all jobs in the queue, type:

qstat -a

To see detailed info on each job, type:

qstat -f

We are using the maui scheduler to implement various scheduling requirements.

If you realize that you made a mistake in your script file or if you’ve made modifications to your program since you submitted your job and you want to cancel your job, first get the “Job ID” by typing qstat. If you encounter an error while using qdel, add the -W force flag.

For example:

qdel [-W force] 203 – if running on mnc2

To see the names of the available queues and their current parameters, type:

qstat -q

The notable parameters in the output are the queue names (in the Queuecolumn) and the CPU time limits (in the Walltime column).

You can create dependency trees in order to satisfy different requirements for your workflow.  You might need to run A and when job A completes successfully, job B is available to run.  You could set up such a dependency by:

qsub job1

(system returns PBS job id xxx)

qsub  -W depend=afterok:xxx job2

You can do similar operations with job arrays:

qsub jobArrayjob1

(system returns PBS job id xxx[])

qsub  -W depend=afterokarray:xxx[] jobArrayjob2

There are several dependency types (before, after, afterok, afternotok, etc.).  For more examples, see the qsub manpage or

http://www.clusterresources.com/torquedocs21/commands/qsub.shtml

Job arrays in PBS are an easy way to submit multiple similar jobs. The only difference in them is the array index, which you can use in your PBS script to run each task with a different set of parameters, load different data files, or any other operation that requires a unique index.

To submit a PBS job with 10 elements, use the -t option in your PBS script like:

#PBS -t 1-10

. If your job ID is 5432 the elements in the array will be 5432[1], 5432[2], 5432[3], …, 5432[10]. In each script the environment variablePBS_ARRAYID is set to the numbers 1 through 10.

Note that each array element will appear as a separate job in the queue, and the normal scheduling rules apply to each element.

You can delete individual array elements from the queue by specifying the element number, like:

qdel 5432[4]

. You can also delete the entire array by using the base job number as the argument to qdel:

qdel 5432[]

which will delete all remaining array elements.

To view the status of the entire job array, run qstat with the -t option:

qstat -t  5432[]

A sample PBS script that uses this to run the same executable with 10 different input files that you have pre-named with the appropriate names would look like:

#!/bin/sh

#PBS -N yourjobname

#PBS -l nodes=1,walltime=00:05:00

#PBS -S /bin/sh

#PBS -M YourEmailAddressGoesHere

#PBS -t 1-10

#PBS -m abe

#PBS -q route

#PBS -j oe

#PBS -V

cd /path/to/my/program

./myprogram -input=file-${PBS_ARRAYID}

This will run myprogram 10 times on 10 nodes with input files named file-1, file-2, file-3, …, file-10.

For more information on job arrays, please see Cluster Resources web page on them: http://www.clusterresources.com/torquedocs21/2.1jobsubmission.shtml

 

Job Limits

At the NNIN/C@UM, we strive to promote equitable access to our resources. Because all jobs run on MNC2 are submitted to a batch queuing system, we enforce this fairness by controlling several parameters to the scheduling algorithm used by the queuing system.

When a job is submitted to the queuing system, the queuing system looks for free nodes on which to run it. If it can’t find any nodes that are suitable for your job, your job stays in the queued state (in PBS this is denoted by the letter “Q”). While your job is queued, its position in the queue is adjusted relative to the other jobs in the queue based on two primary factors: limits and priority.

In the general access partition on each cluster, we limit the number of cpus that any one person can use at a time. On MNC2 this is 32 cpus. However, to get the maximum use out of the MNC2, these limits are soft, so if no one else is waiting, it is possible for one person to use more nodes than the soft limit.

We also limit the number of jobs that are considered for scheduling. We will schedule 2 jobs per person at a time. This means that if user one submits 10 jobs with job IDs 110 through 120, and user two later submits a job with job ID 121, the scheduler will consider only jobs 110, 111 and 121, 122 for scheduling. When user one’s jobs are started, his next job will become eligible for scheduling; while it is waiting, it is not accumulating priority.

To further promote fair use of the NNIN/C@UM resources, jobs in the queued state are ordered by their priority. The priority of a job is computed from several factors:

  • The amount of time the job has been in the queue; the longer the time, the greater the priority. However, only one of your jobs at a time accumulates priority based on how long it has been in the queue.
  • Your usage over the past 30 days. If you have used a large amount of wallclock time on the cluster in the past month, people who have used less will receive a higher priority. This is known as “fairshare” and attempts to insure that the widest possible range of users will have access to the NNIN/C@UM resources.

There are exceptions to these rules, the largest being that for people who have purchased nodes or dedicated time on the cluster the limits do not apply. Fairshare still applies to promote fair use within the group of people with access to the

 

Visualization and Analysis

VisIt is a free interactive parallel visualization and graphical analysis tool for viewing scientific data on Unix and PC platforms. Users can quickly generate visualizations from their data, animate them through time, manipulate them, and save the resulting images for presentations. VisIt contains a rich set of visualization features so that you can view your data in a variety of ways. It can be used to visualize scalar and vector fields defined on two- and three-dimensional (2D and 3D) structured and unstructured meshes. VisIt was designed to handle very large data set sizes in the terascale range and yet can also handle small data sets in the kilobyte range.

 

Tuning and Optimization

The most important goal of performance tuning and optimization is to reduce a program’s wall-clock execution time. Reducing resource usage in other areas, such as memory or disk requirements, may also be a tuning goal. Performance analysis tools are essential to optimizing an application’s performance.

 

Storage  & Backup

  • Storage

NNIN/C@UM provides two levels of data storage. Users are strongly urged to store vital files in archival storage because online files can be lost during a machine crash, none of directories are backed up, and files on some machines are purge.

Storage on MNC2 is the following:

1. /tmp is a local directory that’s unique to each node, and thus, not shared.

2. /home is shared across the entire cluster. Everyone shares this space and it is limited, so please keep only files you need there for current jobs.

  • Backup

Data is not backed up. None of the data on the NNIN/C systems is backed up. The data that you keep in your home directory, /nobackup, /tmp or any other filesystem on any of the NNIN/C systems is exposed to immediate and permanent loss at all times. You are responsible for mitigating your own risk. We suggest you store copies of hard-to-reproduce data on systems that are backed up or at least systems other than the NNIN/C systems.

Your usage is tracked and may be used for reports. We track a lot of job data and store it for a long time. We use this data to generate usage reports and look at patterns and trends. We may report this data, including your individual data, to your adviser or other administrator or supervisor.

  • Purge

Purge policies are subject to change and, when revised, are announced in news postings, and status e-mails. Once files are purged, there is no possibility of recovering them.