Policies

Golden Rules

Any computation tasks must be launched through the scheduler.
Use resources efficiently.
ssh on a compute node is allowed for debugging and monitoring purpose only when users have a running job on this node.

Scheduling

Jobs are scheduled by jobs priority, this priority is computed based on group/project faireshare usage.
Strict_ordering: runs jobs exactly in the order determined by the scheduling option settings, i.e. runs the "most deserving job" as soon as possible.
Backfill: allows smaller jobs to be scheduled around more deserving jobs.
Sharing: jobs share nodes by default except if it is explicitly specified in the queue (see queue large) or in the job requirements.
Quarterly maintenance window (dedicated time): to be confirmed 1 month prior to the maintenance.

Jobs

Jobs that alter the sake of optimal global functioning of the cluster or that negatively impact other jobs through an abnormal resources usage will be killed.
HPC administrators will wait 12 hours prior to do a non-crucial intervention and stop jobs.
- We require that jobs which walltime lasts more than 12 hours must be re-runnable.
- Re-runnable means that the job can be terminated and restarted from the beginning without harmful side effects (PBS Reference guide terminology).
- This must be materialized in your PBSpro script by the directive:
  #PBS -r y
- In case of cancellation (node crash, operator intervention,...) re-runnable jobs will be automatically resubmitted by PBSpro and they may belong to one of the following cases:
  - Worst case:
    - 1. The job is actually not re-runnable (for instance it is influenced by the output of a previous run in an uncontrolled manner) and will most probably crash (possibly corrupting generated results). The job's owner knows and accepts it;
    - 2.The job is not restartable but it is not influenced by previous output files, then it will rerun automatically from the beginning (and the job's owner knows and accepts it);
  - Ad hoc case: The job's owner is ready to take manual actions to make sure the input and previous output files are adapted adequately for the restart. Then, insert at the beginning of the PBS script, just after the PBS directives:
    qhold -h u $PBS_JOBID
    In the case of cancellation and rerun, the PBS server will put the job on hold. At this step, after the modification of your data, you can release the job with the qrls command.
  - Ideal case: The job does modify the input files and/or checks the output generated by a previous run adequately. In the case of cancellation and rerun, the job will restart and continue automatically from the last checkpoint.
Running job on the front-end node is not allowed.
Using Haswell and Ivy Bridge nodes in a same job is not allowed.

Project

Computing hours and disk credits are allocated by project. A project has always a start date and an end date.
- Project can be extended on demand and with justifications.
- If not:
  - At the end date, the project is closed. New jobs are not allowed.
  - Three months after the end of project, the remaining data in project directories will be cleared.
Jobs submission is only allowed through project. In order to do accounting and to work with the resources (walltime, ncpus, disk space,…) allocated to the project <project_name>, add in your PBSpro script, the directive:
#PBS -W group_list=<project_name>