User's guide

Zenobe users Guide

System Overview

 

  • Manufacturer: Bull Atos
  • 15 racks, 19 chassis equipped with 171 blades BullX B510 and 14 chassis equipped with 120 blades BullX B520
  • 407.7 Tflop/s peak cluster
  • 332.5 Tflop/s LINPACK
  • TOP500 List zenobe
  • Total CPU cores: 13968
  • Total memory: 39168 GB

 

Nodes details
Blade types BullX B520 BullX B510
Node types
Haswell
2 x Intel E5-2680v3
Ivy Bridge
2 x Intel E5-2697v2
# cores per nodes 24 24
# nodes 240 342
Total # of  cores 5760 8208
Processor speed 2.5 GHz 2.7 GHz
Instruction Set Extensions AVX2.0 AVX
Cache 30MB shared by 12 cores 30MB shared by 12 cores
Memory size per node
64 GB (218 fit)
128 GB (18 fat)
256 GB (4 xfat)
64 GB
Memory size per core
2.67 GB (fit)
5.34 GB (fat)
10.68 GB (xfat)
2.67 GB
Memory Type
DDR4 @ 2133 MHz (fit & fat)
DDR3 @ 1866 MHz
Memory bandwith 68.2 GB/s read/write 59.7 GB/s read/write
Ethernet  Gigabit Ethernet Gigabit Ethernet
Host Channel  Adapter Mellanox Connect-IB single FDR Mellanox ConnectX3 single FDR
Local disk 1 SSD 128 GB 1 SSD 64 GB

 

 

  • Manufacturer: Bull Atos
  • 2 servers R421-E3

 

Nodes details
Blade type Bullx R421-E3
Node type Ivy Bridge
2x Intel E5-2697v2
# cores per nodes 24
# nodes 2
Total # of  cores 48
Processor speed 2.7 GHz
Cache 30MB shared by 12 cores
Memory size per node 256 GB
Memory size per core 10.68 GB
Memory Type DDR3 @ 1866 MHz
Memory bandwith 59.7 GB/s read/write
Ethernet Gigabit Ethernet
Host Channel  Adapter Mellanox ConnectX3 single FDR
PCIe 2x slots PCIe 3.0x16 dedicated to graphic cards
1x slot PCIex8
GPU 2x nVidia K40
Local disk 2x SATA3  500GB
nVidia K40 details
Processor 1 processor GK110B
# cores 2880
Frequency Base clock: 745Mhz
Boost clock: 810MHz and 875MHz
Memory size 12GB
Memory I/O 384bit GDDR5

 

Gigabit Ethernet Network

The Ethernet network interconnects all the entities of the cluster and is used for administrative tasks (installation, management, supervision, NFS, PBS communications,...).

Infiniband Network

The infiniband network is made of  63 switches: 1 per chassis, 10 Voltaire QDR ISR-036 and 20 Mellanox QDR IS5030 switches.

The infiniband network is composed of 3 levels: the Leaf level (L1), the Core level (L2) and the Top level (L3).

The higher level, L3, connects 2 non-blocking fat-tree which include:

  • FT1: 14 Haswell nodes chassis;
  • FT2: 19 Ivy Bridge nodes chassis + frontal nodes + GPFS nodes.

Inside each fat tree, communication between entities are non-blocking and between the two fat-trees, communications are blocking. The link between the L3 switches has been designed to satisfy the GPFS traffic but not to support computation between the two blocks.

infiniband_update_0.png

  • Network File System (NFS) on a NetApp FAS2040 equipped with 3 DS4243 drawers, with a total of 84 disks for 116 TB net space split across multipe volumes.
  • IBM's General Parallel File System (GPFS) on 4 Bull R423-E2 IO servers connected to 4 LSI 2600-HD SAN arrays equipped with a total of 240 disks for 350 TB net space on a single volume.
  • Operating System: Red Hat Enterprise and CentOS Linux release 6.5
  • Job Scheduler: Altair PBS Professional
  • Web Portal: BULL eXtreme Computing Studio (XCS)
  • Compilers: Intel and GNU C, C++ and Fortran
  • MPI: Intel MPI and OpenMPI

Access and Accounting

Getting access

Please refer to this page.

Web Portal login

Once you received your credentials, you should be able to access zenobe through the portal.hpc.cenaero.be web interface with your zenobe login and password.

Console login

  • Academic users

    Once your access has been validated by the CÉCI adminstrators, the Tier-1 calculator's administrators will create your account and send you an e-mail notification, you should then be able to connect to the HPC using an SSH client. All of the information on how to connect to the system is available on the CÉCI's website.

  • Other users

    Once you received your credentials, you should be able to access zenobe through the hpc.cenaero.be SSH gateway.

    Linux and Mac OS users
    • First, connect to the hpc.cenaero.be SSH gateway using the provided login and password:
      ssh mylogin@hpc.cenaero.be
    • then, connect to zenobe's the frontal node by using the following command:
      ssh mylogin@zenobe
    Windows users
    • First, use an SSH client (like PuTTY)
      • The host is hpc.cenaero.be
      • Enter the provided login and password
    • Once logged on the hpc.cenaero.be gateway, you can connect to the frontal node by using the following command:
      ssh mylogin@zenobe

 

Jobs Accounting Metric

The jobs accounting is based on the R_Wall_Time metric that accounts for the resources actually mobilized by the jobs.

The metric definition is: R_Wall_Time = Wall_Time * ncpus_equiv_pjob

where:

  • Wall_Time is the execution time of the job (end_time - start_time - suspend_time);
  • and ncpus_equiv_pjob is defined as follows:
    • if the job is run in an exclusive queue or environment, complete nodes are associated to your job, and then
      ncpus_equiv_pjob = nodes_pjob * ncpus_pnode
    • else the job is run in a shared mode (sharing nodes with other jobs or projects) and then
      ncpus_equiv_pjob = max ( ncpus_pjob , mem_pjob / mem_pcpu )

in which:

  • ncpus_pnode is the number of cores available per node
  • ncpus_pjob and mem_pjob are respectively the total number of cores and total amount of memory requested by the job (respectively resource_list.ncpus and resource_list.mem reported in a qstat -f on the job) and
  • mem_pcpu is the amount of memory available per core depending on the type of node requested, i.e.
    • 2625MB by default (Haswell fit nodes)
    • 5250MB if you requested Haswell fat nodes ( -l model=haswell_fat )
    • 10500MB if you requested Haswell xfat nodes ( -l model=haswell_xfat )

ncpus_equiv_pjob is the resource_used.rncpus value reported in a qstat -f on the job.

See also Job Output file to check resource usage of the job.

Reservation or Dedicated Nodes Accounting

R_Wall_Time accounted is:
(Reservation end_time - Reservation start_time) * Number of Dedicated Nodes * ncpus_pnode.

Reporting

Reports are sent weekly and monthly to project managers. Persistent and scratch storage usages and some job statistics are provided with the R_Wall_time credit used during the period by queue and/or globally and the remaining credit.
Some ratios are also provided:

  • η = Total CPU_Time / ( Total Wall_Time * NCPUS_PJOB )
  • α = Total CPU_Time / ( Total Wall_Time * NCPUS_EQUIV_PJOB )

Example:

======================================================================================
Project created Friday Oct 31 2014
======================================================================================
SCRATCH storage quota    : 400 GiB
SCRATCH storage used     : 196 GiB
Persistent storage quota : 200 GiB
Persistent storage used  : 48 GiB

-------------------------------------------------------------------------------------
All Standard Queues Usage
-------------------------------------------------------------------------------------
R_Wall_Time Credit used this month : 10.6 hours

Job Usage Detail (times in hours)
             # of       Total     Total  Total                  Total         Average
Username     jobs    CPU_Time Wall_Time N_Wall_Time     η R_Wall_Time    α  Wait_Time
----------- ----- ----------- --------- ----------- ----- ----------- ----- ---------
TOTAL           3         8.7       1.0        10.6 82.1%        10.6 82.1%       0.0

user1           3         8.7       1.0        10.6 82.1%        10.6 82.1%       0.0

Job Set Summary (times in hours)
                        Percentil  Percentil  Percentil
               Minimum       P_50       P_75       P_95    Maximum
            ---------- ---------- ---------- ---------- ----------
Ncpus                1          2          2          2         32
CPU_time           0.0        0.0        0.0        0.0        8.7
Wall_time          0.0        0.3        0.3        0.3        0.7
Wait_time          0.0        0.0        0.0        0.0        0.0
-------------------------------------------------------------------------------------

R_Wall_Time Credit allocated            : 31000 hours
R_Wall_Time Credit Valid until          : Thursday Apr 30 2016
R_Wall_Time Credit used previous report : 20128.2 hours
R_Wall_Time Credit used this month      : 10.6 hours
R_Wall_Time Credit used current report  : 20138.8 hours
Remaining R_Wall_Time Credit            : 10861 hours
Percentage R_Wall_Time Credit left      : 35 %

 

Credit management

Project managers are kindly requested to monitor their project resources usage. Projects with a negative remaining credit will have one month to update their request; after this delay, jobs associated to these projects will be rejected. Academic users can always update their project credit via the following page https://login.ceci-hpc.be/init-project/. Other users can send an email to itatcenaero [dotcenaero] be.

Papers published thanks to results obtained from zenobe must include the following paragraph in the appropriate "Acknowledgement" section:

"The present research benefited from computational resources made available on the Tier-1 supercomputer of the Fédération Wallonie-Bruxelles, infrastructure funded by the Walloon Region under the grant agreement n°1117545."

If the paper is published in French, use the following:

"Les présents travaux ont bénéficié de moyens de calcul mis à disposition sur le supercalculateur Tier-1 de la Fédération Wallonie-Bruxelles, infrastructure financée par la Région wallonne sous la convention n°1117545."

Data Storage organisation

Structure

Home directory

When you login on the supercomputer, you are directly redirected to /home/organization/username. This space is only there for personnal data and configurations.

Projects directory

This data space is only reserved for the data persistent through the time life of the project and potentially common to the project members : typically, particular software or files that need to be kept after a job is completed. To access this space, go to /projects/organization/project_name.

Scratch directory

This space is made for 2 types of data and only for temporary files:

  • Users data: All the data that require a bigger space. Accessible from /SCRATCH/organization/username;
  • Projects data: When you run a job, the data has to be stored first here. Accessible from /SCRATCH/organization/project_name.

Be careful: the data on this storage can be deleted at any time without any warning from the supercomputer admin.

Nodes local storage

Each compute node has a local SSD disk which is dedicated administration tasks. The local disk is not available for computation.

Systems details

NetApp

Quotas are set per user on /home and per project (managed by unix group) on /projects. The home directories are set as follow:

Configuration
Volumes name Default quota (soft/hard)
/home 30/35 Go
/home/acad 50/50 Go

The quota for the projects needs to be identified before starting it, as you can see by default no space are allocated to the project. There is no method to check by yourself the quota, so if you have doubts, please contact itatcenaero [dotcenaero] be.

Currently, there is no backup on this volume.

Snapshots are done for /home, /projects and /softs. For /home and /projects, they are taken nightly at midnight and hourly at 8:00, 12:00, 16:00 and 20:00. The 6 last hourly and the 2 last nightly snapshots are kept. For /softs only the 1 nightly is kept.

GPFS

Configuration
/SCRATCH 350 To

Quotas are set per project (managed by unix group) and per organizational group. GPFS is reliable but data are not backed up.

In this volume you can check by yourself the quota for a project. To do that enter the following command line into the frontend system:

mmlsquota -g <PROJECTNAME>

The output of this command will be something like that :

Disk quotas for group PROJECTNAME (gid xxxx):                        

Block Limits                                           | File Limits
Filesystem type    MB   quota   limit  in_doubt  grace | files  quota   limit in_doubt  grace  Remarks
gpfs       GRP   5082  102400  133120         0   none |   866      0       0        0   none

 

For now, the hard limit (named limit in the above line) is at 30% of the soft limit (named quota in the above line)

Web Portal

  • Web Portal is slow. Users have to wait several minutes in order to obtain status of their jobs.
  • Reservations are not available.
  • Job arrays are not supported.
  • Jobs accounting information is not relevant as it does not take into account ncpus_equiv_pjob metric (see Accounting)
  • Project disk usage is not relevant.
  • User can only upload files in the Upload directory located in his home directory.
  • Only TurboVNC or XRV are supported for visualisation session.

Software Libraries

User- and project-specific packages

Users can install tools in their home or project directory with respect to their quotas, but it is a good idea to discuss with the HPC and Infrastructure team in order to optimize the collaborative work, not duplicate applications and save disk space.

Standard Software

Shared tools and applications are installed in the /softs partition.

Note that the access to commercial software is restricted (Samcef, Abaqus, Ansys, Fluent, Morfeo, elsA,...).

New software can be installed with the agreement and collaboration of the HPC and Infrastructure team.

Software Environment Management

The Environment Modules package provides for the dynamic modification of a user's environment via modulefiles.

When you log in, no modules are loaded by default.

To see all available modules, issue the following command: $ module avail

It displays a complete list of modules that are available to be loaded into your environment.

You can load, unload, and even swap/switch modules using the module command:

$ module load
$ module unload
$ module switch

 

To see which modules you currently have loaded, issue the following command: $ module list

To display a list of environment variables and other information about an individual module "module_name", invoke the following: $ module show <module_name>

To remove all modules from the environment use: $ module purge

For advanced topics use: $ module help

 

 

This is a non-exhaustive list of software maintained on Zenobe in /softs.

To check available supported software, issue the command :$ module avail

Type Name Versions
Compilers GNU Compilers 4.1.2
4.4.4
4.4.7
4.6.4
4.7.4
4.8.4
Intel Compilers composerxe/2011.7.256
composerxe/2013.2.146
composerxe/2013_sp1.1.106

composerxe/2015.1.133

 

Type Name Versions
Message Passing Interface
OpenMPI - RedHat6 1.4.5
1.6.5
1.8.4
IntelMPI 4.0.3.008
4.1.0.024
4.1.0.030
4.1.1.036
4.1.3.045
5.0.2.044

 

Type Name Versions
Parrallel Tools Intel Develoment Tool Suite Vtune_amplifier_XE 2013
Vtune_amplifier_XE 2015.1.1
Advisor_XE 2013
Advisor_XE 2015.1.10
Inspector_XE 2013
Inspector_XE 2015.1.2
Itac 8.1.0.024
Itac 8.1.4.045
Itac 9.0.2.045
Allinea 3.1-21691-Redhat-5.7
4.1-32296-Redhat-5-7
4.2.1-36484-Redhat-6.0
Scalasca (+ cube 4.2.1) intelmpi-4.1.3-045
openmpi-1.4.5
openmpi-1.6.5
TAU intelmpi-4.1.3.045
openmpi-1.4.5
openmpi-1.6.5
Scorep intelmpi-4.1.3.045
openmpi-1.4.5
openmpi-1.6.5
PAPI 5.3 gcc 4.4.7
Likwid 3.0.0
3.1.1

 

 

Compiling and Linking

Software porting :

Intel Composer 2015 update 5 is the recommended compiler (currently the latest Intel compiler installed on zenobe).

  • The instruction set is upward compatible. Therefore:
    • Applications compiled with -xAVX can run on ivy Bridge, or Haswell processor
    • Applications compiled with -xCORE-AVX2 can run only on Haswell processors
  • Generating optimized code for Haswell processors
    • If your goal is to achieve the best performance from the Haswell processors, use the latest Intel compiler with one of the following the Haswell-specific optimization flags:
      -axCORE-AVX2
      -xCORE-AVX2
      Running an executable built with either the -xCORE-AVX2 flag or the -axCORE-AVX2 flag on Ivy Bridge processors will result in the following error:
      Fatal Error: This program was not built to run in your system. Please verify that both the operating system and the processor support Intel(R) AVX2
    • Generating code for any processor type
      If your goal is to create a portable executable file that can run on any processor type, you can choose one of the following approaches:
      Use none of the above flags (which defaults to -mSSE2)
      Use -xAVX
      Use -O3 -axCORE-AVX2 -xAVX (with the latest Intel compiler) Libraries built by ourselves (hdf5,....) are not yet optimized for Haswell processors.

MPI

Intel MPI and OpenMPI librairies are available on Zenobe.

OpenMPI are compiled with gcc 4.1.2 unless those tag with “-el6” which are built with gcc 4.4.7. Use module command to list the different releases.

The recommended tool is Intel MPI 4.1.3.045.

Prior to using an MPI library, you will have to load an appropriate module for a supported compiler suite.

Debug

Debugging your application gdb (gcc 4.4.7 tool suite ), Intel idb provided with Intel Composer XE tools suite and Allinea DDT are available for debugging purpose. Use module command to list the different releases.

Performance analysis

Tuning your application Allinea MAP and Intel Cluster Studio XE tools suite are available for tuning purpose. Use module command to list the different releases.

TAU, PAPI, SCALASCA and likwid installations are not finalized yet.

Queueing and Running Jobs

Golden Rules

  • Any computation tasks must be launched through the scheduler.
  • Use resources efficiently.
  • ssh on a compute node is allowed for debugging and monitoring purpose only when users have a running job on this node.

Scheduling

  • Jobs are scheduled by jobs priority, this priority is computed based on group/project faireshare usage.
  • Strict_ordering: runs jobs exactly in the order determined by the scheduling option settings, i.e. runs the "most deserving job" as soon as possible.
  • Backfill: allows smaller jobs to be scheduled around more deserving jobs.
  • Sharing: jobs share nodes by default except if it is explicitly specified in the queue (see queue large) or in the job requirements.
  • Quarterly maintenance window (dedicated time): to be confirmed 1 month prior to the maintenance.

Jobs

  • Jobs that alter the sake of optimal global functioning of the cluster or that negatively impact other jobs through an abnormal resources usage will be killed.
  • HPC administrators will wait 12 hours prior to do a non-crucial intervention and stop jobs.
    • We require that jobs which walltime lasts more than 12 hours must be re-runnable.
    • Re-runnable means that the job can be terminated and restarted from the beginning without harmful side effects (PBS Reference guide terminology).
    • This must be materialized in your PBSpro script by the directive:
      #PBS -r y
    • In case of cancellation (node crash, operator intervention,...) re-runnable jobs will be automatically resubmitted by PBSpro and they may belong to one of the following cases:
      • Worst case:
        • 1. The job is actually not re-runnable (for instance it is influenced by the output of a previous run in an uncontrolled manner) and will most probably crash (possibly corrupting generated results). The job's owner knows and accepts it;
        • 2.The job is not restartable but it is not influenced by previous output files, then it will rerun automatically from the beginning (and the job's owner knows and accepts it);
      • Ad hoc case: The job's owner is ready to take manual actions to make sure the input and previous output files are adapted adequately for the restart. Then, insert at the beginning of the PBS script, just after the PBS directives:
        qhold -h u $PBS_JOBID
        In the case of cancellation and rerun, the PBS server will put the job on hold. At this step, after the modification of your data, you can release the job with the qrls command.
      • Ideal case: The job does modify the input files and/or checks the output generated by a previous run adequately. In the case of cancellation and rerun, the job will restart and continue automatically from the last checkpoint.
  • Running job on the front-end node is not allowed.
  • Using Haswell and Ivy Bridge nodes in a same job is not allowed.

Project

  • Computing hours and disk credits are allocated by project. A project has always a start date and an end date.
    • Project can be extended on demand and with justifications.
    • If not:
      • At the end date, the project is closed. New jobs are not allowed.
      • Three months after the end of project, the remaining data in project directories will be cleared.
  • Jobs submission is only allowed through project. In order to do accounting and to work with the resources (walltime, ncpus, disk space,…) allocated to the project <project_name>, add in your PBSpro script, the directive:
    #PBS -W group_list=<project_name>

 

The different queues are organized to follow the different hardware and usages:

  • large: This queue is a routing queue dispatching jobs in execution queue, large_ivy. It addresses jobs only to Ivy Bridge nodes and is dedicated to large massively parallel jobs. Following limits are applied to this queue:
    • Job placement on nodes is exclusive.
    • Minimum number of cpus per job = 96
    • Maximum number of cpus per job = 4320
    • Walltime maximum = 24 hours
  • main: This is the default queue.
    • Following limits are applied to main queue:
      • Walltime maximum = 168 hours
      • Maximum number of cpus per job = 192
    • The main queue is a routing queue dispatching jobs in execution queues:
      • main_has: default execution queue. It addresses jobs only to fit or fat Haswell nodes (accounting based on fit nodes). Nothing to add.
        • Following limits are applied to this execution queue:
          • Maximum number of cpus per user = 480
          • Maximum number of cpus per project = 960
      • main_has_fat: It addresses jobs only to fat Haswell nodes (accounting based on fat nodes). To use this queue, add:
        #PBS -l model=haswell_fat
      • main_has_xfat: It addresses jobs only to xfat Haswell nodes (accounting based on xfat nodes). To use this queue, add:
        #PBS -l model=haswell_xfat
    • Do not submit jobs directly in the execution queues.
    • Job placement on nodes in queue main is shared.
  • visu: This queue is a routing queue dispatching visualisation jobs on the 2 graphic nodes. Each visualisation session can required at maximum: 12 cores, 126000MB of memory and 1 GPU.
    • execution queues:
      • visu2: It addresses jobs to the available GPU of each virtual node
    • Do not submit jobs directly in the execution queues.
  • Restricted access queues: zen, diags, or ded_ for dedicated time, ...
  • Reservation can be done by the support on demand.

Check the queue properties and limits with the command:
qstat -Qf

The cstat command (non standard) displays nodes/queues jobs repartition.

 

Zenobe supercomputer uses the Portable Batch System (PBS) from Altair for job submission, job monitoring, and job management. Current release is 13.1.2.

Batch and Interactive jobs are available. Interactive jobs can be particularly useful for developing and debugging applications.

Basic Commands

The most commonly used PBS commands, qsub, qdel, qhold, qalter and qstat are briefly described below. These commands are run from the login node to manage your batch jobs.
See PBS Reference manual for a list of all PBS commands

qsub

The qsub is used to submit a job to the batch system.

  • To submit a batch job script:
    qsub [options] my_script
  • To submit a batch job executable:
    qsub [options] my_executable [my_executable_arguments]
  • To submit a interactive job:
    qsub -I [options]
  • To submit a job array:
    qsub -J <num-range> [options] script or executable

Most common options:

Input/output

-o path

standard output file
-e path path standard error file
-j oe (eo) joins standard error to standard output (standard output to standard error). oe is the default.
Queue
-q <queue_name> runs jobs in queue <queue_name>
Notification
-M email address notifications will be sent to this email address
-m b|e|a|n notifications on the following events:
begin, end, abort, no mail (default)
Do not forget to specify an email address (with -M) if you want to get these notifications.
Resource
-l walltime=[hours:minutes:]seconds requests real time; the default is 12 hours.
-l select=N:ncpus=NCPU requests N times NCPU slots (=CPU cores) for the job (default for NCPU: 1)
-l select=N:mem=size requests N times size bytes of memory for each chunk (default is 1GB).
-l pmem=size request a maximum of size bytes of memory for all processes of the job.
-l model=<model_type> request fit, fat, xfat Haswell nodes when allowed in the queue.
-l place=  chooses the sharing, grouping and the placement of nodes when it is allowed in the queue (default is free).
Dependency
-W depend=afterok:job-id starts job only if the job with job id job-id has finished successfully.
Miscellaneous
-r y|n

notifies that job is rerunnable (default no)

-v  specifies the environment variables and shell functions to be exported to the job.
-V Declares that all environment variables and shell functions in the user's login environment where qsub is run are to be exported to the job.
qdel

To delete a job:
qdel <jobid>

qhold

To hold a job:
qhold <jobid>

Only the job owner or a system administrator can place a hold on a job. The hold can be released using the qrls <jobid> command.

qalter

The qalter command is used to modify attributes of one or more PBS queued (not running) jobs. The options you can modify are a subset of the directives that can be used when submitting a job. A non-privileged user may only lower the limits for resources.
qalter [options] <jobid>

qstat

To display queue information:
qstat -Q

Common options to display job information :

  • -a Display all jobs in any status (running, queued, held)
  • -r Display running jobs
  • -u <username> Display user username jobs
  • -f <jobid> Display detailed information about a specific job
  • -xf <jobid> Display detailed information about a finished specific job (within past 48 hours)
  • -T Display estimated start time
  • -w Display information in a wide format

PBS Environment Variables

Several environment variables are provided to PBS jobs. All PBS-provided environment variable names start with the characters "PBS_". Some start with "PBS_O_", which indicates that the variable is taken from the job's originating environment (that is, the user's environment).

A few useful PBS environment variables are described in the following list:

PBS_O_WORKDIR Contains the name of the directory from which the user submitted the PBS job
PBS_O_QUEUE Contains the queue name
PBS_JOBID Contains the PBS job identifier
PBS_NODEFILE Contains a list of nodes assigned to the job
PBS_JOBNAME Contains the job name

 

 

Writing a submission script is typically the most convenient way to submit your job to the job submission system but jobs can also be submitted on the command line. A job script consists of PBS directives, comments and executable commands.

A PBSpro script is composed of 3 parts:

  • Shell (PBSpro always executes the shell startup script)
  • PBSpro directives
  • Instructions set of your code

Example:

#!/bin/csh
#PBS -j oe
#PBS -N HPLtest
#PBS -l walltime=1:00:00
#PBS -l select=1:ncpus=24:mem=63000mb:mpiprocs=1:ompthreads=24:cenaero=z_has
#PBS -W group_list=PRACE_T1FWB
#PBS -r y

echo "------------------ Work dir --------------------"
echo $PBS_O_WORKDIR
cd ${PBS_O_WORKDIR}
echo "------------------ Job Info --------------------"
echo "jobid : $PBS_JOBID"
echo "jobname : $PBS_JOBNAME"
echo "submit dir : $PBS_O_WORKDIR"
echo "exec dir : $PBS_JOBDIR"
echo "queue : $PBS_O_QUEUE"
echo "user : $PBS_O_LOGNAME"
echo "threads : $OMP_NUM_THREADS"
echo "------------------ $PBS_NODEFILE ---------------"
cat $PBS_NODEFILE

source /usr/share/Modules/init/csh
module purge
module load compiler/intel/composerxe/2015.5.223
module load intelmpi/5.0.3.049/64
module list

setenv name `eval hostname`
setenv OUT ${name}_xhpl_intel64_${PBS_JOBID}.txt

echo -n "This run was done on: " >> $OUT
date >> $OUT
mpirun -perhost 1 -np 1 $MKLROOT/benchmarks/mp_linpack/bin_intel/intel64/xhpl_offload_intel64>>$OUT
echo -n "Done: " >> $OUT
date >> $OUT

 

Resource types and chunk

PBS resources represent things such as CPUs (cores), memory, switches, hosts, Operating System, chassis, time.... They can also represent whether or not something is true or not, for example whether a node is dedicated to a particular project or group.

A chunk is a set of resources that are allocated as a unit to a job. All parts of a chunk come from the same host/node.

A chunk-level resource or a host-level resource is a resource available at host level. The resources of a chunk are to be applied to the portion of the job running in that chunk. Chunk resources are requested inside a select statement.

Chunk level resources that can be requested are: ncpus, mpiprocs, ompthreads, mem, chassis, host,...
Check chunk resources available with the pbsnodes command:
pbsnodes <node_name>

Job-wide resource, also called queue-level or server-level resource, is a resource that is available to the entire job at the server or the queue.
Job-wide resources that can be requested are: walltime, pmem, place

Format for requesting job-wide resources:
qsub … (non-resource portion of the job) -l resource=value

Format for requesting chunk resources:
qsub … (non-resource portion of the job) -l select=[N:][chunk specification][+[N:]...]
where N is the number of identical chunks, + adds a new set of identical chunks.

PBS assigns chunks to job processes in the order in which the chunks appear in the select statement.

Memory resource

The memory per node available for computation is limited to the following values:
Ivy Bridge and Haswell fit nodes: 63000MB
18 Haswell fat nodes: 126000MB
4 Haswell xfat nodes: 252000MB
   
The ideal memory per process is :
fit nodes: 2625MB
fat nodes: 5250MB
xfat nodes: 10500MB

 

The memory lower bound is 256MB. Jobs requesting less than 256MB per chunk will be put on hold (system) by the PBS enforcement tools with the message  CgroupLimitError.

Placement

Users can specify how the job should be placed on nodes. Users can choose to place each chunk on a different host, or to share a specific value for some resource.

The place statement must be used to specify how the job's chunk are placed. The place statement has the form:
qsub … -l place=[arrangement][:sharing][:grouping]
where

  • arrangement is one of free | pack | scatter| vscatter
  • sharing is excl | shared | exclhost
  • grouping can have only one instance group=resource

Examples:

Academic users (and others) can find a useful script generation wizard on the CÉCI web site.

Below are some other specific examples:

  1. You want five identical chunks requesting twelve CPUS and 43000MB of memory each, placed on different nodes:
    qsub … -l select=5:ncpus=12:mem=43000mb -l place=scatter
  2. You want four chunks, where the first has two CPUs and 20GB of memory, the second has 4 CPUs and 4GB of memory and the two last ones, one CPU and 40GB of memory with a free placement:
    qsub … -l select=1:ncpus=2:mem=20GB+1:ncpus=4:mem=4GB+2:ncpus=1:mem=40GB …

Some resources or placement can be requested by users, other ones are read-only and cannot be modified by users (queues limits).

 

 

Cpus and memory enforcements

Cpus and memory are enforced on zenobe with "cgroups". The term cgroup (pronounced see-group, short for control groups) refers to a Linux kernel feature that was introduced in version 2.6.24. A cgroup may be used to restrict access to system resources and account for resource usage. The root cgroup is the ancestor of all cgroups and provides access to all system resources. When a cgroup is created, it inherits the configuration of its parent. Once created, a cgroup may be configured to restrict access to a subset of its parent’s resources. When processes are assigned to a cgroup, the kernel enforces all configured restrictions. When a process assigned to a cgroup creates a child process, the child is automatically assigned to its parent’s cgroup.

On zenobe Linux cgroups implementation in PBSpro scheduler do the following:

  • Prevent job processes from using more resources than specified; e.g. disallow bursting above limits
  • Keep job processes within defined memory and CPU boundaries
  • Track and report resource usage

Cgroups are not set by chunk but by node/host. If PBS can put several chunks of a job on the same node, these resources will be all attached to the same cgroup.

Within the memory cgroup, the memory management is based on the Resident Set Size, ie the physical memory used. Use mem or pmem to request the job resources.

When a job is killed due to hitting the memory cgroup limit, you will see something like the following in the job's output:

Cgroup memory limit exceeded: Killed process ...
Job Lifecycle with Cgroups
When PBS runs a single-host job with Cgroup, the following happens:
  1. PBS creates a cgroup on the host assigned to the job. PBS assigns resources (CPUs and memory) to the cgroup.
  2. PBS places the job’s parent process ID (PPID) in the cgroup. The kernel captures any child processes that the job starts on the primary execution host and places them in the cgroup.
  3. When the job has finished, the cgroups hook reports CPU and memory usage to PBS and PBS cleans up the cgroup.
When PBS runs a multi-host job, the following happens:
  1. PBS creates a cgroup on each host assigned to the job. PBS assigns resources (CPUs and memory) to the cgroup.
  2. PBS places the job’s parent process ID (PPID) in the cgroup. The kernel captures any child processes that the job starts on the primary execution host and places them in the cgroup.
    • MPI jobs:
      • As PBS is integrated with IntelMPI and OpenMPI. PBS places the parent process ID (PPID) in the correct cgroup. PBS communicates the PPID to any sister MoMs and adds them to the correct group on the sister MoM.
      • For MPI jobs that do not use IntelMPI or OpenMPI,  please contact itatcenaero [dotcenaero] be to verify the program behavior.
    • Non-MPI jobs: we must make sure that job processes get attached to the correct job.   Please contact itatcenaero [dotcenaero] be to verify the program behavior.
  3. When the job has finished, the cgroups hook reports CPU and memory usage to PBS and PBS cleans up the cgroup.

Node Access Enforcement

  • Users who are not running PBS jobs cannot access to the compute nodes.
  • At the end of the job PBS output file,
    • First,  information logged by the server during the last 2 days is provided.
    • Second, master and slaves nodes information logged by the cpuacct cgroup subsystem follows. Pay attention to memory and cpu time used on each node.
    • Third, a summary of the resources requested and used is provided.  See Accounting for metrics definition.
  •  Exit codes:
    • Exit Code = 0: Job execution was successful.
    • Exit Code < 0: This is a PBS special return indicating that the job could not be executed. (See PBS documentation for more details and contact support).
    • Exit Code between 0 and 128 (or 256) : This is the exit value of the top process, typically the shell. This may be the exit value of the last command executed in the shell.
    • Exit Code >= 128 or 256: This means the job was killed by a signal. The signal is given by X modulo 128 ( or 256). If a job had an exit status of 143, that indicates the job was killed with a SIGTERM ( e.g. 143 - 128 = 15 ). See kill(1) man page for signal definitions.
  •  Do not send the output of the software in the PBS output file. The PBS output file is kept on the job master node /var/spool/PBS/spool directory and copied back at the end of the job in the user's directory where the job was launched.
  • Example:

----------------- PBS server and MOM logs -----------------

------------------frontal2.cenaero.be------------------

Job: 915424.frontal2

03/06/2017 19:55:08  S    enqueuing into main, state 1 hop 1
03/06/2017 19:55:08  S    dequeuing from main, state 1
03/06/2017 19:55:08  S    enqueuing into main_has, state 1 hop 1
03/06/2017 19:55:08  S    Job Queued at request of coulon@frontal3, owner = coulon@frontal3, job name = hpl_bloc.pbs, queue = main_has
03/06/2017 19:55:08  A    queue=main
03/06/2017 19:55:08  A    queue=main_has
03/06/2017 19:55:09  L    Considering job to run
03/06/2017 19:55:09  L    Job run
03/06/2017 19:55:09  A    user=coulon group=PRACE_T1FWB project=generic jobname=hpl_bloc.pbs queue=main_has ctime=1488826508 qtime=1488826508 etime=1488826508 start=1488826509 exec_host=node0851/0*24+node0852/0*24 exec_vnode=(node0851:ncpus=24:mem=64512000kb)+(node0852:ncpus=24:mem=64512000kb) Resource_List.mem=126000mb Resource_List.mem_pnode=63000mb Resource_List.model=haswell_fit Resource_List.mpiprocs=2 Resource_List.ncpus=48 Resource_List.ncpus_pnode=24 Resource_List.nodect=2 Resource_List.place=free Resource_List.rncpus=48 Resource_List.select=2:ncpus=24:mem=63000mb:mpiprocs=1:ompthreads=24 Resource_List.walltime=01:00:00 resource_assigned.mem=129024000kb resource_assigned.ncpus=48
03/06/2017 19:55:35  S    delete job request received
03/06/2017 19:55:35  S    Job sent signal TermJob on delete
03/06/2017 19:55:35  S    Job to be deleted at request of coulon@frontal3
03/06/2017 19:55:35  A    requestor=coulon@frontal3

------------------node0851------------------

Job: 915424.frontal2

03/06/2017 19:55:09  M    running prologue
03/06/2017 19:55:09  M    no active tasks
03/06/2017 19:55:09  M    Started, pid = 18599
03/06/2017 19:55:35  M    signal job request received
03/06/2017 19:55:35  M    signal job with TermJob
03/06/2017 19:55:35  M    task 00000001 terminated
03/06/2017 19:55:53  M    task 00000001 force exited
03/06/2017 19:55:53  M    Terminated
03/06/2017 19:55:53  M    task 00000001 cput= 0:04:16
03/06/2017 19:55:53  M    node0851 cput= 0:04:15 mem=32482772kb
03/06/2017 19:55:53  M    node0852 cput= 0:05:23 mem=32480340kb
03/06/2017 19:55:53  M    update_job_usage: CPU usage: 323.409 secs
03/06/2017 19:55:53  M    update_job_usage: Memory usage: mem=32482772kb
03/06/2017 19:55:53  M    update_job_usage: Memory usage: vmem=32482772kb
03/06/2017 19:55:53  M    no active tasks
03/06/2017 19:55:53  M    copy file request received
03/06/2017 19:55:53  M    staged 1 items out over 0:00:00
03/06/2017 19:55:53  M    no active tasks
03/06/2017 19:55:53  M    delete job request received

------------------node0852------------------

03/06/2017 19:55:09  M    JOIN_JOB as node 1
03/06/2017 19:55:09  M    task 40000001 started, /softs/intel/impi/4.1.3.045/intel64/bin/pmi_proxy
03/06/2017 19:55:37  M    task 40000001 terminated
03/06/2017 19:55:53  M    KILL_JOB received
03/06/2017 19:55:53  M    task 40000001 cput= 0:05:24
03/06/2017 19:55:54  M    no active tasks

------------------------------- Job Information -------------------------------

Job Owner       : coulon@frontal3
Job Project     : PRACE_T1FWB
Job Name        : hpl_bloc.pbs
Job Id          : 915424.frontal2
Job Queue       : main_has
Job Exit Status : 271

Resources Requested

Number of Cores per Job                - NCPUS_PJOB  : 48
Total Memory per Job                   - MEM_PJOB    : 126000mb
Placement                                            : free
Execution Time                         - WALLTIME    : 01:00:00

Resources Used

Total Memory used                      - MEM              : 64963112kb
Total CPU Time                         - CPU_Time         : 00:10:46
Execution Time                         - Wall_Time        : 00:00:44
Ncpus x Execution Time                 - N_Wall_Time      : 0:35:12
CPU_Time / N_Wall_Time (%)             - ETA              : 30%
Number of Mobilized Resources per Job  - NCPUS_EQUIV_PJOB : 48
Mobilized Resources x Execution Time   - R_Wall_Time      : 0:35:12
CPU_Time / R_Wall_Time (%)             - ALPHA            : 30%

For metrics definition, please refer to https://tier1.cenaero.be/en/faq-page

-------------------------------------------------------------------------------

 

 

Job's Nodes File

For each job, PBS creates a job-specific “host file”  which is a text file containing the name of the nodes allocated to that job, one per line. The file is created by PBS on the primary execution host and is only available on that host.  The order in which hosts appear in the node file is the order in which chunks are specified.
The full path and name for the node file is set in the job's environment variable $PBS_NODEFILE.

MPI

The number of MPI processes per chunk defaults to 1 unless it is explicitly specified using the mpiprocs resource.  Open MPI and IntelMPI automatically obtain both the list of hosts and how many processes to start on each host from PBS Pro directly through the $PBS_NODEFILE. Hence, it is unnecessary to specify the --hostfile, --host, or -np options to mpirun if the MPI software default interpretation of this file corresponds to what you want. For example:

  • IntelMPI : default is hostfile which means that duplicated hostname lines are removed.
  • OpenMPI : The reordering of the lines is performed in order to group the same nodes.

Open MPI and IntelMPI versions installed on zenobe use PBS mechanisms to launch and kill processes. PBS can track resource usage, control jobs, clean up job processes and perform accounting for all of the tasks run under the MPI.

OpenMP

PBSpro supports OpenMP applications by setting the OMP_NUM_THREADS variable in the job's environment, based on the request of the job.

If ompthreads is requested, OMP_NUM_THREADS is set to this value, if ompthreads is not requested, OMP_NUM_THREADS is set 1.

For the MPI process with rank 0, the environment variable OMP_NUM_THREADS is set to the value of ompthreads. For other MPI processes, the behavior is dependent on MPI implementation.

Usage Cases:

FAQ

 
Two procedures are described for that purpose.
The first one relies on the visualization portal installed on Zenobe.
The second one relies on a manual configuration of a ParaView server and its connection to a ParaView client.

Link to the documentation

Supercomputer under maintenance

Every three months, we planned a maintenance between 7AM and 7PM. During this maintenance the supercomputer is unreachable.

You are not allowed on our system

please see the section Login and Access in this page.

Undefined problem

Please contact itatcenaero [dotcenaero] be.

Wrong permission on your directory

To avoid any fake error "quota exceeded" on /project/organization/projectsName and on /SCRATCH/organization/projectsName, please check the permission on your directories. The setGID need to be set for the group so the line has to be something like that :

drwxr-sr-x  2 MyUsername  GroupProjects    4096 Oct 16  2015 myDirectories

If this lowcase "s" is not set to the group part, please make a : chmod g+s directory

Once you modified the permission make sur the owners is correctly specified. The group has to be the project on which the directory is related.

This kind of problem can be triggered by a move into the directory. When you make a move from your home directory (for example), the directory keeps its permission, so when you want move some directories, please use the following command :

cp -r fromDirectory toDirectory

By using this command, the directories will have the correct permissions.

If you want more information about setGID, please use the following command : man chmod. In this manual, there is a specific section about the "special" permissions

General points

If the above solution is not applicable, please check if you can safe some spaces by cleaning your data.

If needed a modification of the /SCRATCH or /projects quota can be done (only for projects)

Academic users

The project's manager has to send a request to the CÉCI Admin through this link.

Others users

You have to send a request to itatcenaero [dotcenaero] be after reevaluation of the need. If you need assisstances for this kind of reevaluation, don't hesitate to contact the HPC admin (itatcenaero [dotcenaero] be)