March 2017 Issue

Submitted by Coulon on Tue, 02/28/2017 - 10:35

Content

Zenobe exceptional maintenance
Cpus and memory enforcements with "cgroups"
Ivy Bridge nodes end of maintenance

Zenobe exceptional maintenance on Monday 6th March 2017

Upgrade PBSpro to release 13.1.2 on all nodes and cgroups hook deployement
All jobs still in queue at the beginning of the maintenance will be killed.
After the maintenance, all scripts should be modified according to zenobe new configuration, failing which they will be rejected.

Cpus and memory enforcements with "cgroups"

Why Use Cgroups ?

Linux cgroups can do the following:

Prevent job processes from using more resources than specified; e.g. disallow bursting above limits
Keep job processes within defined memory and CPU boundaries
Track and report resource usage
Enable or disable access to devices

Zenobe configuration and impact

The following sub-systems have been implemented in the jobs scheduler:

cpuset: this subsystem assigns individual CPUs (on a multicore system) and memory nodes to tasks in a cgroup.
cpuacct: this subsystem generates automatic reports on CPU resources used by tasks in a cgroup.
memory: this subsystem sets limits on memory use by tasks in a cgroup and generates automatic reports on memory resources used by those tasks.

Cgroups are not set by chunk but by node. If PBS can put several chunks of a job on the same node, these resources will be all attached to the same cgroup.

Within the memory cgroup, the memory management is based on the Resident Set Size, ie the physical memory used. As a consequence do not use vmem or pvmem to reserve memory for your job, use mem or pmem. All scripts must be adapted. Jobs still using vmem or pvmem will be rejected with the following message:

qsub: Error: with cgroups memory management use mem instead of vmem in resources requests.

Example:

Before March 6th 2017:
#PBS -l select=1:ncpus=1:vmem=10000mb:mpiprocs=1+127:ncpus=1:vmem=3000mb:mpiprocs=1
# PBS -l pvmem=50gb
After March 6th 2017:
#PBS -l select=1:ncpus=1:mem=10000mb:mpiprocs=1+127:ncpus=1:mem=3000mb:mpiprocs=1

Note that the memory per process request is now meaningless. It can be removed.

Nodes memory limits remain the same.

Now, when a job is killed due to hitting the memory cgroup limit, you will see something like the following in the job's output:

Cgroup memory limit exceeded: Killed process 5163, UID 523, (xhpl_intel64) total-vm:38752588kB, anon-rss:33789996kB, file-rss:1836kB

Job Lifecycle with Cgroups

When PBS runs a single-host job with Cgroup, the following happens:

PBS creates a cgroup on the host assigned to the job. PBS assigns resources (CPUs and memory) to the cgroup.
PBS places the job’s parent process ID (PPID) in the cgroup. The kernel captures any child processes that the job starts on the primary execution host and places them in the cgroup.
When the job has finished, the cgroups hook reports CPU and memory usage to PBS and PBS cleans up the cgroup.

When PBS runs a multi-host job, the following happens:

PBS creates a cgroup on each host assigned to the job. PBS assigns resources (CPUs and memory) to the cgroup
PBS places the job’s parent process ID (PPID) in the cgroup. The kernel captures any child processes that the job starts on the primary execution host and places them in the cgroup.
- MPI jobs:
  - As PBS is integrated with IntelMPI and OpenMPI. PBS places the parent process ID (PPID) in the correct cgroup. PBS communicates the PPID to any sister nodes and adds them to the correct group on the sister MoM.
  - For MPI jobs that do not use IntelMPI or OpenMPI, please contact us to verify the program behavior.
- Non-MPI jobs: we must make sure that job processes get attached to the correct job. Please contact us to verify the program behavior.
When the job has finished, the cgroups hook reports CPU and memory usage to PBS and PBS cleans up the cgroup.

About cgroups

The term cgroup (pronounced see-group, short for control groups) refers to a Linux kernel feature that was introduced in version 2.6.24. A cgroup may be used to restrict access to system resources and account for resource usage. The root cgroup is the ancestor of all cgroups and provides access to all system resources. When a cgroup is created, it inherits the configuration of its parent. Once created, a cgroup may be configured to restrict access to a subset of its parent’s resources. These different resource classes are grouped into categories referred to as cgroup subsystems. When processes are assigned to a cgroup, the kernel enforces all configured restrictions. When a process assigned to a cgroup creates a child process, the child is automatically assigned to its parent’s cgroup.

More information can be found on RedHat website.

Ivy Bridge nodes end of maintenance

Please note that the hardware maintenance of the Ivy Bridge nodes is over since the end of December 2016. These nodes will remain in production as long as they will not encounter an hardware issue. These nodes all address the large queue.

Tier-1 operations newsletters