Job Output

  • At the end of the job PBS output file,
    • First,  information logged by the server during the last 2 days is provided.
    • Second, master and slaves nodes information logged by the cpuacct cgroup subsystem follows. Pay attention to memory and cpu time used on each node.
    • Third, a summary of the resources requested and used is provided.  See Accounting for metrics definition.
  •  Exit codes:
    • Exit Code = 0: Job execution was successful.
    • Exit Code < 0: This is a PBS special return indicating that the job could not be executed. (See PBS documentation for more details and contact support).
    • Exit Code between 0 and 128 (or 256) : This is the exit value of the top process, typically the shell. This may be the exit value of the last command executed in the shell.
    • Exit Code >= 128 or 256: This means the job was killed by a signal. The signal is given by X modulo 128 ( or 256). If a job had an exit status of 143, that indicates the job was killed with a SIGTERM ( e.g. 143 - 128 = 15 ). See kill(1) man page for signal definitions.
  •  Do not send the output of the software in the PBS output file. The PBS output file is kept on the job master node /var/spool/PBS/spool directory and copied back at the end of the job in the user's directory where the job was launched.
  • Example:

----------------- PBS server and MOM logs -----------------

------------------frontal2.cenaero.be------------------

Job: 915424.frontal2

03/06/2017 19:55:08  S    enqueuing into main, state 1 hop 1
03/06/2017 19:55:08  S    dequeuing from main, state 1
03/06/2017 19:55:08  S    enqueuing into main_has, state 1 hop 1
03/06/2017 19:55:08  S    Job Queued at request of coulon@frontal3, owner = coulon@frontal3, job name = hpl_bloc.pbs, queue = main_has
03/06/2017 19:55:08  A    queue=main
03/06/2017 19:55:08  A    queue=main_has
03/06/2017 19:55:09  L    Considering job to run
03/06/2017 19:55:09  L    Job run
03/06/2017 19:55:09  A    user=coulon group=PRACE_T1FWB project=generic jobname=hpl_bloc.pbs queue=main_has ctime=1488826508 qtime=1488826508 etime=1488826508 start=1488826509 exec_host=node0851/0*24+node0852/0*24 exec_vnode=(node0851:ncpus=24:mem=64512000kb)+(node0852:ncpus=24:mem=64512000kb) Resource_List.mem=126000mb Resource_List.mem_pnode=63000mb Resource_List.model=haswell_fit Resource_List.mpiprocs=2 Resource_List.ncpus=48 Resource_List.ncpus_pnode=24 Resource_List.nodect=2 Resource_List.place=free Resource_List.rncpus=48 Resource_List.select=2:ncpus=24:mem=63000mb:mpiprocs=1:ompthreads=24 Resource_List.walltime=01:00:00 resource_assigned.mem=129024000kb resource_assigned.ncpus=48
03/06/2017 19:55:35  S    delete job request received
03/06/2017 19:55:35  S    Job sent signal TermJob on delete
03/06/2017 19:55:35  S    Job to be deleted at request of coulon@frontal3
03/06/2017 19:55:35  A    requestor=coulon@frontal3

------------------node0851------------------

Job: 915424.frontal2

03/06/2017 19:55:09  M    running prologue
03/06/2017 19:55:09  M    no active tasks
03/06/2017 19:55:09  M    Started, pid = 18599
03/06/2017 19:55:35  M    signal job request received
03/06/2017 19:55:35  M    signal job with TermJob
03/06/2017 19:55:35  M    task 00000001 terminated
03/06/2017 19:55:53  M    task 00000001 force exited
03/06/2017 19:55:53  M    Terminated
03/06/2017 19:55:53  M    task 00000001 cput= 0:04:16
03/06/2017 19:55:53  M    node0851 cput= 0:04:15 mem=32482772kb
03/06/2017 19:55:53  M    node0852 cput= 0:05:23 mem=32480340kb
03/06/2017 19:55:53  M    update_job_usage: CPU usage: 323.409 secs
03/06/2017 19:55:53  M    update_job_usage: Memory usage: mem=32482772kb
03/06/2017 19:55:53  M    update_job_usage: Memory usage: vmem=32482772kb
03/06/2017 19:55:53  M    no active tasks
03/06/2017 19:55:53  M    copy file request received
03/06/2017 19:55:53  M    staged 1 items out over 0:00:00
03/06/2017 19:55:53  M    no active tasks
03/06/2017 19:55:53  M    delete job request received

------------------node0852------------------

03/06/2017 19:55:09  M    JOIN_JOB as node 1
03/06/2017 19:55:09  M    task 40000001 started, /softs/intel/impi/4.1.3.045/intel64/bin/pmi_proxy
03/06/2017 19:55:37  M    task 40000001 terminated
03/06/2017 19:55:53  M    KILL_JOB received
03/06/2017 19:55:53  M    task 40000001 cput= 0:05:24
03/06/2017 19:55:54  M    no active tasks

------------------------------- Job Information -------------------------------

Job Owner       : coulon@frontal3
Job Project     : PRACE_T1FWB
Job Name        : hpl_bloc.pbs
Job Id          : 915424.frontal2
Job Queue       : main_has
Job Exit Status : 271

Resources Requested

Number of Cores per Job                - NCPUS_PJOB  : 48
Total Memory per Job                   - MEM_PJOB    : 126000mb
Placement                                            : free
Execution Time                         - WALLTIME    : 01:00:00

Resources Used

Total Memory used                      - MEM              : 64963112kb
Total CPU Time                         - CPU_Time         : 00:10:46
Execution Time                         - Wall_Time        : 00:00:44
Ncpus x Execution Time                 - N_Wall_Time      : 0:35:12
CPU_Time / N_Wall_Time (%)             - ETA              : 30%
Number of Mobilized Resources per Job  - NCPUS_EQUIV_PJOB : 48
Mobilized Resources x Execution Time   - R_Wall_Time      : 0:35:12
CPU_Time / R_Wall_Time (%)             - ALPHA            : 30%

For metrics definition, please refer to https://tier1.cenaero.be/en/faq-page

-------------------------------------------------------------------------------