Skip to content

Commit 28e91c7

Browse files
committed
EH: CS-1378 document the error codes which can be reported in the accounting failed attribute
1 parent 5dda5a9 commit 28e91c7

File tree

2 files changed

+96
-2
lines changed

2 files changed

+96
-2
lines changed

doc/markdown/man/man1/CMakeLists.txt

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
#___INFO__MARK_BEGIN_NEW__
22
###########################################################################
33
#
4-
# Copyright 2023-2024 HPC-Gridware GmbH
4+
# Copyright 2023-2025 HPC-Gridware GmbH
55
#
66
# Licensed under the Apache License, Version 2.0 (the "License");
77
# you may not use this file except in compliance with the License.
@@ -28,7 +28,7 @@ set(HOSTNAME_PAGES gethostbyaddr gethostbyname gethostname getservbyname sge_hos
2828
build_markdown_man_from_template("1" "gethost.include" HOSTNAME_PAGES "0")
2929

3030
# build all other man pages from section 1
31-
set(PAGES qacct qconf qdel qhold qhost qmake qmod qping qquota qrdel qrls qrstat qrsub qselect qstat sge_intro sge_jsv sge_types sge_share_mon)
31+
set(PAGES qacct qconf qdel qhold qhost qmake qmod qping qquota qrdel qrls qrstat qrsub qselect qstat sge_diagnostics sge_intro sge_jsv sge_types sge_share_mon)
3232
build_markdown_man("1" PAGES "0")
3333

3434
# target for building all troff man pages from section 1
@@ -66,6 +66,7 @@ add_custom_target(troffman1 ALL DEPENDS
6666
${CMAKE_CURRENT_BINARY_DIR}/qrsub.1
6767
${CMAKE_CURRENT_BINARY_DIR}/qselect.1
6868
${CMAKE_CURRENT_BINARY_DIR}/qstat.1
69+
${CMAKE_CURRENT_BINARY_DIR}/sge_diagnostics.1
6970
${CMAKE_CURRENT_BINARY_DIR}/sge_intro.1
7071
${CMAKE_CURRENT_BINARY_DIR}/sge_jsv.1
7172
${CMAKE_CURRENT_BINARY_DIR}/sge_share_mon.1
Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
---
2+
title: sge_diagnostics
3+
section: 1
4+
header: Reference Manual
5+
footer: __RELEASE__
6+
date: __DATE__
7+
---
8+
9+
# NAME
10+
11+
xxqs_name_sxx_diagnostics - xxQS_NAMExx diagnostics documentation
12+
13+
# DESCRIPTION
14+
15+
This document describes how to collect diagnostic information for xxQS_NAMExx installations.
16+
It is intended to be used by system administrators and support personnel to gather relevant information about the xxQS_NAMExx installation and its current state.
17+
18+
## Error Codes reported in the failed state of jobs
19+
20+
The `failed` attribute of both `qstat -j <job_id>` and `qacct -j <job_id>` commands can contain error codes that
21+
indicate the reason for a job failure.
22+
23+
Depending on the error code the job or the queue instance may be set into error state.
24+
25+
The reason for a queue error state can be queried via `qstat -explain E`.
26+
The error state can be cleared via `qmod -cq <queue_name>`.
27+
28+
The reason for a job error state can be queried via `qstat -j <job_id>`.
29+
The error state can be cleared via `qmod -cj <job_id>`.
30+
31+
The following table lists the error codes and their meaning:
32+
33+
| Code | Name / Meaning |
34+
|------:|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
35+
| `0` | `STATUS_OK`: Job ran through and exited normally |
36+
| `1` | `SSTATE_FAILURE_BEFORE_JOB`: `sge_execd` cannot start the job. The job or queue instances may be set into error state and further information will be available via `qstat -j <job_id>` and/or `qstat -explain E`. |
37+
| `2` | `ESSTATE_NO_SHEPHERD`: `sge_shepherd` cannot be executed, see `sge_execd` messages file for details. |
38+
| `3` | `SSTATE_NO_CONFIG`: `sge_execd` could not write the `sge_shepherd` config file. |
39+
| `4` | `SSTATE_NO_PID`: `sge_shepherd` did not write its `pid` file (poss. as it crashed), see the `sge_shepherd` trace file for details. |
40+
| `5` | `SSTATE_READ_CONFIG`: `sge_shepherd` cannot read its `config` file. |
41+
| `6` | `SSTATE_PROCSET_NOTSET`: On Solaris: `sge_shepherd` could not create a processor set. |
42+
| `7` | `SSTATE_BEFORE_PROLOG`: `sge_shepherd` could not start a prolog. |
43+
| `8` | `SSTATE_PROLOG_FAILED`: A prolog was started by `sge_shepherd` but failed. |
44+
| `9` | `SSTATE_BEFORE_PESTART`: `sge_shepherd` could not start a PE start procedure. |
45+
| `10` | `SSTATE_PESTART_FAILED`: A PE start procedure was started by `sge_shepherd` but failed. |
46+
| `11` | `SSTATE_BEFORE_JOB`: `sge_shepherd` could not start the job. More information can be found in the `sge_shepherd` trace file. |
47+
| `12` | `SSTATE_BEFORE_PESTOP`: `sge_shepherd` could not start a PE stop procedure. |
48+
| `13` | `SSTATE_PESTOP_FAILED`: A PE stop procedure was started by `sge_shepherd` but failed. |
49+
| `14` | `SSTATE_BEFORE_EPILOG`: `sge_shepherd` could not start an epilog. |
50+
| `15` | `SSTATE_EPILOG_FAILED`: An epilog was started by `sge_shepherd` but failed. |
51+
| `16` | `SSTATE_PROCSET_NOTFREED`: On Solaris: `sge_shepherd` could not release a previously created processor set. |
52+
| `17` | `ESSTATE_DIED_THRU_SIGNAL`: The job died through a signal. |
53+
| `18` | `ESSTATE_SHEPHERD_EXIT`: `sge_shepherd` exited with exit status > 0. |
54+
| `19` | `ESSTATE_NO_EXITSTATUS`: `sge_shepherd` didn't write its `exit_status` file - possibly crashed before exiting regularly. |
55+
| `20` | `ESSTATE_UNEXP_ERRORFILE`: The `sge_shepherd` `error` file couldn't be read. |
56+
| `21` | `ESSTATE_UNKNOWN_JOB`: `sge_execd` got a message from `sge_qmaster` about a job it doesn't know about. |
57+
| `22` | `ESSTATE_EXECD_LOST_RUNNING`: Job removed manually. |
58+
| `23` | `ESSTATE_PTF_CANT_GET_PIDS`: PTF can't get information for certain pids. |
59+
| `24` | `SSTATE_MIGRATE`: The job was checkpointed for migration. |
60+
| `25` | `SSTATE_AGAIN`: The job shall be re-started. |
61+
| `26` | `SSTATE_OPEN_OUTPUT`: Error, input, or output file couldn't be opened by `sge_shepherd`. |
62+
| `27` | `SSTATE_NO_SHELL`: The requested shell could not be found by `sge_shepherd`. |
63+
| `28` | `SSTATE_NO_CWD`: `sge_shepherd` cannot change directory to the requested job directory. |
64+
| `29` | `SSTATE_AFS_PROBLEM`: AFS setup failed. |
65+
| `30` | `SSTATE_APPERROR`: The job exited with exit_status 100 (application error) |
66+
| `36` | `SSTATE_CHECK_DAEMON_CONFIG`: The daemon for an interactive job could not be found (if `rsh_daemon`, `rlogin_daemon`, `qlogin_daemon` is configured to a daemon path, instead of `builtin`) |
67+
| `37` | `SSTATE_QMASTER_ENFORCED_LIMIT: `sge_qmaster` enforced killing the job due to a limit. |
68+
| `38` | `SSTATE_ADD_GRP_SET_ERROR`: `sge_shepherd` cannot attach the additional group id to the `sge_shepherd` child process becoming the job. |
69+
| `100` | `SSTATE_FAILURE_AFTER_JOB`: The job ran through, but no `usage` file was written by `sge_shepherd`. |
70+
71+
72+
More details about errors reported by `sge_execd` can be found in the `sge_execd` messages file.
73+
For errors reported by `sge_shepherd` please check the `sge_shepherd` trace file or the error mail (if requested at job submission) or administrator mail (if configured in the global configuration).
74+
75+
76+
# ENVIRONMENTAL VARIABLES
77+
78+
For a complete list of common environment variables used by all xxQS_NAMExx commands, see xxqs_name_sxx_intro(1).
79+
80+
# FILES
81+
82+
The `sge_shepherd` trace file is located in `<sge_shepherd_spool_dir>/active_jobs/<job_id>.<array_task_id>/trace` (where `<array_task_id>` is `1` for non-array jobs).
83+
Set the `execd_params` attribute `KEEP_ACTIVE` to keep the active job directories after job termination. See xxqs_name_sxx_conf(5) for details.
84+
85+
The `sge_execd` messages file is located in the `sge_execd` spool directory.
86+
87+
# SEE ALSO
88+
89+
xxqs_name_sxx_conf(5), xxqs_name_sxx_execd(8), xxqs_name_sxx_shepherd(8)
90+
91+
# COPYRIGHT
92+
93+
See xxqs_name_sxx_intro(1) for a full statement of rights and permissions.

0 commit comments

Comments
 (0)