|
| 1 | +--- |
| 2 | +title: sge_diagnostics |
| 3 | +section: 1 |
| 4 | +header: Reference Manual |
| 5 | +footer: __RELEASE__ |
| 6 | +date: __DATE__ |
| 7 | +--- |
| 8 | + |
| 9 | +# NAME |
| 10 | + |
| 11 | +xxqs_name_sxx_diagnostics - xxQS_NAMExx diagnostics documentation |
| 12 | + |
| 13 | +# DESCRIPTION |
| 14 | + |
| 15 | +This document describes how to collect diagnostic information for xxQS_NAMExx installations. |
| 16 | +It is intended to be used by system administrators and support personnel to gather relevant information about the xxQS_NAMExx installation and its current state. |
| 17 | + |
| 18 | +## Error Codes reported in the failed state of jobs |
| 19 | + |
| 20 | +The `failed` attribute of both `qstat -j <job_id>` and `qacct -j <job_id>` commands can contain error codes that |
| 21 | +indicate the reason for a job failure. |
| 22 | + |
| 23 | +Depending on the error code the job or the queue instance may be set into error state. |
| 24 | + |
| 25 | +The reason for a queue error state can be queried via `qstat -explain E`. |
| 26 | +The error state can be cleared via `qmod -cq <queue_name>`. |
| 27 | + |
| 28 | +The reason for a job error state can be queried via `qstat -j <job_id>`. |
| 29 | +The error state can be cleared via `qmod -cj <job_id>`. |
| 30 | + |
| 31 | +The following table lists the error codes and their meaning: |
| 32 | + |
| 33 | +| Code | Name / Meaning | |
| 34 | +|------:|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| |
| 35 | +| `0` | `STATUS_OK`: Job ran through and exited normally | |
| 36 | +| `1` | `SSTATE_FAILURE_BEFORE_JOB`: `sge_execd` cannot start the job. The job or queue instances may be set into error state and further information will be available via `qstat -j <job_id>` and/or `qstat -explain E`. | |
| 37 | +| `2` | `ESSTATE_NO_SHEPHERD`: `sge_shepherd` cannot be executed, see `sge_execd` messages file for details. | |
| 38 | +| `3` | `SSTATE_NO_CONFIG`: `sge_execd` could not write the `sge_shepherd` config file. | |
| 39 | +| `4` | `SSTATE_NO_PID`: `sge_shepherd` did not write its `pid` file (poss. as it crashed), see the `sge_shepherd` trace file for details. | |
| 40 | +| `5` | `SSTATE_READ_CONFIG`: `sge_shepherd` cannot read its `config` file. | |
| 41 | +| `6` | `SSTATE_PROCSET_NOTSET`: On Solaris: `sge_shepherd` could not create a processor set. | |
| 42 | +| `7` | `SSTATE_BEFORE_PROLOG`: `sge_shepherd` could not start a prolog. | |
| 43 | +| `8` | `SSTATE_PROLOG_FAILED`: A prolog was started by `sge_shepherd` but failed. | |
| 44 | +| `9` | `SSTATE_BEFORE_PESTART`: `sge_shepherd` could not start a PE start procedure. | |
| 45 | +| `10` | `SSTATE_PESTART_FAILED`: A PE start procedure was started by `sge_shepherd` but failed. | |
| 46 | +| `11` | `SSTATE_BEFORE_JOB`: `sge_shepherd` could not start the job. More information can be found in the `sge_shepherd` trace file. | |
| 47 | +| `12` | `SSTATE_BEFORE_PESTOP`: `sge_shepherd` could not start a PE stop procedure. | |
| 48 | +| `13` | `SSTATE_PESTOP_FAILED`: A PE stop procedure was started by `sge_shepherd` but failed. | |
| 49 | +| `14` | `SSTATE_BEFORE_EPILOG`: `sge_shepherd` could not start an epilog. | |
| 50 | +| `15` | `SSTATE_EPILOG_FAILED`: An epilog was started by `sge_shepherd` but failed. | |
| 51 | +| `16` | `SSTATE_PROCSET_NOTFREED`: On Solaris: `sge_shepherd` could not release a previously created processor set. | |
| 52 | +| `17` | `ESSTATE_DIED_THRU_SIGNAL`: The job died through a signal. | |
| 53 | +| `18` | `ESSTATE_SHEPHERD_EXIT`: `sge_shepherd` exited with exit status > 0. | |
| 54 | +| `19` | `ESSTATE_NO_EXITSTATUS`: `sge_shepherd` didn't write its `exit_status` file - possibly crashed before exiting regularly. | |
| 55 | +| `20` | `ESSTATE_UNEXP_ERRORFILE`: The `sge_shepherd` `error` file couldn't be read. | |
| 56 | +| `21` | `ESSTATE_UNKNOWN_JOB`: `sge_execd` got a message from `sge_qmaster` about a job it doesn't know about. | |
| 57 | +| `22` | `ESSTATE_EXECD_LOST_RUNNING`: Job removed manually. | |
| 58 | +| `23` | `ESSTATE_PTF_CANT_GET_PIDS`: PTF can't get information for certain pids. | |
| 59 | +| `24` | `SSTATE_MIGRATE`: The job was checkpointed for migration. | |
| 60 | +| `25` | `SSTATE_AGAIN`: The job shall be re-started. | |
| 61 | +| `26` | `SSTATE_OPEN_OUTPUT`: Error, input, or output file couldn't be opened by `sge_shepherd`. | |
| 62 | +| `27` | `SSTATE_NO_SHELL`: The requested shell could not be found by `sge_shepherd`. | |
| 63 | +| `28` | `SSTATE_NO_CWD`: `sge_shepherd` cannot change directory to the requested job directory. | |
| 64 | +| `29` | `SSTATE_AFS_PROBLEM`: AFS setup failed. | |
| 65 | +| `30` | `SSTATE_APPERROR`: The job exited with exit_status 100 (application error) | |
| 66 | +| `36` | `SSTATE_CHECK_DAEMON_CONFIG`: The daemon for an interactive job could not be found (if `rsh_daemon`, `rlogin_daemon`, `qlogin_daemon` is configured to a daemon path, instead of `builtin`) | |
| 67 | +| `37` | `SSTATE_QMASTER_ENFORCED_LIMIT: `sge_qmaster` enforced killing the job due to a limit. | |
| 68 | +| `38` | `SSTATE_ADD_GRP_SET_ERROR`: `sge_shepherd` cannot attach the additional group id to the `sge_shepherd` child process becoming the job. | |
| 69 | +| `100` | `SSTATE_FAILURE_AFTER_JOB`: The job ran through, but no `usage` file was written by `sge_shepherd`. | |
| 70 | + |
| 71 | + |
| 72 | +More details about errors reported by `sge_execd` can be found in the `sge_execd` messages file. |
| 73 | +For errors reported by `sge_shepherd` please check the `sge_shepherd` trace file or the error mail (if requested at job submission) or administrator mail (if configured in the global configuration). |
| 74 | + |
| 75 | + |
| 76 | +# ENVIRONMENTAL VARIABLES |
| 77 | + |
| 78 | +For a complete list of common environment variables used by all xxQS_NAMExx commands, see xxqs_name_sxx_intro(1). |
| 79 | + |
| 80 | +# FILES |
| 81 | + |
| 82 | +The `sge_shepherd` trace file is located in `<sge_shepherd_spool_dir>/active_jobs/<job_id>.<array_task_id>/trace` (where `<array_task_id>` is `1` for non-array jobs). |
| 83 | +Set the `execd_params` attribute `KEEP_ACTIVE` to keep the active job directories after job termination. See xxqs_name_sxx_conf(5) for details. |
| 84 | + |
| 85 | +The `sge_execd` messages file is located in the `sge_execd` spool directory. |
| 86 | + |
| 87 | +# SEE ALSO |
| 88 | + |
| 89 | +xxqs_name_sxx_conf(5), xxqs_name_sxx_execd(8), xxqs_name_sxx_shepherd(8) |
| 90 | + |
| 91 | +# COPYRIGHT |
| 92 | + |
| 93 | +See xxqs_name_sxx_intro(1) for a full statement of rights and permissions. |
0 commit comments