diff --git a/doc/device_local_diagnosis/device-local-diagnosis-daemon.md b/doc/device_local_diagnosis/device-local-diagnosis-daemon.md new file mode 100644 index 00000000000..eb77da53948 --- /dev/null +++ b/doc/device_local_diagnosis/device-local-diagnosis-daemon.md @@ -0,0 +1,760 @@ +# Device Local Diagnosis Service HLD + +## Table of Contents + +- [Document History](#document-history) +- [Scope](#scope) +- [Terminology](#terminology) +- [Requirements](#requirements) +- [High-Level Design](#high-level-design) +- [Detailed Design](#detailed-design) +- [Telemetry and Diagnostics](#telemetry-and-diagnostics) +- [Configuration Management](#configuration-management) +- [File Management and Rule Updates](#file-management-and-rule-updates) +- [Integration Points](#integration-points) +- [Testing and Validation](#testing-and-validation) +- [Restrictions and Limitations](#restrictions-and-limitations) +- [References](#references) + +## Introduction + +The Device Local Diagnosis (DLD) Service is a daemon running on SONiC switches that consumes vendor-provided rules, executes platform-specific data collection, correlates events, and posts validated faults to the controller. It is the on-device implementation partner to the `vendor-rules-schema-hld.md` document and provides the runtime that evaluates those rules using data source extensions (DSE), direct data sources, and vendor defined actions. + +The service provides: +- **Configurable Monitoring**: Rule-driven fault detection across multiple data sources +- **Periodic Polling**: Configurable polling intervals with threshold-based fault detection as defined by vendor rules +- **Remote Integration**: Integration with OpenConfig to publish fault information to remote controllers in a standard manner +- **Multi-Signal Rules**: Support for complex fault conditions combining multiple data sources within a single rule evaluation + +## Document History + +| Revision | Date | Author | Description | +|----------|------|--------|-------------| +| 0.1 | 2025-09-24 | Gregory Boudreau | Initial draft of DLDD HLD | + +## Scope + +This document describes the Device Local Diagnosis Daemon (DLDD) implementation on SONiC platforms. It covers how vendor-provided rules are ingested, evaluated, and converted into telemetry for remote controllers. The following items are **not** covered: individual vendor rule authoring, OpenConfig schema specifications, or controller-side workflows. + +## Terminology + +| Term / Abbreviation | Description | +|----------------------|-------------| +| **DLDD** | Device Local Diagnosis Daemon running on the switch | +| **DSE** | Data Source Extension used to resolve abstract rule identifiers | +| **FIFO** | First-In, First-Out queue used for fault buffering | +| **gNMI** | gRPC Network Management Interface used for telemetry publication | +| **gNOI** | gRPC Network Operations Interface used for Healthz artifact exchange | +| **PMON** | Platform Monitor service family already present on SONiC | +| **Signal** | Data input evaluated by DLDD when processing a rule event | +| **Thread-Safe Shared FIFO Fault Buffer** | Central queue where monitor threads enqueue fault events for batch consumption by the primary orchestration thread | +| **Fault Signature** | Complete rule definition including metadata, conditions, and actions | +| **Multi-Event Rule** | Rule that combines multiple signals/events within its condition block for complex fault detection | +| **Primary Orchestration Thread** | DLDD thread responsible for rule ingestion, spawning monitor threads, consuming the FIFO buffer, executing vendor actions, and publishing telemetry | +| **Vendor Defined Actions** | Local remediation actions supplied with the rules package and executed by the primary thread | +| **Vendor Rules Source** | YAML or JSON file conforming to the schema defined in `vendor-rules-schema-hld.md`, containing signatures, conditions, and actions | +| **ACTION_\* Escalations** | Controller-driven remediation steps contained by the rules schema and defined in OpenConfig (for example `ACTION_RESEAT`, `ACTION_COLD_REBOOT`, `ACTION_POWER_CYCLE`, `ACTION_FACTORY_RESET`, `ACTION_REPLACE`) | + +## Requirements + +### Functional Requirements +This section describes the SONiC requirements for the Device Local Diagnosis Daemon (DLDD). +- Monitor multiple data sources: Redis, platform APIs, sysfs, i2c, CLI, files +- Resolve vendor Data Source Extensions (DSE) defined in the rules schema into executable data collection operations +- Support complex fault logic with multi-event rules that evaluate within a single rule definition +- Provide polling-based fault detection with configurable intervals +- Integrate with existing SONiC platform monitoring infrastructure without disrupting existing PMON services +- Support remote rule updates without service restart and keep golden backups for rollback +- Generate telemetry data for remote controller consumption through gNMI or redis directly +- Implement vendor-defined local remediation actions and escalate requests for remotely executed ACTION_* + +## High-Level Design + +![Device Local Diagnosis Architecture](./images/dldd-graphic-design.jpg) + +*Figure 1. DLDD runtime architecture showing vendor rules ingestion, monitor thread dispatch, FIFO buffering, and telemetry publication path.* + +DLDD is a multithreaded SONiC daemon that implements vendor-agnostic, rule-driven hardware fault detection and remediation. The service operates as a polling-based monitoring engine that ingests vendor-provided fault signatures, evaluates hardware health against defined conditions, and publishes actionable telemetry to remote controllers. + +At startup, DLDD loads vendor-provided fault rules from `/usr/share/sonic/device//dld_rules.yaml`, validates schema compatibility, and resolves Data Source Extensions (DSE) into concrete data collection paths. It then builds execution plans that map rules to appropriate monitor threads based on transport type (Redis, Platform API, I2C, CLI, file). Rules that fail validation are tracked as broken and excluded from execution, with diagnostics published to the controller. + +During runtime, the **primary orchestration thread** manages the lifecycle of specialized monitor threads and consumes fault events from a shared FIFO buffer. The **monitor threads** (Redis, File, Common) periodically sample their assigned data sources using standardized adapters that abstract transport differences through a uniform interface (`validate()`, `get_value()`, `get_evaluator()`, `run_evaluation()`, `collect()`). When rule conditions are violated (fault detected), monitor threads evaluate the results on-thread and enqueue `FaultEvent` objects to the FIFO. In case of a failure (not a rule violation), per-rule failure counters track consecutive errors and trigger state transitions (`OK` → `DEGRADED` → `BROKEN|FATAL`) based on configurable thresholds stored in CONFIG_DB. + +Confirmed faults are published to the Redis `FAULT_INFO` table for gNMI subscription by the controller. DLDD executes vendor-defined local actions (log collection, component resets) as specified in the rules and escalates `ACTION_*` requests (RESEAT, COLD_BOOT, REPLACE) to the controller when local remediation is insufficient. The service maintains a heartbeat via `DLDD_STATUS|process_state` with a 120-second TTL, publishing both service health and broken rule diagnostics to provide full observability. + +Operators control the service through standard SONiC mechanisms: the `FEATURE` table in CONFIG_DB enables or disables DLDD (`config feature state dldd enabled/disabled`), while the `DLDD_CONFIG` table allows dynamic threshold tuning without service restart. Controllers _can_ push updated rules via gNOI File service, with a systemd timer monitoring for changes and triggering validation and reload. The design prioritizes graceful degradation—when individual rules fail, the service continues operating with the remaining functional rules, and catastrophic errors trigger automatic fallback to a golden backup. + +## Detailed Design + +### Core Components + +#### Primary Orchestration Engine (Primary Thread) +- **Rule Management Pipeline**: Parses signatures from `vendor-rules-schema-hld.md`, validates schema compatibility, resolves DSE references into concrete transport specifications (determining whether an event becomes I2C, Redis, Platform API, CLI, or file-based), and materializes execution plans that map events to monitor thread capabilities. +- **Thread Coordinator**: Owns the lifecycle of all monitor threads, instantiating them from a common `MonitorThread` base class, injecting the resolved event plan, and distributing work items to the appropriate monitor based on transport type. +- **Fault Processing & Actions**: Consumes `FaultEvent` objects already evaluated by monitor threads, tracks per-rule failure counts, executes vendor local actions, and raises ACTION_* escalations to the controller. +- **Telemetry Publisher**: Emits confirmed fault and service-state records into redis STATE_DB, and orchestrates gNOI Healthz artifact creation. + +#### Internal Data Structures + +DLDD uses three primary data structures for inter-thread communication and execution coordination: + +**MonitorWorkItem** - Work assignment from orchestrator to monitor threads +- **Rule ID**: Unique identifier for the rule being evaluated +- **Event Definition**: Resolved event specification with concrete transport details (Redis key, I2C address, file path, CLI command, etc.) +- **Transport Type**: Classification determining which monitor thread handles the work (Redis, File, Common) +- **DSE Bindings**: Resolved Data Source Extension mappings for abstract identifiers +- **Evaluator Metadata**: Condition type (mask, comparison, string, boolean) and threshold values + +**EvaluationResult** - Output of adapter evaluation within monitor threads +- **Violation Status**: Boolean indicating if the rule condition was violated (fault detected) +- **Collected Value**: Raw value retrieved from the data source (formatted per value_configs) +- **Evaluator**: The evaluation logic that was applied (type and threshold) +- **Timestamp**: When the evaluation occurred +- **Exception Details**: Error information if the evaluation failed (used for failure tracking) +- **State Transition**: Whether this result triggers OK → DEGRADED or DEGRADED → BROKEN|FATAL transition + +**FaultEvent** - Message enqueued to FIFO when rule conditions are violated or evaluation failures occur +- **Rule Metadata**: Rule ID, component info (type, name, serial), error type, severity, symptom +- **Evaluation Context**: The `EvaluationResult` that triggered this event +- **Failure Tracking**: Current consecutive failure count for this rule, current state (OK/DEGRADED/BROKEN) +- **Action Context**: List of local actions taken (if any), repair actions available for controller escalation +- **Temporal Data**: Timestamp of original detection, time window for condition persistence +- **Telemetry Payload**: Pre-formatted data ready for publication to `FAULT_INFO` Redis table + +These structures maintain type consistency across the service. The orchestrator creates `MonitorWorkItem` objects, monitors produce `EvaluationResult` objects via adapters, and package them into `FaultEvent` objects for the FIFO, which the orchestrator consumes to act and publish telemetry. + +#### Monitor Thread Architecture + +- **Shared Interface**: Every monitor inherits the common `MonitorThread` contract (`get_query_path()`, `get_path_value()`, `generate_queue_object()`, `push_queue_object()`), guaranteeing uniform behavior regardless of underlying transport. +- **Typed Adapters**: Each monitor thread composes the appropriate `DataSourceAdapter` (Redis, Platform API, CLI, I2C, File, etc.) which implements `validate()`, `get_value()`, `get_evaluator()`, `run_evaluation()`, and `collect()`. +- **On-Thread Evaluation**: Logic and evaluation are executed inside the monitor threads; each `collect()` call resolves the value, evaluator, and produces an `EvaluationResult` before packaging a `FaultEvent`. +- **Structured Output**: When a rule condition is violated or state changes, monitors emit a normalized `FaultEvent` that encapsulates the rule, event metadata, value, evaluator outcome, and timestamps before enqueuing to the shared FIFO. + +**Data Collection Strategy**: +- **Redis Monitor**: Polls all assigned Redis-based rules on a configurable interval (default: 60 seconds) by querying specific keys defined in rules +- **File Monitor**: Polls all assigned file-based rules on a configurable interval (default: 60 seconds) +- **Common Monitor**: Polls all assigned Platform API/I2C/CLI rules on a configurable interval (default: 60 seconds) +- Each monitor thread has an independent polling interval configured via CONFIG_DB `DLDD_CONFIG` table; per-rule polling intervals are not currently supported + +#### Shared Data Contracts +- **Execution Plan Artifacts**: Orchestrator and monitors exchange `MonitorWorkItem` descriptors (rule ID, resolved event definition with concrete transport details, adapter binding). Orchestrator may modify the execution plan at runtime to add or remove rules from monitor threads as a result of failure states. +- **FaultEvent Queue Objects**: The FIFO carries serialized `FaultEvent` dataclasses between threads with consistent schema including rule metadata, evaluation results, timestamps, and exception details for broken rule tracking. + + +### Process Model + +``` +DLDD Process (PID: main) +└─ Primary Orchestration Thread + ├─ Maintains rule execution plan and DSE bindings + ├─ Manages shared FIFO of `FaultEvent` objects + └─ Invokes ACTION_* escalations and telemetry publishers + + ╰─ Monitor Thread Pool (instances of the shared MonitorThread base class) + ├─ Redis Monitor (uses RedisAdapter → MonitorThread interface) + ├─ File Monitor (uses FileAdapter → MonitorThread interface) + └─ Common Monitor (uses PlatformAPI/CLI/I2C adapters → MonitorThread interface) + + ↳ Each monitor produces `FaultEvent` objects with identical schema + and enqueues them to the FIFO for orchestration consumption. +``` + +- **Monitor Thread Interface Enforcement**: All monitors are instantiated from the same base class, guaranteeing consistent callback signatures for value retrieval, evaluation, and queueing. +- **Inter-Thread Payloads**: Communication between threads relies exclusively on the `MonitorWorkItem` descriptors (from orchestrator → monitor) and `FaultEvent` objects (monitor → orchestrator), keeping the data flow self-describing and serialization-friendly. +- **Deterministic Ordering**: The FIFO buffer preserves chronological ordering of `FaultEvent` payloads. If multiple faults on the same component and symptom are received, the pushed fault is determined first by the severity and then by priority (lower numeric priority takes precedence). If component instance, symptom, severity, and priority are the same, the pushed fault is based on first received fault. + +### Rule Evaluation Workflow + +1. **Rule ingestion**: Primary thread loads the vendor rules source, validates schema versions, resolves DSE references, and stores the resulting execution plans. +2. **Monitor thread provisioning**: Based on the rule metadata, the primary thread spawns or reconfigures monitor threads to cover the necessary data sources and DSE bindings. +3. **Event sampling**: Monitor threads collect data from Redis, platform APIs, sysfs, CLI, I2C, and file sources, applying per-event evaluations defined in the rules schema. +4. **Fault buffering**: Events that violate rule conditions generate fault evidence that is enqueued into the thread-safe shared FIFO buffer, preserving ordering and batching semantics. +5. **Fault processing and actions**: The primary thread consumes buffered events in batches, tracks failure counts, executes vendor defined local actions when thresholds are met, and triggers remote ACTION_* sequences through the controller interface. +6. **Telemetry publication**: Confirmed faults are written to Redis `FAULT_INFO`, exported via gNMI, and associated log artifacts are pushed into gNOI Healthz bundles. + +### Vendor Rule Lifecycle Coordination + +- **Schema Compatibility**: DLDD verifies that the `schema_version` provided in the rules source is supported by the on-device schema layout definitions before activating signatures. +- **Signature Distribution**: Each signature's metadata drives monitor thread assignments (for example, events referencing DSE paths are dispatched to the thread that can resolve the DSE binding). +- **Action Interface Enforcement**: Local actions are required to follow the `{type, command}` structure defined in the rules schema. Supported executors include `dse`, `cli`, `i2c`, and `file`. +- **Escalation Handling**: Remote actions are propagated as ACTION_* enums defined by the rules schema and surfaced to the controller through gNMI/gNOI. +- **Log Collection Alignment**: DLDD consumes the `log_collection` queries specified in the rules schema to populate gNOI Healthz artifacts. + +### Data Intake Pathways + +### Priority Order +Data collection should follow a priority hierarchy optimized for performance and a focus on minimizing resource usage and simplicity of use: + +1. **Redis Database** - Primary source when available + - Lowest latency access + - Leverages already captured data + - Structured data format + - Native SONiC integration + - Examples: `STATE_DB`, `COUNTERS_DB`, `APPL_DB` + +2. **Platform APIs** - Platform abstraction layer + - Hardware-agnostic interface + - Vendor-specific implementations through common SONiC APIs + - Examples: PSU status, fan speeds, thermal readings, chassis object, etc. + +3. **Sysfs Paths** - Direct filesystem access + - Kernel-exposed hardware data + - Low-level sensor access + - Requires path knowledge + - Examples: `/sys/class/hwmon/`, `/sys/bus/i2c/` + +4. **CLI Commands** - Linux/SONiC Command Line Access + - Standard SONiC/Linux command execution + - Human-readable output requires parsing + - Examples: `show platform npu ?`(vendor CLIs), `dmesg`, `lspci`, `sensors` + +5. **I2C Commands** - Direct hardware communication + - Last resort for unavailable data + - Requires detailed hardware knowledge + - Examples: Direct sensor register reads via i2c + +### Data Source Interfaces + +#### Shared Interface Contract + +Every rule references a `DataSourceAdapter` that implements a common contract. The adapter receives a resolved event specification (DSE references are already converted to concrete transport details by the primary thread during rule ingestion). All adapters expose the same surface area to the rule engine: + +```python +class DataSourceAdapter(Protocol): + def validate(self, event: RuleEvent) -> None: + """Raise on unsupported configuration prior to activation.""" + + def get_value(self, event: RuleEvent) -> CollectedValue: + """Fetch the raw value from the underlying transport.""" + + def get_evaluator(self, event: RuleEvent) -> Evaluator: + """Produce a callable or structure that encapsulates the evaluation logic.""" + + def run_evaluation(self, value: CollectedValue, evaluator: Evaluator) -> EvaluationResult: + """Return the boolean outcome plus any metadata (timestamps, values, etc.).""" + + def collect(self, event: RuleEvent) -> EvaluationResult: + """Convenience wrapper that orchestrates value retrieval and evaluation.""" +``` + +#### Method Responsibilities +The below is provided to help provide a better idea of where functionality takes place in the common data source adapter interface. + +**`validate(event: RuleEvent) -> None`** +- **Purpose**: Pre-flight check executed once during rule ingestion, before any monitor thread starts sampling. +- **Behavior**: Inspects the event configuration (path structure, evaluation type, DSE references) and raises an exception if the adapter cannot support it. +- **Example**: An I2CAdapter would verify that the bus/chip addresses are syntactically valid and that the requested operation (`get`/`set`) is supported. A RedisAdapter would confirm the database name exists in the SONiC schema. +- **Failure Impact**: If validation fails, the rule is marked as broken and excluded from the execution plan; the service continues with remaining valid rules. + +**`get_value(event: RuleEvent) -> CollectedValue`** +- **Purpose**: Fetch the raw data from the underlying transport (I2C register, Redis key, CLI stdout, file content, etc.). +- **Behavior**: Executes the transport operation using the already-resolved event specification and returns the unprocessed value (bytes, string, integer, JSON blob, etc.). DSE resolution has already occurred in the primary thread. +- **Example**: For an I2C event with resolved bus/chip addresses, this reads the chip register and returns the raw byte/word. For Redis with a concrete database/table/key path, it performs `HGET` and returns the field value. For CLI with the final command string, it executes and returns stdout. + +**`get_evaluator(event: RuleEvent) -> Evaluator`** +- **Purpose**: Build the evaluation logic based on the `evaluation` block from the rules schema. +- **Behavior**: Parses the evaluation type (`mask`, `comparison`, `string`, `boolean`) and constructs a callable or data structure that can be applied to the collected value. +- **Example**: For a mask evaluation with `logic: '&'` and `value: '10000000'`, returns an evaluator that performs bitwise AND. For a comparison evaluation with `operator: '>'` and `value: 50.0`, returns a greater-than checker. +- **Reusability**: The evaluator can be cached and reused across multiple `get_value()` calls if the evaluator is static and not dynamically generated (a DSE reference as the value would require a new evaluator each time as the underlying may change/is not hardcoded). + +**`run_evaluation(value: CollectedValue, evaluator: Evaluator) -> EvaluationResult`** +- **Purpose**: Execute the evaluation logic and return the boolean outcome plus any metadata (actual value read, expected threshold, etc.). +- **Behavior**: Applies the evaluator to the value and packages the result into an `EvaluationResult` object that includes violation status, timestamps, and diagnostic information. +- **Example**: For a temperature threshold check, returns `EvaluationResult(violated=True, value=55.2, threshold=50.0, unit='celsius')` if the sensor reads above the limit. +- **Usage**: This is typically called by `collect()` but can be invoked independently for testing or batch evaluation scenarios. + +- **Purpose**: Convenience method that chains the full evaluation workflow in a single call. +- **Behavior**: Internally calls `get_value(event)`, `get_evaluator(event)` (if necessary), and `run_evaluation(value, evaluator)`, then returns the final `EvaluationResult`. +- **Usage**: Monitor threads call this method in their main sampling loop. It simplifies the common case where the thread wants a complete evaluation without needing to manage intermediate steps. +- **Example**: `result = adapter.collect(event)` → fetches I2C register, applies mask, returns violation status in one operation. + +#### Type-Specific Adapter Expectations +Below are some examples of how type specific adapters will function: +- **RedisAdapter**: Resolves database/table/key/path (or DSE aliases) and performs `HGET`/`JSON` extraction using the shared `collect()` entry point. +- **PlatformAPIAdapter**: Uses the platform chassis object obtained from the DSE resolver, executes the requested method on the component, and returns structured results. +- **I2CAdapter**: Converts logical bus/chip identifiers provided by the rule (or DSE) into physical addresses, applies the requested operation (`get`/`set`), and returns the response of size specified by the rule. +- **CLIAdapter**: Executes vendor CLI commands with standard timeout handling, normalizes stdout to the expected format, and returns parsed content. +- **FileAdapter**: Reads file paths or glob patterns, normalize data into buffer for later comparison as defined in the rules. + +Each adapter adheres to the same lifecycle hooks (`validate()`, `collect()`, etc.), which keeps the evaluation pipeline agnostic to the underlying transport while still allowing vendor-specific implementations behind the interface. + +### Error Handling and Recovery + +#### Exception Handling Strategy + +DLDD uses exception-based error handling at both the primary orchestration thread and monitor thread levels. All failures are caught, logged, tracked, and escalated appropriately based on severity and persistence. The system maintains isolation between rules so that failures in one rule do not effect others. + +#### Primary Thread Error Handling + +**Rule Ingestion Phase** + +During rule ingestion, the primary thread processes each signature from the vendor rules source sequentially. For each signature, the thread performs schema validation and DSE rule conversion. If any of these steps raise an exception: + +- The failure is logged with the rule name and error details +- The rule is added to a `broken_rules` collection with metadata (rule name, version, failure reason) +- Processing continues with the next rule in the source +- The broken rule is excluded from the execution plan + +After all rules are processed, the primary thread publishes the complete list of broken rules to the service state telemetry (Redis `DLDD_STATUS|process_state`). This allows the remote controller to detect which rules failed to load and why. + +**Primary Thread Fault Consumption** + +During fault consumption, the primary thread consumes batches of `FaultEvent` objects from the shared FIFO buffer and evaluates signature logic. If exceptions occur: + +- **Action Execution Errors**: Logged, but the fault is still published to telemetry with an action failure annotation. The controller receives notification that the fault was detected but remediation failed. + +If the `FaultEvent` pushed by a monitor thread has a value that signifies a broken rule, the primary thread will add the rule to the `broken_rules` table and track number of instances. Once the rule reaches a number of failures that exceeds the `max_failures` threshold, the rule will be excluded from the execution plan and the primary thread will modify the rules that the monitor thread is using to remove the broken rule from any further attempts. + +The primary thread never terminates due to individual rule or action failures—it continues operating with the subset of functional rules. + +#### Monitor Thread Error Handling + +**Per-Rule Failure Tracking** + +Each monitor thread maintains per-rule state tracking: + +- **Failure count**: Number of consecutive transport or evaluation exceptions for each rule +- **Last success timestamp**: Most recent successful evaluation for each rule + +This per-rule tracking ensures that one failing rule does not block evaluation of other rules assigned to the same monitor thread. + +**Sampling Loop Behavior** + +During each sampling cycle, the monitor thread iterates through its assigned work items (rules). For each rule: + +1. **Collection Attempt**: Invoke `adapter.collect(event)` which may raise exceptions +2. **Success Path**: If collection succeeds, set internal tracker to new state, update last success timestamp, and enqueue a `FaultEvent` if the rule condition was violated or broken. Also enqueue a `FaultEvent` if the rule condition was OK and the rule was previously broken or violated to inform the primary thread of this state change. +3. **Exception Path**: If collection raises an exception, handle based on exception type. Push the rule into the shared buffer with a `FaultEvent` with the exception details. + +**Evaluation Exception Handling** + +Evaluation exceptions (evaluator type mismatch, invalid logic) indicate permanent configuration errors. When caught: + +- Log an error into system logs. +- Notify the primary thread immediately for removal from the execution plan and addition to the `broken_rules` list with fatal reason for this rule. + +Evaluation failures are considered fatal for the rule because they indicate a schema or DSE resolution issue that cannot be fixed by retrying. + +#### Failure Classification Summary + +| Failure Type | Detection Point | Recovery Strategy | Impact | Telemetry | +|--------------|----------------|-------------------|--------|--------| +| **Schema Validation Error** | Primary thread during rule ingestion | Skip rule, continue with others | Rule excluded from execution plan | Included in `broken_rules` list | +| **DSE Resolution Error** | Primary thread during rule ingestion | Skip rule, continue with others | Rule excluded from execution plan | Included in `broken_rules` list | +| **Adapter Validation Error** | Primary thread during rule ingestion | Skip rule, continue with others | Rule excluded from execution plan | Included in `broken_rules` list | +| **Query Error** | Monitor thread during `collect()` | Mark broken, attempt further retries until primary thread removes from execution plan | Rule excluded from execution plan upon decision from primary thread | Added to `broken_rules` immediately | +| **Evaluation Error** | Monitor thread during `collect()` | Mark broken immediately, no retry | Rule permanently disabled until re-ingestion | Added to `broken_rules` immediately | +| **Action Execution Error** | Primary thread during local action execution, this includes the post action requery | Log error, publish fault with failure annotation | Fault reported despite action failure | Fault includes action failure metadata | + +#### Broken Rule Reporting + +Broken rules are published to the service state telemetry: + +```json +{ + "DLDD_STATUS|process_state": { + "expireat": 1746122880.1234567, + "ttl": 120, + "type": "hash", + "value": { + "state": "DEGRADED", + "running_schema": "0.0.1", + "individual_max_failure_threshold": 4, + "broken_rules_max_threshold": 5, + "broken_rules": [ + { + "rule": "PSU_OV_FAULT", + "version": "1.0.0", + "reason": "query_error: I2C bus 6 not responding", + "failure_count": 3, + "state": "DEGRADED", + "last_attempt": 1735678901.234 + }, + { + "rule": "TEMP_THRESHOLD_CHECK", + "version": "1.0.2", + "reason": "evaluation_error: evaluator type mismatch", + "failure_count": 1, + "state": "BROKEN", + "last_attempt": 1735678905.678 + } + ], + "reason": "1 rule(s) [PSU_OV_FAULT] degraded, 1 rule(s) [TEMP_THRESHOLD_CHECK] broken" + } + } +} +``` + +**Schema Fields**: + +**Top-Level Fields**: +- **`expireat`**: Unix timestamp when the key expires. DLDD refreshes this every 120 seconds. +- **`ttl`**: Time-to-live in seconds (default: 120). If DLDD fails to update this key within the TTL window, the controller can assume the service is unresponsive. +- **`type`**: Redis data structure type (always `"hash"`). + +**Value Object Fields**: +- **`state`**: Service health status. Values: + - `"OK"`: All rules functional, no broken rules + - `"DEGRADED"`: Some rules broken but service operational (broken rule count < `broken_rules_max_threshold`) + - `"BROKEN|FATAL"`: Critical failure, service non-functional (broken rule count ≥ `broken_rules_max_threshold` or fatal service error) +- **`running_schema`**: Version of the vendor rules schema currently loaded (e.g., `"0.0.1"`). +- **`individual_max_failure_threshold`**: Configurable threshold for how many consecutive failures a single rule can experience before being marked `"BROKEN"` and removed from execution plan. +- **`broken_rules_max_threshold`**: Configurable threshold for how many total broken rules will trigger the service `state` to become `"BROKEN|FATAL"`. +- **`broken_rules`**: Array of rules that failed validation or exceeded runtime failure thresholds. Empty array when `state` is `"OK"`. +- **`reason`**: Human-readable explanation of the current state. Empty when `state` is `"OK"`. + +**Broken Rule Object Fields**: +- **`rule`**: Rule identifier from the signature metadata. +- **`version`**: Rule version from the signature metadata. +- **`reason`**: Detailed failure cause with error type prefix (`"query_error:"`, `"evaluation_error:"`, `"schema_error:"`, `"dse_error:"`, `"validation_error:"`). +- **`failure_count`**: Number of consecutive failures observed for this rule. +- **`state`**: Rule-level health status. Values: + - `"DEGRADED"`: Rule experiencing failures but still in execution plan (failure count < `individual_max_failure_threshold`) + - `"BROKEN"`: Rule removed from execution plan due to exceeding failure threshold or fatal error (evaluation errors) +- **`last_attempt`**: Unix timestamp of the most recent evaluation attempt for this rule. + +### Service Configuration + +#### Enable/Disable Service + +DLDD follows the standard SONiC pattern for service management using the `FEATURE` table in CONFIG_DB. This ensures consistent behavior with other SONiC services. + +**CONFIG_DB FEATURE Table**: + +```json +{ + "FEATURE": { + "dldd": { + "state": "enabled", + "auto_restart": "enabled", + "has_timer": "false", + "has_global_scope": "true", + "has_per_asic_scope": "false" + } + } +} +``` + +**CLI Commands**: + +```bash +# Enable DLDD service (persistent across reboots) +sudo config feature state dldd enabled + +# Disable DLDD service +sudo config feature state dldd disabled + +# Check service status +show feature status dldd +``` + +#### Threshold Configuration + +The failure thresholds (`individual_max_failure_threshold` and `broken_rules_max_threshold`) control when rules and the service transition between health states. These are stored in CONFIG_DB under the `DLDD_CONFIG` table. + +**CONFIG_DB DLDD_CONFIG Table**: + +```json +{ + "DLDD_CONFIG": { + "global": { + "individual_max_failure_threshold": "10", + "broken_rules_max_threshold": "5", + "redis_monitor_polling_interval": "60", + "file_monitor_polling_interval": "60", + "common_monitor_polling_interval": "60" + } + } +} +``` + +**Vendor Defaults**: + +Vendors can provide platform-specific defaults in `/usr/share/sonic/device//dldd-config.yaml`. On service start, these are loaded into CONFIG_DB if no `DLDD_CONFIG` entry exists. + +```yaml +dldd_config: + individual_max_failure_threshold: 10 + broken_rules_max_threshold: 5 + redis_monitor_polling_interval: 60 + file_monitor_polling_interval: 60 + common_monitor_polling_interval: 60 +``` + +**Default Values**: + +If `DLDD_CONFIG` is not present in CONFIG_DB or vendor defaults are not provided, DLDD uses hardcoded defaults: +- `individual_max_failure_threshold`: 10 (rule marked `BROKEN|FATAL` after 10 consecutive failures) +- `broken_rules_max_threshold`: 5 (service marked `BROKEN|FATAL` after 5 rules broken) +- `redis_monitor_polling_interval`: 60 seconds +- `file_monitor_polling_interval`: 60 seconds +- `common_monitor_polling_interval`: 60 seconds + +Below the threshold, rules/service will be considered in a `DEGRADED` state and will continue to run. + +**Operator Configuration**: + +Operators modify thresholds and polling intervals via SONiC `config` commands, which write to CONFIG_DB: + +```bash +# Set individual rule failure threshold +sudo config dldd threshold individual-max-failure 15 + +# Set service-level broken rules threshold +sudo config dldd threshold broken-rules-max 8 + +# Set monitor polling intervals (in seconds) +sudo config dldd polling-interval redis 30 +sudo config dldd polling-interval file 120 +sudo config dldd polling-interval common 60 + +# View current configuration +show dldd config +``` + +DLDD subscribes to `DLDD_CONFIG` changes via Redis SUBSCRIBE and applies updates dynamically without requiring a service restart. The currently active thresholds are published in the `DLDD_STATUS|process_state` telemetry, allowing the controller to understand the service's failure tolerance. + +**Configuration Precedence**: + +1. CONFIG_DB `DLDD_CONFIG` table (highest priority - operator configuration) +2. Vendor platform defaults (`/usr/share/sonic/device//dldd-config.yaml`) +3. Hardcoded service defaults (fallback) + +#### Controller Actions + +- **Key expired or missing**: DLDD likely crashed; attempt service restart or more aggressive restart +- **`state: "DEGRADED"`**: Review `broken_rules`, consider pushing updated rule definitions or adjusting thresholds via CONFIG_DB. Can be considered a NO-OP +- **`state: "BROKEN|FATAL"`**: For the service state: critical failure (too many broken rules, rules file corrupted, all DSE resolution failed); escalate to vendor. For the rule state: isolated failure; continue to monitor and potentially escalate to vendor. +- **Persistent failures**: Consistent service or rule degradation doesn't particularly point to HW issue. Investigate rules source and thresholds. + +## Integration Points + +### Platform Monitor Integration +- **Non-Interference**: DLDD monitoring should not disrupt existing PMON daemons +- **Data Sharing**: Leverage existing sensor data collection where possible, limits redundant data collection +- **Resource Sharing**: DLDD will execute all operations within a thread serially, any underlying potential resource contention should be handled by layers below DLDD or within the vendor DSE hooks. + +### Additional Vendor Log Collection Location +- **Fault Storage**: Beyond Healthz artifacts, vendor log hooks are allowed to log in any location. +- **Example of Vendor Log Location**: A vendor may want to put limited data into OBFL for long term storage, this would happen within the query defined under `log_collection` in the rules schema + +### gNOI Healthz Integration +- **Artifact Generation**: On fault detection, DLDD executes all `log_collection` queries defined in the triggered rule, capturing diagnostic data (logs, register dumps, CLI outputs) and packaging them into timestamped artifacts +- **Artifact Lifecycle**: Healthz maintains these artifacts with configurable retention policies, ensuring recent fault diagnostics are available for controller `get` operations while managing storage limits +- **Structured Bundling**: Artifacts include fault metadata (rule ID, timestamp, component info) alongside the collected diagnostic data, providing full context for post-mortem analysis +- **Remote Access**: Controllers retrieve artifacts via gNOI Healthz `get` operations, enabling centralized log aggregation without requiring direct device access or custom file transfer mechanisms + +## Telemetry and Diagnostics + +### Redis Fault Reporting +Fault information is published to Redis `FAULT_INFO` table in the STATE_DB. Conversion into OpenConfig is handled by UMF. + +```json +{ + "FAULT_INFO|PSU_0|SYMPTOM_OVER_THRESHOLD": { + "expireat": 1745614218.6978462, + "ttl": -0.001, + "type": "hash", + "value": { + "rule": "PSU_OV_FAULT", + "component_info": { + "component": "PSU", + "name": "PSU 0", + "serial_number": "" + }, + "error_type": "POWER", + "events": [ + { + "id": 1, + "value_read": "0b10000000", + "value_configs": { + "type": "binary", + "unit": "N/A" + }, + "condition": { + "type": "mask", + "value": "0b10000000", + "value_configs": { + "type": "binary", + "unit": "N/A" + } + } + } + ], + "time_window": 86400, + "repair_actions": ["ACTION_RESEAT", "ACTION_COLD_REBOOT", "ACTION_POWER_CYCLE", "ACTION_FACTORY_RESET", "ACTION_REPLACE"], + "actions_taken": ["PSU:reset_output_power()"], + "severity": "MINOR_ALARM", + "symptom": "SYMPTOM_OVER_THRESHOLD", + "first_seen_timestamp": 1745614206.0123456, #first instance of failure + "timestamp": 1745614206.0123456, #current timestamp of failure + "current_state": false, #added to track CURRENT state of failure (true being is fault) + "description": "An over voltage fault has occurred on the output feed from the PSU to the chassis." + } + } +} +``` + +**Field Descriptions**: +- **Key Format**: `FAULT_INFO||` where spaces in component names are replaced with underscores (e.g., "PSU 0" becomes "PSU_0", resulting in key `FAULT_INFO|PSU_0|SYMPTOM_OVER_THRESHOLD`) +- **`rule`**: Rule identifier from the vendor rules schema that triggered this fault +- **`component_info`**: Object containing component identification details + - **`component`**: Component type (PSU, FAN, ASIC, TRANSCEIVER, etc.) + - **`name`**: Human-readable component name as reported by platform API or defined in rule instance + - **`serial_number`**: Serial number of the associated component or parent component, lowest available in hierarchy +- **`error_type`**: High-level error category (POWER, THERMAL, TRANSCEIVER, MEMORY, etc.) from rule metadata +- **`events`**: Array of event objects representing the data points and conditions evaluated for this fault (only includes events that triggered the fault) + - **`id`**: ID taken from the rule schema associated with the originating event + - **`value_read`**: Raw value read from the data source for this event, formatted according to `value_configs.type` + - **`value_configs`**: Metadata about the value format + - **`type`**: Data type of `value_read` (binary, hex, int, string, float, etc.) + - **`unit`**: Unit of measurement for the value (millivolts, celsius, RPM, N/A, etc.) + - **`condition`**: The evaluation condition that triggered the fault for this event + - **`type`**: Evaluation type (mask, comparison, string, boolean) from rule + - **`value`**: Expected/threshold value that triggered the fault + - **`value_configs`**: Format metadata for the condition value +- **`time_window`**: Time window in seconds from the rule within which the fault condition must persist +- **`repair_actions`**: List of escalation actions from the rule that the controller can execute (ACTION_RESEAT, ACTION_COLD_REBOOT, ACTION_POWER_CYCLE, ACTION_FACTORY_RESET, ACTION_REPLACE) +- **`actions_taken`**: List of local actions already executed by DLDD to protect device health; ["NONE"] if no local actions taken +- **`severity`**: Fault severity level from rule metadata (MINOR_ALARM, MAJOR_ALARM, CRITICAL_ALARM, etc.) +- **`symptom`**: OpenConfig-defined symptom enum that categorizes the fault for standardized controller processing +- **`timestamp`**: Unix timestamp when the fault was originally detected by DLDD monitor thread +- **`description`**: Human-readable fault description from the rule metadata + + +## File Management and Rule Updates + +### Critical Files +- **Rules Source**: `/usr/share/sonic/device//dld_rules.yaml` (vendor-provided, remotely updatable via gNOI File service) +- **Golden Backup**: `/usr/share/sonic/device//dld_rules_golden.yaml` (fallback configuration) +- **DSE Configuration**: `/usr/share/sonic/device//dld_dse.yaml` (platform-specific Data Source Extension mappings for abstract identifiers) +- **State Data**: `/var/lib/sonic/dld_state.json` (persistent issue tracking across service restarts) + +### State Persistence + +DLDD maintains persistent state in `/var/lib/sonic/dld_state.json` to track rule and service failures across service restarts and reboots. Only failing, degraded, or broken rules are persisted — rules in OK state are not tracked. + +**Persisted Data**: +- **Failing rule counters**: Consecutive failure counts for rules currently experiencing issues (DEGRADED or BROKEN|FATAL state only) +- **Degraded/Broken rule list**: Rules currently in DEGRADED or BROKEN|FATAL state, with their failure counts and state +- **Service failure count**: Number of broken rules contributing to service-level DEGRADED or BROKEN|FATAL state +- **Rule source checksum**: Hash of the active rules file to detect rule source changes + +**State Recovery**: On service start, DLDD loads the state file and: +1. Validates the rule source checksum—if changed, clears all state and starts fresh +2. Restores failure counters only for rules still present and still failing +3. Re-applies broken rule exclusions to avoid re-evaluating known bad rules +4. Creates a new state file if none exists (fresh install or corruption recovery) + +**State Reset**: The state file is automatically reset when: +- Rules source file (`dld_rules.yaml`) is modified or replaced +- User triggered restart of the service via `systemctl restart dldd` will reset the state file + +**State Updates**: The state file is updated after each batch of fault processing, but only for rules that have entered or remain in DEGRADED/BROKEN states. Once a rule is no longer in DEGRADED/BROKEN state, it will be removed from the state file after 5 minutes of no further failures, this is to ensure that a potentially flaky rule does not get removed from the state file prematurely. + +### Rule Update Process +1. **Remote Delivery**: Controller pushes updated rules via gNOI File service +2. **File Monitoring**: Systemd timer (`dldd-rules-watch.timer`) checks file every 5 minutes +3. **Validation**: DLDD validates schema and DSE references before activation +4. **Reload**: Service reload (`systemctl reload dldd`) re-ingests rules without full restart +5. **Rollback**: Automatic fallback to golden backup on validation errors + +### Runtime File Monitoring +- **Checksum Validation**: Watcher detects file modifications and triggers reload +- **Absent File Handling**: If rules file disappears, watcher stops service, restores golden backup, and restarts +- **Golden Backup Maintenance**: Controller should update golden backup after successful rule validation + + + +## Testing and Validation + +### Schema Validation Utility + +DLDD provides a built-in validation utility for offline testing and schema validation of rules files before deployment. This allows vendors and operators to validate rules without requiring a full service restart or service impact. + +**CLI Interface**: + +```bash +# Validate a rules file against the current schema +sudo dldd validate-rules --file /path/to/dld_rules.yaml + +# Validate with verbose output (show all DSE resolutions and evaluator checks) +sudo dldd validate-rules --file /path/to/dld_rules.yaml --verbose + +# Validate and show JSON output for automation +sudo dldd validate-rules --file /path/to/dld_rules.yaml --json +``` + +**Validation Checks**: +- **Schema Version Compatibility**: Verifies the rules file `schema_version` is supported by the daemon +- **YAML Syntax**: Validates well-formed YAML structure and required fields +- **Schema Structure**: Ensures all mandatory fields are present in rules and instances according to the schema definition +- **DSE Resolution**: Attempts to resolve all Data Source Extension references against the platform DSE configuration file (`dld_dse.yaml`) +- **Field Type Validation**: Validates data types for all fields (strings, integers, enums, etc.) +- **Enum Validation**: Checks that repair_actions use only supported ACTION_* values, evaluation types are valid (mask/comparison/string/boolean), severity levels are recognized + +**Output Format Example**: + +``` +Schema version: 1.0 +Rules parsed successfully: 43 +Rules failed validation: 2 + - PSU_TEMP_THRESHOLD: Missing required field 'severity' in rule metadata + - FAN_SPEED_CHECK: DSE reference 'UNKNOWN_DSE' not found in platform config +Result: FAILED +``` + +**Implementation Notes**: +- Validation runs in dry-run mode without starting monitor threads, querying data sources, or testing evaluator logic +- Validation does NOT test whether evaluators will work correctly at runtime or whether data sources are accessible +- Validation results include line numbers and field paths for error localization +- Exit code 0 for success, non-zero for validation failures + +### Unit Testing + +**Adapter Testing with Mocked Libraries**: + +Each adapter type (Redis, Platform API, I2C, CLI, File) requires comprehensive unit tests with mocked underlying libraries to validate the adapter interface implementation without hardware dependencies. + +All adapters must pass conformance tests for the DataSourceAdapter interface using mock data sources to validate the underlying logic without hardware dependencies. Tests must verify that `validate()` correctly identifies invalid configurations, `get_value()` returns properly formatted values matching `value_configs` specifications, `get_evaluator()` constructs correct evaluation logic from rule definitions, `run_evaluation()` produces EvaluationResult objects with correct violation status, and `collect()` chains the full workflow while handling exceptions appropriately. + +### Integration Testing + +Integration tests validate the complete rule execution pipeline from ingestion through telemetry publication using mock hardware and synthetic data sources. Tests must cover both successful fault detection scenarios and various failure modes to validate the service's graceful degradation behavior and telemetry accuracy. + +**Healthy System Testing**: Load a rules file with multiple rules across different data sources (Redis, Platform API, I2C, CLI, File) and inject synthetic data that does not violate any conditions. Verify that `DLDD_STATUS|process_state` shows `state: OK` with empty `broken_rules` array, no fault entries are published to the `FAULT_INFO` table, and the heartbeat TTL refreshes every 120 seconds. + +**Fault Detection Testing**: Deploy rules that detect actual faults, such as temperature threshold violations or power supply anomalies. When synthetic data is injected that violates rule conditions, verify that `FaultEvent` objects are correctly enqueued with violation status, faults are published to `FAULT_INFO` with complete telemetry payloads (including rule, component_info, events array with value_read/condition pairs, severity, symptom), and any local actions defined in the rules are executed. Confirm that `process_state` remains in `state: OK` since the service itself is healthy despite detecting hardware faults. + +**Rule Failure State Transitions**: Test the progression through DEGRADED and BROKEN states by simulating transient and persistent adapter failures. For transient failures, force I2C "timeout" exceptions for several consecutive polling cycles and verify that per-rule failure counters increment correctly, the rule enters DEGRADED state but remains in the execution plan, and `process_state` reflects the degraded rules list. + +**Service-Level State Testing**: Validate service-level BROKEN|FATAL state by loading a rules file with multiple rules and simulating failures that cause enough rules to break to exceed default `broken_rules_max_threshold` (default: 5). Verify that `process_state` transitions to `state: BROKEN|FATAL`, the service continues running but publishes critical state to the controller, and all broken rules are listed with diagnostics in the `process_state.broken_rules` array. + +**Configuration Error Handling**: Test schema validation failures by loading rules with invalid evaluation logic or missing required fields and verifying that affected rules are marked broken at ingestion time with appropriate diagnostics (e.g., "evaluation_error" or "schema_validation_error"). Test DSE resolution failures by loading rules with invalid DSE references (e.g., `@UNKNOWN_DSE`) and confirming rules are marked broken during execution plan materialization with "dse_resolution_error" diagnostics. In both cases, verify the service starts successfully and continues evaluating a valid rule. + +### Platform Testing + +Platform testing validates DLDD behavior on actual hardware with vendor-provided rules and DSE configurations, leveraging the SONiC management (sonic-mgmt) test framework for automated validation. If the vendor-specific rules file is available on target hardware, verify that all rules execute without adapter failures, confirming that DSE references correctly resolve to hardware paths such as I2C buses, Redis keys, and platform API methods. On a per adapter basis, select a rule from the rules file and hook into the underlying APIs for more targeted validation of the APIs. Test DLDD CLI commands to validate the service is running and to view the current state of the service. + +## Restrictions and Limitations + +**Service Limitations**: +- DLDD does not perform active hardware remediation beyond vendor-defined local actions specified in rules +- Remote `ACTION_*` escalations must be executed by controllers; DLDD only publishes the request +- No support for per-rule polling intervals; all rules within a monitor thread currently share the same polling interval + +**Rule Limitations**: +- Rules must conform to the schema defined in `vendor-rules-schema-hld.md` +- DSE references must be defined in the platform-specific DSE configuration file (`/usr/share/sonic/device//dld_dse.yaml`) + +**Platform Dependencies**: +- Requires vendor-provided rules file at `/usr/share/sonic/device//dld_rules.yaml` +- Any rule that uses DSE to hook into a platform API relies on said API being implemented by the vendor. + +--- + +*This document defines the Device Local Diagnosis Service implementation. For details on the rules schema format, refer to the companion Vendor Rules Schema HLD document.* + +## References + +- `vendor-rules-schema-hld.md` diff --git a/doc/device_local_diagnosis/images/dldd-graphic-design.jpg b/doc/device_local_diagnosis/images/dldd-graphic-design.jpg new file mode 100644 index 00000000000..24c91b53106 Binary files /dev/null and b/doc/device_local_diagnosis/images/dldd-graphic-design.jpg differ diff --git a/doc/device_local_diagnosis/vendor-rules-schema-hld.md b/doc/device_local_diagnosis/vendor-rules-schema-hld.md new file mode 100644 index 00000000000..378a16e3d32 --- /dev/null +++ b/doc/device_local_diagnosis/vendor-rules-schema-hld.md @@ -0,0 +1,564 @@ +# Device Local Diagnosis Rules Schema HLD + +## Table of Contents + +1. [Introduction](#introduction) +2. [Definitions](#definitions) +3. [Requirements](#requirements) +4. [Schema Architecture](#schema-architecture) +5. [Schema Versioning](#schema-versioning) +6. [Rule Structure](#rule-structure) +7. [Abstract Rule Data Source Extensions](#abstract-rule-data-source-extensions) +8. [Schema Layout Definitions](#schema-layout-definitions) +9. [Rule Examples](#rule-examples) +10. [Schema Validation](#schema-validation) +11. [Backward Compatibility](#backward-compatibility) + +## Introduction + +This document defines the schema and structure for vendor rules consumed by the Device Local Diagnosis (DLD) daemon running on SONiC switches. The rules schema provides a standardized, extensible format for defining fault detection signatures that can be consumed by the DLD daemon running on SONiC switches. + +The schema is designed to be: +- **Flexible**: Support multiple data sources (i2c, Redis, platform APIs, CLI, files, etc.) +- **Versioned**: Enable and track schema modifications +- **Extensible**: Allow for new fault types and detection methods +- **Standardized**: Provide a common format for rule definitions regardless of underlying SW +- **Hardware-agnostic**: Allow for hardware abstraction through data source extension (DSE) layers + +## Definitions + +| Term | Definition | +|------|------------| +| **Schema Version** | Version identifier for the rules structure format | +| **Signature** | A complete fault detection rule with metadata, conditions, and actions | +| **Event** | A specific data collection and evaluation point within a signature | +| **Data Source Extension (DSE)** | Translation layer between abstract rule definitions and hardware/software specific implementation | +| **Abstract Rule** | Rule using DSE identifiers that are resolved through data source extension files | +| **Direct Rule** | Rule with explicit hardware-specific paths, bypassing the DSE layer | + +## Requirements + +### Functional Requirements +- Support multiple SW versions and hardware revisions within a single schema +- Support both abstract (DSE) and direct rule definitions +- Enable fault correlation across multiple events and conditions +- Schema must be human-readable and maintainable +- Schema evolution must maintain backward compatibility with existing implementations wherever possible. Changes that violate this must modify schema major version as defined below. + +## Schema Versioning + +### Version Format +The schema version follows semantic versioning: `MAJOR.MINOR.PATCH` + +- **MAJOR**: Non-backward compatible changes requiring modification of the on-device component +- **MINOR**: Backward compatible additions such as new optional fields or evaluation types +- **PATCH**: Minor corrections and clarifications + +### Version Header +Every rules source file must begin with a schema version declaration: + +```yaml +schema_version: "0.0.1" +``` + +**CRITICAL**: This header format is immutable and serves as the entry point for schema interpretation. + +### Versioning and Compatability with SONiC NOS + +The schema version does not have explicit associated DLD daemon or SW version requirements. Schema versioning is independent of the software release cycle, allowing for: + +- Multiple schema versions supported by a single DLD daemon version +- Backward compatibility across software releases +- Independent evolution of schema structure and daemon implementation + +The DLD daemon is responsible for handling schema version compatibility through its schema layout definitions. + +## Rule Structure +At the highest level, each rules contains 3 primary sections: `metadata`, `conditions`, and `actions`. A breakdown of the content of each of these can be found below: + +### Signature Metadata +Each signature contains comprehensive metadata for identification and applicability. Every field serves a specific purpose in rule processing and system integration: + +- **Severity Ordering**: The `severity` field encodes the OpenConfig alarm severity (`CRITICAL`, `MAJOR`, `WARNING`, `MINOR`, `UNKNOWN`). Higher severity signatures always take precedence when multiple rules target the same component/symptom pair. +- **Priority Tiebreaker**: The optional `priority` field provides deterministic ordering for rules that share the same severity and symptom. Lower numeric values indicate higher priority; when omitted, adapters treat the priority as `5`. + +```yaml +signature: + metadata: + name: "PSU_OV_FAULT" # Required: Unique string identifier for the rule + id: 1000001 # Required: Unique numeric ID for cross-referencing + version: "1.0.0" # Required: Semantic version for rule tracking + description: | # Required: Human-readable fault explanation + An over voltage fault has occurred on the output feed from the PSU to the chassis. + This condition indicates potential hardware failure requiring immediate attention. + product_ids: # Required: List of compatible hardware products + - "8122-64EHF-O P1" # Product ID with hardware revision + - "8122-64EHF-O P2" # Multiple products can share the same rule + sw_versions: # Required: List of compatible software versions + - "202311.3.0.1" # Specific software version where rule is validated + - "202311.3.0.2" # Additional compatible versions + component: "PSU" # Required: Component type affected by fault + symptom: "" # Required: OpenConfig symptom enumeration + severity: "CRITICAL" # Required: OpenConfig severity enumeration + priority: 1 # Optional: Numeric priority to account for rule ordering (default is 5 when omitted) + tags: # Optional: Classification tags for filtering, below is just an example + - "power" # Functional category tag + - "voltage" # Specific fault type tag +``` + +#### Metadata Field Details + +| Field | Type | Required | Description | Valid Values | Example | +|-------|------|----------|-------------|--------------|----------| +| `name` | String | Yes | Unique human-readable identifier for the rule | Alphanumeric with underscores | `"PSU_OV_FAULT"` | +| `id` | Integer | Yes | Unique numeric identifier for programmatic reference | 1000000-9999999 | `1000001` | +| `version` | String | Yes | Semantic version following MAJOR.MINOR.PATCH format | Semantic versioning | `"1.0.0"` | +| `description` | String | Yes | Multi-line human-readable explanation of the fault condition | Plain text, can use YAML literal block | See example above | +| `product_ids` | List | Yes | Hardware products where this rule applies | Product ID (and HW version) are defined in vendor EEPROM, format dependent on that | `["8122-64EHF-O P1"]` | +| `sw_versions` | List | Yes | SW versions where rule is validated | SW version formatting dependent on NOS, given example is for Cisco release identifier for SONiC NOS | `["202311.3.0.1"]` | +| `component` | String | Yes | Primary component category affected | `PSU`, `FAN`, `CHASSIS`, `SSD`, `CPU`, `MEMORY` | `"PSU"` | +| `symptom` | String | Yes | OpenConfig alarm symptom enumeration for telemetry | OpenConfig defined symptoms | `"SYMPTOM_OVER_THRESHOLD"` | +| `severity` | String | Yes | OpenConfig alarm severity used for precedence | OpenConfig defined alarms | `"CRITICAL"` | +| `priority` | Integer | No | Numeric priority for rules with matching severity and symptom (lower value = higher priority, default = 5 when omitted) | Non-negative integer | `5` | +| `tags` | List | No | Categorization tags for filtering and organization | Arbitrary strings | `["power", "voltage"]` | + +### Condition Logic +Conditions define the logical evaluation framework for determining when a fault has occurred. This section controls how multiple events are correlated and evaluated: + +```yaml +conditions: + logic: "1 AND 2" # Required: Boolean expression referencing event IDs + logic_lookback_time: 60 # Required: Time window for event correlation (seconds) + events: # Required: List of individual detection events + - event: + id: 1 # Required: Unique identifier within this signature + # ... event definition + - event: + id: 2 # Required: Must be unique within signature + # ... event definition +``` + +#### Condition Field Details + +| Field | Type | Required | Description | Valid Values | Example | +|-------|------|----------|-------------|--------------|----------| +| `logic` | String | Yes | Boolean expression defining how events are combined | Boolean operators: `AND`, `OR`, `NOT` with event IDs | `"1 AND 2"`, `"1 OR (2 AND 3)"`, `"NOT 1"` | +| `logic_lookback_time` | Integer | Yes | Time window in seconds for correlating events | 0-86400 (0=instant, 86400=24 hours) | `60` (1 minute window) | +| `events` | List | Yes | Array of event definitions that can trigger the fault | Must contain at least 1 event | See Event Definition below | + +#### Logic Expression Rules +- **Event References**: Use numeric IDs that match event `id` fields +- **Operators**: `AND`, `OR`, `NOT` (case sensitive) +- **Precedence**: Use parentheses for complex expressions: `"(1 OR 2) AND 3"` +- **Simple Cases**: Single event: `"1"`, Multiple events: `"1 AND 2"` +- **Time Correlation**: All events must occur within `logic_lookback_time` seconds + +### Event Definition +Events specify individual data collection points and their evaluation criteria. Each event represents a specific check that can contribute to fault detection: + +```yaml +event: + id: 1 # Required: Unique identifier within signature + type: "i2c" # Required: Data source type + instances: ['PSU0:IO-MUX-6', 'PSU1:IO-MUX-7'] # Optional: Device instance DSE + path: # Required: Data source specification (varies by type) + bus: ['IO-MUX-6', 'IO-MUX-7'] # I2C bus names (resolver notation) + chip_addr: '0x58' # I2C chip address (hex format) + i2c_type: 'get' # I2C operation type + command: '0x7A' # I2C register/command (hex format) + size: 'b' # Data size (b=byte, w=word, l=long) + scaling: 'N/A' # Optional: Value scaling factor + evaluation: # Required: Fault detection criteria + type: 'mask' # Evaluation method + logic: '&' # Logical operation for mask + value: '10000000' # Comparison value (binary string) + match_count: 1 # Required: Number of matches needed + match_period: 0 # Required: Time window for matches (seconds) +``` + +#### Event Field Details + +| Field | Type | Required | Description | Valid Values | Example | +|-------|------|----------|-------------|--------------|----------| +| `id` | Integer | Yes | Unique identifier within the signature | 1-999 (unique per signature) | `1` | +| `type` | String | Yes | Data source type determining path structure | `i2c`, `redis`, `dse`, `cli`, `sysfs`, etc | `"i2c"` | +| `instances` | List | No | Device instance, can be leveraged to identify device name or specify multiple devices to query | Format: `"DeviceName:PathIdentifier"` (if PathIdentifier is an empty string, it will be assumed to apply to the entirety of the event) | `["PSU0:IO-MUX-6"]` | +| `path` | Object | Yes | Data source specification (structure varies by type) | See Path Specifications below | See examples below | +| `evaluation` | Object | Yes | Criteria for determining if fault condition is met | See Evaluation Specifications | See examples below | +| `match_count` | Integer | Yes | Number of positive evaluations needed to trigger event | 1-1000 | `1` | +| `match_period` | Integer | Yes | Time window in seconds for accumulating matches | 0-3600 (0=instant) | `0` | + +#### Path Specifications by Type + +Any change to the schema that results in the structure of the path content changing must update this section accordingly. A running history of older schemas and their layouts can be maintained elsewhere. Currently the below examples are for schema version: "0.0.1": + +**I2C Path Structure:** +```yaml +path: + bus: ['IO-MUX-6', 'IO-MUX-7'] # List of bus names (notation defined by vendor with association to instance, in this case the example is "ACPI_nickname-mux-number") + chip_addr: '0x58' # Hex address of target chip + i2c_type: 'get' # Operation: 'get', 'set' + command: '0x7A' # Register/command in hex + size: 'b' # Data size: 'b'(byte), 'w'(word), 'l'(long) + scaling: 'N/A' # Scaling factor or 'N/A' +``` + +**Redis Path Structure:** +```yaml +path: + database: 'STATE_DB' # Redis database name + table: 'PSU_INFO' # Table name within database + key: 'PSU_INFO|PSU 0' # Full key or template + path: 'value/output_voltage' # JSON path within value +``` + +**DSE Path Structure (Abstract):** +```yaml +path: "PSU:get_output_voltage_fault_register()" # Abstract DSE reference +``` + +**CLI Path Structure:** +```yaml +path: + command: 'lspci -vvnnt | grep a008' # Shell command to execute + timeout: 30 # Optional: Command timeout in seconds +``` + +#### Evaluation Specifications + +**Mask Evaluation (Bitwise Operations):** +```yaml +evaluation: + type: 'mask' # Evaluation method + logic: '&' # Bitwise operator: '&', '|', '^' + value: 0b10000000 # Mask value (binary string) +``` + +**Comparison Evaluation:** +```yaml +evaluation: + type: 'comparison' # Evaluation method + operator: '>' # Comparison: '>', '<', '>=', '<=', '==', '!=' + value: 50.0 # Comparison value + unit: 'celsius' # Optional: Value unit +``` + +**String Match Evaluation:** +```yaml +evaluation: + type: 'string' # Evaluation method + operator: 'contains' # String operation: 'contains', 'equals', 'regex' + value: 'error' # Search string or regex pattern + case_sensitive: false # Optional: Case sensitivity +``` + +**Boolean Evaluation:** +```yaml +evaluation: + type: 'boolean' # Evaluation method + value: true # Expected boolean value +``` + +### Action Specification +Actions define the response procedures when a fault is detected. This section specifies both immediate local remediation and escalating remote actions: + +```yaml +actions: + repair_actions: + local_actions: # Optional: Actions performed by DLD daemon + wait_period: 60 # Required if local_actions: Wait time after actions before secondary check and further escalation (seconds) + action_list: # Required if local_actions: Vendor defined method calls to execute + - action: + type: 'dse' + command: 'PSU:reset_output_power()' + - action: + type: 'dse' + command: 'PSU:clear_faults()' + remote_actions: # Required: Actions for remote controller + action_list: # Required: Escalating sequence of actions + - ACTION_RESEAT # First action: Remove and reinsert the component (if possible) + - ACTION_COLD_REBOOT # Second action: System reboot + - ACTION_POWER_CYCLE # Third action: Power cycle + - ACTION_FACTORY_RESET # Fourth action: Full software reimage + - ACTION_REPLACE # Final action: Return material authorization + time_window: 86400 # Required: Duration controller tracks fault history for escalation (seconds) + log_collection: # Required: Diagnostic data to collect on fault + logs: # Optional: Static log files to capture + - log: "/var/log/platform.log" # Log file path + - log: "/var/log/syslog" # System log path + queries: # Optional: Dynamic data collection commands + - query: + type: "dse" # Query type + command: "PSU:get_status()" # Platform API method to call + - query: + type: "cli" # CLI command type + command: "show platform psu" # CLI command to execute +``` + +#### Action Field Details + +| Field | Type | Required | Description | Valid Values | Example | +|-------|------|----------|-------------|--------------|----------| +| `repair_actions` | Object | Yes | Container for all local and remote remedial actions | See subfields below | See example above | +| `log_collection` | Object | Yes | Diagnostic data collection specification | See subfields below | See example above | + +#### Repair Actions Structure + +**Local Actions (Optional):** +```yaml +local_actions: + wait_period: 60 # Required: Seconds to wait after executing actions + action_list: # Required: List of vendor defined method calls + - action: + type: 'dse' # Action type + command: 'PSU:reset_output_power()' # Command to execute + - action: + type: 'dse' # Action type + command: 'PSU:clear_fault_register()' # Multiple actions executed in sequence +``` + +| Field | Type | Required | Description | Valid Values | Example | +|-------|------|----------|-------------|--------------|----------| +| `wait_period` | Integer | Yes | Time in seconds to wait after executing all local actions before allowing remote action escalation. This cooling-off period prevents rapid escalation and allows local actions to take effect. | 30-3600 (30 seconds to 1 hour) | `60` | +| `action_list` | List | Yes | Ordered sequence of vendor-defined remediation actions executed locally by the DLD daemon. Each action contains a type field specifying the execution method and a command field with the actual operation (note that the structure after the type is variable in the same way as the path block in the condition section). Actions are executed sequentially in the order specified. | List of action objects with `type` and `command` fields. Supported types: `dse`, `cli`, `i2c`, etc. | See example above | + +**Remote Actions (Required):** +Below example is placeholder of OpenConfig defined enums, actual actions will be defined by associated OpenConfig model. `time_window` defines how long (in seconds) the controller should retain the fault history for escalation decisions; if the fault remains active throughout this window, the next action in `action_list` should be triggered. +```yaml +remote_actions: + action_list: # Required: Escalating sequence of controller actions + - ACTION_RESEAT # Level 1: Remove and reinsert the component (if possible) + - ACTION_COLD_REBOOT # Level 2: System reboot + - ACTION_POWER_CYCLE # Level 3: Power cycle + - ACTION_FACTORY_RESET # Level 4: Full software reimage + - ACTION_REPLACE # Final action: Replace the component + time_window: 86400 # Required: Fault history window for escalation evaluation (seconds) +``` + +For comprehensive list of actions, please refer to the OpenConfig fault model. Link to model: TBD + + +#### Log Collection Structure + +**Static Log Files:** +```yaml +logs: + - log: "/var/log/syslog" # System log file + - log: "/var/log/platform.log" # Platform-specific logs + - log: "/mnt/obfl/*" # Onboard failure logging, capturing all files in the wildcard +``` + +**Dynamic Queries:** +```yaml +queries: + - query: + type: "dse" # Platform abstraction layer + command: "PSU:get_blackbox()" # Component-specific method + - query: + type: "cli" # Command line interface + command: "show platform temperature" # Standard CLI command + - query: + type: "dse" # SDK CLI (disruptive) + command: "CHASSIS:get_sdk_debug_dump()" # Detailed hardware dump +``` + +| Field | Type | Required | Description | Valid Values | Example | +|-------|------|----------|-------------|--------------|----------| +| `queries` | List | Yes | Ordered sequence of diagnostic data collection commands executed when a fault is detected. Each query contains a type field specifying the execution method and a command field with the actual operation (note that the structure after the type is variable in the same way as the path block in the condition section). Queries are executed sequentially in the order specified. Outputs/content from these queries are collected and stored within the Healthz artifact.| List of query objects with `type` and `command` fields. Supported types: `dse`, `cli`, `i2c`, etc. | See example above | + +## Abstract Rule Data Source Extensions - Vendor Extensible + +### What are Data Source Extensions + +Abstract data source extensions (DSE) provide a way for vendors to extend the schema with granularity at the NOS level. This allows vendors to define their own detailed hardware abstractions that can be used to match against specific events and conditions, while keeping the actual rules source file standardized and uniform. Vendors are not required to implement or use DSE, but they provide a way to better simplify the rules source file and make it more maintainable. Complexity and potential variations in hardware implementations can be abstracted away from the rules source file. Actual integration and usage of the DSE will be done through a vendor implemented hook which the on-device service will operate on. If this is not defined, DSE rules will be skipped. + +Data source extensions also allow for the ability to hook into NOS specific APIs and methods. A good example of this would be defining a DSE that resolves to a method to call on the SONiC platform chassis object to retrieve the PSU object, and then using that object to retrieve the PSU output voltage fault register. This allows for the reuse of existing infrastructure the NOS provides wherever possible. + +### Data Source Extension Architecture +Abstract rules use symbolic references that are resolved through device-specific DSE files: + +In the rules source, the event path would be defined like so: +```yaml +# Abstract rule definition +path: "{psu*}:{get_output_voltage_fault_register}" +``` + +A separate datafile on the NOS would include the information needed to convert this to a queryable source of information. This translation layer is vendor specific and consumption is handled by the on-device service through a vendor implemented hook. For example, the above abstract rule could be resolved with the following: +```json +// DSE file content +{ + "8122_64ehf-o": { <-- Product ID (no hw rev) + "p1": [{ <-- Hardware revision + "component": "psu", <-- Component + "functions": [{ + "name": "get_output_voltage_fault_register", <-- Function Name + "operation": [{ + "sw-version": ["202311.3.0.1", "202311.3.0.2"], <-- SW Version as Defined by Vendor + "platform_object": "{chassis:psu}", <-- SONiC Platform Object + "type": "i2c", + "bus": "{platform_object}:get_bus_addr()[0]", <-- Method to retrieve bus address + "chip_addr": "{platform_object}:get_bus_addr()[1]", <-- Method to retrieve chip address + "i2c_type": "get", <-- I2C Operation Command Type + "command": "0x7A", <-- I2C Command + "scaling": "N/A" <-- Scaling Factor to Apply to Output + }] + }] + }] + } +} +``` +This is defining that, for this DSE, the process will be running an i2c read operation, deriving the target bus and chip address from the PSU object that is retrieved from the platform chassis, and using the command 0x7A to read the output voltage fault register. + +### DSE Benefits +- **Hardware Abstraction**: Rules remain independent of hardware implementation details +- **SW Version Support**: Single rule supports multiple SW versions +- **Maintainability**: Hardware changes require only DSE updates +- **Reusability**: Common patterns can be shared across rules + +## Schema Layout Definitions + +Schema layout definitions provide the NOS with instructions on how to parse different schema versions. At its simplest, its a consistently formatted JSON object that defines the underlying YAML paths that are used to define the signature, event, and action objects. This is defined to simplify the consumption process and allow for the rules source to be parsed in a consistent manner. The below example is for SONiC's usage but the same concept can be applied to any NOS. It is the responsibility of the NOS to remove unsupported schema versions from this list as specific SW version identifiers can differ from vendor to vendor on the same NOS. Example below: + +```json +{ + "schemas": [{ + "major_0": { <-- Major version of the schema + "schema_data": [{ <-- List of full schema versions that contains the necessary info to parse the rules source file + "0.0.1": { + "base_paths": { <-- Reused paths for the signature, event, and action objects + "higher_rule_object": "signatures.signature", + "event_rule_object": "{higher_rule_object}.conditions.events.event", + "action_rule_object": "{higher_rule_object}.actions" + }, + "signature_name": "{higher_rule_object}.metadata.name", + "signature_id": "{higher_rule_object}.metadata.id", + "fault_description": "{higher_rule_object}.metadata.description", + "fault_severity": "{higher_rule_object}.metadata.severity", + "rule_priority": "{higher_rule_object}.metadata.priority", + "supported_product_ids": "{higher_rule_object}.metadata.product_ids", + "supported_sw_versions": "{higher_rule_object}.metadata.sw_versions", + "affected_component": "{higher_rule_object}.metadata.component", + "fault_logic": "{higher_rule_object}.conditions.logic", + "logic_lookback_time": "{higher_rule_object}.conditions.logic_lookback_time", + "event_*_id": "{event_rule_object}.id", + "event_*_type": "{event_rule_object}.type", + "event_*_path": "{event_rule_object}.path", + "event_*_evaluation": "{event_rule_object}.evaluation", + "event_*_match_count": "{event_rule_object}.match_count", + "event_*_match_period": "{event_rule_object}.match_period" + } + }] + } + }] +} +``` + +## Rule Examples + +### Complete PSU Over-Voltage Rule + +```yaml +schema_version: "0.0.1" + +signatures: + - signature: + metadata: + name: PSU_OV_FAULT + id: 1000001 + version: "1.0.0" + description: | + An over voltage fault has occurred on the output feed from the PSU to the chassis. + product_ids: + - "8122-64EHF-O P1" + - "8122-64EHF-O P2" + sw_versions: + - "202311.3.0.1" + component: PSU + tags: + - power + - voltage + + conditions: + logic: "1 OR 2" + logic_lookback_time: 60 + events: + - event: + id: 1 + type: i2c + instances: ['PSU0:IO-MUX-6', 'PSU1:IO-MUX-7'] + path: + bus: ['IO-MUX-6', 'IO-MUX-7'] + chip_addr: '0x58' + i2c_type: 'get' + command: '0x7A' + size: 'b' + scaling: 'N/A' + evaluation: + type: 'mask' + logic: '&' + value: '10000000' + match_count: 1 + match_period: 0 + + - event: + id: 2 + type: dse + path: "{psu*}:{get_output_voltage_fault_register}" + evaluation: + type: 'dse' + value: "{psu*}:{get_output_voltage_failure_value}" + match_count: 1 + match_period: 0 + + actions: + repair_actions: + local_actions: + wait_period: 60 + action_list: + - action: + type: 'dse' + command: 'PSU:reset_output_power()' + remote_actions: + action_list: + - ACTION_RESEAT + - ACTION_COLD_REBOOT + - ACTION_POWER_CYCLE + - ACTION_FACTORY_RESET + - ACTION_REPLACE + time_window: 86400 + log_collection: + logs: + - log: "/var/log/platform.log" + queries: + - query: + type: "dse" + command: "PSU:get_blackbox()" + - query: + type: "CLI" + command: "show platform voltage" +``` + +## Schema Validation + +### Validation Requirements +- Schema version must be present and valid +- All required fields must be populated +- Event IDs must be unique within a signature +- Logic expressions must reference valid event IDs +- Product IDs and SW versions must follow defined formats + +### Validation Process +1. **Syntax Validation**: YAML/JSON structure verification +2. **Schema Validation**: Conformance to version-specific schema +4. **Hardware Validation**: Compatibility with target hardware +5. **End to End Validation**: End-to-end rule execution testing, ensuring that the rule can be executed successfully. + +It is the responsibility of the consumer to validate the underlying content of the rules source and ensure that it is compatible with the expected schema version. This does not need to be every time the consumer reads the rules, only when the rules source changes. Depending on the underlying NOS implementation, this can be done as a standalone check or integrated into the final consumer of this content. Any failure of validation should result in a failure of the rule to load. Validation can be done at a high level or a rule by rule basis, allowing for valid formatted rules to be loaded even if some rules are invalid if the latter approach is taken. + +## Backward Compatibility +- **Schema Layout**: Maintain parsing instructions for all supported versions +- **Consumer Ignore**: Ensure that the consumer is able to ignore unknown fields (such as optional fields that can be added in a new minor version) + +--- + +*This document defines the vendor defined rules for hardware health monitoring. For implementation details of the SONiC focused DLD daemon itself, refer to the companion Device Local Diagnosis Service HLD document.*