|
| 1 | +# Common Metrics Provider LIbrary |
| 2 | +<!-- limited-template.md --> |
| 3 | + |
| 4 | +**Status**: Draft <!-- Draft | Under Review | Approved | Replaced | Deferred | Rejected --> |
| 5 | + |
| 6 | +**Authors**: J.Wyman (@whoisj) |
| 7 | + |
| 8 | +**Category**: Architecture <!-- Architecture | Process | Guidelines --> |
| 9 | + |
| 10 | +<!-- |
| 11 | +**Replaces**: [Link of previous proposal if applicable] |
| 12 | +
|
| 13 | +**Replaced By**: [Link of previous proposal if applicable] |
| 14 | +
|
| 15 | +**Sponsor**: [Name of code owner or maintainer to shepard process] |
| 16 | +--> |
| 17 | + |
| 18 | +**Required Reviewers**: N.Shah (@nnshah1), K.Chang (@keivenchang) |
| 19 | + |
| 20 | +**Review Date**: [Date for review] |
| 21 | + |
| 22 | +<!-- |
| 23 | +**Pull Request**: [Link to Pull Request of the Proposal itself] |
| 24 | +
|
| 25 | +**Implementation PR / Tracking Issue**: [Link to Pull Request or Tracking Issue for Implementation] |
| 26 | +--> |
| 27 | + |
| 28 | + |
| 29 | +# Summary |
| 30 | + |
| 31 | +<!-- |
| 32 | +**\[Required\]** |
| 33 | +--> |
| 34 | + |
| 35 | +Common metrics collection library with bindings for Python, Rust, and C/C++ that enables cross-language sharing of registries and counters within the same process. |
| 36 | +Provided by a reusable library with a stable ABI and support for Prometheus formatted metrics reporting. |
| 37 | + |
| 38 | + |
| 39 | +# Motivation |
| 40 | + |
| 41 | +<!-- |
| 42 | +**\[Required\]** |
| 43 | +
|
| 44 | +Describe the problem that needs to be addressed with enough detail for |
| 45 | +someone familiar with the project to understand. Generally one to two |
| 46 | +short paragraphs. Additional details can be placed in the background |
| 47 | +section as needed. Cover **what** the issue is and **why** it needs to |
| 48 | +be addressed. Link to github issues if relevant. |
| 49 | +--> |
| 50 | + |
| 51 | +Existing solutions are, generally speaking, language specific. |
| 52 | +This means that an application comprised of C++, Rust, and Python components would require at least three separate metrics solutions each providing its own registry and counter objects. |
| 53 | + |
| 54 | +In the case of Dynamo Runtime, a Rust based solution is used and API entrypoints have been added to the Dynamo Runtime that can be accessed from Python to enable a pseudo-cross-language solution. |
| 55 | +The downside of this approach is that external projects need to take a dependency on Dynamo Runtime to enable centralized metrics collection and reporting. |
| 56 | +Third-party projects are unlikely to depend on Dynamo Runtime for metrics collection, and even NVIDIA projects like TRTLLM require the ability to independent of Dynamo and therefore are unable to take a direct dependency on it. |
| 57 | + |
| 58 | +A simple, focused library which any customer could depend on that avoided a direct dependency on Dynamo, provided cross-language, shared objects, and provided multi-language support via bindings would make the unification of metrics collection and reporting a possibility. |
| 59 | + |
| 60 | +## Goals |
| 61 | + |
| 62 | +<!-- |
| 63 | +**\[Optional \- if not applicable omit\]** |
| 64 | +
|
| 65 | +List out any additional goals in bullet points. Goals may be aspirational / difficult to measure but guide the proposal. |
| 66 | +--> |
| 67 | + |
| 68 | +- Multi-language support via language specific bindings (Python, Rust, C/C++, _others?_) |
| 69 | + |
| 70 | + - Bindings that feel "natural" to developers experienced with the language the bindings are provided in. |
| 71 | + |
| 72 | +- Single, common root registry regardless of which language interacts with the library first. |
| 73 | + |
| 74 | +- Nested registry support. |
| 75 | + |
| 76 | +- Full support for serializing metrics as Prometheus metrics reports. |
| 77 | + |
| 78 | +- High performance, low overhead implementation. |
| 79 | + |
| 80 | +- Stable, forward- and backward- compatible ABI. |
| 81 | + |
| 82 | + - Consistent interface. |
| 83 | + |
| 84 | + - Actionable errors. |
| 85 | + |
| 86 | +### Non Goals |
| 87 | + |
| 88 | +<!-- |
| 89 | +**\[Optional \- if not applicable omit\]** |
| 90 | +
|
| 91 | +List out any items which are out of scope / specifically not required in bullet points. Indicates the scope of the proposal and issue being resolved. |
| 92 | +--> |
| 93 | + |
| 94 | +- Redesign how metric collection is done. |
| 95 | +- Redesign how metrics reporting is done. |
| 96 | + |
| 97 | +## Requirements |
| 98 | + |
| 99 | +<!-- |
| 100 | +Describe the requirement in as much detail as necessary for others to understand it and how it applies to the DEP. Keep in mind that requirements should be measurable and will be used to determine if a DEP has been successfully implemented or not. |
| 101 | +
|
| 102 | +Requirement names should be prefixed using a monotonically increasing number such as “REQ 1 \<Title\>” followed by “REQ 2 \<Title\>” and so on. Use title casing when naming requirements. Requirement names should be as descriptive as possible while remaining as terse as possible. |
| 103 | +
|
| 104 | +Use all-caps, bolded terms like **MUST** and **SHOULD** when describing each requirement. See [RFC-2119](https://datatracker.ietf.org/doc/html/rfc2119) for additional information. |
| 105 | +--> |
| 106 | + |
| 107 | +### REQ 1: Language "Native" Bindings for Supported Languages |
| 108 | + |
| 109 | +### REQ 2: Cross-Language Shared Registries and Counters |
| 110 | + |
| 111 | +### REQ 3: High-Performance, Low-Overhead Implementation |
| 112 | + |
| 113 | +### REQ 4: Serialization of Metric Data to Prometheus Formatted Output |
| 114 | + |
| 115 | +### REQ 5: Designed for Testability |
| 116 | + |
| 117 | +### REQ 6: 85%+ Code Coverage from Unit Tests |
| 118 | + |
| 119 | + |
| 120 | +# Proposal |
| 121 | + |
| 122 | +<!-- |
| 123 | +**\[Required\]** |
| 124 | +
|
| 125 | +Describe the high level design / proposal. Use sub sections as needed, but start with an overview and then dig into the details. Try to provide images and diagrams to facilitate understanding. |
| 126 | +--> |
| 127 | + |
| 128 | +- Provide a common library via .so files (.dll for Windows) for amd64 and arm64 based machines. |
| 129 | +- Provide wheel file for Python consumers. |
| 130 | +- Provide a cargo for Rust consumers. |
| 131 | +- Provide header files for C++ consumers. |
| 132 | +- Provide a stable Application Binary Interface (ABI) such that any language wrapper is viable. |
| 133 | + |
| 134 | +## Registries |
| 135 | + |
| 136 | +- Library provides a singleton to a "root" metrics registry. |
| 137 | +- Any registry can create counters and subordinate registries. |
| 138 | +- Functionally, there is no difference between subordinate registries and the root registry, except that the root registry is a singleton. |
| 139 | +- Supports prefixing all counter names including counters of subordinate registries. |
| 140 | +- Supports default set of labels for all counters including counters of subordinate registries. |
| 141 | + |
| 142 | +## Counters |
| 143 | + |
| 144 | +- Monotonically increasing counters. |
| 145 | +- Increment by value counters which track the number of times they've been incremented (i.e. total and count values). |
| 146 | +- Value set gauge counters. |
| 147 | +- Native support for integer, floating-point, and nanosecond counters and gauges. |
| 148 | +- Assigned set of labels at creation. |
| 149 | + - Variants by label set are separate counters by design (performance optimization). |
| 150 | +- Future support for histogram metrics (if necessary). |
| 151 | +- Support priority levels to allow wide spread metric collection with variable verbosity reporting. |
| 152 | + |
| 153 | +## Stable ABI |
| 154 | + |
| 155 | +- Versioned. |
| 156 | +- Consistent. |
| 157 | +- Provide actionable error messages. |
| 158 | +- Provides sufficient error information to enable language specific exception or error handling. |
| 159 | + |
| 160 | +## Python Wrapper |
| 161 | + |
| 162 | +- "pythonic" by design. |
| 163 | +- Intended to "naturally" consumed by Python developers. |
| 164 | +- Leans heavily on Python's memory management hooks to properly integrate with the language. |
| 165 | + |
| 166 | +## Rust Wrapper |
| 167 | + |
| 168 | +- Intended to be "naturally" consumed by Rust developers. |
| 169 | +- Leans heavily on Rust's trait hooks to properly integrate with the language. |
| 170 | + |
| 171 | + |
| 172 | +# Alternate Solutions |
| 173 | + |
| 174 | +<!-- |
| 175 | +**\[Required, if not applicable write N/A\]** |
| 176 | +
|
| 177 | +List out solutions that were considered but ultimately rejected. Consider free form \- but a possible format shown below. |
| 178 | +--> |
| 179 | + |
| 180 | +## Alt 1: Prometheus Client Library |
| 181 | + |
| 182 | +**Pros:** |
| 183 | + |
| 184 | +- Preexists. |
| 185 | +- Supports Python and Rust. |
| 186 | + |
| 187 | +**Cons:** |
| 188 | + |
| 189 | +- Reimplemented for each language; no common components. |
| 190 | +- No mechanism to share registries or counters across language boundaries. |
| 191 | +- Hashmap based implementation can cause performance bottlenecks. |
| 192 | +- Limited ability to influence design direction or make changes to implementation. |
| 193 | + |
| 194 | +**Reason Rejected:** |
| 195 | + |
| 196 | +- Poor cross-language support (not "poor multi-language support"). |
0 commit comments