Flink Agents Metrics Design #73
Closed
GreatEugenius
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Introduce
In Flink Agents, we aim to introduce a metrics mechanism that allows users not only to understand their jobs through output results but also to easily collect and analyze the usage of different Events, Actions, Models, and Tools within the Agent jobs by integrating with external systems. When necessary, we will also set up monitoring and alerting for key metrics—such as abnormal spikes in Token usage over a short period. This will help users better understand the runtime status of their jobs, enabling them to refine and optimize their workflows to achieve desired outcomes.
We plan to integrate Flink Agents Metrics into Flink’s built-in metric system. Through the Flink Web UI, users can conveniently view relevant parameters. Additionally, leveraging the rich set of Metric Reporters provided by the Flink metric system, users will be able to export Flink runtime metrics to external systems, enabling visualization and setting up monitoring alerts.
Flink Agents Metric Design
In Flink Agent jobs, the
ActionExecutionOperatorplays a central role in the entire Agent lifecycle — from event parsing to action execution. To align with Flink’s native metric architecture, we design the metric system by leveraging aProxyMetricGroupassociated with theActionExecutionOperator. This ensures that all metrics defined under theFlinkAgentMetricGroupare automatically registered at the operator-level (OperatorMetricGroup), enabling seamless integration with Flink’s built-in monitoring capabilities.The Flink Agents metrics system is logically divided into two main categories:
Flink Agents Builtin Metric
To provide a comprehensive and extensible metric system, the Flink Agents project defines a set of Builtin Metrics, which are automatically collected during job execution. These metrics are implemented directly within core components such as
ActionExecutionOperatorandBuiltInAction, ensuring minimal user configuration while offering rich visibility into agent behavior.Survey of Metric Support in Agent-Based Frameworks
When developing metrics for Flink Agents, we analyzed the types of metrics supported by mainstream frameworks such as LlamaIndex and LangChain. Based on this analysis, metrics can be broadly categorized into three groups:
1. Performance Monitoring Metrics
These metrics track system performance characteristics such as response time and throughput. Commonly collected using tools like Prometheus or Grafana, they provide insights into runtime behavior.
2. Resource and Cost Metrics
These metrics monitor resource usage, particularly in relation to model token consumption. They are essential for cost tracking and optimization.
Common metrics include:
total_tokens: total number of tokensprompt_tokens: number of prompt tokenscompletion_tokens: number of generated tokens3. Model Evaluation Metrics
Model Evaluation Metrics are used to assess the quality of model input and output, such as correctness, semantic similarity, and preference alignment. These metrics are typically applied in offline evaluation scenarios rather than real-time monitoring.
Common types include accuracy checks, text similarity measurements (e.g., based on string or embedding distances), and pairwise comparisons between model outputs for preference scoring or ranking.
Proposed Scope of Builtin Metrics in Flink Agents
The MVP version of Flink Agents focuses on implementing Performance Monitoring Metrics and Resource & Cost Metrics, which are most relevant for real-time monitoring and observability. Support for Model Evaluation Metrics is planned for future versions.
We have designed the following metrics at different levels of granularity, using a two-dimensional structure based on component type (Agent, Event, Action, Model, Tool) and metric type (Count, Meter, Histogram):
(Operater Builtin has been implemented)
NumOfOutput
NumOfOutputPerSec
NumOfActionsExecuted
NumOfToken
NumOfPromptToken
NumCompletionToken
NumOfTokenPerSec
NumOfPromptTokenPerSec
NumCompletionTokenPerSec
Note: Token statistics depend on the output returned by the model, as different models use different tokenizers.
For Builtin Metrics, we provide metrics for Event, ModelChat, and ToolCall at a general level (without type distinction), as well as per-action metrics. For example, the
NumOfEventmetric counts the total number of events without differentiation by event type, while theNumOfActionmetric tracks both the total number of actions executed and the count per action type.Implementation Example: NumOfEvent Builtin Metric
To illustrate how Builtin Metrics are implemented, consider the
NumOfEventmetric. In theActionExecutionOperator, event counting can be achieved as follows:This implementation ensures that every event processed by the operator is counted and reported via the Flink metric system, enabling visibility through the Web UI or external monitoring tools.
Flink Agents User Define Metric
In Flink Agents, users implement their logic by defining custom Actions that respond to various Events throughout the Agent lifecycle. To support user-defined metrics, we introduce two new methods:
get_metric_group()andget_action_metric_group()in the RunnerContext. These methods allow users to create or update global metrics and independent metrics for actions.This method allows users to access the operator-level OperatorMetricGroup from any Action. With this capability, users can register and update metrics directly at the operator level while defining their Flink Agent jobs. The resulting metric identifier is
<host>.taskmanager.<tm_id>.<job_name>.<operator_name>.<subtask_index>.<metrix_name>. We also support Pre Action metrics, with the identifier being<host>.taskmanager.<tm_id>.<job_name>.<operator_name>.<subtask_index>.<action_name>.<metric_name>.Additionally, the system supports the creation of sub-metric groups to distinguish metrics generated by different Actions. These sub-groups enable more granular tracking and use the identifier:
<host>.taskmanager.<tm_id>.<job_name>.<operator_name>.<subtask_index>.<user_define_group_name>.<metrix_name>.APIs
MetricGroup
The
MetricGroupinterface enables hierarchical metric management throughget_sub_group(name: str)and provides dynamic creation/update of four core metric types: counters (e.g., action counts), meters (e.g., event rates), histograms (e.g., latency distributions), and gauges (e.g., system state). These methods allow users to organize and track metrics within custom Actions while integrating seamlessly with Flink’s operator-level monitoring system.Note: In
FlinkAgentsMetricGroup, we maintain four internal maps keyed by metric name for each type (Counter,Meter,Histogram,Gauge). If a metric with a given name does not exist, it will be automatically created. Otherwise, the existing metric is returned.Core Metric Types
Counter
A counter is used to measure the number of occurrences of an event.
Meter
A meter is used to track the rate of events over time (throughput).
Histogram
A histogram is used to collect and analyze distributions of values (e.g., latencies).
Gauge
A gauge is used to record a single value at a point in time (e.g., current system load).
RunnerContext
Examples
Beta Was this translation helpful? Give feedback.
All reactions