Skip to content

[JMX Insight] Hadoop jmx metics semconv alignment #14411

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

robsunday
Copy link
Contributor

@robsunday robsunday commented Aug 12, 2025

Fixes #14274
Includes:

  • Hadoop JMX metrics semconv alignment
  • Moved yaml with JMX metrics mapping to the library
  • Integration tests for Hadoop 2.x and Hadoop 3.x


| Metric Name | Type | Attributes | Description |
|---------------------------------|---------------|-------------------------------------|--------------------------------------------------------|
| hadoop.dfs.capacity | UpDownCounter | hadoop.node.name | Current raw capacity of data nodes. |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[for reviewer] Naming prefix of metrics has been change to utilize a metric context (dfs).
Context is described in official docs: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/Metrics.html

@otelbot-java-instrumentation
Copy link
Contributor

🔧 The result from spotlessApply was committed to the PR branch.

@robsunday robsunday changed the title Hadoop jmx metics semconv alignment [JMX Insight] Hadoop jmx metics semconv alignment Aug 13, 2025
@robsunday robsunday marked this pull request as ready for review August 14, 2025 13:39
@robsunday robsunday requested a review from a team as a code owner August 14, 2025 13:39
@@ -63,7 +63,7 @@ public class TargetSystemTest {
private static OtlpGrpcServer otlpServer;
private static Path agentPath;
private static Path testAppPath;
private static String otlpEndpoint;
protected static String otlpEndpoint;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] I would suggest to use a protected getter to expose this as read-only to sub-classes only as the field is not final.

# Java agent opts needed for
export JAVA_AGENT_OPTS="-javaagent:/opentelemetry-instrumentation-javaagent.jar"
export JAVA_AGENT_OPTS="$JAVA_AGENT_OPTS -Dotel.logs.exporter=none -Dotel.traces.exporter=none -Dotel.metrics.exporter=otlp"
export JAVA_AGENT_OPTS="$JAVA_AGENT_OPTS -Dotel.exporter.otlp.endpoint=<<ENDPOINT_PLACEHOLDER>> -Dotel.exporter.otlp.protocol=grpc"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand this correctly, the reason we have to use this placeholder and replace it at runtime is because we can't use environment variables to provide it (this does not only applies to otel env variables).

@@ -150,7 +150,7 @@ protected static Map<String, String> otelConfigProperties(List<String> yamlFiles
// disable runtime telemetry metrics
config.put("otel.instrumentation.runtime-telemetry.enabled", "false");
// set yaml config files to test
config.put("otel.jmx.target", "tomcat");
config.put("otel.jmx.target", "hadoop");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this line should probably be removed as we test with only the explicit list of yaml files here.

Comment on lines +77 to +87
// Hadoop startup script does not propagate env vars to launched hadoop daemons,
// so all the env vars needs to be embedded inside the hadoop-env.sh file
GenericContainer<?> target =
new GenericContainer<>("loum/hadoop-pseudo:3.3.6")
.withExposedPorts(9870, 9000)
.withCopyToContainer(
Transferable.of(readAndPreprocessEnvFile("hadoop3-env.sh")),
"/opt/hadoop/etc/hadoop/hadoop-env.sh")
.withCreateContainerCmdModifier(cmd -> cmd.withHostName("test-host"))
.waitingFor(
Wait.forListeningPorts(9870, 9000).withStartupTimeout(Duration.ofMinutes(3)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] this is the only part that differs between hadoop 2.x and 3.x, it could be worth refactoring this with a common method for the test body that delegates the container creation to a Producer<GenericContainer<?>> lambda with a dedicated implementation for each.


| Metric Name | Type | Attributes | Description |
|---------------------------------|---------------|-------------------------------------|--------------------------------------------------------|
| hadoop.dfs.capacity | UpDownCounter | hadoop.node.name | Current raw capacity of data nodes. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The capacity is in bytes, so maybe we should start adding a dedicated column for the units (then we can do the same for other metrics as follow-up).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also not sure if we should have a metric that is a prefix of another, maybe capacity.raw could work here to replicate the wording in the docs.

Comment on lines +58 to +71
NumLiveDataNodes:
metric: &metric data_node.count
type: &type updowncounter
unit: &unit "{node}"
desc: &desc The number of DataNodes.
metricAttribute:
hadoop.node.state: const(live)
NumDeadDataNodes:
metric: *metric
type: *type
unit: *unit
desc: *desc
metricAttribute:
hadoop.node.state: const(dead)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not very familiar with hadoop, but that there are more than 2 states for the data nodes as we can infer from the Num.*Datanodes attributes in doc, so I would suggest to map this as individual metrics instead of using a constant metric attribute as we can't guarantee that it's a partition.

For example, if a node would go into a state that is not mapped here, then it means that the total number of nodes in the cluster would go down by one, even if the node is still part of the cluster. With individual metrics per state, the consumer can expect that the metric does not represent the whole cluster node count.

---
rules:
- bean: Hadoop:service=NameNode,name=FSNamesystem
prefix: hadoop.dfs.
Copy link
Contributor

@SylvainJuge SylvainJuge Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to always have the dfs. infix here ? It's an acronym and probably implicit by hadoop, unless we have other use-cases, for example hadoop.rpc to capture the in/out bytes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed today, while dfs is an acronym, it maps to the "context" in the documentation, also providing some structure and help in the mapping/linking to the MBean attributes. Adding hadoop.rpc.io to with ReceivedBytes and SentBytes MBean attributes would be a worthwile addition that we could include directly into this PR, hence creating at least one metric that is not related to dfs.

desc: Current number of files and directories.
# hadoop.dfs.connection.count
TotalLoad:
metric: connection.count
Copy link
Contributor

@SylvainJuge SylvainJuge Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the .count suffix could be removed, as there is currently no other connection-related metric besides the count in the dfs context. I really don't know what "connection count" means in this context here and if it overlaps or relates with the number of open network connections or not.

With rpc context, we have NumOpenConnections and numDroppedConnections which could (maybe in a future improvement) be mapped respectively to:

  • hadoop.rpc.connection.count
  • hadoop.rpc.connection.dropped

So if those metrics were added in the future, then having hadoop.dfs.connection.count and hadoop.rpc.connection.count would be consistent and keeping connection.count as suffix would be a good option.

However, if the dfs.connection.count metric overlaps the rpc.connection.count, then maybe we could only capture rpc.connection.{count,dropped} to prevent any confusion. Due to my lack of knowledge on hadoop, in doubt I would suggest to capture both.

desc: Current number of blocks with corrupt replicas.
# hadoop.dfs.volume.failure.count
VolumeFailuresTotal:
metric: volume.failure.count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could map EstimatedCapacityLostTotal attribute in the future to volume.failure.capacity to provide an estimated capacity lost due to volume failures, so the .count suffix is relevant to keep here.

Also, I think this is safe to keep the volume wording here for clarity as there are lots of different possible failure types.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

jmx hadoop metrics update and align with semconv
2 participants