-
Notifications
You must be signed in to change notification settings - Fork 987
[JMX Insight] Hadoop jmx metics semconv alignment #14411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[JMX Insight] Hadoop jmx metics semconv alignment #14411
Conversation
|
||
| Metric Name | Type | Attributes | Description | | ||
|---------------------------------|---------------|-------------------------------------|--------------------------------------------------------| | ||
| hadoop.dfs.capacity | UpDownCounter | hadoop.node.name | Current raw capacity of data nodes. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[for reviewer] Naming prefix of metrics has been change to utilize a metric context (dfs).
Context is described in official docs: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/Metrics.html
🔧 The result from spotlessApply was committed to the PR branch. |
…ent' into hadoop-jmx-metics-semconv-alignment
...trics/library/src/test/java/io/opentelemetry/instrumentation/jmx/rules/TargetSystemTest.java
Outdated
Show resolved
Hide resolved
…unday/opentelemetry-java-instrumentation into hadoop-jmx-metics-semconv-alignment
@@ -63,7 +63,7 @@ public class TargetSystemTest { | |||
private static OtlpGrpcServer otlpServer; | |||
private static Path agentPath; | |||
private static Path testAppPath; | |||
private static String otlpEndpoint; | |||
protected static String otlpEndpoint; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[minor] I would suggest to use a protected getter to expose this as read-only to sub-classes only as the field is not final.
# Java agent opts needed for | ||
export JAVA_AGENT_OPTS="-javaagent:/opentelemetry-instrumentation-javaagent.jar" | ||
export JAVA_AGENT_OPTS="$JAVA_AGENT_OPTS -Dotel.logs.exporter=none -Dotel.traces.exporter=none -Dotel.metrics.exporter=otlp" | ||
export JAVA_AGENT_OPTS="$JAVA_AGENT_OPTS -Dotel.exporter.otlp.endpoint=<<ENDPOINT_PLACEHOLDER>> -Dotel.exporter.otlp.protocol=grpc" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand this correctly, the reason we have to use this placeholder and replace it at runtime is because we can't use environment variables to provide it (this does not only applies to otel env variables).
@@ -150,7 +150,7 @@ protected static Map<String, String> otelConfigProperties(List<String> yamlFiles | |||
// disable runtime telemetry metrics | |||
config.put("otel.instrumentation.runtime-telemetry.enabled", "false"); | |||
// set yaml config files to test | |||
config.put("otel.jmx.target", "tomcat"); | |||
config.put("otel.jmx.target", "hadoop"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this line should probably be removed as we test with only the explicit list of yaml files here.
// Hadoop startup script does not propagate env vars to launched hadoop daemons, | ||
// so all the env vars needs to be embedded inside the hadoop-env.sh file | ||
GenericContainer<?> target = | ||
new GenericContainer<>("loum/hadoop-pseudo:3.3.6") | ||
.withExposedPorts(9870, 9000) | ||
.withCopyToContainer( | ||
Transferable.of(readAndPreprocessEnvFile("hadoop3-env.sh")), | ||
"/opt/hadoop/etc/hadoop/hadoop-env.sh") | ||
.withCreateContainerCmdModifier(cmd -> cmd.withHostName("test-host")) | ||
.waitingFor( | ||
Wait.forListeningPorts(9870, 9000).withStartupTimeout(Duration.ofMinutes(3))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[minor] this is the only part that differs between hadoop 2.x and 3.x, it could be worth refactoring this with a common method for the test body that delegates the container creation to a Producer<GenericContainer<?>>
lambda with a dedicated implementation for each.
|
||
| Metric Name | Type | Attributes | Description | | ||
|---------------------------------|---------------|-------------------------------------|--------------------------------------------------------| | ||
| hadoop.dfs.capacity | UpDownCounter | hadoop.node.name | Current raw capacity of data nodes. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The capacity is in bytes, so maybe we should start adding a dedicated column for the units (then we can do the same for other metrics as follow-up).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also not sure if we should have a metric that is a prefix of another, maybe capacity.raw
could work here to replicate the wording in the docs.
NumLiveDataNodes: | ||
metric: &metric data_node.count | ||
type: &type updowncounter | ||
unit: &unit "{node}" | ||
desc: &desc The number of DataNodes. | ||
metricAttribute: | ||
hadoop.node.state: const(live) | ||
NumDeadDataNodes: | ||
metric: *metric | ||
type: *type | ||
unit: *unit | ||
desc: *desc | ||
metricAttribute: | ||
hadoop.node.state: const(dead) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not very familiar with hadoop, but that there are more than 2 states for the data nodes as we can infer from the Num.*Datanodes
attributes in doc, so I would suggest to map this as individual metrics instead of using a constant metric attribute as we can't guarantee that it's a partition.
For example, if a node would go into a state that is not mapped here, then it means that the total number of nodes in the cluster would go down by one, even if the node is still part of the cluster. With individual metrics per state, the consumer can expect that the metric does not represent the whole cluster node count.
--- | ||
rules: | ||
- bean: Hadoop:service=NameNode,name=FSNamesystem | ||
prefix: hadoop.dfs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to always have the dfs.
infix here ? It's an acronym and probably implicit by hadoop, unless we have other use-cases, for example hadoop.rpc
to capture the in/out bytes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed today, while dfs
is an acronym, it maps to the "context" in the documentation, also providing some structure and help in the mapping/linking to the MBean attributes. Adding hadoop.rpc.io
to with ReceivedBytes
and SentBytes
MBean attributes would be a worthwile addition that we could include directly into this PR, hence creating at least one metric that is not related to dfs
.
desc: Current number of files and directories. | ||
# hadoop.dfs.connection.count | ||
TotalLoad: | ||
metric: connection.count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here the .count
suffix could be removed, as there is currently no other connection-related metric besides the count in the dfs
context. I really don't know what "connection count" means in this context here and if it overlaps or relates with the number of open network connections or not.
With rpc
context, we have NumOpenConnections
and numDroppedConnections
which could (maybe in a future improvement) be mapped respectively to:
hadoop.rpc.connection.count
hadoop.rpc.connection.dropped
So if those metrics were added in the future, then having hadoop.dfs.connection.count
and hadoop.rpc.connection.count
would be consistent and keeping connection.count
as suffix would be a good option.
However, if the dfs.connection.count
metric overlaps the rpc.connection.count
, then maybe we could only capture rpc.connection.{count,dropped}
to prevent any confusion. Due to my lack of knowledge on hadoop, in doubt I would suggest to capture both.
desc: Current number of blocks with corrupt replicas. | ||
# hadoop.dfs.volume.failure.count | ||
VolumeFailuresTotal: | ||
metric: volume.failure.count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could map EstimatedCapacityLostTotal
attribute in the future to volume.failure.capacity
to provide an estimated capacity lost due to volume failures, so the .count
suffix is relevant to keep here.
Also, I think this is safe to keep the volume
wording here for clarity as there are lots of different possible failure types.
Fixes #14274
Includes: