Skip to content

Commit e360278

Browse files
maltesanderxeniapeadwk67
authored
chore: ensure metrics are correctly exposed (#721)
* WIP: adds metrics and native-metrics service, adds metrics port to container, adds prometheus annotations * move services to own module * remove second metrics service, consolidate ports * start fixing smoke tests * fix smoke tests * fix smoke tests scripts with regex * fix missing tests * fix scopes and visibility * adapted docs * run precommit * add metrics service to tls cert * clippy * clippy 2 * remove listener class from zk in test * Apply suggestions from code review Co-authored-by: Andrew Kenworthy <[email protected]> * rename test script, fix multi disk smoke test * run precommit * adapted changelog --------- Co-authored-by: xeniape <[email protected]> Co-authored-by: Andrew Kenworthy <[email protected]>
1 parent fa97b86 commit e360278

File tree

17 files changed

+533
-199
lines changed

17 files changed

+533
-199
lines changed

CHANGELOG.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ All notable changes to this project will be documented in this file.
2424
- The built-in Prometheus servlet is now enabled and metrics are exposed under the `/prom` path of all UI services ([#695]).
2525
- Add several properties to `hdfs-site.xml` and `core-site.xml` that improve general performance and reliability ([#696]).
2626
- Add RBAC rule to helm template for automatic cluster domain detection ([#699]).
27+
- Add `prometheus.io/path|port|scheme` annotations to metrics service ([#721]).
2728

2829
### Changed
2930

@@ -48,6 +49,9 @@ All notable changes to this project will be documented in this file.
4849
- The CLI argument `--kubernetes-node-name` or env variable `KUBERNETES_NODE_NAME` needs to be set. The helm-chart takes care of this.
4950
- The operator helm-chart now grants RBAC `patch` permissions on `events.k8s.io/events`,
5051
so events can be aggregated (e.g. "error happened 10 times over the last 5 minutes") ([#700]).
52+
- BREAKING: Renamed headless rolegroup service from `<stacklet>-<role>-<rolegroup>` to `<stacklet>-<role>-<rolegroup>-metrics` ([#721]).
53+
- The `prometheus.io/scrape` label was moved to the metrics service
54+
- The headless service now only exposes product / data ports, the metrics service only metrics ports
5155

5256
### Fixed
5357

@@ -76,6 +80,7 @@ All notable changes to this project will be documented in this file.
7680
[#697]: https://github.com/stackabletech/hdfs-operator/pull/697
7781
[#699]: https://github.com/stackabletech/hdfs-operator/pull/699
7882
[#700]: https://github.com/stackabletech/hdfs-operator/pull/700
83+
[#721]: https://github.com/stackabletech/hdfs-operator/pull/721
7984

8085
## [25.3.0] - 2025-03-21
8186

Lines changed: 14 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,25 @@
11
= Monitoring
2-
:description: The HDFS cluster can be monitored with Prometheus from inside or outside the K8S cluster.
2+
:description: The HDFS cluster is automatically configured to export Prometheus metrics.
33

44
The cluster can be monitored with Prometheus from inside or outside the K8S cluster.
55

6-
All services (with the exception of the Zookeeper daemon on the node names) run with the JMX exporter agent enabled and expose metrics on the `metrics` port.
7-
This port is available from the container level up to the NodePort services.
6+
The managed HDFS stacklets are automatically configured to export Prometheus metrics.
7+
See xref:operators:monitoring.adoc[] for more details.
88

99
[IMPORTANT]
1010
====
11-
Starting with Stackable Data Platform 25.7, the built-in Prometheus metrics are also available at the `/prom` endpoint of all the UI services.
11+
Starting with Stackable Data Platform 25.7, the built-in Prometheus metrics are available at the `/prom` endpoint of all the UI services.
1212
The JMX exporter metrics are now deprecated and will be removed in a future release.
1313
====
1414

15-
The metrics endpoints are also used as liveliness probes by Kubernetes.
15+
This endpoint, in the case of the Namenode service, is reachable via the the `metrics` service:
16+
[source,shell]
17+
----
18+
http://<hdfs-stacklet>-namenode-<rolegroup-name>-metrics:9870/prom
19+
----
1620

17-
See xref:operators:monitoring.adoc[] for more details.
21+
== Authentication when using TLS
22+
23+
HDFS exposes metrics through the same port as their web UI. Hence, when configuring HDFS with TLS the metrics are also secured by TLS,
24+
and the clients scraping the metrics endpoint need to authenticate against it. This could for example be accomplished by utilizing mTLS
25+
between Kubernetes Pods with the xref:home:secret-operator:index.adoc[Secret Operator].

rust/operator-binary/src/container.rs

Lines changed: 30 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,7 @@ use stackable_operator::{
4848
CustomContainerLogConfig,
4949
},
5050
},
51+
role_utils::RoleGroupRef,
5152
utils::{COMMON_BASH_TRAP_FUNCTIONS, cluster_info::KubernetesClusterInfo},
5253
};
5354
use strum::{Display, EnumDiscriminants, IntoStaticStr};
@@ -216,24 +217,25 @@ impl ContainerConfig {
216217
hdfs: &v1alpha1::HdfsCluster,
217218
cluster_info: &KubernetesClusterInfo,
218219
role: &HdfsNodeRole,
219-
role_group: &str,
220+
rolegroup_ref: &RoleGroupRef<v1alpha1::HdfsCluster>,
220221
resolved_product_image: &ResolvedProductImage,
221222
merged_config: &AnyNodeConfig,
222223
env_overrides: Option<&BTreeMap<String, String>>,
223224
zk_config_map_name: &str,
224-
object_name: &str,
225225
namenode_podrefs: &[HdfsPodRef],
226226
labels: &Labels,
227227
) -> Result<(), Error> {
228228
// HDFS main container
229229
let main_container_config = Self::from(*role);
230-
pb.add_volumes(main_container_config.volumes(merged_config, object_name, labels)?)
230+
let object_name = rolegroup_ref.object_name();
231+
232+
pb.add_volumes(main_container_config.volumes(merged_config, &object_name, labels)?)
231233
.context(AddVolumeSnafu)?;
232234
pb.add_container(main_container_config.main_container(
233235
hdfs,
234236
cluster_info,
235237
role,
236-
role_group,
238+
rolegroup_ref,
237239
resolved_product_image,
238240
zk_config_map_name,
239241
env_overrides,
@@ -277,6 +279,8 @@ impl ContainerConfig {
277279
)
278280
.with_pod_scope()
279281
.with_node_scope()
282+
// To scrape metrics behind TLS endpoint (without FQDN)
283+
.with_service_scope(rolegroup_ref.rolegroup_metrics_service_name())
280284
.with_format(SecretFormat::TlsPkcs12)
281285
.with_tls_pkcs12_password(TLS_STORE_PASSWORD)
282286
.with_auto_tls_cert_lifetime(
@@ -319,15 +323,15 @@ impl ContainerConfig {
319323
let zkfc_container_config = Self::try_from(NameNodeContainer::Zkfc.to_string())?;
320324
pb.add_volumes(zkfc_container_config.volumes(
321325
merged_config,
322-
object_name,
326+
&object_name,
323327
labels,
324328
)?)
325329
.context(AddVolumeSnafu)?;
326330
pb.add_container(zkfc_container_config.main_container(
327331
hdfs,
328332
cluster_info,
329333
role,
330-
role_group,
334+
rolegroup_ref,
331335
resolved_product_image,
332336
zk_config_map_name,
333337
env_overrides,
@@ -340,15 +344,15 @@ impl ContainerConfig {
340344
Self::try_from(NameNodeContainer::FormatNameNodes.to_string())?;
341345
pb.add_volumes(format_namenodes_container_config.volumes(
342346
merged_config,
343-
object_name,
347+
&object_name,
344348
labels,
345349
)?)
346350
.context(AddVolumeSnafu)?;
347351
pb.add_init_container(format_namenodes_container_config.init_container(
348352
hdfs,
349353
cluster_info,
350354
role,
351-
role_group,
355+
&rolegroup_ref.role_group,
352356
resolved_product_image,
353357
zk_config_map_name,
354358
env_overrides,
@@ -362,15 +366,15 @@ impl ContainerConfig {
362366
Self::try_from(NameNodeContainer::FormatZooKeeper.to_string())?;
363367
pb.add_volumes(format_zookeeper_container_config.volumes(
364368
merged_config,
365-
object_name,
369+
&object_name,
366370
labels,
367371
)?)
368372
.context(AddVolumeSnafu)?;
369373
pb.add_init_container(format_zookeeper_container_config.init_container(
370374
hdfs,
371375
cluster_info,
372376
role,
373-
role_group,
377+
&rolegroup_ref.role_group,
374378
resolved_product_image,
375379
zk_config_map_name,
376380
env_overrides,
@@ -385,15 +389,15 @@ impl ContainerConfig {
385389
Self::try_from(DataNodeContainer::WaitForNameNodes.to_string())?;
386390
pb.add_volumes(wait_for_namenodes_container_config.volumes(
387391
merged_config,
388-
object_name,
392+
&object_name,
389393
labels,
390394
)?)
391395
.context(AddVolumeSnafu)?;
392396
pb.add_init_container(wait_for_namenodes_container_config.init_container(
393397
hdfs,
394398
cluster_info,
395399
role,
396-
role_group,
400+
&rolegroup_ref.role_group,
397401
resolved_product_image,
398402
zk_config_map_name,
399403
env_overrides,
@@ -462,7 +466,7 @@ impl ContainerConfig {
462466
hdfs: &v1alpha1::HdfsCluster,
463467
cluster_info: &KubernetesClusterInfo,
464468
role: &HdfsNodeRole,
465-
role_group: &str,
469+
rolegroup_ref: &RoleGroupRef<v1alpha1::HdfsCluster>,
466470
resolved_product_image: &ResolvedProductImage,
467471
zookeeper_config_map_name: &str,
468472
env_overrides: Option<&BTreeMap<String, String>>,
@@ -481,7 +485,7 @@ impl ContainerConfig {
481485
.args(self.args(hdfs, cluster_info, role, merged_config, &[])?)
482486
.add_env_vars(self.env(
483487
hdfs,
484-
role_group,
488+
&rolegroup_ref.role_group,
485489
zookeeper_config_map_name,
486490
env_overrides,
487491
resources.as_ref(),
@@ -1249,16 +1253,18 @@ wait_for_termination $!
12491253
/// Container ports for the main containers namenode, datanode and journalnode.
12501254
fn container_ports(&self, hdfs: &v1alpha1::HdfsCluster) -> Vec<ContainerPort> {
12511255
match self {
1252-
ContainerConfig::Hdfs { role, .. } => hdfs
1253-
.ports(role)
1254-
.into_iter()
1255-
.map(|(name, value)| ContainerPort {
1256-
name: Some(name),
1257-
container_port: i32::from(value),
1258-
protocol: Some("TCP".to_string()),
1259-
..ContainerPort::default()
1260-
})
1261-
.collect(),
1256+
ContainerConfig::Hdfs { role, .. } => {
1257+
// data ports
1258+
hdfs.hdfs_main_container_ports(role)
1259+
.into_iter()
1260+
.map(|(name, value)| ContainerPort {
1261+
name: Some(name),
1262+
container_port: i32::from(value),
1263+
protocol: Some("TCP".to_string()),
1264+
..ContainerPort::default()
1265+
})
1266+
.collect()
1267+
}
12621268
_ => {
12631269
vec![]
12641270
}

rust/operator-binary/src/crd/constants.rs

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,21 +20,28 @@ pub const SERVICE_PORT_NAME_HTTP: &str = "http";
2020
pub const SERVICE_PORT_NAME_HTTPS: &str = "https";
2121
pub const SERVICE_PORT_NAME_DATA: &str = "data";
2222
pub const SERVICE_PORT_NAME_METRICS: &str = "metrics";
23+
pub const SERVICE_PORT_NAME_JMX_METRICS: &str = "jmx-metrics";
2324

2425
pub const DEFAULT_LISTENER_CLASS: &str = "cluster-internal";
2526

2627
pub const DEFAULT_NAME_NODE_METRICS_PORT: u16 = 8183;
28+
pub const DEFAULT_NAME_NODE_NATIVE_METRICS_HTTP_PORT: u16 = 9870;
29+
pub const DEFAULT_NAME_NODE_NATIVE_METRICS_HTTPS_PORT: u16 = 9871;
2730
pub const DEFAULT_NAME_NODE_HTTP_PORT: u16 = 9870;
2831
pub const DEFAULT_NAME_NODE_HTTPS_PORT: u16 = 9871;
2932
pub const DEFAULT_NAME_NODE_RPC_PORT: u16 = 8020;
3033

3134
pub const DEFAULT_DATA_NODE_METRICS_PORT: u16 = 8082;
35+
pub const DEFAULT_DATA_NODE_NATIVE_METRICS_HTTP_PORT: u16 = 9864;
36+
pub const DEFAULT_DATA_NODE_NATIVE_METRICS_HTTPS_PORT: u16 = 9865;
3237
pub const DEFAULT_DATA_NODE_HTTP_PORT: u16 = 9864;
3338
pub const DEFAULT_DATA_NODE_HTTPS_PORT: u16 = 9865;
3439
pub const DEFAULT_DATA_NODE_DATA_PORT: u16 = 9866;
3540
pub const DEFAULT_DATA_NODE_IPC_PORT: u16 = 9867;
3641

3742
pub const DEFAULT_JOURNAL_NODE_METRICS_PORT: u16 = 8081;
43+
pub const DEFAULT_JOURNAL_NODE_NATIVE_METRICS_HTTP_PORT: u16 = 8480;
44+
pub const DEFAULT_JOURNAL_NODE_NATIVE_METRICS_HTTPS_PORT: u16 = 8481;
3845
pub const DEFAULT_JOURNAL_NODE_HTTP_PORT: u16 = 8480;
3946
pub const DEFAULT_JOURNAL_NODE_HTTPS_PORT: u16 = 8481;
4047
pub const DEFAULT_JOURNAL_NODE_RPC_PORT: u16 = 8485;

0 commit comments

Comments
 (0)