Skip to content

Conversation

SangJunBak
Copy link
Contributor

@SangJunBak SangJunBak commented Sep 16, 2025

inspired by https://www.cockroachlabs.com/docs/stable/monitor-cockroachdb-with-prometheus

For the scrape configs:

  • Exposes all the prom http endpoints Paul implemented to keep parity with promsql exporter queries. Tested and it works with password auth too.
  • Introduces two dashboards:

Environment overview

Screenshot 2025-09-16 at 7 20 21 AM Screenshot 2025-09-16 at 7 20 52 AM Screenshot 2025-09-16 at 7 21 11 AM Screenshot 2025-09-16 at 7 21 19 AM

Freshness

Screenshot 2025-09-16 at 7 21 47 AM Screenshot 2025-09-16 at 7 21 58 AM Screenshot 2025-09-16 at 7 22 08 AM

TODO next:

  • Replace Datadog + Promsql exporter with just Datadog + Prometheus. Also create a datadog dashboard
  • Document and filter Prometheus metrics
  • Create sample alerting config and some guidelines what to look out for
  • Incorporate swap metric into the env overview dashboard
  • Document what these dashboards actually represent.

Motivation

Fixes https://github.com/MaterializeInc/cloud/issues/10992

Tips for reviewer

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

@SangJunBak SangJunBak changed the base branch from main to self-managed-docs/v25.2 September 16, 2025 11:23
Comment on lines +174 to +188
- job_name: 'kubelet-cadvisor'
scheme: https
kubernetes_sd_configs:
- role: node
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed this for the CPU metrics. Wondering if this is ok

@SangJunBak
Copy link
Contributor Author

@kay-kim Unsure if the best way to document these panels in grafana are the builtin descriptions or in our docs... 🤔

@SangJunBak SangJunBak marked this pull request as ready for review September 16, 2025 11:25
@SangJunBak SangJunBak requested a review from a team as a code owner September 16, 2025 11:25
```


## 2. Install Prometheus to your Kubernetes cluster using [`prometheus-community/prometheus`](https://github.com/prometheus-community/helm-charts/tree/main/charts/prometheus) (Optional)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently as written then, it's

  1. curl the prometheus.yml
  2. -Optional. So, can skip this step-
  3. -Optional. So, can also skip this step- 😄
    So, this page just tells people that only requirement is to curl prometheus.yml.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's more they don't have to use prometheus-community ....
I'd just make the heading Install Prometheus to your Kubernetes cluster.
And make a note that the following uses the prometheus-community chart values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

kubectl port-forward pod/$MZ_POD_GRAFANA 3000:3000 -n monitoring
```

{{< note >}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd make it a warning instead of a note.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


## 1. Download our Prometheus scrape configurations (`prometheus.yml`)
```bash
curl -O https://raw.githubusercontent.com/MaterializeInc/materialize/refs/heads/self-managed-docs/v25.2/doc/user/data/monitoring/prometheus.yml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder ... should we make it

curl -o prometheus_scrape_configs.yml https://raw.githubusercontent.com/prometheus-community/helm-charts/refs/heads/main/charts/prometheus/values.yaml

so that later on, it'll be re-enforce the notion that we'll replace below?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also mention that they'll use these scrape configs when running Prometheus

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

--values values.yaml
```

## 3. Visualize through Grafana (optional)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make Optional come first. Easier for people to skim over optional parts.

3. Optional. Visualize ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


## 3. Visualize through Grafana (optional)

1. Install the Grafana helm chart following [this guide](https://grafana.com/docs/grafana/latest/setup-grafana/installation/helm/)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing a period.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


{{< note >}}
The port forwarding method is for testing purposes only. For production environments, configure an ingress controller to securely expose the Grafana UI.
{{< /note >}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should tell them to open a browser and access Grafana.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

The port forwarding method is for testing purposes only. For production environments, configure an ingress controller to securely expose the Grafana UI.
{{< /note >}}

3. Within the UI, add a Prometheus data source where the URL is `http://<prometheus server name>.<namespace>.svc.cluster.local:<port>`(i.e. `http://prometheus-server.prometheus.svc.cluster.local:80`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd break this up:
In Granfana UI, add a Prometheus data source. In the Connection, set the Prometheus server URL to http://<prometheus server name>.<namespace>.svc.cluster.local:<port>(e.g., http://prometheus-server.prometheus.svc.cluster.local:80)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


![Image of Materialize Console login screen with mz_system user](/images/grafana-prometheus-datasource-setup.png)

4. Download the following dashboards:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd mention the file names since next step asks to import them (so, they don't have to go from terminal to browser back to terminal to grafana)

  • environment_overview_dashboard.json
  • freshness_overview_dashboard.json

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

These will be useful for future dashboards
…tion and new dashboards

- Made section less specific to prometheus-community
- Added steps to install Grafana
- Introduced two new Grafana dashboard JSON files: `environment_overview_dashboard.json` and `freshness_overview_dashboard.json`.
For self managed, this is redundant given we have parity using just prometheus
- Make prometheus-community usage a footnote
- Address stylistic nits Kay recommended
- Add an "open browser" step
- Improve punctuation
@SangJunBak SangJunBak force-pushed the self-managed-docs/add-grafana-dashboard branch from 638b18e to af2f5a1 Compare September 24, 2025 02:58
@SangJunBak SangJunBak requested a review from kay-kim September 24, 2025 02:59
@SangJunBak
Copy link
Contributor Author

Addressed your feedback @kay-kim. TFTR!

Copy link
Contributor

@kay-kim kay-kim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Tried it out and got the dashboards all doing their thing 🎉🎉🎉
Just a minor thing w.r.t. port forwarding + admin password.

## 2. Install Prometheus to your Kubernetes cluster

{{< note >}}
In this guide, we use the [prometheus-community](https://github.com/prometheus-community/helm-charts) Helm chart to install Prometheus.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small nit (just to make reading shorter)
In this guide, we use -> This guide uses

1. Install the Grafana helm chart following [this guide](https://grafana.com/docs/grafana/latest/setup-grafana/installation/helm/).


2. Set up port forwarding to access the Grafana UI:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since in the above grafana steps include the port forwarding + admin pwd in the https://grafana.com/docs/grafana/latest/setup-grafana/installation/helm/#access-grafana section as well as it sysouts during the install

1. Get your 'admin' user password by running:

   kubectl get secret --namespace monitoring my-grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo


2. The Grafana server can be accessed via port 80 on the following DNS name from within your cluster:

   my-grafana.monitoring.svc.cluster.local

   Get the Grafana URL to visit by running these commands in the same shell:
     export POD_NAME=$(kubectl get pods --namespace monitoring -l "app.kubernetes.io/name=grafana,app.kubernetes.io/instance=my-grafana" -o jsonpath="{.items[0].metadata.name}")
     kubectl --namespace monitoring port-forward $POD_NAME 3000

3. Login with the password from step 1 and the username: admin
#################################################################################
######   WARNING: Persistence is disabled!!! You will lose your data when   #####
######            the Grafana pod is terminated.                            #####
#################################################################################

Should we defer this to their instructions? since we don't in the next step mention the admin + pwd.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess a main concern is if they change these docs and the instructions are no longer correct. I kind of feel like we should defer this to their instruction!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh ... heh ...I meant if we defer to their instructions, we can get rid of our explicit port forwarding code block.


3. Open the Grafana UI on [http://localhost:3000](http://localhost:3000) in a browser.

4. In the Grafana UI, add a Prometheus data source. In the Connection section, set the Prometheus server URL to `http://<prometheus server name>.<namespace>.svc.cluster.local:<port>`(e.g. `http://prometheus-server.prometheus.svc.cluster.local:80`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a Prometheus data source. In Grafana UI, under Connections > Data sources,

  • Click Add data source and select promethueus.
  • In the Connection section, set th Prometheus serverURL to ...

@SangJunBak SangJunBak merged commit cbf155a into MaterializeInc:self-managed-docs/v25.2 Sep 24, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants