Skip to content

Conversation

@omerap12
Copy link
Contributor

@omerap12 omerap12 commented Aug 2, 2025

Pull Request Description

Added liveness and readiness checks to the metadata service.
Tested it on my local machine (without using the chart), and it works.
I think we should merge PR #1370 first because it adds unit tests for the chart ( and by that testing the liveness and readiness probes to all components ).

Related Issues

Resolves: #[Insert issue number(s)]

Important: Before submitting, please complete the description above and review the checklist below.


Contribution Guidelines (Expand for Details)

We appreciate your contribution to aibrix! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:

Pull Request Title Format

Your PR title should start with one of these prefixes to indicate the nature of the change:

  • [Bug]: Corrections to existing functionality
  • [CI]: Changes to build process or CI pipeline
  • [Docs]: Updates or additions to documentation
  • [API]: Modifications to aibrix's API or interface
  • [CLI]: Changes or additions to the Command Line Interface
  • [Misc]: For changes not covered above (use sparingly)

Note: For changes spanning multiple categories, use multiple prefixes in order of importance.

Submission Checklist

  • PR title includes appropriate prefix(es)
  • Changes are clearly explained in the PR description
  • New and existing tests pass successfully
  • Code adheres to project style and best practices
  • Documentation updated to reflect changes (if applicable)
  • Thorough testing completed, no regressions introduced

By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @omerap12, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the operational resilience and reliability of the metadata-service within a Kubernetes environment. By implementing standard liveness and readiness probes, the service can now signal its health and readiness to the orchestrator, ensuring proper lifecycle management and improved availability.

Highlights

  • Kubernetes Probes Integration: I've added liveness and readiness probe configurations to the metadata-service Kubernetes deployment manifest (dist/chart/templates/metadata-service/deployment.yaml). These probes are now templated to allow for flexible configuration via Helm values.
  • Health Check Endpoints: New HTTP GET endpoints, /healthz and /readyz, have been introduced in the pkg/metadata/handlers.go file. The /healthz endpoint performs a health check against the Redis client, while /readyz currently reuses the healthz logic, with a TODO for future differentiation.
  • Redis Health Utility: A new utility function, CheckRedisHealth, has been added to pkg/utils/users.go. This function pings the Redis client with a 2-second context timeout to verify its connectivity, providing a robust way to check the service's dependency.
  • Helm Chart Defaults: Default liveness and readiness probe configurations have been added to dist/chart/values.yaml under metadata.service.container.probes. These specify HTTP GET checks on the new /healthz and /readyz paths, along with initial delays and periods.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces liveness and readiness probes for the metadata service. The changes to the Helm chart and Go code are well-implemented. I suggest refactoring the readiness probe to have a different check than the healthz endpoint. Also, consider adding logging within the CheckRedisHealth function to provide better visibility into Redis connection issues.

Comment on lines 197 to 198
// TODO: we need to have a different check
s.healthz(w, r)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The readiness probe currently calls the healthz endpoint. It's better to have a distinct check for readiness that confirms the service is ready to accept traffic, potentially including checks beyond basic Redis health. Consider adding other checks to ensure the service is fully initialized and ready to handle requests.

func (s *httpServer) readyz(w http.ResponseWriter, r *http.Request) {
	// TODO: we need to have a different check
	// s.healthz(w, r)
	w.WriteHeader(http.StatusOK)
}

Copy link
Collaborator

@Jeffwan Jeffwan Aug 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to do reverse way in current situation?

  1. healthz can return static response => it should focus on metadata service pod itself
  2. readyz check redis readiness. => functionally perspective, it relies on redis.

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. will change.

Comment on lines +78 to +81
func CheckRedisHealth(ctx context.Context, redisClient *redis.Client) error {
ctx, cancel := context.WithTimeout(ctx, 2*time.Second)
defer cancel()
return redisClient.Ping(ctx).Err()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Consider adding logging within the CheckRedisHealth function to provide better visibility into Redis connection issues. Log the error when the ping fails to aid in debugging.

Suggested change
func CheckRedisHealth(ctx context.Context, redisClient *redis.Client) error {
ctx, cancel := context.WithTimeout(ctx, 2*time.Second)
defer cancel()
return redisClient.Ping(ctx).Err()
func CheckRedisHealth(ctx context.Context, redisClient *redis.Client) error {
ctx, cancel := context.WithTimeout(ctx, 2*time.Second)
defer cancel()
err := redisClient.Ping(ctx).Err()
if err != nil {
klog.Errorf("Redis health check failed: %v", err)
return err
}
return nil
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We return this err in the service itself, so I think this isn't necessary.

@googs1025
Copy link
Collaborator

some info: #1249

Do we need to maintain the metadata service in the future? 🤔

@omerap12
Copy link
Contributor Author

omerap12 commented Aug 4, 2025

Oh I wasn't aware of this. thanks @googs1025!
I think we should keep maintaining this service for now since we don't have the new AIBrix solution ready yet (AFAIK).
We won't add new features, but we should keep it working well for our current customers

@googs1025
Copy link
Collaborator

Oh I wasn't aware of this. thanks @googs1025!

I think we should keep maintaining this service for now since we don't have the new AIBrix solution ready yet (AFAIK).

We won't add new features, but we should keep it working well for our current customers

sgtm I'm ok with this

@Jeffwan
Copy link
Collaborator

Jeffwan commented Aug 5, 2025

let's hold this change and review it once v0.4.0 is released (no late than tomorrow).

@Jeffwan
Copy link
Collaborator

Jeffwan commented Aug 6, 2025

some info: #1249
Do we need to maintain the metadata service in the future? 🤔

@googs1025 It might be changed once the batch api #182 work is done. so let's keep moving on existing solutions. If we finally plan to move to python based metadata server. Exact same healthz and readyz check should be migrated to python code base as well

@Jeffwan Jeffwan force-pushed the metadata-service-health branch from 6beeed0 to ec1ce54 Compare August 7, 2025 17:33
@Jeffwan
Copy link
Collaborator

Jeffwan commented Aug 7, 2025

The Helm chart CI has started working and caught some errors here. They're not directly related to this PR. I'm fine with including the fix either in a separate PR or as part of this one—whichever works better.

image

@omerap12
Copy link
Contributor Author

omerap12 commented Aug 7, 2025

The Helm chart CI has started working and caught some errors here. They're not directly related to this PR. I'm fine with including the fix either in a separate PR or as part of this one—whichever works better.

image

I'll submit a fix, np.

@googs1025
Copy link
Collaborator

/lgtm

thanks for this

@Jeffwan Jeffwan force-pushed the metadata-service-health branch from ec1ce54 to f090166 Compare August 8, 2025 20:31
@omerap12
Copy link
Contributor Author

omerap12 commented Aug 9, 2025

@Jeffwan, the chart-ci is failing which is makes sense since we are adding liveness and readiness in the chart but the code is not yet available in the nightly version.
I can separate this into two different PRs if you'd like

@Jeffwan
Copy link
Collaborator

Jeffwan commented Aug 9, 2025

@omerap12 technically, we could build the images internally and override the tag used in the testing.

ct install --chart-dirs . --charts . --namespace aibrix-system

but that introduce lots of work and kind of redudant since we already have a few image builds jobs in integration test. I agree that splitting into two PR sounds reasonable at this moment.

@omerap12 omerap12 requested review from Jeffwan and googs1025 August 10, 2025 06:46
@omerap12
Copy link
Contributor Author

@omerap12 technically, we could build the images internally and override the tag used in the testing.

ct install --chart-dirs . --charts . --namespace aibrix-system

but that introduce lots of work and kind of redudant since we already have a few image builds jobs in integration test. I agree that splitting into two PR sounds reasonable at this moment.

Yeah totally agree. so in this PR we will introduce the liveness & readiness probes only.
Next PR will be chart update

@Jeffwan Jeffwan force-pushed the metadata-service-health branch from c7f4267 to 8a031cf Compare August 10, 2025 23:42
@Jeffwan Jeffwan merged commit 666ca17 into vllm-project:main Aug 11, 2025
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants