Skip to content

Conversation

akram
Copy link
Contributor

@akram akram commented Oct 10, 2025

What does this PR do?

fixes #3769

Test Plan

have a config-registered-models.yaml run

version: 2
image_name: config-registered-models
apis:
- inference
providers:
  inference:
  - provider_id: openai
    provider_type: remote::openai
    config:
      api_key: BOGUS
metadata_store:
  type: sqlite
  db_path: /tmp/config-registered-model.db
models:
- model_id: test-model
  provider_id: openai
  provider_model_id: custom-model
  model_type: llm

and run

llama stack run config-registered-models.yaml

server is not crashing anymore

INFO     2025-10-10 23:54:35,214 uvicorn.error:84 uncategorized: Started server process [51980]
INFO     2025-10-10 23:54:35,215 uvicorn.error:48 uncategorized: Waiting for application startup.
INFO     2025-10-10 23:54:35,215 llama_stack.core.server.server:177 core::server: Starting up
INFO     2025-10-10 23:54:35,216 llama_stack.core.stack:421 core: starting registry refresh task
INFO     2025-10-10 23:54:35,221 uvicorn.error:62 uncategorized: Application startup complete.
INFO     2025-10-10 23:54:35,222 uvicorn.error:216 uncategorized: Uvicorn running on http://['::', '0.0.0.0']:8321
         (Press CTRL+C to quit)
ERROR    2025-10-10 23:54:35,379 llama_stack.providers.utils.inference.openai_mixin:434 providers::utils:
         OpenAIInferenceAdapter.list_provider_model_ids() failed with: Error code: 401 - {'error': {'message':
         'Incorrect API key provided: BOGUS. You can find your API key at
         https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code':
         'invalid_api_key'}}

cc @mattf

akram added 2 commits October 10, 2025 22:19
When a provider fails during model registration or listing, the stack
should continue initializing rather than crashing. This allows the
stack to start even if some providers are misconfigured.

- Added error handling in register_resources()
- Added unit tests to verify error handling behavior
- Improved error logging with provider context
- Removed @pytest.mark.asyncio decorators (pytest already configured with async-mode=auto)

Fixes llamastack#3769
Added tests to verify that the stack:
1. Continues initialization when providers fail to register models
2. Skips invalid models instead of crashing
3. Handles provider listing failures gracefully
4. Maintains partial functionality with mixed success/failure

Example:
- OpenAI provider fails to list models
- Stack logs error and continues with registered models
- Other providers remain functional

This prevents the entire stack from crashing when:
- Provider API keys are invalid
- Models are misconfigured
- Provider API is temporarily unavailable
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 10, 2025
@akram akram changed the title Fix/3769 graceful provider registration failure Fix: 3769 graceful provider registration failure Oct 10, 2025
@akram akram changed the title Fix: 3769 graceful provider registration failure fix: 3769 graceful provider registration failure Oct 10, 2025
@akram
Copy link
Contributor Author

akram commented Oct 11, 2025

/assign @mattf
/assign @ashwinb

Copy link
Contributor

@ashwinb ashwinb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should absolutely not do this. this means bogus things in run.yaml will continue to not get fixed. it is not an acceptable method to hide errors like this.

@akram
Copy link
Contributor Author

akram commented Oct 13, 2025

Hi @ashwinb , thanks for the review.
The error is still logged with an ERROR level as shown in the PR's comment. So, the error is not hidden per se. Or maybe I get it wrong ?

@akram
Copy link
Contributor Author

akram commented Oct 13, 2025

/hold

@ashwinb
Copy link
Contributor

ashwinb commented Oct 13, 2025

Hi @ashwinb , thanks for the review. The error is still logged with an ERROR level as shown in the PR's comment. So, the error is not hidden per se. Or maybe I get it wrong ?

Yes. We cannot catch the exception, the server should be dying if you misconfigure it like that.

@ashwinb ashwinb closed this Oct 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

startup crash when config registers a model and provider returns an exception

2 participants