-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
Describe the bug
The LiteLLM wrapper does not propagate finish_reason
from LiteLLM responses to LlmResponse.finish_reason
. This makes it impossible for after_model_callback
functions to detect when the max_tokens limit is hit or other completion conditions occur.
When using LiteLLM models, llm_response.finish_reason
is always None
, even when LiteLLM returns a valid finish_reason value (e.g., "length" for max_tokens truncation).
To Reproduce
Minimal reproduction code:
import asyncio
from google.adk import Agent, Runner
from google.adk.agents.callback_context import CallbackContext
from google.adk.models.lite_llm import LiteLlm
from google.adk.models.llm_response import LlmResponse
from google.adk.sessions import InMemorySessionService
from google.genai import types
def create_inspector():
"""Callback to capture finish_reason."""
captured = {"finish_reason": None}
def inspector(ctx: CallbackContext, resp: LlmResponse) -> LlmResponse:
captured["finish_reason"] = resp.finish_reason
return resp
inspector.captured = captured
return inspector
async def test():
# Create model with low max_tokens to trigger truncation
model = LiteLlm(
model="gpt-3.5-turbo",
api_key="your-key",
max_tokens=50, # Intentionally low
)
inspector = create_inspector()
agent = Agent(
model=model,
name="test",
instruction="Provide detailed explanations.",
after_model_callback=inspector,
)
session_service = InMemorySessionService()
runner = Runner(
app_name="test",
agent=agent,
session_service=session_service
)
await session_service.create_session(
app_name="test",
user_id="user",
session_id="session",
state={},
)
message = types.Content(
role="user",
parts=[types.Part(text="Explain quantum computing in detail.")]
)
async for _ in runner.run_async(
user_id="user",
session_id="session",
new_message=message
):
pass
print(f"finish_reason: {inspector.captured['finish_reason']}")
# Output: finish_reason: None (BUG - should be "length")
asyncio.run(test())
Steps to reproduce:
- Install:
pip install google-adk litellm openai
- Run the code above with a valid API key
- Observe that
finish_reason
isNone
even though LiteLLM returned"length"
Expected behavior
llm_response.finish_reason
should contain the finish_reason value from LiteLLM:
"stop"
for natural completion"length"
for max_tokens limit reached"tool_calls"
for tool invocations"content_filter"
for filtered content
This matches the behavior of native Gemini models, where finish_reason
is properly populated.
Root Cause
In google/adk/models/lite_llm.py
, the _model_response_to_generate_content_response()
function (lines 473-499) extracts usage_metadata
from the LiteLLM response but does not extract finish_reason
:
def _model_response_to_generate_content_response(response):
message = None
if response.get("choices", None):
message = response["choices"][0].get("message", None)
# Missing: finish_reason = response["choices"][0].get("finish_reason", None)
llm_response = _message_to_generate_content_response(message)
# Missing: llm_response.finish_reason = finish_reason
if response.get("usage", None):
llm_response.usage_metadata = types.GenerateContentResponseUsageMetadata(...)
return llm_response
The LiteLLM response contains response["choices"][0]["finish_reason"]
, but this is never extracted or set on the LlmResponse
object.
Proposed Fix
Add 3 lines to extract and set finish_reason
:
def _model_response_to_generate_content_response(response):
message = None
finish_reason = None # ADD
if response.get("choices", None):
message = response["choices"][0].get("message", None)
finish_reason = response["choices"][0].get("finish_reason", None) # ADD
if not message:
raise ValueError("No message in response")
llm_response = _message_to_generate_content_response(message)
if finish_reason: # ADD
llm_response.finish_reason = finish_reason # ADD
if response.get("usage", None):
llm_response.usage_metadata = types.GenerateContentResponseUsageMetadata(...)
return llm_response
A complete patch file is available here: litellm_finish_reason.patch
Desktop (please complete the following information):
- OS: macOS (also reproduced on Linux)
- Python version: 3.12.0
- ADK version: 1.11.0
Model Information:
- Are you using LiteLLM: Yes
- Which model is being used: gpt-3.5-turbo, gpt-4o (any LiteLLM-supported model)
Additional context
Impact: This bug prevents callbacks from:
- Detecting when responses are truncated due to max_tokens limits
- Implementing retry logic for incomplete responses
- Logging completion statistics
- Handling different completion conditions appropriately
Validation: A standalone reproduction script with automated validation is available at litellm_finish_reason_bug.py. This script demonstrates both the bug and validates the proposed fix.
Note on Tracing: After this fix is applied, google/adk/telemetry/tracing.py:222
will need updating to handle both enum (Gemini) and string (LiteLLM) finish_reason values:
if llm_response.finish_reason:
if hasattr(llm_response.finish_reason, 'value'):
finish_reason_str = llm_response.finish_reason.value.lower()
else:
finish_reason_str = str(llm_response.finish_reason)
span.set_attribute('gen_ai.response.finish_reasons', [finish_reason_str])
Without this, tracing will raise AttributeError: 'str' object has no attribute 'value'
.