Skip to content

Conversation

slekkala1
Copy link
Contributor

@slekkala1 slekkala1 commented Oct 10, 2025

What does this PR do?

Have closed the previous PR due to merge conflicts with multiple PRs
Addressed all comments from #3768 (sorry for carrying over to this one)

Test Plan

Added UTs and integration tests

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 10, 2025
@slekkala1 slekkala1 marked this pull request as ready for review October 10, 2025 22:25
# Shields parameter received via extra_body - not yet implemented
if shields is not None:
raise NotImplementedError("Shields parameter is not yet implemented in the meta-reference provider")
shield_ids = extract_shield_ids(shields) if shields else []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given that we would like to reuse the moderations API -- could we reconsider naming this parameter perhaps? I think people use guardrails much more (even in OpenAI's agents-sdk). I wonder if we should use that?

@slekkala1 slekkala1 force-pushed the new-responses-and-safety branch 2 times, most recently from 76f0478 to 76b991c Compare October 13, 2025 19:13
:param type: The type/identifier of the guardrail.
"""

type: str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just for my learning: what types are available?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure about this part @ashwinb, I usually only know identifier for a shield, as such only supporting that for now. May be this is to allow more fields in future.

include: list[str] | None = None,
max_infer_iters: int | None = 10,
shields: list | None = None,
guardrails: list | None = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you type the list more exactly?

if isinstance(guardrail, str):
guardrail_ids.append(guardrail)
elif isinstance(guardrail, ResponseGuardrailSpec):
guardrail_ids.append(guardrail.type)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems confusing: type being used as id. Is there a better way to name this?


violation_message = await run_multiple_guardrails(self.safety_api, text, self.guardrail_ids)
if violation_message:
logger.info(f"{context.capitalize()} guardrail violation: {violation_message}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: just add this log to run_multiple_guardrails and we don't need this extra wrapper _apply_guardrails function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just when I think I fixed all the redundancies by claude, something still slips...


# Input safety validation - check messages before processing
if self.guardrail_ids:
combined_text = interleaved_content_as_str([msg.content for msg in self.ctx.messages])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we document somewhere that guardrails only apply to text input?

Copy link
Contributor Author

@slekkala1 slekkala1 Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes the shield + moderation apis dont support the image, this is known tech debt, I filed an issue for that before.

) + tool_call.function.arguments

# Output Safety Validation for a chunk
if chat_response_content:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should just check for self.guardrail_ids first as it makes it clear what this block is for and prevents unnecessary work

accumulated_text = "".join(chat_response_content)
violation_message = await self._apply_guardrails(accumulated_text, "output")
if violation_message:
yield await self._create_refusal_response(violation_message)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the output chunks already yielded by this point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ResponseTextDeltaEvent are streamed, the output chunk is not yet streamed by this point. I initially had this check within the delta in the above loop, but apparently that is too expensive, so moved to here for a chunk

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated as discussed to not emit delta events when guardrails are configured

@slekkala1 slekkala1 force-pushed the new-responses-and-safety branch 2 times, most recently from e72c6e4 to d462e96 Compare October 14, 2025 21:53
@slekkala1 slekkala1 force-pushed the new-responses-and-safety branch from d462e96 to a9ebdfe Compare October 15, 2025 13:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants