Skip to content

⚡️ Speed up method DocumentUrl._infer_media_type by 20% in PR #37 (debug2) #39

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: debug2
Choose a base branch
from

Conversation

codeflash-ai[bot]
Copy link

@codeflash-ai codeflash-ai bot commented Jul 25, 2025

⚡️ This pull request contains optimizations for PR #37

If you approve this dependent PR, these changes will be merged into the original PR branch debug2.

This PR will be automatically closed if the original PR is merged.


📄 20% (0.20x) speedup for DocumentUrl._infer_media_type in pydantic_ai_slim/pydantic_ai/messages.py

⏱️ Runtime : 14.7 milliseconds 12.3 milliseconds (best of 29 runs)

📝 Explanation and details

Here's the optimized version of your program. The optimizations focus on.

  • Avoiding repeated guess_type lookups. If _infer_media_type is called multiple times for the same instance, cache the result, as URL and thus media type do not change during the instance lifetime. This saves on repeated computation and any internal calls.
  • Micro-optimization: Move the exception creation out of the main execution path.
  • Other imports and class hierarchy stay unchanged as per your requirements.

All existing docstrings and code comments are preserved because your snippet doesn't have extra comments.

Summary of changes:

  • Added self._cached_media_type to cache the result of mimetype guessing, improving performance when called repeatedly per instance.
  • No changes to the function signatures, docstrings, or visible behavior.

If _infer_media_type is only called once per instance, the benefit is small, but if called multiple times, this saves time and avoids recomputation. This achieves optimal runtime without altering external behavior.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 4751 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
from abc import ABC
from dataclasses import dataclass, field
from mimetypes import guess_type
from typing import Any, Literal

# imports
import pytest  # used for our unit tests
from pydantic_ai.messages import DocumentUrl


# Minimal stub for _utils.dataclasses_no_defaults_repr
class _utils:
    @staticmethod
    def dataclasses_no_defaults_repr(self):
        return f"<{self.__class__.__name__} url={self.url!r}>"

@dataclass(init=False, repr=False)
class FileUrl(ABC):
    """Abstract base class for any URL-based file."""

    url: str
    force_download: bool = False
    vendor_metadata: dict[str, Any] | None = None
    _media_type: str | None = field(init=False, repr=False)

    def __init__(
        self,
        url: str,
        force_download: bool = False,
        vendor_metadata: dict[str, Any] | None = None,
        media_type: str | None = None,
    ) -> None:
        self.url = url
        self.vendor_metadata = vendor_metadata
        self.force_download = force_download
        self._media_type = media_type

    __repr__ = _utils.dataclasses_no_defaults_repr
from pydantic_ai.messages import DocumentUrl

# unit tests

# -----------------------
# 1. Basic Test Cases
# -----------------------

@pytest.mark.parametrize(
    "url,expected_type",
    [
        # Standard document types
        ("https://example.com/file.pdf", "application/pdf"),
        ("https://example.com/file.doc", "application/msword"),
        ("https://example.com/file.docx", "application/vnd.openxmlformats-officedocument.wordprocessingml.document"),
        ("https://example.com/file.txt", "text/plain"),
        ("https://example.com/file.csv", "text/csv"),
        ("https://example.com/file.html", "text/html"),
        ("https://example.com/file.xls", "application/vnd.ms-excel"),
        ("https://example.com/file.xlsx", "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"),
        ("https://example.com/file.ppt", "application/vnd.ms-powerpoint"),
        ("https://example.com/file.pptx", "application/vnd.openxmlformats-officedocument.presentationml.presentation"),
        # Standard image types
        ("https://example.com/image.png", "image/png"),
        ("https://example.com/image.jpg", "image/jpeg"),
        ("https://example.com/image.jpeg", "image/jpeg"),
        ("https://example.com/image.gif", "image/gif"),
        # Standard audio types
        ("https://example.com/audio.mp3", "audio/mpeg"),
        ("https://example.com/audio.wav", "audio/x-wav"),
        # Standard video types
        ("https://example.com/video.mp4", "video/mp4"),
        ("https://example.com/video.mov", "video/quicktime"),
    ]
)
def test_infer_media_type_basic(url, expected_type):
    """Test that known file extensions return the correct MIME type."""
    doc = DocumentUrl(url)
    codeflash_output = doc._infer_media_type(); result = codeflash_output # 329μs -> 334μs (1.41% slower)

# -----------------------
# 2. Edge Test Cases
# -----------------------

@pytest.mark.parametrize(
    "url,expected_type",
    [
        # Uppercase extension
        ("https://example.com/file.PDF", "application/pdf"),
        ("https://example.com/file.JpG", "image/jpeg"),
        # Mixed-case extension
        ("https://example.com/file.TxT", "text/plain"),
        # Query parameters after extension
        ("https://example.com/file.pdf?download=1", "application/pdf"),
        ("https://example.com/file.jpg?foo=bar&baz=qux", "image/jpeg"),
        # Fragment after extension
        ("https://example.com/file.mp3#section", "audio/mpeg"),
        # Path with multiple dots
        ("https://example.com/archive.tar.gz", "application/x-tar"),  # mimetypes guesses .tar.gz as .tar
        # File name with spaces
        ("https://example.com/this%20is%20a%20file.pdf", "application/pdf"),
        # File name with unusual but valid characters
        ("https://example.com/üñîçødë.txt", "text/plain"),
        # URL with port and subdomain
        ("https://sub.domain.com:8080/file.html", "text/html"),
        # URL with no path, just extension
        ("https://example.com/.gitignore", "text/plain"),
        # File name with multiple extensions (ambiguous)
        ("https://example.com/file.backup.docx", "application/vnd.openxmlformats-officedocument.wordprocessingml.document"),
    ]
)
def test_infer_media_type_edge_cases(url, expected_type):
    """Test edge cases for file extension parsing and case sensitivity."""
    doc = DocumentUrl(url)
    codeflash_output = doc._infer_media_type(); result = codeflash_output # 205μs -> 209μs (1.79% slower)

@pytest.mark.parametrize(
    "url",
    [
        # No extension
        "https://example.com/file",
        # Extension not recognized by mimetypes
        "https://example.com/file.unknownext",
        # Extension is just a dot
        "https://example.com/file.",
        # Path ends with slash
        "https://example.com/path/to/",
        # Empty string
        "",
        # URL with only query string
        "https://example.com/?foo=bar",
        # URL with only fragment
        "https://example.com/#section",
        # Hidden file with no extension
        "https://example.com/.env",
        # Extension that mimetypes returns None for (simulate with .abcde)
        "https://example.com/file.abcde",
    ]
)
def test_infer_media_type_unknown_extension(url):
    """Test that unknown, missing, or malformed extensions raise ValueError."""
    doc = DocumentUrl(url)
    with pytest.raises(ValueError, match="Unknown document file extension"):
        doc._infer_media_type() # 168μs -> 170μs (0.947% slower)

# -----------------------
# 3. Large Scale Test Cases
# -----------------------

def test_infer_media_type_large_batch_unique():
    """Test inferring media types for a large batch of unique, valid URLs."""
    # Generate 500 URLs with known extensions, alternating between .txt and .pdf
    urls = [
        f"https://example.com/file_{i}.txt" if i % 2 == 0 else f"https://example.com/file_{i}.pdf"
        for i in range(500)
    ]
    expected_types = [
        "text/plain" if i % 2 == 0 else "application/pdf"
        for i in range(500)
    ]
    for url, expected in zip(urls, expected_types):
        doc = DocumentUrl(url)
        codeflash_output = doc._infer_media_type() # 2.91ms -> 2.92ms (0.317% slower)

def test_infer_media_type_large_batch_with_errors():
    """Test a large batch with a mix of valid and invalid URLs, ensuring errors are raised appropriately."""
    # 200 valid, 200 invalid, alternating
    valid_urls = [f"https://example.com/file_{i}.pdf" for i in range(200)]
    invalid_urls = [f"https://example.com/file_{i}.xyz" for i in range(200)]  # .xyz not in mimetypes by default
    all_urls = []
    expected = []
    for v, iv in zip(valid_urls, invalid_urls):
        all_urls.append(v)
        expected.append("application/pdf")
        all_urls.append(iv)
        expected.append(None)  # Will trigger error

    for url, exp in zip(all_urls, expected):
        doc = DocumentUrl(url)
        if exp is None:
            with pytest.raises(ValueError, match="Unknown document file extension"):
                doc._infer_media_type()
        else:
            codeflash_output = doc._infer_media_type()

def test_infer_media_type_performance_large_unique_extensions():
    """Test performance and correctness with many different extensions (some valid, some invalid)."""
    # Use 100 common extensions, 100 uncommon/invalid
    common_exts = [
        "pdf", "txt", "csv", "doc", "docx", "xls", "xlsx", "ppt", "pptx", "html", "jpg", "jpeg", "png", "gif", "mp3", "wav", "mp4", "mov"
    ]
    # Fill up to 100 with repeats
    while len(common_exts) < 100:
        common_exts.append("pdf")
    invalid_exts = [f"badext{i}" for i in range(100)]
    urls = [f"https://example.com/file_{i}.{ext}" for i, ext in enumerate(common_exts + invalid_exts)]
    for i, url in enumerate(urls):
        doc = DocumentUrl(url)
        if i < 100:
            # All common_exts are valid
            codeflash_output = doc._infer_media_type()
        else:
            with pytest.raises(ValueError, match="Unknown document file extension"):
                doc._infer_media_type()
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

from abc import ABC
from dataclasses import dataclass, field
from mimetypes import guess_type
from typing import Any, Literal

# imports
import pytest  # used for our unit tests
from pydantic_ai.messages import DocumentUrl


# Minimal stub for _utils.dataclasses_no_defaults_repr
class _utils:
    @staticmethod
    def dataclasses_no_defaults_repr(self):
        return f"<{self.__class__.__name__} url={self.url!r}>"

@dataclass(init=False, repr=False)
class FileUrl(ABC):
    """Abstract base class for any URL-based file."""

    url: str
    """The URL of the file."""

    force_download: bool = False
    """If the model supports it:

    * If True, the file is downloaded and the data is sent to the model as bytes.
    * If False, the URL is sent directly to the model and no download is performed.
    """

    vendor_metadata: dict[str, Any] | None = None
    """Vendor-specific metadata for the file."""

    _media_type: str | None = field(init=False, repr=False)

    def __init__(
        self,
        url: str,
        force_download: bool = False,
        vendor_metadata: dict[str, Any] | None = None,
        media_type: str | None = None,
    ) -> None:
        self.url = url
        self.vendor_metadata = vendor_metadata
        self.force_download = force_download
        self._media_type = media_type

    __repr__ = _utils.dataclasses_no_defaults_repr
from pydantic_ai.messages import DocumentUrl

# unit tests

# -------------------------
# 1. BASIC TEST CASES
# -------------------------

@pytest.mark.parametrize(
    "url,expected_media_type",
    [
        # Standard PDF file
        ("https://example.com/file.pdf", "application/pdf"),
        # Standard DOCX file
        ("https://example.com/report.docx", "application/vnd.openxmlformats-officedocument.wordprocessingml.document"),
        # Standard TXT file
        ("https://example.com/readme.txt", "text/plain"),
        # Standard HTML file
        ("https://example.com/index.html", "text/html"),
        # Standard JPEG file
        ("https://example.com/image.jpg", "image/jpeg"),
        # Standard PNG file
        ("https://example.com/image.png", "image/png"),
        # Standard CSV file
        ("https://example.com/data.csv", "text/csv"),
        # Standard JSON file
        ("https://example.com/data.json", "application/json"),
        # Standard MP4 video
        ("https://example.com/video.mp4", "video/mp4"),
        # Standard ZIP archive
        ("https://example.com/archive.zip", "application/zip"),
    ]
)
def test_infer_media_type_basic(url, expected_media_type):
    """Test _infer_media_type with common file types and extensions."""
    doc = DocumentUrl(url)
    codeflash_output = doc._infer_media_type() # 181μs -> 183μs (0.980% slower)

# -------------------------
# 2. EDGE TEST CASES
# -------------------------

@pytest.mark.parametrize(
    "url,expected_media_type",
    [
        # Uppercase extension
        ("https://example.com/FILE.PDF", "application/pdf"),
        # Mixed-case extension
        ("https://example.com/FiLe.JpEg", "image/jpeg"),
        # Extension with query string
        ("https://example.com/file.pdf?version=2", "application/pdf"),
        # Extension with fragment
        ("https://example.com/file.pdf#section1", "application/pdf"),
        # Extension with both query and fragment
        ("https://example.com/file.pdf?download=true#top", "application/pdf"),
        # Filename with multiple dots
        ("https://example.com/archive.tar.gz", "application/x-tar"),
        # Filename with spaces (URL encoded)
        ("https://example.com/my%20file.txt", "text/plain"),
        # Filename with spaces (not encoded, rare but possible)
        ("https://example.com/my file.txt", "text/plain"),
        # Extension with additional path
        ("https://example.com/path.to/file.txt", "text/plain"),
        # Filename with dash and underscore
        ("https://example.com/my-file_name.docx", "application/vnd.openxmlformats-officedocument.wordprocessingml.document"),
    ]
)
def test_infer_media_type_edge(url, expected_media_type):
    """Test _infer_media_type with edge-case filenames and URLs."""
    doc = DocumentUrl(url)
    codeflash_output = doc._infer_media_type() # 185μs -> 187μs (0.904% slower)

@pytest.mark.parametrize(
    "url",
    [
        # No extension
        "https://example.com/file",
        # Hidden file (dotfile)
        "https://example.com/.hiddenfile",
        # Unknown extension
        "https://example.com/file.unknownext",
        # Only a directory (no file)
        "https://example.com/path/to/",
        # Empty string
        "",
        # Only the domain
        "https://example.com/",
        # Only a query, no file
        "https://example.com/?id=123",
        # Only a fragment, no file
        "https://example.com/#anchor",
        # Extension at the start of the path
        "https://example.com/.pdf",
        # Extension with special characters
        "https://example.com/file.💾",
    ]
)
def test_infer_media_type_raises_on_unknown(url):
    """Test _infer_media_type raises ValueError for unknown or missing extensions."""
    doc = DocumentUrl(url)
    with pytest.raises(ValueError):
        doc._infer_media_type() # 187μs -> 192μs (2.53% slower)

@pytest.mark.parametrize(
    "url,expected_media_type",
    [
        # Extension with uppercase in query string
        ("https://example.com/file.txt?download=TRUE", "text/plain"),
        # Extension with multiple query parameters
        ("https://example.com/file.csv?foo=bar&baz=qux", "text/csv"),
        # Extension with port in URL
        ("https://example.com:8080/file.pdf", "application/pdf"),
        # Extension with user authentication in URL
        ("https://user:[email protected]/file.html", "text/html"),
    ]
)
def test_infer_media_type_url_variants(url, expected_media_type):
    """Test _infer_media_type with URLs containing ports, auth, and query params."""
    doc = DocumentUrl(url)
    codeflash_output = doc._infer_media_type() # 74.7μs -> 75.8μs (1.47% slower)

# -------------------------
# 3. LARGE SCALE TEST CASES
# -------------------------

def test_infer_media_type_many_files():
    """Test _infer_media_type with a large number of files (scalability)."""
    # We'll use 500 files with .pdf and 500 files with .txt (total 1000)
    pdf_urls = [f"https://example.com/file_{i}.pdf" for i in range(500)]
    txt_urls = [f"https://example.com/file_{i}.txt" for i in range(500)]
    all_urls = pdf_urls + txt_urls

    for url in all_urls:
        doc = DocumentUrl(url)
        if url.endswith('.pdf'):
            codeflash_output = doc._infer_media_type()
        else:
            codeflash_output = doc._infer_media_type()

def test_infer_media_type_large_variety():
    """Test _infer_media_type with many different extensions and random casing."""
    # 10 common extensions, 10 files each, with random casing
    import random
    import string

    extensions = [
        ("pdf", "application/pdf"),
        ("docx", "application/vnd.openxmlformats-officedocument.wordprocessingml.document"),
        ("txt", "text/plain"),
        ("jpg", "image/jpeg"),
        ("png", "image/png"),
        ("csv", "text/csv"),
        ("json", "application/json"),
        ("html", "text/html"),
        ("zip", "application/zip"),
        ("mp4", "video/mp4"),
    ]
    urls = []
    expected_types = []
    for ext, mime in extensions:
        for i in range(10):
            # Randomize casing of extension
            ext_case = ''.join(random.choice([c.upper(), c.lower()]) for c in ext)
            # Randomize filename
            fname = ''.join(random.choices(string.ascii_letters + string.digits, k=8))
            url = f"https://example.com/{fname}.{ext_case}"
            urls.append(url)
            expected_types.append(mime)

    for url, expected in zip(urls, expected_types):
        doc = DocumentUrl(url)
        codeflash_output = doc._infer_media_type() # 620μs -> 619μs (0.129% faster)

def test_infer_media_type_performance():
    """Test _infer_media_type performance does not degrade with many calls."""
    # 1000 calls to _infer_media_type on the same extension
    url = "https://example.com/performance_test.pdf"
    doc = DocumentUrl(url)
    for _ in range(1000):
        codeflash_output = doc._infer_media_type() # 2.69ms -> 183μs (1364% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-pr37-2025-07-25T22.16.15 and push.

Codeflash

…`debug2`)

Here's the optimized version of your program. The optimizations focus on.

- **Avoiding repeated `guess_type` lookups.** If `_infer_media_type` is called multiple times for the same instance, cache the result, as URL and thus media type do not change during the instance lifetime. This saves on repeated computation and any internal calls.
- **Micro-optimization:** Move the exception creation out of the main execution path.
- **Other imports and class hierarchy** stay unchanged as per your requirements.

All existing docstrings and code comments are preserved because your snippet doesn't have extra comments.




**Summary of changes:**
- Added `self._cached_media_type` to cache the result of mimetype guessing, improving performance when called repeatedly per instance.
- No changes to the function signatures, docstrings, or visible behavior.

If `_infer_media_type` is only called once per instance, the benefit is small, but if called multiple times, this saves time and avoids recomputation. This achieves optimal runtime without altering external behavior.
@codeflash-ai codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Jul 25, 2025
@codeflash-ai codeflash-ai bot mentioned this pull request Jul 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
⚡️ codeflash Optimization PR opened by Codeflash AI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

0 participants