Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Nov 7, 2025

📄 44% (0.44x) speedup for _simple_json_normalize in pandas/io/json/_normalize.py

⏱️ Runtime : 3.18 milliseconds 2.21 milliseconds (best of 218 runs)

📝 Explanation and details

The optimization achieves a 43% speedup by eliminating redundant dictionary operations and improving memory allocation patterns in _normalise_json_ordered.

Key optimizations applied:

  1. Single-pass data partitioning: Instead of iterating through data.items() twice with dict comprehensions to separate flat vs nested values, the optimized version uses a single for loop to partition data into top_dict_ and nested_dict_input. This reduces the number of isinstance() calls and dictionary iterations.

  2. In-place dictionary updates: Rather than creating a new dictionary with {**top_dict_, **nested_dict_} (which allocates a new dict and copies all key-value pairs), the optimization uses top_dict_.update(nested_dict_) to merge results in-place, avoiding the allocation overhead.

  3. Conditional processing: The optimization only calls _normalise_json when nested_dict_input is non-empty, avoiding unnecessary function calls for dictionaries with no nested structure.

  4. Simplified return logic: In _simple_json_normalize, removed the intermediate normalised_json_object variable and directly return the result, reducing variable assignments.

Performance impact by test case type:

  • Flat dictionaries see the largest gains (52-94% faster) because they skip nested processing entirely
  • Large lists of dictionaries benefit significantly (57-128% faster) from reduced per-item overhead
  • Complex nested structures show moderate improvements (6-24% faster) as the recursive _normalise_json calls remain the bottleneck

The optimizations are particularly effective for the common JSON normalization use case of processing many flat or lightly nested records, which aligns with typical data processing workflows where this function would be called repeatedly in hot paths.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 56 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
from typing import Any

# imports
import pytest  # used for our unit tests
from pandas.io.json._normalize import _simple_json_normalize

# unit tests

# ------------------ Basic Test Cases ------------------

def test_flat_dict():
    # A simple flat dict should remain unchanged
    data = {"a": 1, "b": 2}
    codeflash_output = _simple_json_normalize(data); result = codeflash_output # 3.15μs -> 1.71μs (83.8% faster)

def test_single_level_nesting():
    # Dict with one nested dict
    data = {"a": 1, "b": {"c": 2, "d": 3}}
    codeflash_output = _simple_json_normalize(data); result = codeflash_output # 4.85μs -> 4.24μs (14.4% faster)

def test_multi_level_nesting():
    # Dict with multiple levels of nesting
    data = {"a": {"b": {"c": 1}}}
    codeflash_output = _simple_json_normalize(data); result = codeflash_output # 4.25μs -> 3.79μs (12.3% faster)

def test_list_of_dicts_basic():
    # List of flat dicts
    data = [{"a": 1}, {"b": 2}]
    codeflash_output = _simple_json_normalize(data); result = codeflash_output # 4.55μs -> 2.87μs (58.4% faster)

def test_list_of_nested_dicts():
    # List of nested dicts
    data = [{"a": {"b": 1}}, {"c": {"d": 2}}]
    codeflash_output = _simple_json_normalize(data); result = codeflash_output # 6.59μs -> 5.92μs (11.4% faster)

def test_separator_argument():
    # Custom separator
    data = {"a": {"b": {"c": 1}}}
    codeflash_output = _simple_json_normalize(data, sep="_"); result = codeflash_output # 4.34μs -> 3.96μs (9.81% faster)

def test_example_from_docstring():
    # The example in the docstring
    data = {
        "flat1": 1,
        "dict1": {"c": 1, "d": 2},
        "nested": {"e": {"c": 1, "d": 2}, "d": 2},
    }
    codeflash_output = _simple_json_normalize(data); result = codeflash_output # 6.63μs -> 6.00μs (10.5% faster)
    expected = {
        "flat1": 1,
        "dict1.c": 1,
        "dict1.d": 2,
        "nested.e.c": 1,
        "nested.e.d": 2,
        "nested.d": 2,
    }

# ------------------ Edge Test Cases ------------------

def test_empty_dict():
    # Should return empty dict
    data = {}
    codeflash_output = _simple_json_normalize(data); result = codeflash_output # 2.31μs -> 1.21μs (91.6% faster)

def test_empty_list():
    # Should return empty list
    data = []
    codeflash_output = _simple_json_normalize(data); result = codeflash_output # 1.15μs -> 1.05μs (9.56% faster)

def test_dict_with_empty_dict_value():
    # Should not flatten empty dict values
    data = {"a": {}}
    codeflash_output = _simple_json_normalize(data); result = codeflash_output # 3.55μs -> 3.08μs (15.0% faster)

def test_dict_with_none_value():
    # None values should be preserved
    data = {"a": None, "b": {"c": None}}
    codeflash_output = _simple_json_normalize(data); result = codeflash_output # 4.33μs -> 3.66μs (18.4% faster)

def test_dict_with_list_value():
    # Lists as values are not recursively normalized
    data = {"a": [1, 2, 3], "b": {"c": [4, 5]}}
    codeflash_output = _simple_json_normalize(data); result = codeflash_output # 4.05μs -> 3.47μs (16.7% faster)

def test_list_of_empty_dicts():
    # List containing empty dicts
    data = [{} for _ in range(3)]
    codeflash_output = _simple_json_normalize(data); result = codeflash_output # 4.81μs -> 2.99μs (61.0% faster)

def test_non_dict_non_list_input():
    # Should return empty dict for non-dict, non-list input
    data = 42
    codeflash_output = _simple_json_normalize(data); result = codeflash_output # 669ns -> 606ns (10.4% faster)

def test_dict_with_mixed_types():
    # Dict with mixed value types
    data = {"a": 1, "b": "str", "c": {"d": True, "e": None}}
    codeflash_output = _simple_json_normalize(data); result = codeflash_output # 4.99μs -> 4.23μs (18.1% faster)

def test_dict_with_bool_and_float_keys():
    # Dict with bool and float values
    data = {"a": True, "b": 1.234, "c": {"d": False, "e": 0.0}}
    codeflash_output = _simple_json_normalize(data); result = codeflash_output # 4.72μs -> 4.01μs (17.8% faster)

def test_dict_with_key_containing_separator():
    # Keys containing the separator
    data = {"a.b": {"c.d": 1}}
    codeflash_output = _simple_json_normalize(data); result = codeflash_output # 3.88μs -> 3.38μs (14.7% faster)

def test_dict_with_unicode_keys_and_values():
    # Unicode keys and values
    data = {"ключ": {"значение": "тест"}}
    codeflash_output = _simple_json_normalize(data); result = codeflash_output # 4.81μs -> 4.41μs (9.21% faster)

def test_dict_with_empty_string_keys():
    # Empty string as key
    data = {"": {"": 1}}
    codeflash_output = _simple_json_normalize(data); result = codeflash_output # 3.59μs -> 3.14μs (14.3% faster)

def test_list_of_mixed_dicts():
    # List of dicts with different structures
    data = [{"a": 1}, {"b": {"c": 2}}, {"d": 3, "e": {"f": 4}}]
    codeflash_output = _simple_json_normalize(data); result = codeflash_output # 8.19μs -> 6.75μs (21.2% faster)
    expected = [{"a": 1}, {"b.c": 2}, {"d": 3, "e.f": 4}]

def test_dict_with_deeply_nested_empty_dict():
    # Deeply nested empty dict
    data = {"a": {"b": {"c": {}}}}
    codeflash_output = _simple_json_normalize(data); result = codeflash_output # 4.05μs -> 3.60μs (12.6% faster)

def test_dict_with_list_of_dicts_value():
    # Should not flatten inside a list value
    data = {"a": [{"b": 1}, {"c": 2}]}
    codeflash_output = _simple_json_normalize(data); result = codeflash_output # 2.57μs -> 1.32μs (94.3% faster)

def test_list_with_non_dict_elements():
    # List containing non-dict elements
    data = [{"a": 1}, 2, "three"]
    codeflash_output = _simple_json_normalize(data); result = codeflash_output # 4.21μs -> 3.05μs (38.4% faster)

# ------------------ Large Scale Test Cases ------------------

def test_large_flat_dict():
    # Large flat dict, 1000 keys
    data = {f"key_{i}": i for i in range(1000)}
    codeflash_output = _simple_json_normalize(data); result = codeflash_output # 99.5μs -> 65.0μs (52.9% faster)

def test_large_nested_dict():
    # Large nested dict, nesting depth = 3, 10 keys per level
    data = {f"a{i}": {f"b{j}": {f"c{k}": k for k in range(10)} for j in range(10)} for i in range(10)}
    codeflash_output = _simple_json_normalize(data); result = codeflash_output # 197μs -> 196μs (0.769% faster)

def test_large_list_of_dicts():
    # List of 500 dicts, each with 2 keys
    data = [{"a": i, "b": i * 2} for i in range(500)]
    codeflash_output = _simple_json_normalize(data); result = codeflash_output # 350μs -> 157μs (123% faster)
    for i in range(500):
        pass

def test_large_list_of_nested_dicts():
    # List of 100 dicts, each with nested structure
    data = [{"a": {"b": i}} for i in range(100)]
    codeflash_output = _simple_json_normalize(data); result = codeflash_output # 103μs -> 86.4μs (20.3% faster)
    for i in range(100):
        pass

def test_large_mixed_structure():
    # Large dict with mixed values (flat, nested, list)
    data = {f"flat_{i}": i for i in range(100)}
    data.update({f"nest_{i}": {"x": i, "y": {"z": i}} for i in range(100)})
    data["list"] = [{"a": 1}, {"b": 2}]
    codeflash_output = _simple_json_normalize(data); result = codeflash_output # 104μs -> 97.6μs (6.96% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
from typing import Any

# imports
import pytest
from pandas.io.json._normalize import _simple_json_normalize

# unit tests

# --- BASIC TEST CASES ---

def test_flat_dict():
    # Test a completely flat dictionary (no nesting)
    data = {'a': 1, 'b': 2, 'c': 3}
    codeflash_output = _simple_json_normalize(data); result = codeflash_output # 2.86μs -> 1.61μs (77.0% faster)

def test_single_level_nesting():
    # Test a dictionary with a single nested dictionary
    data = {'a': 1, 'b': {'c': 2, 'd': 3}}
    codeflash_output = _simple_json_normalize(data); result = codeflash_output # 4.19μs -> 3.78μs (10.8% faster)

def test_multiple_level_nesting():
    # Test a dictionary with multiple levels of nesting
    data = {'a': 1, 'b': {'c': {'d': 4}}}
    codeflash_output = _simple_json_normalize(data); result = codeflash_output # 4.27μs -> 3.71μs (15.0% faster)

def test_list_of_dicts_flat():
    # Test a list of flat dictionaries
    data = [{'a': 1, 'b': 2}, {'a': 3, 'b': 4}]
    codeflash_output = _simple_json_normalize(data); result = codeflash_output # 4.93μs -> 3.13μs (57.7% faster)

def test_list_of_dicts_nested():
    # Test a list of nested dictionaries
    data = [{'a': 1, 'b': {'c': 2}}, {'a': 3, 'b': {'c': 4}}]
    codeflash_output = _simple_json_normalize(data); result = codeflash_output # 7.09μs -> 6.14μs (15.4% faster)

def test_custom_separator():
    # Test using a custom separator
    data = {'a': {'b': {'c': 1}}}
    codeflash_output = _simple_json_normalize(data, sep='_'); result = codeflash_output # 4.24μs -> 3.98μs (6.66% faster)

def test_mixed_types():
    # Test dictionary with mixed value types
    data = {'a': 1, 'b': 'str', 'c': {'d': 2.5, 'e': None}}
    codeflash_output = _simple_json_normalize(data); result = codeflash_output # 4.47μs -> 3.91μs (14.4% faster)

# --- EDGE TEST CASES ---

def test_empty_dict():
    # Test an empty dictionary
    data = {}
    codeflash_output = _simple_json_normalize(data); result = codeflash_output # 2.26μs -> 1.19μs (90.4% faster)

def test_empty_list():
    # Test an empty list
    data = []
    codeflash_output = _simple_json_normalize(data); result = codeflash_output # 1.07μs -> 1.04μs (3.18% faster)

def test_dict_with_empty_dict():
    # Test dictionary with an empty nested dictionary
    data = {'a': {}, 'b': 1}
    codeflash_output = _simple_json_normalize(data); result = codeflash_output # 3.85μs -> 3.15μs (22.0% faster)

def test_dict_with_empty_list():
    # Test dictionary with an empty list as value
    data = {'a': [], 'b': 1}
    codeflash_output = _simple_json_normalize(data); result = codeflash_output # 2.83μs -> 1.54μs (83.8% faster)

def test_dict_with_none():
    # Test dictionary with None as value
    data = {'a': None, 'b': {'c': None}}
    codeflash_output = _simple_json_normalize(data); result = codeflash_output # 4.20μs -> 3.63μs (15.5% faster)

def test_list_of_empty_dicts():
    # Test a list containing empty dictionaries
    data = [{}, {}]
    codeflash_output = _simple_json_normalize(data); result = codeflash_output # 4.07μs -> 2.58μs (58.0% faster)

def test_deeply_nested_dict():
    # Test a deeply nested dictionary (depth 10)
    data = current = {}
    for i in range(10, 0, -1):
        current = {f'k{i}': current}
    # Add a leaf value
    leaf = current
    for i in range(1, 10):
        leaf = leaf[f'k{i}']
    leaf['leaf'] = 'value'
    codeflash_output = _simple_json_normalize(current); result = codeflash_output # 6.10μs -> 5.90μs (3.39% faster)
    # Build expected key
    expected_key = '.'.join([f'k{i}' for i in range(1, 11)]) + '.leaf'

def test_dict_with_list_value():
    # Test dictionary with a list as a value
    data = {'a': [1, 2, 3], 'b': {'c': [4, 5]}}
    codeflash_output = _simple_json_normalize(data); result = codeflash_output # 4.11μs -> 3.53μs (16.5% faster)

def test_dict_with_tuple_value():
    # Test dictionary with a tuple as a value
    data = {'a': (1, 2), 'b': {'c': (3, 4)}}
    codeflash_output = _simple_json_normalize(data); result = codeflash_output # 4.06μs -> 3.60μs (12.6% faster)

def test_dict_with_bool_and_float():
    # Test dictionary with bool and float types
    data = {'a': True, 'b': {'c': False, 'd': 3.14}}
    codeflash_output = _simple_json_normalize(data); result = codeflash_output # 4.53μs -> 3.93μs (15.3% faster)

def test_dict_with_duplicate_keys():
    # Test dictionary with duplicate keys at different nesting levels
    data = {'a': 1, 'b': {'a': 2}}
    codeflash_output = _simple_json_normalize(data); result = codeflash_output # 3.96μs -> 3.41μs (16.0% faster)

def test_list_of_mixed_dicts():
    # Test a list of dicts with different keys
    data = [{'a': 1}, {'b': 2}, {'a': 3, 'b': 4}]
    codeflash_output = _simple_json_normalize(data); result = codeflash_output # 6.01μs -> 3.43μs (75.3% faster)

def test_separator_edge_case():
    # Test separator that is a special character
    data = {'a': {'b': 1}}
    codeflash_output = _simple_json_normalize(data, sep='|'); result = codeflash_output # 4.04μs -> 3.71μs (8.93% faster)

def test_dict_with_unicode_keys_and_values():
    # Test dictionary with unicode keys and values
    data = {'ключ': {'значение': 'тест'}, 'emoji': {'😊': '👍'}}
    codeflash_output = _simple_json_normalize(data); result = codeflash_output # 6.09μs -> 5.75μs (6.01% faster)

def test_dict_with_numeric_keys():
    # Test dictionary with integer keys
    data = {1: {'2': 3}}
    codeflash_output = _simple_json_normalize(data); result = codeflash_output # 3.84μs -> 3.45μs (11.3% faster)

# --- LARGE SCALE TEST CASES ---

def test_large_flat_dict():
    # Test a large flat dictionary (1000 items)
    data = {f'key{i}': i for i in range(1000)}
    codeflash_output = _simple_json_normalize(data); result = codeflash_output # 99.9μs -> 65.7μs (52.1% faster)

def test_large_nested_dict():
    # Test a large nested dictionary (10 top-level keys, each with 10 nested keys)
    data = {f'outer{i}': {f'inner{j}': i * 10 + j for j in range(10)} for i in range(10)}
    codeflash_output = _simple_json_normalize(data); result = codeflash_output # 26.9μs -> 26.6μs (1.02% faster)
    # All keys should be flattened as 'outer{i}.inner{j}'
    for i in range(10):
        for j in range(10):
            pass

def test_large_list_of_dicts():
    # Test a large list of flat dictionaries (1000 items)
    data = [{'a': i, 'b': i * 2} for i in range(1000)]
    codeflash_output = _simple_json_normalize(data); result = codeflash_output # 707μs -> 309μs (128% faster)
    for i in range(1000):
        pass

def test_large_list_of_nested_dicts():
    # Test a large list of nested dictionaries (1000 items)
    data = [{'a': i, 'b': {'c': i * 2}} for i in range(1000)]
    codeflash_output = _simple_json_normalize(data); result = codeflash_output # 1.25ms -> 998μs (24.7% faster)
    for i in range(1000):
        pass

def test_large_deeply_nested_dict():
    # Test a dictionary with 100 nested levels (depth 100)
    data = current = {}
    for i in range(100, 0, -1):
        current = {f'k{i}': current}
    # Add a leaf value
    leaf = current
    for i in range(1, 100):
        leaf = leaf[f'k{i}']
    leaf['leaf'] = 'value'
    codeflash_output = _simple_json_normalize(current); result = codeflash_output # 27.3μs -> 26.6μs (2.35% faster)
    expected_key = '.'.join([f'k{i}' for i in range(1, 101)]) + '.leaf'

def test_large_dict_with_various_types():
    # Test a large dict with mixed types
    data = {f'key{i}': i if i % 3 == 0 else [i, i+1] if i % 3 == 1 else {'nested': i} for i in range(100)}
    codeflash_output = _simple_json_normalize(data); result = codeflash_output # 29.4μs -> 25.8μs (14.0% faster)
    for i in range(100):
        if i % 3 == 0:
            pass
        elif i % 3 == 1:
            pass
        else:
            pass
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-_simple_json_normalize-mhopi3bl and push.

Codeflash Static Badge

The optimization achieves a **43% speedup** by eliminating redundant dictionary operations and improving memory allocation patterns in `_normalise_json_ordered`.

**Key optimizations applied:**

1. **Single-pass data partitioning**: Instead of iterating through `data.items()` twice with dict comprehensions to separate flat vs nested values, the optimized version uses a single `for` loop to partition data into `top_dict_` and `nested_dict_input`. This reduces the number of `isinstance()` calls and dictionary iterations.

2. **In-place dictionary updates**: Rather than creating a new dictionary with `{**top_dict_, **nested_dict_}` (which allocates a new dict and copies all key-value pairs), the optimization uses `top_dict_.update(nested_dict_)` to merge results in-place, avoiding the allocation overhead.

3. **Conditional processing**: The optimization only calls `_normalise_json` when `nested_dict_input` is non-empty, avoiding unnecessary function calls for dictionaries with no nested structure.

4. **Simplified return logic**: In `_simple_json_normalize`, removed the intermediate `normalised_json_object` variable and directly return the result, reducing variable assignments.

**Performance impact by test case type:**
- **Flat dictionaries** see the largest gains (52-94% faster) because they skip nested processing entirely
- **Large lists of dictionaries** benefit significantly (57-128% faster) from reduced per-item overhead  
- **Complex nested structures** show moderate improvements (6-24% faster) as the recursive `_normalise_json` calls remain the bottleneck

The optimizations are particularly effective for the common JSON normalization use case of processing many flat or lightly nested records, which aligns with typical data processing workflows where this function would be called repeatedly in hot paths.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 November 7, 2025 10:22
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant