Why is PyTensor with 'NUMBA' mode compilation is much slower than numba-jitted calculation? #1545

kratsg · 2025-07-17T14:19:19Z

kratsg
Jul 17, 2025

I see a factor of 10 slowdown between numba-stats (numba-jitted distributions) compared to using pytensor for various compilation modes -- all of which seem to be approximately equivalent (which is suspicious?).

I would like to ask if I'm somehow not properly compiling the pytensor graph or doing something wildly suboptimal - as I've read through the documentation and I think what I was doing is correct, but I would like others with more experience with pytensor to let me know what I've done wrong...

Here are the results of the timing:

============================================================
Gaussian PDF Timing Comparison
============================================================
Test parameters:
  x: 1000 points from -10.0 to 10.0
  mu: 2.0
  sigma: 3.0
  Evaluations: 100,000

Setting up pytensor graph...
✅ pytensor graph created

Testing PyTensor FAST_RUN mode:
  ✅ 100,000 evaluations: 7.8692s total, 0.0787ms per eval
  📈 12708 evaluations/second

Testing PyTensor JAX mode:
  ✅ 100,000 evaluations: 8.8733s total, 0.0887ms per eval
  📈 11270 evaluations/second

Testing PyTensor NUMBA mode:
  ✅ 100,000 evaluations: 7.9343s total, 0.0793ms per eval
  📈 12603 evaluations/second

Testing numba-stats baseline:
  numba-stats warming up JIT compilation...
  ✅ 100,000 evaluations: 0.4466s total, 0.0045ms per eval
  📈 223892 evaluations/second

============================================================
PERFORMANCE SUMMARY
============================================================
Ranked by performance (fastest first):

1. numba-stats     | NUMBA           |   0.0045ms/eval |   223892 eval/s |   1.0x (baseline)
2. pytensor        | FAST_RUN        |   0.0787ms/eval |    12708 eval/s |   0.1x slower
3. pytensor        | NUMBA           |   0.0793ms/eval |    12603 eval/s |   0.1x slower
4. pytensor        | JAX             |   0.0887ms/eval |    11270 eval/s |   0.1x slower

Key findings:
✅ Fastest: ('numba-stats', 'NUMBA') (0.0045ms per evaluation)
🐌 Slowest: ('pytensor', 'JAX') (0.0887ms per evaluation)
📊 Performance ratio: 19.9x difference between fastest and slowest

using the following script

Expand for script

#!/usr/bin/env python3
"""
Timing comparison script for pytensor/mode Gaussian implementations vs numba-stats.

Compares PyTensor Gaussian PDF evaluation performance across different modes:
- FAST_RUN (default optimized mode)
- JAX (if available)
- NUMBA (if available)

Also compares against numba-stats norm.pdf as baseline.
"""

import time
import timeit
import numpy as np
from contextlib import contextmanager
from pytensor.compile.function import function
import pytensor.tensor as pt
import math

# Import numba-stats for comparison
try:
    from numba_stats import norm as numba_norm
    HAS_NUMBA_STATS = True
except ImportError:
    HAS_NUMBA_STATS = False
    print("Warning: numba-stats not available, skipping numba-stats comparison")

# Test parameters
x = np.linspace(-10, 10, 1000)
mu = 2.0
sigma = 3.0
N_EVALUATIONS = 100000

@contextmanager
def time_block(label):
    """Context manager for timing code blocks."""
    start = time.perf_counter()
    yield
    end = time.perf_counter()
    print(f"{label}: {end - start:.4f} seconds")


def setup_pytensor_graph():
    """Create pytensor Gaussian distribution."""

    x = pt.vector("x")
    mean = pt.vector("mean")
    sigma = pt.vector("sigma")

    norm_const = 1.0 / (
        pt.sqrt(2 * math.pi) * sigma
    )
    exponent = pt.exp( -0.5 * ( ( x - mean) / sigma) ** 2)
    return norm_const * exponent, [x, mean, sigma]


def time_pytensor_mode(dist, inputs, mode, x_vals, mu_val, sigma_val, n_evals):
    func = function(
        inputs=inputs,
        outputs=dist,
        mode=mode,
        on_unused_input="ignore",
    )

    params = {"mean": mu_val, "sigma": sigma_val, "x": x_vals}

    def evaluate_pdf():
        return func(**params)

    total_time = timeit.timeit(evaluate_pdf, setup=evaluate_pdf, number=n_evals)
    time_per_eval = total_time / n_evals

    return total_time, time_per_eval

def time_numba_stats(x_vals, mu_val, sigma_val, n_evals):
    """Time numba-stats evaluation."""
    if not HAS_NUMBA_STATS:
        return None, None

    # Setup function for timeit (includes JIT warm-up)
    def setup():
        print("  numba-stats warming up JIT compilation...")
        numba_norm.pdf(x_vals, mu_val, sigma_val)

    # Timing function for timeit
    def evaluate_pdf():
        return numba_norm.pdf(x_vals, mu_val, sigma_val)

    # Time multiple evaluations with proper setup
    total_time = timeit.timeit(evaluate_pdf, setup=setup, number=n_evals)
    time_per_eval = total_time / n_evals

    return total_time, time_per_eval

def main():
    """Main timing comparison."""
    print("=" * 60)
    print("Gaussian PDF Timing Comparison")
    print("=" * 60)
    print(f"Test parameters:")
    print(f"  x: {len(x)} points from {x[0]:.1f} to {x[-1]:.1f}")
    print(f"  mu: {mu}")
    print(f"  sigma: {sigma}")
    print(f"  Evaluations: {N_EVALUATIONS:,}")
    print()

    print("Setting up pytensor graph...")
    graph, inputs = setup_pytensor_graph()
    print("✅ pytensor graph created")
    print()

    # Results storage
    results = {}

    # Test pytensor modes
    pytensor_modes = ["FAST_RUN", "JAX", "NUMBA"]

    for mode in pytensor_modes:
        print(f"Testing PyTensor {mode} mode:")
        total_time, time_per_eval = time_pytensor_mode(graph, inputs, mode, x, [mu]*len(x), [sigma]*len(x), N_EVALUATIONS)

        if total_time is not None:
            results[('pytensor', mode)] = {
                "total_time": total_time,
                "time_per_eval": time_per_eval,
                "evaluations_per_sec": N_EVALUATIONS / total_time
            }
            print(f"  ✅ {N_EVALUATIONS:,} evaluations: {total_time:.4f}s total, {time_per_eval*1000:.4f}ms per eval")
            print(f"  📈 {results[('pytensor', mode)]['evaluations_per_sec']:.0f} evaluations/second")
        else:
            print(f"  ❌ {mode} mode failed or unavailable")
        print()


    # Test numba-stats
    if HAS_NUMBA_STATS:
        print("Testing numba-stats baseline:")
        total_time, time_per_eval = time_numba_stats(x, mu, sigma, N_EVALUATIONS)

        if total_time is not None:
            results[("numba-stats", "NUMBA")] = {
                "total_time": total_time,
                "time_per_eval": time_per_eval,
                "evaluations_per_sec": N_EVALUATIONS / total_time
            }
            print(f"  ✅ {N_EVALUATIONS:,} evaluations: {total_time:.4f}s total, {time_per_eval*1000:.4f}ms per eval")
            print(f"  📈 {results[('numba-stats', 'NUMBA')]['evaluations_per_sec']:.0f} evaluations/second")
        else:
            print(f"  ❌ numba-stats failed")
        print()

    # Summary comparison
    print("=" * 60)
    print("PERFORMANCE SUMMARY")
    print("=" * 60)

    if results:
        # Sort by performance (fastest first)
        sorted_results = sorted(results.items(), key=lambda x: x[1]["time_per_eval"])

        print("Ranked by performance (fastest first):")
        print()

        fastest_time = sorted_results[0][1]["time_per_eval"]

        for i, ((package, mode), result) in enumerate(sorted_results, 1):
            speedup = fastest_time / result["time_per_eval"]
            print(f"{i}. {package:15} | {mode:15} | {result['time_per_eval']*1000:8.4f}ms/eval | "
                  f"{result['evaluations_per_sec']:8.0f} eval/s | "
                  f"{speedup:5.1f}x {'(baseline)' if speedup == 1.0 else 'slower'}")

        print()
        print("Key findings:")
        best_mode = sorted_results[0][0]
        worst_mode = sorted_results[-1][0]
        performance_ratio = sorted_results[-1][1]["time_per_eval"] / fastest_time

        print(f"✅ Fastest: {best_mode} ({sorted_results[0][1]['time_per_eval']*1000:.4f}ms per evaluation)")
        print(f"🐌 Slowest: {worst_mode} ({sorted_results[-1][1]['time_per_eval']*1000:.4f}ms per evaluation)")
        print(f"📊 Performance ratio: {performance_ratio:.1f}x difference between fastest and slowest")

    else:
        print("❌ No successful timing results obtained")

if __name__ == "__main__":
    main()

Answered by ricardoV94

Jul 17, 2025

I edited your file:

#!/usr/bin/env python3
"""
Timing comparison script for pytensor/mode Gaussian implementations vs numba-stats.

Compares PyTensor Gaussian PDF evaluation performance across different modes:
- FAST_RUN (default optimized mode)
- JAX (if available)
- NUMBA (if available)

Also compares against numba-stats norm.pdf as baseline.
"""

import time
import timeit
import numpy as np
from contextlib import contextmanager
from pytensor.compile.function import function
import pytensor.tensor as pt
import math

# Import numba-stats for comparison
try:
    from numba_stats import norm as numba_norm
    HAS_NUMBA_STATS = True
except ImportError:
    HAS_NUMBA_STATS = False
    print("…

View full answer

ricardoV94 · 2025-07-17T14:34:01Z

ricardoV94
Jul 17, 2025
Maintainer

PyTensor has some overhead that you may notice if your function is cheap enough. You can test with larger inputs and the difference should go away (or be reduced substantially)

To reduce overhead, try passing trust_input=True to pytensor.function(...), and make sure to then pass correctly typed numpy arrays as inputs (including scalars). Passing arguments positionally should help as well.

14 replies

kratsg Jul 17, 2025
Author

for jax.numpy.array for the JAX one, I see

============================================================
PERFORMANCE SUMMARY
============================================================
Ranked by performance (fastest first):

1. numba-stats     | NUMBA           |   0.0045ms/eval |   223075 eval/s |   1.0x (baseline)
2. pytensor        | FAST_RUN        |   0.0072ms/eval |   139451 eval/s |   0.6x slower
3. pytensor        | NUMBA           |   0.0082ms/eval |   122348 eval/s |   0.5x slower
4. pytensor        | JAX             |   0.0318ms/eval |    31404 eval/s |   0.1x slower

so this is not great. Will revert and then now try adding trust_input=True on top of numpy arrays.

kratsg Jul 17, 2025
Author

I do not see a trust_input option for pytensor.compile.function: https://pytensor.readthedocs.io/en/latest/library/compile/function.html#pytensor.compile.function.function but I see it for pytensor.compile.function.function_dump.

ricardoV94 Jul 17, 2025
Maintainer

And how do I know the order of positional arguments - is it the order of the inputs provided to the function?

Yes

I do not see a trust_input option for pytensor.compile.function: https://pytensor.readthedocs.io/en/latest/library/compile/function.html#pytensor.compile.function.function but I see it for pytensor.compile.function.function_dump.

Something is being missed in the docs rendering. If you click in source you'll see it, as well as the docstrings:

pytensor/pytensor/compile/function/__init__.py

Lines 95 to 348 in 4d539fa

    
           def function( 
        
               inputs: Iterable[Variable], 
        
               outputs: Variable | Iterable[Variable] | dict[str, Variable] | None = None, 
        
               mode: str | Mode | None = None, 
        
               updates: Iterable[tuple[Variable, Variable]] 
        
               | dict[Variable, Variable] 
        
               | None = None, 
        
               givens: Iterable[tuple[Variable, Variable]] 
        
               | dict[Variable, Variable] 
        
               | None = None, 
        
               no_default_updates: bool = False, 
        
               accept_inplace: bool = False, 
        
               name: str | None = None, 
        
               rebuild_strict: bool = True, 
        
               allow_input_downcast: bool | None = None, 
        
               profile: bool | ProfileStats | None = None, 
        
               on_unused_input: str | None = None, 
        
               trust_input: bool = False, 
        
           ): 
        
               """ 
        
               Return a :class:`callable object <pytensor.compile.function.types.Function>` 
        
               that will calculate `outputs` from `inputs`. 
        
               Parameters 
        
               ---------- 
        
               inputs : list of either Variable or In instances. 
        
                   Function parameters, these are not allowed to be shared variables. 
        
               outputs : list or dict of Variables or Out instances. 
        
                   If it is a dict, the keys must be strings. Expressions to compute. 
        
               mode : string or `Mode` instance. 
        
                   Compilation mode. 
        
               updates : iterable over pairs (shared_variable, new_expression). List, tuple 
        
                         or dict. 
        
                   Updates the values for SharedVariable inputs according to these 
        
                   expressions. 
        
               givens : iterable over pairs (Var1, Var2) of Variables. List, tuple or dict. 
        
                        The Var1 and Var2 in each pair must have the same Type. 
        
                   Specific substitutions to make in the computation graph (Var2 replaces 
        
                   Var1). 
        
               no_default_updates: either bool or list of Variables 
        
                   If True, do not perform any automatic update on Variables. If False 
        
                   (default), perform them all. Else, perform automatic updates on all 
        
                   Variables that are neither in "updates" nor in "no_default_updates". 
        
               accept_inplace : bool 
        
                   True iff the graph can contain inplace operations prior to the 
        
                   optimization phase (default is False). *Note* this parameter is unsupported, 
        
                   and its use is not recommended. 
        
               name : str 
        
                   An optional name for this function. The profile mode will print the time 
        
                   spent in this function. 
        
               rebuild_strict : bool 
        
                   True (Default) is the safer and better tested setting, in which case 
        
                   `givens` must substitute new variables with the same Type as the 
        
                   variables they replace. 
        
                   False is a you-better-know-what-you-are-doing setting, that permits 
        
                   `givens` to replace variables with new variables of any Type. 
        
                   The consequence of changing a Type is that all results depending on that 
        
                   variable may have a different Type too (the graph is rebuilt from inputs 
        
                   to outputs). If one of the new types does not make sense for one of the 
        
                   Ops in the graph, an Exception will be raised. 
        
               allow_input_downcast: bool or None 
        
                   True means that the values passed as inputs when calling the function 
        
                   can be silently down-casted to fit the dtype of the corresponding 
        
                   Variable, which may lose precision. False means that it will only be 
        
                   cast to a more general, or precise, type. None (default) is almost like 
        
                   False, but allows down-casting of Python float scalars to floatX. 
        
               profile: None, True, or ProfileStats instance 
        
                   Accumulate profiling information into a given ProfileStats instance. 
        
                   If argument is `True` then a new ProfileStats instance will be used. 
        
                   If argument is a string, a new ProfileStats instance will be created 
        
                   with that string as its ``message`` attribute. 
        
                   This profiling object will be available via self.profile. 
        
               on_unused_input 
        
                   What to do if a variable in the 'inputs' list is not used in the graph. 
        
                   Possible values are 'raise', 'warn', 'ignore' and None. 
        
               trust_input: bool, default False 
        
                   If True, no input validation checks are performed when the function is 
        
                   called. This includes checking the number of inputs, their types and 
        
                   that multiple inputs are not aliased to each other. Failure to meet any 
        
                   of these conditions can lead to computational errors or to the 
        
                   interpreter crashing. 
        
               Returns 
        
               ------- 
        
               :class:`pytensor.compile.function.types.Function` instance 
        
                   A callable object that will compute the outputs (given the inputs) and 
        
                   update the implicit function arguments according to the `updates`. 
        
               Notes 
        
               ----- 
        
               Regarding givens: Be careful to make sure that these 
        
               substitutions are independent--behaviour when Var1 of one pair 
        
               appears in the graph leading to Var2 in another expression is 
        
               undefined.  Replacements specified with givens are different 
        
               from optimizations in that Var2 is not expected to be 
        
               equivalent to Var1. 
        
               Internal documentation: 
        
                   What happens when you call pytensor.function? 
        
                      1. RemoveShared: shared variables are just an abstraction to make 
        
                   things more convenient for the user. The shared variables are 
        
                   transformed into implicit inputs and implicit outputs. The 
        
                   optimizations don't see which variables are shared or not. 
        
                      2. FunctionGraph: determines whether a graph is valid. For example, 
        
                   suppose 
        
                   you merge the two apply nodes in our example above, ie, do the 
        
                   addition and the tanh at the same time. If you propose a merge that 
        
                   changes the resulting dtype or broadcastable pattern of V4, the fgraph 
        
                   will detect this. 
        
                               inplace optimizations: say we have an apply node that 
        
                   does + on V1 and V2, with output V3. We can change the output to be 
        
                   V1, to use less memory. pytensor must be told that this optimization is 
        
                   happening though, so that other parts of the graph are given the 
        
                   correct (pre + or post + ) version of V1. 
        
                             fgraph will raise an error if any of these types of 
        
                   modifications causes an error 
        
                             fgraph also adds a field called "clients" to all variables. 
        
                   clients is a list of apply nodes that use the variable. this makes it 
        
                   possible to traverse the graph in both directions. this is useful for 
        
                   determining whether to do some optimizations. for example, a fusion 
        
                   operation that removes V3 is not very helpful if V3 is also needed for 
        
                   some other apply node. fusion operations result in a composite op that 
        
                   takes a minigraph of pytensor scalars and uses this to do elemwise 
        
                   operations on pytensor tensors 
        
                    3. Optimization 
        
                          How well do optimizations apply to new ops? 
        
                            Usually there are no optimizations for new ops. In fact, new 
        
                   ops can disrupt patterns and break currently working optimizations. 
        
                   Since the Print op, for example, is not known by any optimization, 
        
                   setting a Print op in the middle of a pattern that is usually 
        
                   optimized out will block the optimization. for example, log(1+x) 
        
                   optimizes to log1p(x) but log(1+Print(x)) is unaffected by 
        
                   optimizations. 
        
                            One exception is elemwise ops. If you implement your new op 
        
                   as a scalar op then it will automatically work with all the elemwise 
        
                   fusion machinery. 
        
                            Local optimizations try to replace some node in the graph 
        
                   with a different node. In the case of log(1+x), we want to replace the 
        
                   log node. 
        
                            def opt_log1p(node): 
        
                               if not isinstance(node.op,Elemwise): 
        
                                  return 
        
                               if not isinstance(node.op.scalar_op, log): 
        
                                  return 
        
                               inp = node.inputs[0] 
        
                               if inp.owner is None: 
        
                                  return 
        
                               if not isinstance(inp.owner.op, add): 
        
                                  return 
        
                               inp2 = inp.owner.inputs 
        
                               check that this has length 2, and that one of the inputs 
        
                   is 1. assign the other input to x 
        
                               return log1p(x) 
        
                    4. Linker 
        
                          The linker uses a Python loop to execute the code associated 
        
                          with all the Apply nodes in the graph in the correct order. 
        
                          The C Virtual Machine (CVM) is a linker that replaces this 
        
                          Python loop with a C loop to avoid continuously changing 
        
                          between Python and C. The CVM is faster for 2 reasons: 
        
                            1) Its internal logic is in C, so no Python interpreter 
        
                               overhead. 
        
                            2) It makes native calls from the VM logic into thunks that 
        
                               have been compiled using the CLinker. 
        
                          The VM is a linker that was developed to prototype the CVM. it 
        
                   was easier to develop the VM in Python then translate it to C instead 
        
                   of just writing it in C from scratch. 
        
               """ 
        
               if isinstance(outputs, dict): 
        
                   assert all(isinstance(k, str) for k in outputs) 
        
                   output_keys = sorted(outputs) 
        
                   outputs = [outputs[key] for key in output_keys] 
        
               else: 
        
                   output_keys = None 
        
               if name is None: 
        
                   # Determine possible file names 
        
                   source_file = re.sub(r"\.pyc?", ".py", __file__) 
        
                   compiled_file = source_file + "c" 
        
                   stack = tb.extract_stack() 
        
                   idx = len(stack) - 1 
        
                   last_frame = stack[idx] 
        
                   if last_frame[0] == source_file or last_frame[0] == compiled_file: 
        
                       func_frame = stack[idx - 1] 
        
                       while "pytensor/graph" in func_frame[0] and idx > 0: 
        
                           idx -= 1 
        
                           # This can happen if we call var.eval() 
        
                           func_frame = stack[idx - 1] 
        
                       name = func_frame[0] + ":" + str(func_frame[1]) 
        
               if updates is None: 
        
                   updates = [] 
        
               if givens is None: 
        
                   givens = [] 
        
               if not isinstance(inputs, list | tuple): 
        
                   raise Exception( 
        
                       "Input variables of an PyTensor function should be " 
        
                       "contained in a list, even when there is a single " 
        
                       "input." 
        
                   ) 
        
               # compute some features of the arguments: 
        
               uses_tuple = any(isinstance(i, list | tuple) for i in inputs) 
        
               uses_updates = bool(updates) 
        
               uses_givens = bool(givens) 
        
               if uses_tuple: 
        
                   # we must use old semantics in this case. 
        
                   if profile: 
        
                       raise NotImplementedError("profiling not supported in old-style function") 
        
                   if uses_updates or uses_givens: 
        
                       raise NotImplementedError( 
        
                           "In() instances and tuple inputs trigger the old " 
        
                           "semantics, which disallow using updates and givens" 
        
                       ) 
        
                   fn = orig_function( 
        
                       inputs, 
        
                       outputs, 
        
                       mode=mode, 
        
                       accept_inplace=accept_inplace, 
        
                       name=name, 
        
                       trust_input=trust_input, 
        
                   ) 
        
               else: 
        
                   # note: pfunc will also call orig_function -- orig_function is 
        
                   #      a choke point that all compilation must pass through 
        
                   fn = pfunc( 
        
                       params=inputs, 
        
                       outputs=outputs, 
        
                       mode=mode, 
        
                       updates=updates, 
        
                       givens=givens, 
        
                       no_default_updates=no_default_updates, 
        
                       accept_inplace=accept_inplace, 
        
                       name=name, 
        
                       rebuild_strict=rebuild_strict, 
        
                       allow_input_downcast=allow_input_downcast, 
        
                       on_unused_input=on_unused_input, 
        
                       profile=profile, 
        
                       output_keys=output_keys, 
        
                       trust_input=trust_input, 
        
                   ) 
        
               return fn

I have not figured out yet how to have it auto-broadcast the inputs

If you define mu/sigma as scalars the generated graph will broadcast automatically.

ricardoV94 Jul 17, 2025
Maintainer

I edited your file:

#!/usr/bin/env python3
"""
Timing comparison script for pytensor/mode Gaussian implementations vs numba-stats.

Compares PyTensor Gaussian PDF evaluation performance across different modes:
- FAST_RUN (default optimized mode)
- JAX (if available)
- NUMBA (if available)

Also compares against numba-stats norm.pdf as baseline.
"""

import time
import timeit
import numpy as np
from contextlib import contextmanager
from pytensor.compile.function import function
import pytensor.tensor as pt
import math

# Import numba-stats for comparison
try:
    from numba_stats import norm as numba_norm
    HAS_NUMBA_STATS = True
except ImportError:
    HAS_NUMBA_STATS = False
    print("Warning: numba-stats not available, skipping numba-stats comparison")

# Test parameters
x = np.linspace(-10, 10, 1000)
mu = np.array(2.0)
sigma = np.array(3.0)
N_EVALUATIONS = 100000

@contextmanager
def time_block(label):
    """Context manager for timing code blocks."""
    start = time.perf_counter()
    yield
    end = time.perf_counter()
    print(f"{label}: {end - start:.4f} seconds")


def setup_pytensor_graph():
    """Create pytensor Gaussian distribution."""

    x = pt.vector("x")
    mean = pt.scalar("mean")
    sigma = pt.scalar("sigma")

    norm_const = 1.0 / (
        pt.sqrt(2 * math.pi) * sigma
    )
    exponent = pt.exp( -0.5 * ( ( x - mean) / sigma) ** 2)
    return norm_const * exponent, [x, mean, sigma]


def time_pytensor_mode(dist, inputs, mode, x_vals, mu_val, sigma_val, n_evals):
    func = function(
        inputs=inputs,
        outputs=dist,
        mode=mode,
        on_unused_input="ignore",
        trust_input=True,
    )

    def evaluate_pdf():
        return func(x_vals, mu_val, sigma_val)

    total_time = timeit.timeit(evaluate_pdf, setup=evaluate_pdf, number=n_evals)
    time_per_eval = total_time / n_evals

    return total_time, time_per_eval

def time_numba_stats(x_vals, mu_val, sigma_val, n_evals):
    """Time numba-stats evaluation."""
    if not HAS_NUMBA_STATS:
        return None, None

    mu_val = mu_val.item()
    sigma_val = sigma_val.item()

    # Setup function for timeit (includes JIT warm-up)
    def setup():
        print("  numba-stats warming up JIT compilation...")
        numba_norm.pdf(x_vals, mu_val, sigma_val)

    # Timing function for timeit
    def evaluate_pdf():
        return numba_norm.pdf(x_vals, mu_val, sigma_val)

    # Time multiple evaluations with proper setup
    total_time = timeit.timeit(evaluate_pdf, setup=setup, number=n_evals)
    time_per_eval = total_time / n_evals

    return total_time, time_per_eval

def main():
    """Main timing comparison."""
    print("=" * 60)
    print("Gaussian PDF Timing Comparison")
    print("=" * 60)
    print(f"Test parameters:")
    print(f"  x: {len(x)} points from {x[0]:.1f} to {x[-1]:.1f}")
    print(f"  mu: {mu}")
    print(f"  sigma: {sigma}")
    print(f"  Evaluations: {N_EVALUATIONS:,}")
    print()

    print("Setting up pytensor graph...")
    graph, inputs = setup_pytensor_graph()
    print("✅ pytensor graph created")
    print()

    # Results storage
    results = {}

    # Test pytensor modes
    pytensor_modes = ["FAST_RUN", "NUMBA"]

    for mode in pytensor_modes:
        print(f"Testing PyTensor {mode} mode:")
        total_time, time_per_eval = time_pytensor_mode(graph, inputs, mode, x, mu, sigma, N_EVALUATIONS)

        if total_time is not None:
            results[('pytensor', mode)] = {
                "total_time": total_time,
                "time_per_eval": time_per_eval,
                "evaluations_per_sec": N_EVALUATIONS / total_time
            }
            print(f"  ✅ {N_EVALUATIONS:,} evaluations: {total_time:.4f}s total, {time_per_eval*1000:.4f}ms per eval")
            print(f"  📈 {results[('pytensor', mode)]['evaluations_per_sec']:.0f} evaluations/second")
        else:
            print(f"  ❌ {mode} mode failed or unavailable")
        print()


    # Test numba-stats
    if HAS_NUMBA_STATS:
        print("Testing numba-stats baseline:")
        total_time, time_per_eval = time_numba_stats(x, mu, sigma, N_EVALUATIONS)

        if total_time is not None:
            results[("numba-stats", "NUMBA")] = {
                "total_time": total_time,
                "time_per_eval": time_per_eval,
                "evaluations_per_sec": N_EVALUATIONS / total_time
            }
            print(f"  ✅ {N_EVALUATIONS:,} evaluations: {total_time:.4f}s total, {time_per_eval*1000:.4f}ms per eval")
            print(f"  📈 {results[('numba-stats', 'NUMBA')]['evaluations_per_sec']:.0f} evaluations/second")
        else:
            print(f"  ❌ numba-stats failed")
        print()

    # Summary comparison
    print("=" * 60)
    print("PERFORMANCE SUMMARY")
    print("=" * 60)

    if results:
        # Sort by performance (fastest first)
        sorted_results = sorted(results.items(), key=lambda x: x[1]["time_per_eval"])

        print("Ranked by performance (fastest first):")
        print()

        fastest_time = sorted_results[0][1]["time_per_eval"]

        for i, ((package, mode), result) in enumerate(sorted_results, 1):
            speedup = fastest_time / result["time_per_eval"]
            print(f"{i}. {package:15} | {mode:15} | {result['time_per_eval']*1000:8.4f}ms/eval | "
                  f"{result['evaluations_per_sec']:8.0f} eval/s | "
                  f"{speedup:5.1f}x {'(baseline)' if speedup == 1.0 else 'slower'}")

        print()
        print("Key findings:")
        best_mode = sorted_results[0][0]
        worst_mode = sorted_results[-1][0]
        performance_ratio = sorted_results[-1][1]["time_per_eval"] / fastest_time

        print(f"✅ Fastest: {best_mode} ({sorted_results[0][1]['time_per_eval']*1000:.4f}ms per evaluation)")
        print(f"🐌 Slowest: {worst_mode} ({sorted_results[-1][1]['time_per_eval']*1000:.4f}ms per evaluation)")
        print(f"📊 Performance ratio: {performance_ratio:.1f}x difference between fastest and slowest")

    else:
        print("❌ No successful timing results obtained")

if __name__ == "__main__":
    main()

On my machine I get similar timings:

============================================================
PERFORMANCE SUMMARY
============================================================
Ranked by performance (fastest first):

1. pytensor        | NUMBA           |   0.0147ms/eval |    67914 eval/s |   1.0x (baseline)
2. pytensor        | FAST_RUN        |   0.0150ms/eval |    66676 eval/s |   1.0x slower
3. numba-stats     | NUMBA           |   0.0150ms/eval |    66618 eval/s |   1.0x slower

If the inputs are scalars, PyTensor doesn't have to iterate over them, that could explain the slowdown you saw (besides trust input)

Answer selected by kratsg

kratsg Jul 17, 2025
Author

trust_input has this for me:

Traceback (most recent call last):
  File "/Users/kratsg/pyhs3/pytensor_timing_pdf_evals.py", line 192, in <module>
    main()
    ~~~~^^
  File "/Users/kratsg/pyhs3/pytensor_timing_pdf_evals.py", line 126, in main
    total_time, time_per_eval = time_pytensor_mode(graph, inputs, mode, array(x), array([mu]*len(x)), array([sigma]*len(x)), N_EVALUATIONS)
                                ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kratsg/pyhs3/pytensor_timing_pdf_evals.py", line 60, in time_pytensor_mode
    func = function(
        inputs=inputs,
    ...<3 lines>...
        trust_input=True,
    )
TypeError: function() got an unexpected keyword argument 'trust_input'

with

pytensor                      2.28.0

I just bumped and required >= 2.28.2 as it was added in 5b82a40 . I barely see performance differences between trust_input=True and trust_input=False.

Let me rewrite to simplify to just scalar inputs x=float, mean=float, sigma=float -- but then I'm not sure how to allow broadcasting to allow for x=array(float), mean=float, sigma=float efficiently in pytensor? Using a pytensor.vector but need to figure out how to broadcast properly without slowdowns (unsure if that's possible)?

kratsg Jul 17, 2025
Author

With the following code, I'm not sure I align with what you showed:

pytensor with array and scalar inputs

#!/usr/bin/env python3
"""
Timing comparison script for pytensor/mode Gaussian implementations vs numba-stats.

Compares PyTensor Gaussian PDF evaluation performance across different modes:
- FAST_RUN (default optimized mode)
- JAX (if available)
- NUMBA (if available)

Also compares against numba-stats norm.pdf as baseline.
"""

import time
import timeit
import numpy as np
from contextlib import contextmanager
from pytensor.compile.function import function
import pytensor.tensor as pt
import math
import jax.numpy as jnp

# Import numba-stats for comparison
try:
    from numba_stats import norm as numba_norm
    HAS_NUMBA_STATS = True
except ImportError:
    HAS_NUMBA_STATS = False
    print("Warning: numba-stats not available, skipping numba-stats comparison")

# Test parameters
x = np.linspace(-10, 10, 1000)
mu = 2.0
sigma = 3.0
N_EVALUATIONS = 100000

@contextmanager
def time_block(label):
    """Context manager for timing code blocks."""
    start = time.perf_counter()
    yield
    end = time.perf_counter()
    print(f"{label}: {end - start:.4f} seconds")


def setup_pytensor_graph():
    """Create pytensor Gaussian distribution."""

    x = pt.scalar("x")
    mean = pt.scalar("mean")
    sigma = pt.scalar("sigma")

    norm_const = 1.0 / (
        pt.sqrt(2 * math.pi) * sigma
    )
    exponent = pt.exp( -0.5 * ( ( x - mean) / sigma) ** 2)
    return norm_const * exponent, [x, mean, sigma]


def time_pytensor_mode(dist, inputs, mode, x_val, mu_val, sigma_val, n_evals):
    func = function(
        inputs=inputs,
        outputs=dist,
        mode=mode,
        on_unused_input="ignore",
        #trust_input=True,
    )

    params = {"mean": mu_val, "sigma": sigma_val, "x": x_val}

    def evaluate_pdf():
        return func(**params)

    total_time = timeit.timeit(evaluate_pdf, setup=evaluate_pdf, number=n_evals)
    time_per_eval = total_time / n_evals

    return total_time, time_per_eval

def time_numba_stats(x_val, mu_val, sigma_val, n_evals):
    """Time numba-stats evaluation."""
    if not HAS_NUMBA_STATS:
        return None, None

    # Setup function for timeit (includes JIT warm-up)
    def setup():
        print("  numba-stats warming up JIT compilation...")
        numba_norm.pdf(x_val, mu_val, sigma_val)

    # Timing function for timeit
    def evaluate_pdf():
        return numba_norm.pdf(x_val, mu_val, sigma_val)

    # Time multiple evaluations with proper setup
    total_time = timeit.timeit(evaluate_pdf, setup=setup, number=n_evals)
    time_per_eval = total_time / n_evals

    return total_time, time_per_eval

def main():
    """Main timing comparison."""
    print("=" * 60)
    print("Gaussian PDF Timing Comparison")
    print("=" * 60)
    print(f"Test parameters:")
    print(f"  x: {len(x)} points from {x[0]:.1f} to {x[-1]:.1f}")
    print(f"  mu: {mu}")
    print(f"  sigma: {sigma}")
    print(f"  Evaluations: {N_EVALUATIONS:,}")
    print()

    print("Setting up pytensor graph...")
    graph, inputs = setup_pytensor_graph()
    print("✅ pytensor graph created")
    print()

    # Results storage
    results = {}

    # Test pytensor modes
    pytensor_modes = ["FAST_RUN", "JAX", "NUMBA"]

    for mode in pytensor_modes:
        print(f"Testing PyTensor {mode} mode:")

        array = np.array #if mode != "JAX" else jnp.array

        total_time, time_per_eval = time_pytensor_mode(graph, inputs, mode, array(x[0]), array(mu), array(sigma), N_EVALUATIONS)

        if total_time is not None:
            results[('pytensor', mode)] = {
                "total_time": total_time,
                "time_per_eval": time_per_eval,
                "evaluations_per_sec": N_EVALUATIONS / total_time
            }
            print(f"  ✅ {N_EVALUATIONS:,} evaluations: {total_time:.4f}s total, {time_per_eval*1000:.4f}ms per eval")
            print(f"  📈 {results[('pytensor', mode)]['evaluations_per_sec']:.0f} evaluations/second")
        else:
            print(f"  ❌ {mode} mode failed or unavailable")
        print()


    # Test numba-stats
    if HAS_NUMBA_STATS:
        print("Testing numba-stats baseline:")
        total_time, time_per_eval = time_numba_stats(x[0], mu, sigma, N_EVALUATIONS)

        if total_time is not None:
            results[("numba-stats", "NUMBA")] = {
                "total_time": total_time,
                "time_per_eval": time_per_eval,
                "evaluations_per_sec": N_EVALUATIONS / total_time
            }
            print(f"  ✅ {N_EVALUATIONS:,} evaluations: {total_time:.4f}s total, {time_per_eval*1000:.4f}ms per eval")
            print(f"  📈 {results[('numba-stats', 'NUMBA')]['evaluations_per_sec']:.0f} evaluations/second")
        else:
            print(f"  ❌ numba-stats failed")
        print()

    # Summary comparison
    print("=" * 60)
    print("PERFORMANCE SUMMARY")
    print("=" * 60)

    if results:
        # Sort by performance (fastest first)
        sorted_results = sorted(results.items(), key=lambda x: x[1]["time_per_eval"])

        print("Ranked by performance (fastest first):")
        print()

        fastest_time = sorted_results[0][1]["time_per_eval"]

        for i, ((package, mode), result) in enumerate(sorted_results, 1):
            speedup = fastest_time / result["time_per_eval"]
            print(f"{i}. {package:15} | {mode:15} | {result['time_per_eval']*1000:8.4f}ms/eval | "
                  f"{result['evaluations_per_sec']:8.0f} eval/s | "
                  f"{speedup:5.1f}x {'(baseline)' if speedup == 1.0 else 'slower'}")

        print()
        print("Key findings:")
        best_mode = sorted_results[0][0]
        worst_mode = sorted_results[-1][0]
        performance_ratio = sorted_results[-1][1]["time_per_eval"] / fastest_time

        print(f"✅ Fastest: {best_mode} ({sorted_results[0][1]['time_per_eval']*1000:.4f}ms per evaluation)")
        print(f"🐌 Slowest: {worst_mode} ({sorted_results[-1][1]['time_per_eval']*1000:.4f}ms per evaluation)")
        print(f"📊 Performance ratio: {performance_ratio:.1f}x difference between fastest and slowest")

    else:
        print("❌ No successful timing results obtained")

if __name__ == "__main__":
    main()

============================================================
PERFORMANCE SUMMARY
============================================================
Ranked by performance (fastest first):

1. numba-stats     | NUMBA           |   0.0011ms/eval |   903491 eval/s |   1.0x (baseline)
2. pytensor        | FAST_RUN        |   0.0043ms/eval |   231106 eval/s |   0.3x slower
3. pytensor        | NUMBA           |   0.0050ms/eval |   199411 eval/s |   0.2x slower
4. pytensor        | JAX             |   0.0105ms/eval |    95668 eval/s |   0.1x slower

ricardoV94 Jul 17, 2025
Maintainer

Did you try my version above? It allows vector x, and broadcasts correctly. That's where I got the similar runtimes.

For PyTensor to broadcast it just needs to know you are operating with two scalars and a vector, when you combine the two it will automatically broadcast the scalars.

kratsg Jul 17, 2025
Author

Ahh -- I overlooked your details, I'm sorry about that. I should read things more carefully :)

Using your code as-is, I have:

============================================================
PERFORMANCE SUMMARY
============================================================
Ranked by performance (fastest first):

1. pytensor        | FAST_RUN        |   0.0033ms/eval |   304844 eval/s |   1.0x (baseline)
2. pytensor        | NUMBA           |   0.0037ms/eval |   267080 eval/s |   0.9x slower
3. numba-stats     | NUMBA           |   0.0044ms/eval |   226088 eval/s |   0.7x slower

The changes I've noticed are:

pt.scalar("x") only, (leaving the other inputs as vector)
using positional arguments instead of keyword arguments

It also looks like you dropped using np.array as inputs too (for mean, sigma). Wow, this is fantastic. This helps me understand the differences now, and it's really subtle(!) which is not too unsurprising.

Were some of these things mentioned in the pytensor documentation somewhere that I had overlooked? One question I had here was about the vector/scalar -- does it matter which input is defined as a scalar? Could x/sigma have been vectors and mean be the scalar? Or is there somehting special or contextual to decide how to choose which one to allow for broadcasting?

maresb · 2025-07-17T16:58:50Z

maresb
Jul 17, 2025
Maintainer

What are your respective operating systems?

@kratsg, could you provide a dump of your Conda environment or venv?

1 reply

kratsg Jul 17, 2025
Author

Sure. This is at the moment a python venv.

Mac OSX ARM (M3) Darwin Ghost.local 23.6.0 Darwin Kernel Version 23.6.0: Thu Apr 24 20:29:27 PDT 2025; root:xnu-10063.141.1.705.2~1/RELEASE_ARM64_T6030 arm64

pip list

$ python -m pip list
Package                       Version                        Editable project location
----------------------------- ------------------------------ -------------------------
alabaster                     1.0.0
astroid                       3.3.9
asttokens                     3.0.0
attrs                         25.3.0
babel                         2.17.0
beautifulsoup4                4.13.4
certifi                       2025.4.26
charset-normalizer            3.4.2
click                         8.1.8
comm                          0.2.2
cons                          0.4.6
coverage                      7.8.0
decorator                     5.2.1
dill                          0.4.0
docutils                      0.21.2
etuples                       0.3.9
executing                     2.2.0
filelock                      3.17.0
formulate                     1.0.0
furo                          2024.8.6
idna                          3.10
imagesize                     1.4.1
iniconfig                     2.1.0
intersphinx_registry          0.2501.23
ipython                       9.2.0
ipython_pygments_lexers       1.1.1
ipywidgets                    8.1.7
isort                         6.0.1
jax                           0.6.2
jaxlib                        0.6.2
jedi                          0.19.2
Jinja2                        3.1.6
jsonpatch                     1.33
jsonpointer                   3.0.0
jsonschema                    4.23.0
jsonschema-specifications     2025.4.1
jupyterlab_widgets            3.0.15
llvmlite                      0.44.0
logical-unification           0.4.6
markdown-it-py                3.0.0
MarkupSafe                    3.0.2
matplotlib-inline             0.1.7
mccabe                        0.7.0
mdit-py-plugins               0.4.2
mdurl                         0.1.2
miniKanren                    1.0.3
ml_dtypes                     0.5.1
mpmath                        1.3.0
multipledispatch              1.0.0
myst-parser                   4.0.1
networkx                      3.4.2
numba                         0.61.2
numba-stats                   1.10.1
numpy                         2.2.3
opt_einsum                    3.4.0
packaging                     25.0
parso                         0.8.4
pexpect                       4.9.0
pip                           24.2
platformdirs                  4.3.8
pluggy                        1.5.0
prompt_toolkit                3.0.51
ptyprocess                    0.7.0
pure_eval                     0.2.3
pydocstyle                    6.3.0
pydot                         4.0.1
Pygments                      2.19.1
pyhf                          0.7.6
pyhs3                         0.0.2.dev44+g087f7d3.d20250711 /Users/kratsg/pyhs3
pylint                        3.3.7
pyparsing                     3.2.3
pytensor                      2.31.7
pytest                        8.3.5
pytest-cov                    6.1.1
PyYAML                        6.0.2
referencing                   0.36.2
requests                      2.32.3
rich                          14.0.0
roman-numerals-py             3.1.0
rpds-py                       0.24.0
rustworkx                     0.16.0
scikit_hep_testdata           0.5.7
scipy                         1.16.0
setuptools                    75.8.0
snowballstemmer               2.2.0
soupsieve                     2.7
Sphinx                        8.2.3
sphinx-autodoc-typehints      3.2.0
sphinx-basic-ng               1.0.0b2
sphinx-click                  6.0.0
sphinx-copybutton             0.5.2
sphinx-issues                 5.0.1
sphinxcontrib-applehelp       2.0.0
sphinxcontrib-devhelp         2.0.0
sphinxcontrib-htmlhelp        2.1.0
sphinxcontrib-jsmath          1.0.1
sphinxcontrib-qthelp          2.0.0
sphinxcontrib-serializinghtml 2.0.0
stack-data                    0.6.3
sympy                         1.13.3
tomlkit                       0.13.2
toolz                         1.0.0
tqdm                          4.67.1
traitlets                     5.14.3
typing_extensions             4.13.2
urllib3                       2.4.0
wcwidth                       0.2.13
widgetsnbextension            4.0.14

Why is PyTensor with 'NUMBA' mode compilation is much slower than numba-jitted calculation? #1545

Uh oh!

kratsg Jul 17, 2025

Replies: 2 comments · 15 replies

Uh oh!

Uh oh!

ricardoV94 Jul 17, 2025 Maintainer

Uh oh!

kratsg Jul 17, 2025 Author

Uh oh!

kratsg Jul 17, 2025 Author

Uh oh!

ricardoV94 Jul 17, 2025 Maintainer

Uh oh!

Uh oh!

ricardoV94 Jul 17, 2025 Maintainer

Uh oh!

kratsg Jul 17, 2025 Author

Uh oh!

Uh oh!

kratsg Jul 17, 2025 Author

Uh oh!

ricardoV94 Jul 17, 2025 Maintainer

Uh oh!

Uh oh!

kratsg Jul 17, 2025 Author

Uh oh!

maresb Jul 17, 2025 Maintainer

Uh oh!

kratsg Jul 17, 2025 Author

kratsg
Jul 17, 2025

Replies: 2 comments 15 replies

ricardoV94
Jul 17, 2025
Maintainer

kratsg Jul 17, 2025
Author

kratsg Jul 17, 2025
Author

ricardoV94 Jul 17, 2025
Maintainer

ricardoV94 Jul 17, 2025
Maintainer

kratsg Jul 17, 2025
Author

kratsg Jul 17, 2025
Author

ricardoV94 Jul 17, 2025
Maintainer

kratsg Jul 17, 2025
Author

maresb
Jul 17, 2025
Maintainer

kratsg Jul 17, 2025
Author