Skip to content

Conversation

@tbensonatl
Copy link
Collaborator

@tbensonatl tbensonatl commented Nov 11, 2025

MatX caches (e.g., cuFFT plans or cuBLAS handles) and allocations (user or internal allocations via matxAlloc) are stored in static data structures. These data structures are destroyed during program exit using the corresponding destructors. However, due to ordering of static destructors and atexit handlers, it is possible for resources to be freed after the CUDA context or some other dependent resource has been destroyed. This can result in a segmentation fault during program exit.

The helper function ClearCachesAndAllocations() can be called prior to program exit to free resources allocated by MatX. This will prevent conflicts with other static destructors and atexit handlers and thus allow clean shutdown. The function may also be useful at other times that the user wishes to free resources allocated via MatX.

There are two other helpers, ClearCaches() and FreeAllocations(), to dellocate data associated with the caches (plans, handles, workspaces, etc.) and allocations made via matxAlloc(), respectively.

MatX caches (e.g., cuFFT plans or cuBLAS handles) and allocations (user or
internal allocations via matxAlloc) are stored in static data structures.
These data structures are destroyed during program exit using the
corresponding destructors. However, due to ordering of static destructors
and atexit handlers, it is possible for resources to be freed after the
CUDA context or some other dependent resource has been destroyed. This
can result in a segmentation fault during program exit.

The helper function ClearMatXCachesAndAllocations() can be called prior to
program exit to free resources allocated by MatX. This will prevent
conflicts with other static destructors and atexit handlers and thus
allow clean shutdown. The function may also be useful at other times
that the user wishes to free resources allocated via MatX.

There are two other helpers, ClearMatXCaches() and FreeMatXAllocations(),
to dellocate data associated with the caches (plans, handles, workspaces, etc.)
and allocations made via matxAlloc(), respectively.

Signed-off-by: Thomas Benson <[email protected]>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 11, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@tbensonatl
Copy link
Collaborator Author

/build

@greptile-apps
Copy link

greptile-apps bot commented Nov 11, 2025

Greptile Overview

Greptile Summary

Added three new helper functions (ClearCaches(), FreeAllocations(), and ClearCachesAndAllocations()) to manually free MatX resources before program exit, addressing static destructor ordering issues that can cause segmentation faults.

Key Changes:

  • cache.h: Implemented ClearAll() method that frees all cache types (transform caches, stream allocations, LTOIR cache) with proper mutex protection
  • allocator.h: Refactored locking by moving memory_mtx acquisition from deallocate_internal() to public deallocate() methods; added free_all() helper
  • Test coverage validates memory is properly freed by checking GPU memory before/after cleanup
  • Examples and test runner demonstrate proper usage pattern

Implementation Notes:

  • Lock ordering is consistent: stream_alloc_mutexmemory_mtx (no reverse paths found)
  • Uses recursive mutexes to prevent same-thread deadlocks
  • ClearAll() calls matxFree() on stream allocations which are then not double-freed by FreeAllocations() because they're removed from the allocation map
  • Registry insertion in GetCacheIdFromType() on every call is inefficient but safe (as noted in existing comment)

Confidence Score: 4/5

  • This PR is safe to merge with low risk - the implementation correctly handles resource cleanup with proper locking and prevents double-free issues
  • Score reflects solid implementation with comprehensive test coverage and correct thread safety. Minor concern about lock ordering consistency across the codebase, though no actual deadlock paths were identified in this review
  • Pay close attention to include/matx/core/cache.h - verify that the lock ordering (stream_alloc_mutex before memory_mtx) is consistent throughout the entire codebase to prevent potential deadlocks

Important Files Changed

File Analysis

Filename Score Overview
include/matx/core/cache.h 4/5 Added ClearAll() method, CacheRegistry(), and helper functions ClearCaches()/ClearCachesAndAllocations(). Implementation correctly handles cache cleanup with proper locking, though registry insertion on every call is inefficient but safe.
include/matx/core/allocator.h 5/5 Refactored deallocate_internal() to require caller hold mutex, moved lock acquisition to public deallocate() methods. Added free_all() helper and public FreeAllocations() function. Changes maintain thread safety correctly.
test/00_misc/ClearCacheTests.cu 4/5 New test validates that ClearCachesAndAllocations() frees memory by checking GPU memory before/after. Test correctly exercises both cache (cuBLAS) and manual allocations, verifying at least 8 MiB freed.

Sequence Diagram

sequenceDiagram
    participant User
    participant ClearCachesAndAllocations
    participant ClearCaches
    participant matxCache_t
    participant FreeAllocations
    participant MemTracker
    participant CUDA

    User->>ClearCachesAndAllocations: Call cleanup before exit
    ClearCachesAndAllocations->>ClearCaches: Clear all caches
    ClearCaches->>matxCache_t: ClearAll()
    
    Note over matxCache_t: Lock cache_mtx
    loop For each cache ID
        matxCache_t->>matxCache_t: Call CacheFreeHelper.free()
        matxCache_t->>matxCache_t: Clear cache entries (destructors)
    end
    
    Note over matxCache_t: Lock stream_alloc_mutex
    loop For each stream allocation
        matxCache_t->>MemTracker: matxFree(ptr)
        Note over MemTracker: Lock memory_mtx
        MemTracker->>CUDA: cudaFreeAsync()
        MemTracker->>MemTracker: Erase from allocationMap
    end
    
    Note over matxCache_t: Lock ltoir_mutex
    matxCache_t->>matxCache_t: Clear LTOIR cache
    
    ClearCaches-->>ClearCachesAndAllocations: Return
    
    ClearCachesAndAllocations->>FreeAllocations: Free remaining allocations
    FreeAllocations->>MemTracker: free_all()
    Note over MemTracker: Lock memory_mtx
    loop While allocationMap not empty
        MemTracker->>MemTracker: deallocate_internal()
        MemTracker->>CUDA: cudaFree()/cudaFreeHost()
        MemTracker->>MemTracker: Erase from allocationMap
    end
    
    FreeAllocations-->>ClearCachesAndAllocations: Return
    ClearCachesAndAllocations-->>User: Cleanup complete
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@cliffburdick
Copy link
Collaborator

Do we need any tests for this?

@tbensonatl
Copy link
Collaborator Author

Do we need any tests for this?

I added a clear call to the post-test code in test/main.cu, so currently each ctest executable will clear the caches after running all of the GTests. I considered adding the same to either each GTest individually or the test exit handler. If we do that, then we will not be able to run tests within the same executable concurrently. One option would be to create a new CTest with a single GTest case that implicitly creates several plans/allocations and then clears everything with a free memory check before and after. The risk there is that the free memory stats will be device-wide, so other concurrent CTests will impact the stats. I'll check if CTest supports flagging a test as one that needs to run when no other tests are running as that would work around these issues.

@cliffburdick
Copy link
Collaborator

Do we need any tests for this?

I added a clear call to the post-test code in test/main.cu, so currently each ctest executable will clear the caches after running all of the GTests. I considered adding the same to either each GTest individually or the test exit handler. If we do that, then we will not be able to run tests within the same executable concurrently. One option would be to create a new CTest with a single GTest case that implicitly creates several plans/allocations and then clears everything with a free memory check before and after. The risk there is that the free memory stats will be device-wide, so other concurrent CTests will impact the stats. I'll check if CTest supports flagging a test as one that needs to run when no other tests are running as that would work around these issues.

I think in some ways not clearing the cache on tests is a feature since it has exposed problems with caching in the past that clearing would hide.

@tbensonatl
Copy link
Collaborator Author

/build

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@cliffburdick
Copy link
Collaborator

/build

3 similar comments
@cliffburdick
Copy link
Collaborator

/build

@cliffburdick
Copy link
Collaborator

/build

@cliffburdick
Copy link
Collaborator

/build

@cliffburdick cliffburdick merged commit 4b43915 into main Nov 14, 2025
1 check passed
@cliffburdick cliffburdick deleted the add-helper-functions-to-clear-caches-allocs branch November 14, 2025 17:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants