Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
b844ac0
add rocm support
tjtanaa Oct 17, 2025
73b2674
add rocm documentation
tjtanaa Oct 19, 2025
663e8b6
add amd perf benchmark data
tjtanaa Oct 19, 2025
93e247e
fix image
tjtanaa Oct 20, 2025
962010c
address reviewer feedback
tjtanaa Oct 24, 2025
f7e617d
remove torch dependencies
tjtanaa Nov 6, 2025
72eb785
handle when cuda and rocm not found
tjtanaa Nov 6, 2025
6ddba67
fix paddle paddle
tjtanaa Nov 6, 2025
8714185
setup rocm wheel ci build
tjtanaa Nov 6, 2025
5fb3a14
manual trigger rocm workflow
tjtanaa Nov 6, 2025
5e2e4af
remove system package steps
tjtanaa Nov 6, 2025
75530b5
install system dependencies
tjtanaa Nov 6, 2025
6ad6bb2
upgrade ubuntu version, skip tests
tjtanaa Nov 6, 2025
ab00395
upgrade ubuntu version, skip tests
tjtanaa Nov 6, 2025
9a11ce3
install from python from offical source
tjtanaa Nov 6, 2025
d6c1f4a
use venv instead
tjtanaa Nov 6, 2025
5d24383
fix other python ci build
tjtanaa Nov 6, 2025
5eba4b5
build many linux
tjtanaa Nov 6, 2025
c543e16
remove rocm_ tag from platform tag
tjtanaa Nov 6, 2025
8ea1956
Add automated PyPI index with separate CUDA/ROCm backends
tjtanaa Nov 6, 2025
f6101fc
only manual trigger when deploying pypi index
tjtanaa Nov 6, 2025
643d12d
add publish to index GA workflow
tjtanaa Nov 7, 2025
0935d8c
fix the publish to index
tjtanaa Nov 7, 2025
378b262
fix the publish to index write permission
tjtanaa Nov 7, 2025
5efe4a9
fix the publish to index for both mode
tjtanaa Nov 7, 2025
8871043
fix the manylinux rocmwheel build
tjtanaa Nov 7, 2025
619c531
update publish to index
tjtanaa Nov 7, 2025
6a3f302
fix nested dumb-pypi
tjtanaa Nov 7, 2025
c3a580b
fix publish to index
tjtanaa Nov 7, 2025
4eca2b7
fix the dumb-pypi
tjtanaa Nov 7, 2025
2f15898
fix the package path
tjtanaa Nov 7, 2025
8ffcefa
lint
tjtanaa Nov 7, 2025
f09cb54
remove deploy-pypi-index
tjtanaa Nov 7, 2025
821fb41
remove unused code
tjtanaa Nov 7, 2025
e9ff27f
update publish to index to handle version isolation
tjtanaa Nov 7, 2025
eff48c7
fix publish to index yaml syntax
tjtanaa Nov 7, 2025
4dcbb76
fix publish to index syntax error
tjtanaa Nov 7, 2025
86faab4
add workflow python script
tjtanaa Nov 7, 2025
2b12a97
only bundle the dependencies specified in the pyproject.toml
tjtanaa Nov 7, 2025
f0ec845
bugfix the workflow
tjtanaa Nov 7, 2025
5291031
fixing the publsih to index workflow
tjtanaa Nov 7, 2025
1f2e60c
fixing the publsih to index workflow
tjtanaa Nov 7, 2025
bf274a1
update workflow instruction
tjtanaa Nov 7, 2025
c5b4886
only allow publish to index be triggered manually
tjtanaa Nov 7, 2025
d631e44
remove github workflow
tjtanaa Nov 10, 2025
8832411
sync with upstream
tjtanaa Nov 10, 2025
e353538
update installation procedure on ROCm
tjtanaa Nov 10, 2025
8fd8b99
fix installation command
tjtanaa Nov 10, 2025
82fba39
fix enum for HIP
tjtanaa Nov 11, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 10 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,4 +10,13 @@ htmlcov/
.idea
*.log
*.pyc
examples/paddle_case/log
*.so
examples/paddle_case/log

# Auto-generated hipified files and directories (created during ROCm build)
fastsafetensors/cpp/hip/
fastsafetensors/cpp/*.hip.*
fastsafetensors/cpp/hip_compat.h

# Auto-generated PyPI index (generated by GitHub Actions)
pypi-index/
22 changes: 20 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,16 +48,34 @@ Please refer to [Foundation Model Stack Community Code of Conduct](https://githu

Takeshi Yoshimura, Tatsuhiro Chiba, Manish Sethi, Daniel Waddington, Swaminathan Sundararaman. (2025) Speeding up Model Loading with fastsafetensors [arXiv:2505.23072](https://arxiv.org/abs/2505.23072) and IEEE CLOUD 2025.

## For NVIDIA

## Install from PyPI
### Install from PyPI

See https://pypi.org/project/fastsafetensors/

```bash
pip install fastsafetensors
```

## Install from source
### Install from source

```bash
pip install .
```

## For ROCm

On ROCm, there are not GDS equivalent support. So fastsafetensors support only supports `nogds=True` mode.
The performance gain example can be found at [amd-perf.md](./docs/amd-perf.md)

### Install from Github Source

```bash
pip install git+https://github.com/foundation-model-stack/fastsafetensors.git
```

### Install from source

```bash
pip install .
Expand Down
88 changes: 88 additions & 0 deletions docs/amd-perf.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# Performance of FastSafeTensors on AMD GPUs

## DeepSeek-R1 vLLM Model Weight Loading Speed

This benchmark compares the performance of `safetensors` vs `fastsafetensors` when loading model weights on AMD GPUs.

NOTES: `fastsafetensors` does not support GDS feature on ROCm as there are no GDS alternative on ROCm.

### Benchmark Methodology

**Platform:** AMD ROCm 7.0.1
**GPUs:** 8x AMD Instinct MI300X
**Library:** fastsafetensors 0.1.15

1. **Clear system cache** to ensure consistent starting conditions:
```bash
sudo sh -c 'sync && echo 3 > /proc/sys/vm/drop_caches'
```

2. **Launch vLLM** with either `--load-format safetensors` or `--load-format fastsafetensors`:

```bash
MODEL=EmbeddedLLM/deepseek-r1-FP8-Dynamic

VLLM_USE_V1=1 \
VLLM_ROCM_USE_AITER=1 \
vllm serve $MODEL \
--tensor-parallel-size 8 \
--disable-log-requests \
--compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \
--trust-remote-code \
--load-format fastsafetensors \
--block-size 1
```

### Results

The experiments are carried on MI300X.

**Cache Scenarios:**
- **No cache**: Model weights are loaded after clearing the system cache (cold start).
- **Cached**: Model weights are loaded immediately after a previous load. The weights are cached in the filesystem and RAM (warm start).

<img src="./images/fastsafetensors-rocm.png" alt="FastSafeTensors on ROCm" width="70%">




## GPT-2 perf tests based on the script [perf/fastsafetensors_perf/perf.py](../perf/fastsafetensors_perf/perf.py)

### Test Configuration

All tests were performed on single-GPU loading scenarios with two different model sizes:
- **GPT-2 (small):** 523MB safetensors file
- **GPT-2 Medium:** ~1.4GB safetensors file

#### Key Parameters Tested:
- **nogds mode:** ROCm fallback (GDS not available on AMD GPUs)
- **Thread counts:** 8, 16, 32
- **Buffer sizes:** 8MB, 16MB, 32MB
- **Loading methods:** nogds (async I/O), mmap (memory-mapped)
- **Data types:** AUTO (no conversion), F16 (half precision conversion)

---

#### Performance Results

##### GPT-2 (523MB) - Single GPU Tests

| Test # | Method | Threads | Buffer | Config | Bandwidth | Elapsed Time | Notes |
|--------|--------|---------|--------|--------|-----------|--------------|-------|
| 1 | nogds | 16 | 16MB | default | **1.91 GB/s** | 0.268s | Baseline test |
| 2 | nogds | 32 | 32MB | default | **2.07 GB/s** | 0.246s | Higher threads/buffer |
| 3 | nogds | 8 | 8MB | default | **2.10 GB/s** | 0.243s | Lower threads/buffer |
| 4 | mmap | N/A | N/A | default | **1.01 GB/s** | 0.505s | Memory-mapped |
| 5 | nogds | 32 | 32MB | cache-drop | **1.24 GB/s** | 0.410s | Cold cache test |
| 6 | nogds | 32 | 32MB | F16 dtype | **0.77 GB/s** | 0.332s | With type conversion |
| 8 | nogds | 16 | 16MB | **optimal** | **2.62 GB/s** | 0.195s | Best config |

##### GPT-2 Medium (1.4GB) - Single GPU Tests

| Test # | Method | Threads | Buffer | Block Size | Bandwidth | Elapsed Time | Notes |
|--------|--------|---------|--------|------------|-----------|--------------|-------|
| 9 | nogds | 16 | 16MB | 160MB | **6.02 GB/s** | 0.235s | Optimal config |
| 10 | mmap | N/A | N/A | N/A | **1.28 GB/s** | 1.104s | Memory-mapped |
| 11 | nogds | 32 | 32MB | 160MB | **5.34 GB/s** | 0.265s | Higher threads |

---
Binary file added docs/images/fastsafetensors-rocm.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
9 changes: 9 additions & 0 deletions fastsafetensors/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,15 @@
from .st_types import Device, DType


def is_gpu_found():
"""Check if any GPU (CUDA or HIP) is available.

Returns True if either CUDA or ROCm/HIP GPUs are detected.
This allows code to work transparently across both platforms.
"""
return fstcpp.is_cuda_found() or fstcpp.is_hip_found()


def get_device_numa_node(device: Optional[int]) -> Optional[int]:
if device is None or not sys.platform.startswith("linux"):
return None
Expand Down
37 changes: 28 additions & 9 deletions fastsafetensors/copier/gds.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
from typing import Dict, Optional

from .. import cpp as fstcpp
from ..common import SafeTensorsMetadata
from ..common import SafeTensorsMetadata, is_gpu_found
from ..frameworks import FrameworkOpBase, TensorBase
from ..st_types import Device, DeviceType, DType
from .base import CopierInterface
Expand All @@ -30,12 +30,29 @@ def __init__(
self.fh: Optional[fstcpp.gds_file_handle] = None
self.copy_reqs: Dict[int, int] = {}
self.aligned_length = 0
cudavers = list(map(int, framework.get_cuda_ver().split(".")))
# CUDA 12.2 (GDS version 1.7) introduces support for non O_DIRECT file descriptors
# Compatible with CUDA 11.x
self.o_direct = not (
cudavers[0] > 12 or (cudavers[0] == 12 and cudavers[1] >= 2)
)
cuda_ver = framework.get_cuda_ver()
if cuda_ver and cuda_ver != "0.0":
# Parse version string (e.g., "cuda-12.1" or "hip-5.7.0")
# Extract the numeric part after the platform prefix
ver_parts = cuda_ver.split("-", 1)
if len(ver_parts) == 2:
cudavers = list(map(int, ver_parts[1].split(".")))
# CUDA 12.2 (GDS version 1.7) introduces support for non O_DIRECT file descriptors
# Compatible with CUDA 11.x
# Only applies to CUDA platform (not ROCm/HIP)
if ver_parts[0] == "cuda":
self.o_direct = not (
cudavers[0] > 12 or (cudavers[0] == 12 and cudavers[1] >= 2)
)
else:
# ROCm/HIP platform, use O_DIRECT
self.o_direct = True
else:
# Fallback if format is unexpected
self.o_direct = True
else:
# No GPU platform detected, use O_DIRECT
self.o_direct = True

def set_o_direct(self, enable: bool):
self.o_direct = enable
Expand Down Expand Up @@ -151,8 +168,10 @@ def new_gds_file_copier(
nogds: bool = False,
):
device_is_not_cpu = device.type != DeviceType.CPU
if device_is_not_cpu and not fstcpp.is_cuda_found():
raise Exception("[FAIL] libcudart.so does not exist")
if device_is_not_cpu and not is_gpu_found():
raise Exception(
"[FAIL] GPU runtime library (libcudart.so or libamdhip64.so) does not exist"
)
if not fstcpp.is_cufile_found() and not nogds:
warnings.warn(
"libcufile.so does not exist but nogds is False. use nogds=True",
Expand Down
37 changes: 37 additions & 0 deletions fastsafetensors/cpp/cuda_compat.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
// SPDX-License-Identifier: Apache-2.0
/*
* CUDA/HIP compatibility layer for fastsafetensors
* Minimal compatibility header - only defines what hipify-perl doesn't handle
*/

#ifndef __CUDA_COMPAT_H__
#define __CUDA_COMPAT_H__

// Platform detection - this gets hipified to check __HIP_PLATFORM_AMD__
#ifdef __HIP_PLATFORM_AMD__
#ifndef USE_ROCM
#define USE_ROCM
#endif
// Note: We do NOT include <hip/hip_runtime.h> here to avoid compile-time dependencies.
// Instead, we dynamically load the ROCm runtime library (libamdhip64.so) at runtime
// using dlopen(), just like we do for CUDA (libcudart.so).
// Minimal types are defined in ext.hpp.
#else
// For CUDA platform, we also avoid including headers and define minimal types in ext.hpp
#endif

// Runtime library name - hipify-perl doesn't change string literals
#ifdef USE_ROCM
#define GPU_RUNTIME_LIB "libamdhip64.so"
#else
#define GPU_RUNTIME_LIB "libcudart.so"
#endif

// Custom function pointer names that hipify-perl doesn't recognize
// These are our own naming in ext_funcs struct, not standard CUDA API
#ifdef USE_ROCM
#define cudaDeviceMalloc hipDeviceMalloc
#define cudaDeviceFree hipDeviceFree
#endif

#endif // __CUDA_COMPAT_H__
43 changes: 40 additions & 3 deletions fastsafetensors/cpp/ext.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
#include <chrono>
#include <dlfcn.h>

#include "cuda_compat.h"
#include "ext.hpp"

#define ALIGN 4096
Expand Down Expand Up @@ -78,6 +79,7 @@ ext_funcs_t cpu_fns = ext_funcs_t {
ext_funcs_t cuda_fns;

static bool cuda_found = false;
static bool is_hip_runtime = false; // Track if we loaded HIP (not auto-hipified)
static bool cufile_found = false;

static int cufile_ver = 0;
Expand All @@ -89,7 +91,7 @@ template <typename T> void mydlsym(T** h, void* lib, std::string const& name) {
static void load_nvidia_functions() {
cudaError_t (*cudaGetDeviceCount)(int*);
const char* cufileLib = "libcufile.so.0";
const char* cudartLib = "libcudart.so";
const char* cudartLib = GPU_RUNTIME_LIB;
const char* numaLib = "libnuma.so.1";
bool init_log = getenv(ENV_ENABLE_INIT_LOG);
int mode = RTLD_LAZY | RTLD_GLOBAL | RTLD_NODELETE;
Expand Down Expand Up @@ -122,8 +124,12 @@ static void load_nvidia_functions() {
count = 0; // why cudaGetDeviceCount returns non-zero for errors?
}
cuda_found = count > 0;
// Detect if we loaded HIP runtime (ROCm) vs CUDA runtime
if (cuda_found && std::string(cudartLib).find("hip") != std::string::npos) {
is_hip_runtime = true;
}
if (init_log) {
fprintf(stderr, "[DEBUG] device count=%d, cuda_found=%d\n", count, cuda_found);
fprintf(stderr, "[DEBUG] device count=%d, cuda_found=%d, is_hip_runtime=%d\n", count, cuda_found, is_hip_runtime);
}
} else {
cuda_found = false;
Expand Down Expand Up @@ -217,11 +223,28 @@ static void load_nvidia_functions() {
}
}

// Note: is_cuda_found gets auto-hipified to is_hip_found on ROCm builds
// So this function will be is_hip_found() after hipification on ROCm
bool is_cuda_found()
{
return cuda_found;
}

// Separate function that always returns false on ROCm (CUDA not available on ROCm)
// This will be used for the "is_cuda_found" Python export on ROCm builds
bool cuda_not_available()
{
return false; // On ROCm, CUDA is never available
}

// Separate function for checking HIP runtime detection (not hipified)
// On CUDA: checks if HIP runtime was detected
// On ROCm: not used (is_cuda_found gets hipified to is_hip_found)
bool check_hip_runtime()
{
return is_hip_runtime;
}

bool is_cufile_found()
{
return cufile_found;
Expand Down Expand Up @@ -718,7 +741,21 @@ cpp_metrics_t get_cpp_metrics() {

PYBIND11_MODULE(__MOD_NAME__, m)
{
m.def("is_cuda_found", &is_cuda_found);
// Export both is_cuda_found and is_hip_found on all platforms
// Use string concatenation to prevent hipify from converting the export names
#ifdef USE_ROCM
// On ROCm after hipify:
// - is_cuda_found() becomes is_hip_found(), so export it as "is_hip_found"
// - Export cuda_not_available() as "is_cuda_found" (CUDA not available on ROCm)
m.def(("is_" "cuda" "_found"), &cuda_not_available); // Returns false on ROCm
m.def(("is_" "hip" "_found"), &is_cuda_found); // hipified to is_hip_found, returns hip status
#else
// On CUDA:
// - is_cuda_found() checks for CUDA
// - check_hip_runtime() checks if HIP runtime was loaded
m.def(("is_" "cuda" "_found"), &is_cuda_found);
m.def(("is_" "hip" "_found"), &check_hip_runtime);
#endif
m.def("is_cufile_found", &is_cufile_found);
m.def("cufile_version", &cufile_version);
m.def("set_debug_log", &set_debug_log);
Expand Down
10 changes: 10 additions & 0 deletions fastsafetensors/cpp/ext.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@
#include <pybind11/pybind11.h>
#include <pybind11/stl.h>

#include "cuda_compat.h"

#define ENV_ENABLE_INIT_LOG "FASTSAFETENSORS_ENABLE_INIT_LOG"

#ifndef __MOD_NAME__
Expand All @@ -33,8 +35,16 @@ typedef struct CUfileDescr_t {
const void *fs_ops; /* CUfileFSOps_t */
} CUfileDescr_t;
typedef struct CUfileError { CUfileOpError err; } CUfileError_t;

// Define minimal CUDA/HIP types for both platforms to avoid compile-time dependencies
// We load all GPU functions dynamically at runtime via dlopen()
typedef enum cudaError { cudaSuccess = 0, cudaErrorMemoryAllocation = 2 } cudaError_t;
// Platform-specific enum values - CUDA and HIP have different values for HostToDevice
#ifdef USE_ROCM
enum cudaMemcpyKind { cudaMemcpyHostToDevice=1, cudaMemcpyDefault = 4 };
#else
enum cudaMemcpyKind { cudaMemcpyHostToDevice=2, cudaMemcpyDefault = 4 };
#endif


typedef enum CUfileFeatureFlags {
Expand Down
Loading