EmbeddedLLM · tjtanaa · Oct 17, 2025 · Oct 19, 2025 · Oct 19, 2025 · Oct 20, 2025
diff --git a/.gitignore b/.gitignore
@@ -10,4 +10,13 @@ htmlcov/
 .idea
 *.log
 *.pyc
-examples/paddle_case/log
+*.so
+examples/paddle_case/log
+
+# Auto-generated hipified files and directories (created during ROCm build)
+fastsafetensors/cpp/hip/
+fastsafetensors/cpp/*.hip.*
+fastsafetensors/cpp/hip_compat.h
+
+# Auto-generated PyPI index (generated by GitHub Actions)
+pypi-index/
diff --git a/README.md b/README.md
@@ -48,16 +48,34 @@ Please refer to [Foundation Model Stack Community Code of Conduct](https://githu
 
 Takeshi Yoshimura, Tatsuhiro Chiba, Manish Sethi, Daniel Waddington, Swaminathan Sundararaman. (2025) Speeding up Model Loading with fastsafetensors [arXiv:2505.23072](https://arxiv.org/abs/2505.23072) and IEEE CLOUD 2025.
 
+## For NVIDIA
 
-## Install from PyPI
+### Install from PyPI
 
 See https://pypi.org/project/fastsafetensors/
 
 ```bash
 pip install fastsafetensors
 ```
 
-## Install from source
+### Install from source
+
+```bash
+pip install .
+```
+
+## For ROCm
+
+On ROCm, there are not GDS equivalent support. So fastsafetensors support only supports `nogds=True` mode.
+The performance gain example can be found at [amd-perf.md](./docs/amd-perf.md)
+
+### Install from Github Source
+
+```bash
+pip install git+https://github.com/foundation-model-stack/fastsafetensors.git
+```
+
+### Install from source
 
 ```bash
 pip install .

diff --git a/docs/amd-perf.md b/docs/amd-perf.md
@@ -0,0 +1,88 @@
+# Performance of FastSafeTensors on AMD GPUs
+
+## DeepSeek-R1 vLLM Model Weight Loading Speed
+
+This benchmark compares the performance of `safetensors` vs `fastsafetensors` when loading model weights on AMD GPUs.
+
+NOTES: `fastsafetensors` does not support GDS feature on ROCm as there are no GDS alternative on ROCm.
+
+### Benchmark Methodology
+
+**Platform:** AMD ROCm 7.0.1
+**GPUs:** 8x AMD Instinct MI300X
+**Library:** fastsafetensors 0.1.15
+
+1. **Clear system cache** to ensure consistent starting conditions:
+   ```bash
+   sudo sh -c 'sync && echo 3 > /proc/sys/vm/drop_caches'
+   ```
+
+2. **Launch vLLM** with either `--load-format safetensors` or `--load-format fastsafetensors`:
+
+    ```bash
+    MODEL=EmbeddedLLM/deepseek-r1-FP8-Dynamic
+
+    VLLM_USE_V1=1 \
+    VLLM_ROCM_USE_AITER=1 \
+    vllm serve $MODEL \
+    --tensor-parallel-size 8 \
+    --disable-log-requests \
+    --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \
+    --trust-remote-code \
+    --load-format fastsafetensors \
+    --block-size 1
+    ```
+
+### Results
+
+The experiments are carried on MI300X.
+
+**Cache Scenarios:**
+- **No cache**: Model weights are loaded after clearing the system cache (cold start).
+- **Cached**: Model weights are loaded immediately after a previous load. The weights are cached in the filesystem and RAM (warm start).
+
+<img src="./images/fastsafetensors-rocm.png" alt="FastSafeTensors on ROCm" width="70%">
+
+
+
+
+## GPT-2 perf tests based on the script [perf/fastsafetensors_perf/perf.py](../perf/fastsafetensors_perf/perf.py)
+
+### Test Configuration
+
+All tests were performed on single-GPU loading scenarios with two different model sizes:
+- **GPT-2 (small):** 523MB safetensors file
+- **GPT-2 Medium:** ~1.4GB safetensors file
+
+#### Key Parameters Tested:
+- **nogds mode:** ROCm fallback (GDS not available on AMD GPUs)
+- **Thread counts:** 8, 16, 32
+- **Buffer sizes:** 8MB, 16MB, 32MB
+- **Loading methods:** nogds (async I/O), mmap (memory-mapped)
+- **Data types:** AUTO (no conversion), F16 (half precision conversion)
+
+---
+
+#### Performance Results
+
+##### GPT-2 (523MB) - Single GPU Tests
+
+| Test # | Method | Threads | Buffer | Config | Bandwidth | Elapsed Time | Notes |
+|--------|--------|---------|--------|--------|-----------|--------------|-------|
+| 1 | nogds | 16 | 16MB | default | **1.91 GB/s** | 0.268s | Baseline test |
+| 2 | nogds | 32 | 32MB | default | **2.07 GB/s** | 0.246s | Higher threads/buffer |
+| 3 | nogds | 8 | 8MB | default | **2.10 GB/s** | 0.243s | Lower threads/buffer |
+| 4 | mmap | N/A | N/A | default | **1.01 GB/s** | 0.505s | Memory-mapped |
+| 5 | nogds | 32 | 32MB | cache-drop | **1.24 GB/s** | 0.410s | Cold cache test |
+| 6 | nogds | 32 | 32MB | F16 dtype | **0.77 GB/s** | 0.332s | With type conversion |
+| 8 | nogds | 16 | 16MB | **optimal** | **2.62 GB/s** | 0.195s | Best config |
+
+##### GPT-2 Medium (1.4GB) - Single GPU Tests
+
+| Test # | Method | Threads | Buffer | Block Size | Bandwidth | Elapsed Time | Notes |
+|--------|--------|---------|--------|------------|-----------|--------------|-------|
+| 9 | nogds | 16 | 16MB | 160MB | **6.02 GB/s** | 0.235s | Optimal config |
+| 10 | mmap | N/A | N/A | N/A | **1.28 GB/s** | 1.104s | Memory-mapped |
+| 11 | nogds | 32 | 32MB | 160MB | **5.34 GB/s** | 0.265s | Higher threads |
+
+---
diff --git a/docs/images/fastsafetensors-rocm.png b/docs/images/fastsafetensors-rocm.png
diff --git a/fastsafetensors/common.py b/fastsafetensors/common.py
@@ -14,6 +14,15 @@
 from .st_types import Device, DType
 
 
+def is_gpu_found():
+    """Check if any GPU (CUDA or HIP) is available.
+
+    Returns True if either CUDA or ROCm/HIP GPUs are detected.
+    This allows code to work transparently across both platforms.
+    """
+    return fstcpp.is_cuda_found() or fstcpp.is_hip_found()
+
+
 def get_device_numa_node(device: Optional[int]) -> Optional[int]:
     if device is None or not sys.platform.startswith("linux"):
         return None

diff --git a/fastsafetensors/copier/gds.py b/fastsafetensors/copier/gds.py
@@ -5,7 +5,7 @@
 from typing import Dict, Optional
 
 from .. import cpp as fstcpp
-from ..common import SafeTensorsMetadata
+from ..common import SafeTensorsMetadata, is_gpu_found
 from ..frameworks import FrameworkOpBase, TensorBase
 from ..st_types import Device, DeviceType, DType
 from .base import CopierInterface
@@ -30,12 +30,29 @@ def __init__(
         self.fh: Optional[fstcpp.gds_file_handle] = None
         self.copy_reqs: Dict[int, int] = {}
         self.aligned_length = 0
-        cudavers = list(map(int, framework.get_cuda_ver().split(".")))
-        # CUDA 12.2 (GDS version 1.7) introduces support for non O_DIRECT file descriptors
-        # Compatible with CUDA 11.x
-        self.o_direct = not (
-            cudavers[0] > 12 or (cudavers[0] == 12 and cudavers[1] >= 2)
-        )
+        cuda_ver = framework.get_cuda_ver()
+        if cuda_ver and cuda_ver != "0.0":
+            # Parse version string (e.g., "cuda-12.1" or "hip-5.7.0")
+            # Extract the numeric part after the platform prefix
+            ver_parts = cuda_ver.split("-", 1)
+            if len(ver_parts) == 2:
+                cudavers = list(map(int, ver_parts[1].split(".")))
+                # CUDA 12.2 (GDS version 1.7) introduces support for non O_DIRECT file descriptors
+                # Compatible with CUDA 11.x
+                # Only applies to CUDA platform (not ROCm/HIP)
+                if ver_parts[0] == "cuda":
+                    self.o_direct = not (
+                        cudavers[0] > 12 or (cudavers[0] == 12 and cudavers[1] >= 2)
+                    )
+                else:
+                    # ROCm/HIP platform, use O_DIRECT
+                    self.o_direct = True
+            else:
+                # Fallback if format is unexpected
+                self.o_direct = True
+        else:
+            # No GPU platform detected, use O_DIRECT
+            self.o_direct = True
 
     def set_o_direct(self, enable: bool):
         self.o_direct = enable
@@ -151,8 +168,10 @@ def new_gds_file_copier(
     nogds: bool = False,
 ):
     device_is_not_cpu = device.type != DeviceType.CPU
-    if device_is_not_cpu and not fstcpp.is_cuda_found():
-        raise Exception("[FAIL] libcudart.so does not exist")
+    if device_is_not_cpu and not is_gpu_found():
+        raise Exception(
+            "[FAIL] GPU runtime library (libcudart.so or libamdhip64.so) does not exist"
+        )
     if not fstcpp.is_cufile_found() and not nogds:
         warnings.warn(
             "libcufile.so does not exist but nogds is False. use nogds=True",

diff --git a/fastsafetensors/cpp/cuda_compat.h b/fastsafetensors/cpp/cuda_compat.h
@@ -0,0 +1,37 @@
+// SPDX-License-Identifier: Apache-2.0
+/*
+ * CUDA/HIP compatibility layer for fastsafetensors
+ * Minimal compatibility header - only defines what hipify-perl doesn't handle
+ */
+
+#ifndef __CUDA_COMPAT_H__
+#define __CUDA_COMPAT_H__
+
+// Platform detection - this gets hipified to check __HIP_PLATFORM_AMD__
+#ifdef __HIP_PLATFORM_AMD__
+  #ifndef USE_ROCM
+    #define USE_ROCM
+  #endif
+  // Note: We do NOT include <hip/hip_runtime.h> here to avoid compile-time dependencies.
+  // Instead, we dynamically load the ROCm runtime library (libamdhip64.so) at runtime
+  // using dlopen(), just like we do for CUDA (libcudart.so).
+  // Minimal types are defined in ext.hpp.
+#else
+  // For CUDA platform, we also avoid including headers and define minimal types in ext.hpp
+#endif
+
+// Runtime library name - hipify-perl doesn't change string literals
+#ifdef USE_ROCM
+  #define GPU_RUNTIME_LIB "libamdhip64.so"
+#else
+  #define GPU_RUNTIME_LIB "libcudart.so"
+#endif
+
+// Custom function pointer names that hipify-perl doesn't recognize
+// These are our own naming in ext_funcs struct, not standard CUDA API
+#ifdef USE_ROCM
+  #define cudaDeviceMalloc hipDeviceMalloc
+  #define cudaDeviceFree hipDeviceFree
+#endif
+
+#endif // __CUDA_COMPAT_H__
diff --git a/fastsafetensors/cpp/ext.cpp b/fastsafetensors/cpp/ext.cpp
@@ -10,6 +10,7 @@
 #include <chrono>
 #include <dlfcn.h>
 
+#include "cuda_compat.h"
 #include "ext.hpp"
 
 #define ALIGN 4096
@@ -78,6 +79,7 @@ ext_funcs_t cpu_fns = ext_funcs_t {
 ext_funcs_t cuda_fns;
 
 static bool cuda_found = false;
+static bool is_hip_runtime = false;  // Track if we loaded HIP (not auto-hipified)
 static bool cufile_found = false;
 
 static int cufile_ver = 0;
@@ -89,7 +91,7 @@ template <typename T> void mydlsym(T** h, void* lib, std::string const& name) {
 static void load_nvidia_functions() {
     cudaError_t (*cudaGetDeviceCount)(int*);
     const char* cufileLib = "libcufile.so.0";
-    const char* cudartLib = "libcudart.so";
+    const char* cudartLib = GPU_RUNTIME_LIB;
     const char* numaLib = "libnuma.so.1";
     bool init_log = getenv(ENV_ENABLE_INIT_LOG);
     int mode = RTLD_LAZY | RTLD_GLOBAL | RTLD_NODELETE;
@@ -122,8 +124,12 @@ static void load_nvidia_functions() {
                 count = 0; // why cudaGetDeviceCount returns non-zero for errors?
             }
             cuda_found = count > 0;
+            // Detect if we loaded HIP runtime (ROCm) vs CUDA runtime
+            if (cuda_found && std::string(cudartLib).find("hip") != std::string::npos) {
+                is_hip_runtime = true;
+            }
             if (init_log) {
-                fprintf(stderr, "[DEBUG] device count=%d, cuda_found=%d\n", count, cuda_found);
+                fprintf(stderr, "[DEBUG] device count=%d, cuda_found=%d, is_hip_runtime=%d\n", count, cuda_found, is_hip_runtime);
             }
         } else {
             cuda_found = false;
@@ -217,11 +223,28 @@ static void load_nvidia_functions() {
     }
 }
 
+// Note: is_cuda_found gets auto-hipified to is_hip_found on ROCm builds
+// So this function will be is_hip_found() after hipification on ROCm
 bool is_cuda_found()
 {
     return cuda_found;
 }
 
+// Separate function that always returns false on ROCm (CUDA not available on ROCm)
+// This will be used for the "is_cuda_found" Python export on ROCm builds
+bool cuda_not_available()
+{
+    return false;  // On ROCm, CUDA is never available
+}
+
+// Separate function for checking HIP runtime detection (not hipified)
+// On CUDA: checks if HIP runtime was detected
+// On ROCm: not used (is_cuda_found gets hipified to is_hip_found)
+bool check_hip_runtime()
+{
+    return is_hip_runtime;
+}
+
 bool is_cufile_found()
 {
     return cufile_found;
@@ -718,7 +741,21 @@ cpp_metrics_t get_cpp_metrics() {
 
 PYBIND11_MODULE(__MOD_NAME__, m)
 {
-    m.def("is_cuda_found", &is_cuda_found);
+    // Export both is_cuda_found and is_hip_found on all platforms
+    // Use string concatenation to prevent hipify from converting the export names
+#ifdef USE_ROCM
+    // On ROCm after hipify:
+    // - is_cuda_found() becomes is_hip_found(), so export it as "is_hip_found"
+    // - Export cuda_not_available() as "is_cuda_found" (CUDA not available on ROCm)
+    m.def(("is_" "cuda" "_found"), &cuda_not_available);  // Returns false on ROCm
+    m.def(("is_" "hip" "_found"), &is_cuda_found);  // hipified to is_hip_found, returns hip status
+#else
+    // On CUDA:
+    // - is_cuda_found() checks for CUDA
+    // - check_hip_runtime() checks if HIP runtime was loaded
+    m.def(("is_" "cuda" "_found"), &is_cuda_found);
+    m.def(("is_" "hip" "_found"), &check_hip_runtime);
+#endif
     m.def("is_cufile_found", &is_cufile_found);
     m.def("cufile_version", &cufile_version);
     m.def("set_debug_log", &set_debug_log);

diff --git a/fastsafetensors/cpp/ext.hpp b/fastsafetensors/cpp/ext.hpp
@@ -15,6 +15,8 @@
 #include <pybind11/pybind11.h>
 #include <pybind11/stl.h>
 
+#include "cuda_compat.h"
+
 #define ENV_ENABLE_INIT_LOG "FASTSAFETENSORS_ENABLE_INIT_LOG"
 
 #ifndef __MOD_NAME__
@@ -33,8 +35,16 @@ typedef struct CUfileDescr_t {
     const void *fs_ops; /* CUfileFSOps_t */
 } CUfileDescr_t;
 typedef struct CUfileError { CUfileOpError err; } CUfileError_t;
+
+// Define minimal CUDA/HIP types for both platforms to avoid compile-time dependencies
+// We load all GPU functions dynamically at runtime via dlopen()
 typedef enum cudaError { cudaSuccess = 0, cudaErrorMemoryAllocation = 2 } cudaError_t;
+// Platform-specific enum values - CUDA and HIP have different values for HostToDevice
+#ifdef USE_ROCM
+enum cudaMemcpyKind { cudaMemcpyHostToDevice=1, cudaMemcpyDefault = 4 };
+#else
 enum cudaMemcpyKind { cudaMemcpyHostToDevice=2, cudaMemcpyDefault = 4 };
+#endif
 
 
 typedef enum CUfileFeatureFlags {