ggml-org · skyne98 · Aug 15, 2025 · Aug 15, 2025 · Aug 15, 2025 · Aug 15, 2025
diff --git a/.claude/commands/0-fix-issue.md b/.claude/commands/0-fix-issue.md
@@ -0,0 +1,14 @@
+Please analyze and fix the GitHub issue: $ARGUMENTS.
+
+Follow these steps:
+
+0. Create a new branch for the issue
+1. Use `gh issue view` to get the issue details
+2. Understand the problem described in the issue
+3. Search the codebase for relevant files
+4. Implement the necessary changes to fix the issue
+5. Write and run tests to verify the fix
+6. Ensure code passes linting and type checking
+7. Create a descriptive commit message
+
+Remember to use the GitHub CLI (`gh`) for all GitHub-related tasks.
diff --git a/.claude/commands/1-create-pr.md b/.claude/commands/1-create-pr.md
@@ -0,0 +1,5 @@
+# Create Pull Request Command
+
+Ensure the current branch is pushed, if not commit and push changes, and submit a pull request using `gh pr create`.
+
+Do NOT add Claude co-authorship footer to commits or "🤖 Generated with Claude Code" to the content of pull requests.
diff --git a/.claude/commands/2-review-failing-pipeline.md b/.claude/commands/2-review-failing-pipeline.md
@@ -0,0 +1,20 @@
+Currently this branch is failing the pipeline.
+
+Please review the PR and associated pipeline and fix the issues.
+
+Use the following commands to review the pipeline:
+
+### How to get the PR number for current branch
+```
+gh pr status
+```
+
+### How to get run ID of the failed job (will need to filter by branch)
+```
+gh run list --branch <branch-name>
+```
+
+### How to get logs of the failed job in the pipeline
+```
+gh run view <run-id> --log-failed
+```
diff --git a/.gitignore b/.gitignore
@@ -147,3 +147,9 @@ poetry.toml
 # Local scripts
 /run-vim.sh
 /run-chat.sh
+
+.specstory
+
+# Model files
+models/
+*.gguf
diff --git a/.gitmodules b/.gitmodules
@@ -0,0 +1,3 @@
+[submodule "ggml"]
+	path = ggml
+	url = https://github.com/skyne98/ggml-gfx906
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -0,0 +1,130 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Overview
+llama.cpp-gfx906 is a high-performance C/C++ implementation for LLM inference with AMD GFX906 GPU support. This is a specialized fork focusing on AMD GPU architecture.
+
+## Build Commands
+
+### Standard CPU Build
+```bash
+# Initialize submodules (required for ggml)
+git submodule update --init --recursive
+
+cmake -B build
+cmake --build build --config Release
+```
+
+### AMD GPU Build (GFX906)
+```bash
+cmake -B build -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx906
+cmake --build build --config Release
+
+# GFX906-optimized build (when available)
+cmake -B build -DGGML_HIP=ON -DGGML_HIP_GFX906_OPTIMIZED=ON -DAMDGPU_TARGETS=gfx906
+cmake --build build --config Release
+```
+
+### Debug Build
+```bash
+cmake -B build -DCMAKE_BUILD_TYPE=Debug
+cmake --build build
+```
+
+## Testing
+
+### Build and Run All Tests
+```bash
+cmake -B build -DLLAMA_BUILD_TESTS=ON
+cmake --build build --config Release
+cd build && ctest
+```
+
+### Run Specific Test Categories
+```bash
+ctest -L main     # Main functionality
+ctest -L model    # Model loading
+```
+
+### Run Individual Tests
+```bash
+./build/bin/test-backend-ops
+./build/bin/test-quantize-fns
+./build/bin/test-tokenizer-0 ./models/ggml-vocab-llama-bpe.gguf
+```
+
+### Running Benchmarks
+```bash
+# Performance benchmark
+./build/bin/llama-bench -m model.gguf
+
+# Perplexity testing
+./build/bin/llama-perplexity -m model.gguf -f file.txt
+
+# Profile with rocprof (AMD GPU)
+rocprof --stats --hip-trace ./build/bin/llama-cli -m model.gguf -p "prompt" -n 100
+```
+
+## Architecture
+
+### Layer Structure
+1. **GGML Layer** (`ggml/`): Low-level tensor operations and backend implementations
+   - `ggml/src/ggml.c`: Core tensor library
+   - `ggml/src/ggml-cuda/`: NVIDIA GPU kernels
+   - `ggml/src/ggml-hip/`: AMD GPU kernels (GFX906 optimizations)
+   - `ggml/src/ggml-backend.c`: Backend abstraction layer
+
+2. **LLaMA Layer** (`src/`): Model implementation and inference engine
+   - `src/llama.cpp`: Main inference engine - coordinates model loading, context management, and inference
+   - `src/llama-model.*`: Model format handling and weight loading
+   - `src/llama-vocab.*`: Tokenization across different vocab types (BPE, SPM, etc.)
+   - `src/llama-sampling.*`: Sampling strategies (greedy, top-k, top-p, etc.)
+
+3. **Tools Layer** (`tools/`): User-facing applications
+   - `tools/main/`: CLI tool for model inference (`llama-cli`)
+   - `tools/server/`: HTTP server with OpenAI API compatibility (`llama-server`)
+   - `tools/quantize/`: Model quantization utilities (`llama-quantize`)
+   - `tools/perplexity/`: Model quality metrics (`llama-perplexity`)
+   - `tools/llama-bench/`: Performance benchmarking (`llama-bench`)
+
+### Key Design Patterns
+- **Backend Abstraction**: All compute operations go through ggml-backend interface, allowing seamless switching between CPU/CUDA/HIP/Vulkan
+- **Model Format**: Uses GGUF (GGML Universal Format) for model storage with metadata and tensor data
+- **Memory Management**: Custom allocators with mmap support for efficient large model loading
+- **Quantization**: Supports multiple quantization levels (Q4_0, Q5_K_M, etc.) defined in `ggml/include/ggml.h`
+
+## Development Guidelines
+
+### Adding New Features
+- Model architecture additions go in `src/llama.cpp` (search for `llm_load_arch`)
+- New sampling methods belong in `src/llama-sampling.cpp`
+- Backend kernels should be added to respective backend directories under `ggml/src/`
+
+### GFX906 Specific Development
+- GFX906 optimizations are in `docs/gfx906/` documentation
+- Key hardware features: V_DOT4_I32_I8, V_DOT2_F32_F16, 64KB LDS
+- Refer to `docs/gfx906/optimization_plan.md` for optimization strategy
+- Check `docs/gfx906/implementation_guide.md` for kernel implementations
+
+### Before Committing
+1. Run clang-format on modified files
+2. Build with tests enabled and run ctest
+3. Test with both CPU and GPU builds if modifying backend code
+4. Check performance impact with llama-bench and perplexity tools
+
+### Common Development Tasks
+- **Add new model architecture**: Modify `llm_load_arch()` and `llm_build_*()` functions in `src/llama.cpp`
+- **Implement new operator**: Add to `ggml/src/ggml.c` and implement in relevant backends
+- **Add sampling method**: Extend `src/llama-sampling.cpp` with new sampling strategy
+- **Debug tokenization**: Use `tools/test-tokenizer-*.cpp` utilities
+- **Optimize for GFX906**: Follow patterns in `ggml/src/ggml-hip/` and reference `docs/gfx906/`
+
+## Important Configuration
+- C++17 required
+- CMake 3.14+ required
+- For AMD GPU: ROCm toolkit and HIP compiler required
+- Environment variables:
+  - `HIP_VISIBLE_DEVICES`: Control AMD GPU visibility
+  - `CUDA_VISIBLE_DEVICES`: Control NVIDIA GPU visibility
+  - `GGML_CUDA_ENABLE_UNIFIED_MEMORY=1`: Enable unified memory for CUDA
diff --git a/Dockerfile.gfx906 b/Dockerfile.gfx906
@@ -0,0 +1,94 @@
+# Optimized Docker image for GFX906 (AMD Instinct MI50) development
+ARG ROCM_VERSION=6.2
+ARG UBUNTU_VERSION=22.04
+
+# Development base with all ROCm tools
+FROM rocm/dev-ubuntu-${UBUNTU_VERSION}:${ROCM_VERSION} AS dev-base
+
+# Set GFX906-specific environment
+ENV AMDGPU_TARGETS=gfx906 \
+    HSA_OVERRIDE_GFX_VERSION=9.0.6 \
+    ROCM_PATH=/opt/rocm \
+    HIP_PLATFORM=amd \
+    PATH=${ROCM_PATH}/bin:${ROCM_PATH}/llvm/bin:$PATH \
+    LD_LIBRARY_PATH=${ROCM_PATH}/lib:${ROCM_PATH}/lib64:$LD_LIBRARY_PATH \
+    HIPCC_COMPILE_FLAGS="-O3 -ffast-math -march=native" \
+    HIPCC_LINK_FLAGS="-O3" \
+    HSA_ENABLE_SDMA=0 \
+    GPU_MAX_HW_QUEUES=8 \
+    GPU_NUM_COMPUTE_RINGS=8 \
+    AMD_LOG_LEVEL=3 \
+    HSA_ENABLE_LARGE_BAR=1
+
+# Install development dependencies
+RUN apt-get update && apt-get install -y \
+    build-essential \
+    cmake \
+    ninja-build \
+    git \
+    vim \
+    gdb \
+    ccache \
+    python3-pip \
+    python3-dev \
+    rocm-dev \
+    rocm-libs \
+    rocm-utils \
+    roctracer-dev \
+    rocprofiler-dev \
+    && pip3 install --upgrade pip numpy scipy \
+    && rm -rf /var/lib/apt/lists/*
+
+# Set up ccache
+ENV CCACHE_DIR=/workspace/.ccache \
+    CCACHE_MAXSIZE=10G \
+    CMAKE_CXX_COMPILER_LAUNCHER=ccache \
+    CMAKE_C_COMPILER_LAUNCHER=ccache
+
+# Create workspace
+WORKDIR /workspace
+RUN mkdir -p /workspace/llama.cpp-gfx906 /workspace/models /workspace/benchmarks
+
+# Development stage with extra tools
+FROM dev-base AS development
+
+RUN apt-get update && apt-get install -y \
+    clang-format \
+    clang-tidy \
+    tmux \
+    htop \
+    && rm -rf /var/lib/apt/lists/*
+
+VOLUME ["/workspace"]
+CMD ["/bin/bash"]
+
+# Builder stage
+FROM dev-base AS builder
+
+COPY . /workspace/llama.cpp-gfx906/
+WORKDIR /workspace/llama.cpp-gfx906
+
+# Initialize ggml submodule (required for build)
+RUN git submodule update --init --recursive || \
+    (echo "Note: Submodule initialization failed (expected in Docker build)" && \
+     echo "Ensure submodules are initialized before building Docker image")
+
+RUN cmake -B build \
+    -DCMAKE_BUILD_TYPE=Release \
+    -DGGML_HIP=ON \
+    -DAMDGPU_TARGETS=gfx906 \
+    -G Ninja \
+    && cmake --build build --config Release -j$(nproc)
+
+# Runtime stage
+FROM rocm/runtime-ubuntu-${UBUNTU_VERSION}:${ROCM_VERSION} AS runtime
+
+ENV HSA_OVERRIDE_GFX_VERSION=9.0.6 \
+    LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
+
+COPY --from=builder /workspace/llama.cpp-gfx906/build/bin/* /usr/local/bin/
+COPY --from=builder /workspace/llama.cpp-gfx906/build/lib/*.so /usr/local/lib/
+
+WORKDIR /models
+VOLUME ["/models"]
+ENTRYPOINT ["/usr/local/bin/llama-cli"]
diff --git a/Dockerfile.gfx906-test b/Dockerfile.gfx906-test
@@ -0,0 +1,29 @@
+# Quick test Docker image for GFX906
+FROM rocm/dev-ubuntu-22.04:6.2
+
+# Set GFX906 environment
+ENV AMDGPU_TARGETS=gfx906 \
+    HSA_OVERRIDE_GFX_VERSION=9.0.6 \
+    ROCM_PATH=/opt/rocm \
+    PATH=${ROCM_PATH}/bin:$PATH \
+    LD_LIBRARY_PATH=${ROCM_PATH}/lib:${ROCM_PATH}/lib64:$LD_LIBRARY_PATH
+
+# Install minimal dependencies
+RUN apt-get update && apt-get install -y \
+    build-essential \
+    cmake \
+    git \
+    && rm -rf /var/lib/apt/lists/*
+
+# Set working directory
+WORKDIR /workspace
+
+# Copy the project
+COPY . /workspace/llama.cpp-gfx906/
+
+# Build the project
+WORKDIR /workspace/llama.cpp-gfx906
+RUN cmake -B build -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx906 && \
+    cmake --build build --config Release -j$(nproc)
+
+CMD ["/bin/bash"]
diff --git a/README.md b/README.md
@@ -37,6 +37,7 @@ Getting started with llama.cpp is straightforward. Here are several ways to inst
 - Run with Docker - see our [Docker documentation](docs/docker.md)
 - Download pre-built binaries from the [releases page](https://github.com/ggml-org/llama.cpp/releases)
 - Build from source by cloning this repository - check out [our build guide](docs/build.md)
+  - **Note:** When building from source, remember to initialize submodules with `git submodule update --init --recursive`
 
 Once installed, you'll need a model to work with. Head to the [Obtaining and quantizing models](#obtaining-and-quantizing-models) section to learn more.