Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
14 changes: 14 additions & 0 deletions .claude/commands/0-fix-issue.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
Please analyze and fix the GitHub issue: $ARGUMENTS.

Follow these steps:

0. Create a new branch for the issue
1. Use `gh issue view` to get the issue details
2. Understand the problem described in the issue
3. Search the codebase for relevant files
4. Implement the necessary changes to fix the issue
5. Write and run tests to verify the fix
6. Ensure code passes linting and type checking
7. Create a descriptive commit message

Remember to use the GitHub CLI (`gh`) for all GitHub-related tasks.
5 changes: 5 additions & 0 deletions .claude/commands/1-create-pr.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Create Pull Request Command

Ensure the current branch is pushed, if not commit and push changes, and submit a pull request using `gh pr create`.

Do NOT add Claude co-authorship footer to commits or "🤖 Generated with Claude Code" to the content of pull requests.
20 changes: 20 additions & 0 deletions .claude/commands/2-review-failing-pipeline.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
Currently this branch is failing the pipeline.

Please review the PR and associated pipeline and fix the issues.

Use the following commands to review the pipeline:

### How to get the PR number for current branch
```
gh pr status
```

### How to get run ID of the failed job (will need to filter by branch)
```
gh run list --branch <branch-name>
```

### How to get logs of the failed job in the pipeline
```
gh run view <run-id> --log-failed
```
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -147,3 +147,9 @@ poetry.toml
# Local scripts
/run-vim.sh
/run-chat.sh

.specstory

# Model files
models/
*.gguf
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[submodule "ggml"]
path = ggml
url = https://github.com/skyne98/ggml-gfx906
130 changes: 130 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Overview
llama.cpp-gfx906 is a high-performance C/C++ implementation for LLM inference with AMD GFX906 GPU support. This is a specialized fork focusing on AMD GPU architecture.

## Build Commands

### Standard CPU Build
```bash
# Initialize submodules (required for ggml)
git submodule update --init --recursive

cmake -B build
cmake --build build --config Release
```

### AMD GPU Build (GFX906)
```bash
cmake -B build -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx906
cmake --build build --config Release

# GFX906-optimized build (when available)
cmake -B build -DGGML_HIP=ON -DGGML_HIP_GFX906_OPTIMIZED=ON -DAMDGPU_TARGETS=gfx906
cmake --build build --config Release
```

### Debug Build
```bash
cmake -B build -DCMAKE_BUILD_TYPE=Debug
cmake --build build
```

## Testing

### Build and Run All Tests
```bash
cmake -B build -DLLAMA_BUILD_TESTS=ON
cmake --build build --config Release
cd build && ctest
```

### Run Specific Test Categories
```bash
ctest -L main # Main functionality
ctest -L model # Model loading
```

### Run Individual Tests
```bash
./build/bin/test-backend-ops
./build/bin/test-quantize-fns
./build/bin/test-tokenizer-0 ./models/ggml-vocab-llama-bpe.gguf
```

### Running Benchmarks
```bash
# Performance benchmark
./build/bin/llama-bench -m model.gguf

# Perplexity testing
./build/bin/llama-perplexity -m model.gguf -f file.txt

# Profile with rocprof (AMD GPU)
rocprof --stats --hip-trace ./build/bin/llama-cli -m model.gguf -p "prompt" -n 100
```

## Architecture

### Layer Structure
1. **GGML Layer** (`ggml/`): Low-level tensor operations and backend implementations
- `ggml/src/ggml.c`: Core tensor library
- `ggml/src/ggml-cuda/`: NVIDIA GPU kernels
- `ggml/src/ggml-hip/`: AMD GPU kernels (GFX906 optimizations)
- `ggml/src/ggml-backend.c`: Backend abstraction layer

2. **LLaMA Layer** (`src/`): Model implementation and inference engine
- `src/llama.cpp`: Main inference engine - coordinates model loading, context management, and inference
- `src/llama-model.*`: Model format handling and weight loading
- `src/llama-vocab.*`: Tokenization across different vocab types (BPE, SPM, etc.)
- `src/llama-sampling.*`: Sampling strategies (greedy, top-k, top-p, etc.)

3. **Tools Layer** (`tools/`): User-facing applications
- `tools/main/`: CLI tool for model inference (`llama-cli`)
- `tools/server/`: HTTP server with OpenAI API compatibility (`llama-server`)
- `tools/quantize/`: Model quantization utilities (`llama-quantize`)
- `tools/perplexity/`: Model quality metrics (`llama-perplexity`)
- `tools/llama-bench/`: Performance benchmarking (`llama-bench`)

### Key Design Patterns
- **Backend Abstraction**: All compute operations go through ggml-backend interface, allowing seamless switching between CPU/CUDA/HIP/Vulkan
- **Model Format**: Uses GGUF (GGML Universal Format) for model storage with metadata and tensor data
- **Memory Management**: Custom allocators with mmap support for efficient large model loading
- **Quantization**: Supports multiple quantization levels (Q4_0, Q5_K_M, etc.) defined in `ggml/include/ggml.h`

## Development Guidelines

### Adding New Features
- Model architecture additions go in `src/llama.cpp` (search for `llm_load_arch`)
- New sampling methods belong in `src/llama-sampling.cpp`
- Backend kernels should be added to respective backend directories under `ggml/src/`

### GFX906 Specific Development
- GFX906 optimizations are in `docs/gfx906/` documentation
- Key hardware features: V_DOT4_I32_I8, V_DOT2_F32_F16, 64KB LDS
- Refer to `docs/gfx906/optimization_plan.md` for optimization strategy
- Check `docs/gfx906/implementation_guide.md` for kernel implementations

### Before Committing
1. Run clang-format on modified files
2. Build with tests enabled and run ctest
3. Test with both CPU and GPU builds if modifying backend code
4. Check performance impact with llama-bench and perplexity tools

### Common Development Tasks
- **Add new model architecture**: Modify `llm_load_arch()` and `llm_build_*()` functions in `src/llama.cpp`
- **Implement new operator**: Add to `ggml/src/ggml.c` and implement in relevant backends
- **Add sampling method**: Extend `src/llama-sampling.cpp` with new sampling strategy
- **Debug tokenization**: Use `tools/test-tokenizer-*.cpp` utilities
- **Optimize for GFX906**: Follow patterns in `ggml/src/ggml-hip/` and reference `docs/gfx906/`

## Important Configuration
- C++17 required
- CMake 3.14+ required
- For AMD GPU: ROCm toolkit and HIP compiler required
- Environment variables:
- `HIP_VISIBLE_DEVICES`: Control AMD GPU visibility
- `CUDA_VISIBLE_DEVICES`: Control NVIDIA GPU visibility
- `GGML_CUDA_ENABLE_UNIFIED_MEMORY=1`: Enable unified memory for CUDA
94 changes: 94 additions & 0 deletions Dockerfile.gfx906
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# Optimized Docker image for GFX906 (AMD Instinct MI50) development
ARG ROCM_VERSION=6.2
ARG UBUNTU_VERSION=22.04

# Development base with all ROCm tools
FROM rocm/dev-ubuntu-${UBUNTU_VERSION}:${ROCM_VERSION} AS dev-base

# Set GFX906-specific environment
ENV AMDGPU_TARGETS=gfx906 \
HSA_OVERRIDE_GFX_VERSION=9.0.6 \
ROCM_PATH=/opt/rocm \
HIP_PLATFORM=amd \
PATH=${ROCM_PATH}/bin:${ROCM_PATH}/llvm/bin:$PATH \
LD_LIBRARY_PATH=${ROCM_PATH}/lib:${ROCM_PATH}/lib64:$LD_LIBRARY_PATH \
HIPCC_COMPILE_FLAGS="-O3 -ffast-math -march=native" \
HIPCC_LINK_FLAGS="-O3" \
HSA_ENABLE_SDMA=0 \
GPU_MAX_HW_QUEUES=8 \
GPU_NUM_COMPUTE_RINGS=8 \
AMD_LOG_LEVEL=3 \
HSA_ENABLE_LARGE_BAR=1

# Install development dependencies
RUN apt-get update && apt-get install -y \
build-essential \
cmake \
ninja-build \
git \
vim \
gdb \
ccache \
python3-pip \
python3-dev \
rocm-dev \
rocm-libs \
rocm-utils \
roctracer-dev \
rocprofiler-dev \
&& pip3 install --upgrade pip numpy scipy \
&& rm -rf /var/lib/apt/lists/*

# Set up ccache
ENV CCACHE_DIR=/workspace/.ccache \
CCACHE_MAXSIZE=10G \
CMAKE_CXX_COMPILER_LAUNCHER=ccache \
CMAKE_C_COMPILER_LAUNCHER=ccache

# Create workspace
WORKDIR /workspace
RUN mkdir -p /workspace/llama.cpp-gfx906 /workspace/models /workspace/benchmarks

# Development stage with extra tools
FROM dev-base AS development

RUN apt-get update && apt-get install -y \
clang-format \
clang-tidy \
tmux \
htop \
&& rm -rf /var/lib/apt/lists/*

VOLUME ["/workspace"]
CMD ["/bin/bash"]

# Builder stage
FROM dev-base AS builder

COPY . /workspace/llama.cpp-gfx906/
WORKDIR /workspace/llama.cpp-gfx906

# Initialize ggml submodule (required for build)
RUN git submodule update --init --recursive || \
(echo "Note: Submodule initialization failed (expected in Docker build)" && \
echo "Ensure submodules are initialized before building Docker image")

RUN cmake -B build \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_HIP=ON \
-DAMDGPU_TARGETS=gfx906 \
-G Ninja \
&& cmake --build build --config Release -j$(nproc)

# Runtime stage
FROM rocm/runtime-ubuntu-${UBUNTU_VERSION}:${ROCM_VERSION} AS runtime

ENV HSA_OVERRIDE_GFX_VERSION=9.0.6 \
LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH

COPY --from=builder /workspace/llama.cpp-gfx906/build/bin/* /usr/local/bin/
COPY --from=builder /workspace/llama.cpp-gfx906/build/lib/*.so /usr/local/lib/

WORKDIR /models
VOLUME ["/models"]
ENTRYPOINT ["/usr/local/bin/llama-cli"]
29 changes: 29 additions & 0 deletions Dockerfile.gfx906-test
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Quick test Docker image for GFX906
FROM rocm/dev-ubuntu-22.04:6.2

# Set GFX906 environment
ENV AMDGPU_TARGETS=gfx906 \
HSA_OVERRIDE_GFX_VERSION=9.0.6 \
ROCM_PATH=/opt/rocm \
PATH=${ROCM_PATH}/bin:$PATH \
LD_LIBRARY_PATH=${ROCM_PATH}/lib:${ROCM_PATH}/lib64:$LD_LIBRARY_PATH

# Install minimal dependencies
RUN apt-get update && apt-get install -y \
build-essential \
cmake \
git \
&& rm -rf /var/lib/apt/lists/*

# Set working directory
WORKDIR /workspace

# Copy the project
COPY . /workspace/llama.cpp-gfx906/

# Build the project
WORKDIR /workspace/llama.cpp-gfx906
RUN cmake -B build -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx906 && \
cmake --build build --config Release -j$(nproc)

CMD ["/bin/bash"]
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ Getting started with llama.cpp is straightforward. Here are several ways to inst
- Run with Docker - see our [Docker documentation](docs/docker.md)
- Download pre-built binaries from the [releases page](https://github.com/ggml-org/llama.cpp/releases)
- Build from source by cloning this repository - check out [our build guide](docs/build.md)
- **Note:** When building from source, remember to initialize submodules with `git submodule update --init --recursive`

Once installed, you'll need a model to work with. Head to the [Obtaining and quantizing models](#obtaining-and-quantizing-models) section to learn more.

Expand Down
Loading