Skip to content

Conversation

@grinco
Copy link

@grinco grinco commented Mar 11, 2025

Tested on the v0.5.13 on linux. Image was built using the supplied Dockerfile with a caveat that release image was bumped to 24.04 (from 20.04).

Build command:

docker buildx build --platform linux/amd64 ${OLLAMA_COMMON_BUILD_ARGS} -t grinco/ollama-amd-apu:vulkan .

Tested on AMD Ryzen 7 8845HS w/ Radeon 780M Graphics with ROCm disabled

[GIN-debug] POST   /v1/chat/completions      --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers)
[GIN-debug] POST   /v1/completions           --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers)
[GIN-debug] POST   /v1/embeddings            --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models                --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models/:model         --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (6 handlers)
time=2025-03-11T13:00:40.793Z level=INFO source=gpu.go:199 msg="vulkan: load libvulkan and libcap ok"
time=2025-03-11T13:00:40.877Z level=INFO source=gpu.go:421 msg="error looking up vulkan GPU memory" error="device is a CPU"
time=2025-03-11T13:00:40.878Z level=WARN source=amd_linux.go:443 msg="amdgpu detected, but no compatible rocm library found.  Either install rocm v6, or follow manual install instructions at https://github.com/ollama/ollama/blob/main/docs/linux.md#manual-install"
time=2025-03-11T13:00:40.878Z level=WARN source=amd_linux.go:348 msg="unable to verify rocm library: no suitable rocm found, falling back to CPU"
time=2025-03-11T13:00:40.879Z level=INFO source=types.go:137 msg="inference compute" id=0 library=vulkan variant="" compute=1.3 driver=1.3 name="AMD Radeon Graphics (RADV GFX1103_R1)" total="15.6 GiB" available="15.6 GiB"
 # ollama run phi4:14b
>>> /set verbose
Set 'verbose' mode.
>>> how's it going?
Hello! I'm here to help you with any questions or tasks you have. How can I assist you today? 😊

total duration:       3.341959745s
load duration:        18.165612ms
prompt eval count:    15 token(s)
prompt eval duration: 475ms
prompt eval rate:     31.58 tokens/s
eval count:           26 token(s)
eval duration:        2.846s
eval rate:            9.14 tokens/s
>>>

qusaismael and others added 30 commits February 8, 2025 12:28
ollama requires vcruntime140_1.dll which isn't found on 2019. previously
the job used the windows runner (2019) but it explicitly installs
2022 to build the app. since the sign job doesn't actually build
anything, it can use the windows-2022 runner instead.
* wrap ggml_backend_load_best in try/catch
* ignore non-ollama paths
removing the channel tag from the url so it will always go to the current stable channel.
feat: add new Ollama engine using ggml through cgo

This change introduces a new way to run pretrained models. It introduces 3 high level interfaces and a bunch of smaller helper interfaces to facilitate this.

- `model.Model` defines the interface for a model architecture. Models such as `llama` and `mllama`, which are provided as examples, can implement the model's forward propagation in the `Forward` method. This method will be called to generate completions. This interface can be found in `model/model.go`
- `ml.Backend` defines the interface for a backend tensor library, in this case `ggml`. Among other things, a Backend is responsible for loading a pretrained model into hardware (GPU, CPU, etc) and providing an interface for Models to access loaded tensors. This interface can be found in `ml/backend.go`
- `ml.Tensor` defines the interface for a tensor and tensor operations

This is the first implementation of the new engine. Follow up PRs will implement more features:

- non-greedy sampling (ollama#8410)
- integration with Ollama and KV caching (ollama#8301)
- more model support (ollama#9080) with more coming soon

Co-authored-by: Bruce MacDonald <[email protected]>
It is not common to return errors with close/free operations - most
people won't check it and even if they did there's probably not much
that can do. It's better to not give implementations false expectations.
Currently there is a mixture of int and int64 used when dealing with
tensor dimensions and shapes, which causes unnecessary conversions -
they all should be the same type.

In general, most interfaces (such as Pytorch) use int64 for
generality but most implementations (such as CUDA) use int32 for
performance. There isn't much benefit to us to being more flexible
than the implementations we are likely to run on.

In addition, as a practical matter, a model with a tensor with a single
dimension larger than 32 bits is unlikely to run on a 32-bit machine.
There are two cases where we may not have an output after computing:
 - Prompt processing where the length of the input exceeds the batch
   size
 - Internal memory management operations such as cache defrag and shift
Most tensor backends try to optimize performance by using a lower
precision for matmuls. However, some operations (such as kq) on
some models are sensitive to this and require full precision.
Passing in a Go buffer is not safe because the garbage collector could
free or move the memory while the context is still open. However, if
we pass in the size and a nil pointer then GGML will allocate it from
the C side.
We need to sync before retrieving data after async computation.
It is also important to ensure that the Go buffer is not moved by
the GC across function calls so we do a synchronous copy.
Currently, if a model uses an interface for its data structures (as mllama
does) then the tensor data in the structs implementing that interface will
not get loaded.
Special tokens are currently read as uint32 from the model metadata.
However, all other parts of the system (including the tokenizer) use
int32 to represent tokens so it is impossible to represent the high
portion of the unsigned range. For consistency and to avoid casts,
we should just use int32 everywhere.
This allows there to be a file that is a list of models that is
not mixed into the runner code.
This provides integration with the new Ollama engine
(5824541 next ollama runner (ollama#7913)) and the rest of the Ollama
infrastructure such as the runner and Ollama server.

In addition, it also builds out the KV cache infrastructure to
support requirements of how Ollama runs models such as:
 - Parallel processing
 - Memory management for defragmentation and shifting
 - Multi-modal modals

Both old and new engines continue to be supported. By default, only
the old engine is used. To enable the new engine:

Start the server with the OLLAMA_NEW_ENGINE environment variable set:
OLLAMA_NEW_ENGINE=1 ./ollama serve

Start a model that is supported by the Ollama engine. This one is Llama 3.1 8b Q4_K_M:
./ollama run jessegross/llama3.1
In some cases, the directories in the executable path read by
filepath.EvalSymlinks are not accessible, resulting in permission
errors which results in an error when running models. It also
doesn't work well on long paths on windows, also resulting in
errors. This change removes filepath.EvalSymlinks when accessing
os.Executable() altogether
provides a better approach to ollama#9088 that will attempt to
evaluate symlinks (important for macOS where 'ollama' is
often a symlink), but use the result of os.Executable()
as a fallback in scenarios where filepath.EvalSymlinks
fails due to permission erorrs or other issues
We currently print system info before the GGML backends are loaded.
This results in only getting information about the default lowest
common denominator runner. If we move up the GGML init then we can
see what we are actually running.

Before:
time=2025-02-14T11:15:07.606-08:00 level=INFO source=runner.go:935 msg=system info="CPU : LLAMAFILE = 1 | CPU : LLAMAFILE = 1 | cgo(gcc)" threads=24

After:
time=2025-02-14T11:16:02.936-08:00 level=INFO source=runner.go:935 msg=system info="CPU : LLAMAFILE = 1 | CPU : LLAMAFILE = 1 | CUDA : ARCHS = 890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | cgo(gcc)" threads=24
mxyng and others added 29 commits March 13, 2025 11:40
Co-authored-by: Jeffrey Morgan <[email protected]>
fix: error on models that don't support embeddings
Add metadata and tensor information to the show command to be able to
see more information about a model. This outputs the same data as
shown on the model details page on ollama.com
fix: error if image requested without vision model
the largest operation is by far (q @ k) so just count that for
simplicity
ollama#9746)

Replace large-chunk blob downloads with parallel small-chunk
verification to solve timeout and performance issues. Registry users
experienced progressively slowing download speeds as large-chunk
transfers aged, often timing out completely.

The previous approach downloaded blobs in a few large chunks but
required a separate, single-threaded pass to read the entire blob back
from disk for verification after download completion.

This change uses the new chunksums API to fetch many smaller
chunk+digest pairs, allowing concurrent downloads and immediate
verification as each chunk arrives. Chunks are written directly to their
final positions, eliminating the entire separate verification pass.

The result is more reliable downloads that maintain speed throughout the
transfer process and significantly faster overall completion, especially
over unstable connections or with large blobs.
This commit refactors the LLM subsystem by removing internal subprocess
request and response types. It consolidates duplicate type definitions
across the codebase, moving them to centralized locations. The change also
standardizes interfaces between components, simplifies the ServerStatusResp
struct, and moves the ParseDurationMs function to a common package. This
cleanup reduces code duplication between different runner implementations
(llamarunner and ollamarunner).
Models may require that a set of inputs all be processed as part
of the same batch. For example, if an image has multiple patches
with fully connected attention between them, we should not split
the batch in the middle of an image.

Fixes ollama#9697
Currently there is a single context per sequence, shared all by
all multimodal inputs. Since we build a vision encoder graph per
image, with a large number of inputs we can eventually hit the
maximum number of graph nodes per context.

This changes to use a separate context for each image, ensuring
that available resource limits are consistent.
Previously processing multiple images in a batch would trigger
segfaults so sending images together was disabled as a way to
mitigate this. The trigger was processing one image on the CPU
and one on the GPU.

This can no longer happen:
 - The vision encoder is now on the GPU so both images would be
   processed on the GPU.
 - We require images to be fully contained in a batch and each
   image including its special tokens is over half the batch size.
   As a result, we will never get two images in the same batch.

Fixes ollama#9731
Darwin was using a different pattern for the version string
than linux or windows.
…lama#9775)

This sets the agent header in DefaultRegistry to include the version of
the client, OS, and architecture in the previous format, with a minor
twist.

Note: The version is obtained from the build info, instead of the
version in version.Version, which should not longer be necessary, but we
can remove in a future commit. Using the build info is more accurate and
also provides extra build information if the build is not tagged, and if
it is "dirty". Previously, the version was just "0.0.0" with no other
helpful information. The ollama.com registry and others handle this
swimmingly.
This fixes the case where a FROM line in previous modelfile points to a
file which may/may not be present in a different ollama instance. We
shouldn't be relying on the filename though and instead just check if
the FROM line was instead a valid model name and point to that instead.
… 0.6.0 whyvl#21

Patch provided by McBane87 on whyvl#21

Signed-off-by: Vadim Grinco <[email protected]>
* readme: add Ellama to list of community integrations (ollama#9800)

* readme: add screenpipe to community integrations (ollama#9786)

* Add support for ROCm gfx1151 (ollama#9773)

* conditionally enable parallel pipelines

* sample: make mutations in transforms explicit (ollama#9743)

* updated minP to use early exit making use of sorted tokens

* ml/backend/ggml: allocate memory with malloc when loading model (ollama#9822)

* runner: remove cache prompt flag from ollama runner (ollama#9826)

We do not need to bypass the prompt caching in the ollama runner yet, as
only embedding models needed to bypass the prompt caching. When embedding
models are implemented they can skip initializing this cache completely.

* ollamarunner: Check for minBatch of context space when shifting

Models can specify that a group of inputs need to be handled a single
batch. However, context shifting didn't respect this and could trigger
a break anyways. In this case, we should instead trigger a context
shift earlier so that it occurs before the grouped batch.

Note that there still some corner cases:
 - A long prompt that exceeds the context window can get truncated
   in the middle of an image. With the current models, this will
   result in the model not recognizing the image at all, which is
   pretty much the expected result with truncation.
 - The context window is set less than the minimum batch size. The
   only solution to this is to refuse to load the model with these
   settings. However, this can never occur with current models and
   default settings.

Since users are unlikely to run into these scenarios, fixing them is
left as a follow up.

* Applied latest patches from McBane87

See this for details: whyvl#7 (comment)

Signed-off-by: Vadim Grinco <[email protected]>

* Add ability to enable flash attention on vulkan (#4)

* discover: add flash attention handling for vulkan
* envconfig: fix typo in config.go

As part of the process some code was refactored and I added a new field
FlashAttention to GpuInfo since the previous solution didn't allow for a
granular check via vulkan extensions. As a side effect, this now allows
for granular per-device FA support checking in other places

---------

Signed-off-by: Vadim Grinco <[email protected]>
Co-authored-by: zeo <[email protected]>
Co-authored-by: Louis Beaumont <[email protected]>
Co-authored-by: Daniel Hiltgen <[email protected]>
Co-authored-by: Michael Yang <[email protected]>
Co-authored-by: Parth Sareen <[email protected]>
Co-authored-by: Jeffrey Morgan <[email protected]>
Co-authored-by: Bruce MacDonald <[email protected]>
Co-authored-by: Jesse Gross <[email protected]>
Co-authored-by: Nikita <[email protected]>
@jim3692
Copy link

jim3692 commented May 6, 2025

I tested this on my PC with qwen2.5:14b.

My specs are:
CPU: AMD Ryzen 5 3600 @4.3GHz
RAM: 32GB Dual Channel
GPU0: AMD Radeon RX 6600 8GB (amdgpu)
GPU1: NVIDIA GeForce RTX 2060 6GB (nouveau)


Since Docker couldn't build be current code, I relied on the ahmedsaed26/ollama-vulkan image from Docker Hub. The results were:

RX6600 + CPU
total duration:       1m0.061823615s
load duration:        15.372518ms
prompt eval count:    88 token(s)
prompt eval duration: 1.246s
prompt eval rate:     70.63 tokens/s
eval count:           559 token(s)
eval duration:        58.552s
eval rate:            9.55 tokens/s
RX6600 + RTX2060
total duration:       1m47.884894817s
load duration:        15.633929ms
prompt eval count:    88 token(s)
prompt eval duration: 8.814s
prompt eval rate:     9.98 tokens/s
eval count:           657 token(s)
eval duration:        1m38.806s
eval rate:            6.65 tokens/s

Building the image in this PR, yielded similar token rates:

RX6600 + CPU
total duration:       1m30.72221318s
load duration:        16.250572ms
prompt eval count:    88 token(s)
prompt eval duration: 1.363173487s
prompt eval rate:     64.56 tokens/s
eval count:           854 token(s)
eval duration:        1m29.33952338s
eval rate:            9.56 tokens/s
RX6600 + RTX2060
total duration:       1m31.944748907s
load duration:        15.525984ms
prompt eval count:    88 token(s)
prompt eval duration: 12.231542017s
prompt eval rate:     7.19 tokens/s
eval count:           500 token(s)
eval duration:        1m19.694382766s
eval rate:            6.27 tokens/s

So, I did one final test. I switched the resulting image to Arch Linux:

- FROM ubuntu:24.04
- RUN apt-get update \
-     && apt-get install -y ca-certificates libcap2 libvulkan1 \
-     && apt-get clean \
-     && rm -rf /var/lib/apt/lists/*
+ FROM archlinux
+ RUN pacman-key --init \
+     && pacman -Syu --noconfirm \
+     && pacman -S --noconfirm ca-certificates libcap vulkan-nouveau vulkan-radeon

And the results got much more promising. By using a newer Mesa driver (25.0.5-1), I got almost double the previous GPU performance, and surpassed the RX6600 + CPU tests.

RX6600 + RTX2060
total duration:       51.471138982s
load duration:        15.677564ms
prompt eval count:    88 token(s)
prompt eval duration: 8.438230008s
prompt eval rate:     10.43 tokens/s
eval count:           487 token(s)
eval duration:        43.014531816s
eval rate:            11.32 tokens/s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.