Merged latest stable and adjusted Dockerfile to replicate upstream build process #20

grinco · 2025-03-11T13:23:25Z

Tested on the v0.5.13 on linux. Image was built using the supplied Dockerfile with a caveat that release image was bumped to 24.04 (from 20.04).

Build command:

docker buildx build --platform linux/amd64 ${OLLAMA_COMMON_BUILD_ARGS} -t grinco/ollama-amd-apu:vulkan .

Tested on AMD Ryzen 7 8845HS w/ Radeon 780M Graphics with ROCm disabled

[GIN-debug] POST   /v1/chat/completions      --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers)
[GIN-debug] POST   /v1/completions           --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers)
[GIN-debug] POST   /v1/embeddings            --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models                --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models/:model         --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (6 handlers)
time=2025-03-11T13:00:40.793Z level=INFO source=gpu.go:199 msg="vulkan: load libvulkan and libcap ok"
time=2025-03-11T13:00:40.877Z level=INFO source=gpu.go:421 msg="error looking up vulkan GPU memory" error="device is a CPU"
time=2025-03-11T13:00:40.878Z level=WARN source=amd_linux.go:443 msg="amdgpu detected, but no compatible rocm library found.  Either install rocm v6, or follow manual install instructions at https://github.com/ollama/ollama/blob/main/docs/linux.md#manual-install"
time=2025-03-11T13:00:40.878Z level=WARN source=amd_linux.go:348 msg="unable to verify rocm library: no suitable rocm found, falling back to CPU"
time=2025-03-11T13:00:40.879Z level=INFO source=types.go:137 msg="inference compute" id=0 library=vulkan variant="" compute=1.3 driver=1.3 name="AMD Radeon Graphics (RADV GFX1103_R1)" total="15.6 GiB" available="15.6 GiB"

 # ollama run phi4:14b
>>> /set verbose
Set 'verbose' mode.
>>> how's it going?
Hello! I'm here to help you with any questions or tasks you have. How can I assist you today? 😊

total duration:       3.341959745s
load duration:        18.165612ms
prompt eval count:    15 token(s)
prompt eval duration: 475ms
prompt eval rate:     31.58 tokens/s
eval count:           26 token(s)
eval duration:        2.846s
eval rate:            9.14 tokens/s
>>>

ollama requires vcruntime140_1.dll which isn't found on 2019. previously the job used the windows runner (2019) but it explicitly installs 2022 to build the app. since the sign job doesn't actually build anything, it can use the windows-2022 runner instead.

* wrap ggml_backend_load_best in try/catch * ignore non-ollama paths

removing the channel tag from the url so it will always go to the current stable channel.

…lama#9060)

…7963)

Co-authored-by: Richard Lyons <[email protected]>

feat: add new Ollama engine using ggml through cgo This change introduces a new way to run pretrained models. It introduces 3 high level interfaces and a bunch of smaller helper interfaces to facilitate this. - `model.Model` defines the interface for a model architecture. Models such as `llama` and `mllama`, which are provided as examples, can implement the model's forward propagation in the `Forward` method. This method will be called to generate completions. This interface can be found in `model/model.go` - `ml.Backend` defines the interface for a backend tensor library, in this case `ggml`. Among other things, a Backend is responsible for loading a pretrained model into hardware (GPU, CPU, etc) and providing an interface for Models to access loaded tensors. This interface can be found in `ml/backend.go` - `ml.Tensor` defines the interface for a tensor and tensor operations This is the first implementation of the new engine. Follow up PRs will implement more features: - non-greedy sampling (ollama#8410) - integration with Ollama and KV caching (ollama#8301) - more model support (ollama#9080) with more coming soon Co-authored-by: Bruce MacDonald <[email protected]>

It is not common to return errors with close/free operations - most people won't check it and even if they did there's probably not much that can do. It's better to not give implementations false expectations.

Currently there is a mixture of int and int64 used when dealing with tensor dimensions and shapes, which causes unnecessary conversions - they all should be the same type. In general, most interfaces (such as Pytorch) use int64 for generality but most implementations (such as CUDA) use int32 for performance. There isn't much benefit to us to being more flexible than the implementations we are likely to run on. In addition, as a practical matter, a model with a tensor with a single dimension larger than 32 bits is unlikely to run on a 32-bit machine.

There are two cases where we may not have an output after computing: - Prompt processing where the length of the input exceeds the batch size - Internal memory management operations such as cache defrag and shift

Most tensor backends try to optimize performance by using a lower precision for matmuls. However, some operations (such as kq) on some models are sensitive to this and require full precision.

Passing in a Go buffer is not safe because the garbage collector could free or move the memory while the context is still open. However, if we pass in the size and a nil pointer then GGML will allocate it from the C side.

We need to sync before retrieving data after async computation. It is also important to ensure that the Go buffer is not moved by the GC across function calls so we do a synchronous copy.

Currently, if a model uses an interface for its data structures (as mllama does) then the tensor data in the structs implementing that interface will not get loaded.

Special tokens are currently read as uint32 from the model metadata. However, all other parts of the system (including the tokenizer) use int32 to represent tokens so it is impossible to represent the high portion of the unsigned range. For consistency and to avoid casts, we should just use int32 everywhere.

This allows there to be a file that is a list of models that is not mixed into the runner code.

This provides integration with the new Ollama engine (5824541 next ollama runner (ollama#7913)) and the rest of the Ollama infrastructure such as the runner and Ollama server. In addition, it also builds out the KV cache infrastructure to support requirements of how Ollama runs models such as: - Parallel processing - Memory management for defragmentation and shifting - Multi-modal modals Both old and new engines continue to be supported. By default, only the old engine is used. To enable the new engine: Start the server with the OLLAMA_NEW_ENGINE environment variable set: OLLAMA_NEW_ENGINE=1 ./ollama serve Start a model that is supported by the Ollama engine. This one is Llama 3.1 8b Q4_K_M: ./ollama run jessegross/llama3.1

In some cases, the directories in the executable path read by filepath.EvalSymlinks are not accessible, resulting in permission errors which results in an error when running models. It also doesn't work well on long paths on windows, also resulting in errors. This change removes filepath.EvalSymlinks when accessing os.Executable() altogether

provides a better approach to ollama#9088 that will attempt to evaluate symlinks (important for macOS where 'ollama' is often a symlink), but use the result of os.Executable() as a fallback in scenarios where filepath.EvalSymlinks fails due to permission erorrs or other issues

Co-authored-by: Jeffrey Morgan <[email protected]>

fix: error on models that don't support embeddings

Add metadata and tensor information to the show command to be able to see more information about a model. This outputs the same data as shown on the model details page on ollama.com

fix: error if image requested without vision model

the largest operation is by far (q @ k) so just count that for simplicity

count gemma3 vision tensors

ollama#9746) Replace large-chunk blob downloads with parallel small-chunk verification to solve timeout and performance issues. Registry users experienced progressively slowing download speeds as large-chunk transfers aged, often timing out completely. The previous approach downloaded blobs in a few large chunks but required a separate, single-threaded pass to read the entire blob back from disk for verification after download completion. This change uses the new chunksums API to fetch many smaller chunk+digest pairs, allowing concurrent downloads and immediate verification as each chunk arrives. Chunks are written directly to their final positions, eliminating the entire separate verification pass. The result is more reliable downloads that maintain speed throughout the transfer process and significantly faster overall completion, especially over unstable connections or with large blobs.

This commit refactors the LLM subsystem by removing internal subprocess request and response types. It consolidates duplicate type definitions across the codebase, moving them to centralized locations. The change also standardizes interfaces between components, simplifies the ServerStatusResp struct, and moves the ParseDurationMs function to a common package. This cleanup reduces code duplication between different runner implementations (llamarunner and ollamarunner).

Models may require that a set of inputs all be processed as part of the same batch. For example, if an image has multiple patches with fully connected attention between them, we should not split the batch in the middle of an image. Fixes ollama#9697

Currently there is a single context per sequence, shared all by all multimodal inputs. Since we build a vision encoder graph per image, with a large number of inputs we can eventually hit the maximum number of graph nodes per context. This changes to use a separate context for each image, ensuring that available resource limits are consistent.

Previously processing multiple images in a batch would trigger segfaults so sending images together was disabled as a way to mitigate this. The trigger was processing one image on the CPU and one on the GPU. This can no longer happen: - The vision encoder is now on the GPU so both images would be processed on the GPU. - We require images to be fully contained in a batch and each image including its special tokens is over half the batch size. As a result, we will never get two images in the same batch. Fixes ollama#9731

Darwin was using a different pattern for the version string than linux or windows.

…lama#9775) This sets the agent header in DefaultRegistry to include the version of the client, OS, and architecture in the previous format, with a minor twist. Note: The version is obtained from the build info, instead of the version in version.Version, which should not longer be necessary, but we can remove in a future commit. Using the build info is more accurate and also provides extra build information if the build is not tagged, and if it is "dirty". Previously, the version was just "0.0.0" with no other helpful information. The ollama.com registry and others handle this swimmingly.

This fixes the case where a FROM line in previous modelfile points to a file which may/may not be present in a different ollama instance. We shouldn't be relying on the filename though and instead just check if the FROM line was instead a valid model name and point to that instead.

… 0.6.0 whyvl#21 Patch provided by McBane87 on whyvl#21 Signed-off-by: Vadim Grinco <[email protected]>

From: whyvl#7 (comment) Signed-off-by: Vadim Grinco <[email protected]>

Signed-off-by: Vadim Grinco <[email protected]>

* readme: add Ellama to list of community integrations (ollama#9800) * readme: add screenpipe to community integrations (ollama#9786) * Add support for ROCm gfx1151 (ollama#9773) * conditionally enable parallel pipelines * sample: make mutations in transforms explicit (ollama#9743) * updated minP to use early exit making use of sorted tokens * ml/backend/ggml: allocate memory with malloc when loading model (ollama#9822) * runner: remove cache prompt flag from ollama runner (ollama#9826) We do not need to bypass the prompt caching in the ollama runner yet, as only embedding models needed to bypass the prompt caching. When embedding models are implemented they can skip initializing this cache completely. * ollamarunner: Check for minBatch of context space when shifting Models can specify that a group of inputs need to be handled a single batch. However, context shifting didn't respect this and could trigger a break anyways. In this case, we should instead trigger a context shift earlier so that it occurs before the grouped batch. Note that there still some corner cases: - A long prompt that exceeds the context window can get truncated in the middle of an image. With the current models, this will result in the model not recognizing the image at all, which is pretty much the expected result with truncation. - The context window is set less than the minimum batch size. The only solution to this is to refuse to load the model with these settings. However, this can never occur with current models and default settings. Since users are unlikely to run into these scenarios, fixing them is left as a follow up. * Applied latest patches from McBane87 See this for details: whyvl#7 (comment) Signed-off-by: Vadim Grinco <[email protected]> * Add ability to enable flash attention on vulkan (#4) * discover: add flash attention handling for vulkan * envconfig: fix typo in config.go As part of the process some code was refactored and I added a new field FlashAttention to GpuInfo since the previous solution didn't allow for a granular check via vulkan extensions. As a side effect, this now allows for granular per-device FA support checking in other places --------- Signed-off-by: Vadim Grinco <[email protected]> Co-authored-by: zeo <[email protected]> Co-authored-by: Louis Beaumont <[email protected]> Co-authored-by: Daniel Hiltgen <[email protected]> Co-authored-by: Michael Yang <[email protected]> Co-authored-by: Parth Sareen <[email protected]> Co-authored-by: Jeffrey Morgan <[email protected]> Co-authored-by: Bruce MacDonald <[email protected]> Co-authored-by: Jesse Gross <[email protected]> Co-authored-by: Nikita <[email protected]>

jim3692 · 2025-05-06T13:55:23Z

I tested this on my PC with qwen2.5:14b.

My specs are:
CPU: AMD Ryzen 5 3600 @4.3GHz
RAM: 32GB Dual Channel
GPU0: AMD Radeon RX 6600 8GB (amdgpu)
GPU1: NVIDIA GeForce RTX 2060 6GB (nouveau)

Since Docker couldn't build be current code, I relied on the ahmedsaed26/ollama-vulkan image from Docker Hub. The results were:

RX6600 + CPU

total duration:       1m0.061823615s
load duration:        15.372518ms
prompt eval count:    88 token(s)
prompt eval duration: 1.246s
prompt eval rate:     70.63 tokens/s
eval count:           559 token(s)
eval duration:        58.552s
eval rate:            9.55 tokens/s

RX6600 + RTX2060

total duration:       1m47.884894817s
load duration:        15.633929ms
prompt eval count:    88 token(s)
prompt eval duration: 8.814s
prompt eval rate:     9.98 tokens/s
eval count:           657 token(s)
eval duration:        1m38.806s
eval rate:            6.65 tokens/s

Building the image in this PR, yielded similar token rates:

RX6600 + CPU

total duration:       1m30.72221318s
load duration:        16.250572ms
prompt eval count:    88 token(s)
prompt eval duration: 1.363173487s
prompt eval rate:     64.56 tokens/s
eval count:           854 token(s)
eval duration:        1m29.33952338s
eval rate:            9.56 tokens/s

RX6600 + RTX2060

total duration:       1m31.944748907s
load duration:        15.525984ms
prompt eval count:    88 token(s)
prompt eval duration: 12.231542017s
prompt eval rate:     7.19 tokens/s
eval count:           500 token(s)
eval duration:        1m19.694382766s
eval rate:            6.27 tokens/s

So, I did one final test. I switched the resulting image to Arch Linux:

- FROM ubuntu:24.04
- RUN apt-get update \
-     && apt-get install -y ca-certificates libcap2 libvulkan1 \
-     && apt-get clean \
-     && rm -rf /var/lib/apt/lists/*
+ FROM archlinux
+ RUN pacman-key --init \
+     && pacman -Syu --noconfirm \
+     && pacman -S --noconfirm ca-certificates libcap vulkan-nouveau vulkan-radeon

And the results got much more promising. By using a newer Mesa driver (25.0.5-1), I got almost double the previous GPU performance, and surpassed the RX6600 + CPU tests.

RX6600 + RTX2060

total duration:       51.471138982s
load duration:        15.677564ms
prompt eval count:    88 token(s)
prompt eval duration: 8.438230008s
prompt eval rate:     10.43 tokens/s
eval count:           487 token(s)
eval duration:        43.014531816s
eval rate:            11.32 tokens/s

qusaismael and others added 30 commits February 8, 2025 12:28

docs: add LocalLLM app to community integrations (ollama#8953)

484a99e

readme: add Lunary to observability community integrations (ollama#8975)

38117fb

ml/backend/ggml: fix crash on dlopen for non-AVX systems (ollama#8976)

f4711da

readme: add Abso SDK to community integrations (ollama#8973)

0189bdd

fix: harden backend loading (ollama#9024)

49df03d

* wrap ggml_backend_load_best in try/catch * ignore non-ollama paths

doc: fix link for Abso (ollama#9043)

afa55bc

docs: fix nix package link (ollama#9045)

378d6e1

removing the channel tag from the url so it will always go to the current stable channel.

readme: add Homebrew to package managers section (ollama#9052)

82658c3

build: add -DGGML_CUDA_NO_PEER_COPY=ON for rocm builds on windows (ol…

a4f69a0

…lama#9060)

openai: finish_reason as tool_calls for streaming with tools (ollama#…

10d59d5

…7963)

docs: add H200 as supported device. (ollama#9076)

3a4449e

Co-authored-by: Richard Lyons <[email protected]>

docs: add ollamazing to the README.md (ollama#9075)

8cf1606

backend: Don't return an error on Close

7e13f56

It is not common to return errors with close/free operations - most people won't check it and even if they did there's probably not much that can do. It's better to not give implementations false expectations.

backend: Support graph computation that does not return an output

4d4463b

There are two cases where we may not have an output after computing: - Prompt processing where the length of the input exceeds the batch size - Internal memory management operations such as cache defrag and shift

backend: API to support full precision matmul

d773b7d

Most tensor backends try to optimize performance by using a lower precision for matmuls. However, some operations (such as kq) on some models are sensitive to this and require full precision.

ggml-backend: Let GGML allocate context memory

01d9a46

Passing in a Go buffer is not safe because the garbage collector could free or move the memory while the context is still open. However, if we pass in the size and a nil pointer then GGML will allocate it from the C side.

ggml-backend: Ensure data is available after async computation

6083069

We need to sync before retrieving data after async computation. It is also important to ensure that the Go buffer is not moved by the GC across function calls so we do a synchronous copy.

ggml-backend: Close on nil should be a no-op

d223f3b

model: Load tensors behind an interface

d650ad3

Currently, if a model uses an interface for its data structures (as mllama does) then the tensor data in the structs implementing that interface will not get loaded.

models: Move model into their own directory

6945617

This allows there to be a file that is a list of models that is not mixed into the runner code.

ml/backend/ggml: stable sort devices by score (ollama#9081)

6600bd7

Wire up system info log for new engine (ollama#9123)

df2680b

mxyng and others added 29 commits March 13, 2025 11:40

engine: error on embeddings; not currently implemented

ec46f32

Update model/model.go

3e102b7

Co-authored-by: Jeffrey Morgan <[email protected]>

Merge pull request ollama#9742 from ollama/mxyng/engine-error-embeddings

ccfd41c

fix: error on models that don't support embeddings

fix: change default context size for gemma3 (ollama#9744)

80c7ce3

add verbose mode to the show command (ollama#9640)

4bed739

Add metadata and tensor information to the show command to be able to see more information about a model. This outputs the same data as shown on the model details page on ollama.com

Merge pull request ollama#9741 from ollama/mxyng/visionless

543240f

fix: error if image requested without vision model

count gemma3 vision tensors

033cec2

count all vision tensors

d2ec223

roughly count gemma3 graph

a422ba3

the largest operation is by far (q @ k) so just count that for simplicity

fix divide by zero

65b88c5

docs: Add OLLAMA_ORIGINS for browser extension support (ollama#9643)

74b44fd

count non-repeating vision layers

8d76fa2

Merge pull request ollama#9703 from ollama/mxyng/gemma3-memory

4ea4d2b

count gemma3 vision tensors

server/internal/chunks: remove chunks package (ollama#9755)

4e320b8

Align versions for local builds (ollama#9635)

2d2247e

Darwin was using a different pattern for the version string than linux or windows.

gemma3 quantization (ollama#9776)

ef378ad

Fixes SIGSEGV: segmentation violation running gemma3 models on ollama…

d1939aa

… 0.6.0 whyvl#21 Patch provided by McBane87 on whyvl#21 Signed-off-by: Vadim Grinco <[email protected]>

Merge branch 'ollama_vanilla_stable' into vulkan

f77b9b9

Applied 04-disable-mmap-vulkan.patch

c2e4408

From: whyvl#7 (comment) Signed-off-by: Vadim Grinco <[email protected]>

Pulled new upstream code for ggml-bulkan backend

640f0bb

Signed-off-by: Vadim Grinco <[email protected]>

Merge ollama/ollama main into vulkan

4aa7e5e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Merged latest stable and adjusted Dockerfile to replicate upstream build process #20

Merged latest stable and adjusted Dockerfile to replicate upstream build process #20

Uh oh!

grinco commented Mar 11, 2025

Uh oh!

jim3692 commented May 6, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

49 participants

Merged latest stable and adjusted Dockerfile to replicate upstream build process #20

Are you sure you want to change the base?

Merged latest stable and adjusted Dockerfile to replicate upstream build process #20

Uh oh!

Conversation

grinco commented Mar 11, 2025

Uh oh!

jim3692 commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

49 participants

jim3692 commented May 6, 2025 •

edited

Loading