-
Notifications
You must be signed in to change notification settings - Fork 23
Merged latest stable and adjusted Dockerfile to replicate upstream build process #20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: vulkan
Are you sure you want to change the base?
Conversation
ollama requires vcruntime140_1.dll which isn't found on 2019. previously the job used the windows runner (2019) but it explicitly installs 2022 to build the app. since the sign job doesn't actually build anything, it can use the windows-2022 runner instead.
* wrap ggml_backend_load_best in try/catch * ignore non-ollama paths
removing the channel tag from the url so it will always go to the current stable channel.
Co-authored-by: Richard Lyons <[email protected]>
feat: add new Ollama engine using ggml through cgo This change introduces a new way to run pretrained models. It introduces 3 high level interfaces and a bunch of smaller helper interfaces to facilitate this. - `model.Model` defines the interface for a model architecture. Models such as `llama` and `mllama`, which are provided as examples, can implement the model's forward propagation in the `Forward` method. This method will be called to generate completions. This interface can be found in `model/model.go` - `ml.Backend` defines the interface for a backend tensor library, in this case `ggml`. Among other things, a Backend is responsible for loading a pretrained model into hardware (GPU, CPU, etc) and providing an interface for Models to access loaded tensors. This interface can be found in `ml/backend.go` - `ml.Tensor` defines the interface for a tensor and tensor operations This is the first implementation of the new engine. Follow up PRs will implement more features: - non-greedy sampling (ollama#8410) - integration with Ollama and KV caching (ollama#8301) - more model support (ollama#9080) with more coming soon Co-authored-by: Bruce MacDonald <[email protected]>
It is not common to return errors with close/free operations - most people won't check it and even if they did there's probably not much that can do. It's better to not give implementations false expectations.
Currently there is a mixture of int and int64 used when dealing with tensor dimensions and shapes, which causes unnecessary conversions - they all should be the same type. In general, most interfaces (such as Pytorch) use int64 for generality but most implementations (such as CUDA) use int32 for performance. There isn't much benefit to us to being more flexible than the implementations we are likely to run on. In addition, as a practical matter, a model with a tensor with a single dimension larger than 32 bits is unlikely to run on a 32-bit machine.
There are two cases where we may not have an output after computing: - Prompt processing where the length of the input exceeds the batch size - Internal memory management operations such as cache defrag and shift
Most tensor backends try to optimize performance by using a lower precision for matmuls. However, some operations (such as kq) on some models are sensitive to this and require full precision.
Passing in a Go buffer is not safe because the garbage collector could free or move the memory while the context is still open. However, if we pass in the size and a nil pointer then GGML will allocate it from the C side.
We need to sync before retrieving data after async computation. It is also important to ensure that the Go buffer is not moved by the GC across function calls so we do a synchronous copy.
Currently, if a model uses an interface for its data structures (as mllama does) then the tensor data in the structs implementing that interface will not get loaded.
Special tokens are currently read as uint32 from the model metadata. However, all other parts of the system (including the tokenizer) use int32 to represent tokens so it is impossible to represent the high portion of the unsigned range. For consistency and to avoid casts, we should just use int32 everywhere.
This allows there to be a file that is a list of models that is not mixed into the runner code.
This provides integration with the new Ollama engine (5824541 next ollama runner (ollama#7913)) and the rest of the Ollama infrastructure such as the runner and Ollama server. In addition, it also builds out the KV cache infrastructure to support requirements of how Ollama runs models such as: - Parallel processing - Memory management for defragmentation and shifting - Multi-modal modals Both old and new engines continue to be supported. By default, only the old engine is used. To enable the new engine: Start the server with the OLLAMA_NEW_ENGINE environment variable set: OLLAMA_NEW_ENGINE=1 ./ollama serve Start a model that is supported by the Ollama engine. This one is Llama 3.1 8b Q4_K_M: ./ollama run jessegross/llama3.1
In some cases, the directories in the executable path read by filepath.EvalSymlinks are not accessible, resulting in permission errors which results in an error when running models. It also doesn't work well on long paths on windows, also resulting in errors. This change removes filepath.EvalSymlinks when accessing os.Executable() altogether
provides a better approach to ollama#9088 that will attempt to evaluate symlinks (important for macOS where 'ollama' is often a symlink), but use the result of os.Executable() as a fallback in scenarios where filepath.EvalSymlinks fails due to permission erorrs or other issues
We currently print system info before the GGML backends are loaded. This results in only getting information about the default lowest common denominator runner. If we move up the GGML init then we can see what we are actually running. Before: time=2025-02-14T11:15:07.606-08:00 level=INFO source=runner.go:935 msg=system info="CPU : LLAMAFILE = 1 | CPU : LLAMAFILE = 1 | cgo(gcc)" threads=24 After: time=2025-02-14T11:16:02.936-08:00 level=INFO source=runner.go:935 msg=system info="CPU : LLAMAFILE = 1 | CPU : LLAMAFILE = 1 | CUDA : ARCHS = 890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | cgo(gcc)" threads=24
Co-authored-by: Jeffrey Morgan <[email protected]>
fix: error on models that don't support embeddings
Add metadata and tensor information to the show command to be able to see more information about a model. This outputs the same data as shown on the model details page on ollama.com
fix: error if image requested without vision model
the largest operation is by far (q @ k) so just count that for simplicity
count gemma3 vision tensors
ollama#9746) Replace large-chunk blob downloads with parallel small-chunk verification to solve timeout and performance issues. Registry users experienced progressively slowing download speeds as large-chunk transfers aged, often timing out completely. The previous approach downloaded blobs in a few large chunks but required a separate, single-threaded pass to read the entire blob back from disk for verification after download completion. This change uses the new chunksums API to fetch many smaller chunk+digest pairs, allowing concurrent downloads and immediate verification as each chunk arrives. Chunks are written directly to their final positions, eliminating the entire separate verification pass. The result is more reliable downloads that maintain speed throughout the transfer process and significantly faster overall completion, especially over unstable connections or with large blobs.
This commit refactors the LLM subsystem by removing internal subprocess request and response types. It consolidates duplicate type definitions across the codebase, moving them to centralized locations. The change also standardizes interfaces between components, simplifies the ServerStatusResp struct, and moves the ParseDurationMs function to a common package. This cleanup reduces code duplication between different runner implementations (llamarunner and ollamarunner).
Models may require that a set of inputs all be processed as part of the same batch. For example, if an image has multiple patches with fully connected attention between them, we should not split the batch in the middle of an image. Fixes ollama#9697
Currently there is a single context per sequence, shared all by all multimodal inputs. Since we build a vision encoder graph per image, with a large number of inputs we can eventually hit the maximum number of graph nodes per context. This changes to use a separate context for each image, ensuring that available resource limits are consistent.
Previously processing multiple images in a batch would trigger segfaults so sending images together was disabled as a way to mitigate this. The trigger was processing one image on the CPU and one on the GPU. This can no longer happen: - The vision encoder is now on the GPU so both images would be processed on the GPU. - We require images to be fully contained in a batch and each image including its special tokens is over half the batch size. As a result, we will never get two images in the same batch. Fixes ollama#9731
Darwin was using a different pattern for the version string than linux or windows.
…lama#9775) This sets the agent header in DefaultRegistry to include the version of the client, OS, and architecture in the previous format, with a minor twist. Note: The version is obtained from the build info, instead of the version in version.Version, which should not longer be necessary, but we can remove in a future commit. Using the build info is more accurate and also provides extra build information if the build is not tagged, and if it is "dirty". Previously, the version was just "0.0.0" with no other helpful information. The ollama.com registry and others handle this swimmingly.
This fixes the case where a FROM line in previous modelfile points to a file which may/may not be present in a different ollama instance. We shouldn't be relying on the filename though and instead just check if the FROM line was instead a valid model name and point to that instead.
… 0.6.0 whyvl#21 Patch provided by McBane87 on whyvl#21 Signed-off-by: Vadim Grinco <[email protected]>
From: whyvl#7 (comment) Signed-off-by: Vadim Grinco <[email protected]>
Signed-off-by: Vadim Grinco <[email protected]>
* readme: add Ellama to list of community integrations (ollama#9800) * readme: add screenpipe to community integrations (ollama#9786) * Add support for ROCm gfx1151 (ollama#9773) * conditionally enable parallel pipelines * sample: make mutations in transforms explicit (ollama#9743) * updated minP to use early exit making use of sorted tokens * ml/backend/ggml: allocate memory with malloc when loading model (ollama#9822) * runner: remove cache prompt flag from ollama runner (ollama#9826) We do not need to bypass the prompt caching in the ollama runner yet, as only embedding models needed to bypass the prompt caching. When embedding models are implemented they can skip initializing this cache completely. * ollamarunner: Check for minBatch of context space when shifting Models can specify that a group of inputs need to be handled a single batch. However, context shifting didn't respect this and could trigger a break anyways. In this case, we should instead trigger a context shift earlier so that it occurs before the grouped batch. Note that there still some corner cases: - A long prompt that exceeds the context window can get truncated in the middle of an image. With the current models, this will result in the model not recognizing the image at all, which is pretty much the expected result with truncation. - The context window is set less than the minimum batch size. The only solution to this is to refuse to load the model with these settings. However, this can never occur with current models and default settings. Since users are unlikely to run into these scenarios, fixing them is left as a follow up. * Applied latest patches from McBane87 See this for details: whyvl#7 (comment) Signed-off-by: Vadim Grinco <[email protected]> * Add ability to enable flash attention on vulkan (#4) * discover: add flash attention handling for vulkan * envconfig: fix typo in config.go As part of the process some code was refactored and I added a new field FlashAttention to GpuInfo since the previous solution didn't allow for a granular check via vulkan extensions. As a side effect, this now allows for granular per-device FA support checking in other places --------- Signed-off-by: Vadim Grinco <[email protected]> Co-authored-by: zeo <[email protected]> Co-authored-by: Louis Beaumont <[email protected]> Co-authored-by: Daniel Hiltgen <[email protected]> Co-authored-by: Michael Yang <[email protected]> Co-authored-by: Parth Sareen <[email protected]> Co-authored-by: Jeffrey Morgan <[email protected]> Co-authored-by: Bruce MacDonald <[email protected]> Co-authored-by: Jesse Gross <[email protected]> Co-authored-by: Nikita <[email protected]>
|
I tested this on my PC with My specs are: Since Docker couldn't build be current code, I relied on the ahmedsaed26/ollama-vulkan image from Docker Hub. The results were: RX6600 + CPURX6600 + RTX2060Building the image in this PR, yielded similar token rates: RX6600 + CPURX6600 + RTX2060So, I did one final test. I switched the resulting image to Arch Linux: - FROM ubuntu:24.04
- RUN apt-get update \
- && apt-get install -y ca-certificates libcap2 libvulkan1 \
- && apt-get clean \
- && rm -rf /var/lib/apt/lists/*
+ FROM archlinux
+ RUN pacman-key --init \
+ && pacman -Syu --noconfirm \
+ && pacman -S --noconfirm ca-certificates libcap vulkan-nouveau vulkan-radeonAnd the results got much more promising. By using a newer Mesa driver (25.0.5-1), I got almost double the previous GPU performance, and surpassed the RX6600 + RTX2060 |
Tested on the v0.5.13 on linux. Image was built using the supplied Dockerfile with a caveat that release image was bumped to 24.04 (from 20.04).
Build command:
Tested on AMD Ryzen 7 8845HS w/ Radeon 780M Graphics with ROCm disabled