Skip to content

Releases: huggingface/text-generation-inference

v3.2.0

12 Mar 10:17
411a282

Choose a tag to compare

Important changes

  • BREAKING CHANGE: Lots of modifications around tool calling. Tool calling now respects fully OpenAI return results (arguments return type is a string instead of a real JSON object). Lots of improvements around the tool calling and side effects fixed.

  • Added Gemma 3 support.

What's Changed

New Contributors

Full Changelog: v3.1.1...v3.2.0

v3.1.1

04 Mar 17:15
c34bd9d

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v3.1.0...v3.1.1

v3.1.0

31 Jan 13:26
463228e

Choose a tag to compare

Important changes

Deepseek R1 is fully supported on both AMD and Nvidia !

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \
    ghcr.io/huggingface/text-generation-inference:3.1.0 --model-id deepseek-ai/DeepSeek-R1

What's Changed

Full Changelog: v3.0.2...v3.1.0

v3.0.2

24 Jan 11:16
b70f29d

Choose a tag to compare

Tl;dr

New transformers backend supporting flashattention at roughly same performance as pure TGI for all non officially supported models directly in TGI. Congrats @Cyrilvallez

New models unlocked: Cohere2, olmo, olmo2, helium.

What's Changed

New Contributors

Full Changelog: v3.0.1...v3.0.2

v3.0.1

11 Dec 20:13
bb9095a

Choose a tag to compare

Summary

Patch release to handle a few older models and corner cases.

What's Changed

New Contributors

Full Changelog: v3.0.0...v3.0.1

v3.0.0

09 Dec 20:22
8f326c9

Choose a tag to compare

TL;DR

Big new release

benchmarks_v3

Details: https://huggingface.co/docs/text-generation-inference/conceptual/chunking

What's Changed

New Contributors

Full Changelog: v2.4.1...v3.0.0

v2.4.1

22 Nov 17:35
d2ed52f

Choose a tag to compare

Notable changes

  • Choose input/total tokens automatically based on available VRAM
  • Support Qwen2 VL
  • Decrease latency of very large batches (> 128)

What's Changed

New Contributors

Full Changelog: v2.3.0...v2.4.1

v2.4.0

25 Oct 21:14
0a655a0

Choose a tag to compare

Notable changes

  • Experimental prefill chunking (PREFILL_CHUNKING=1)
  • Experimental FP8 KV cache support
  • Greatly decrease latency for large batches (> 128 requests)
  • Faster MoE kernels and support for GPTQ-quantized MoE
  • Faster implementation of MLLama

What's Changed

New Contributors

Read more

v2.3.1

03 Oct 13:01
a094729

Choose a tag to compare

Important changes

  • Added support for Mllama (3.2, vision models). Flashed, unpadded.
  • FP8 performance improvements
  • Moe performance improvements
  • BREAKING CHANGE - When using tools, models could answer with a tool call notify_error with the content error, it will instead output regular generation.

What's Changed

New Contributors

Full Changelog: v2.3.0...v2.3.1

v2.3.0

20 Sep 16:20
169178b

Choose a tag to compare

Important changes

  • Renamed HUGGINGFACE_HUB_CACHE to use HF_HOME. This is done to harmonize environment variables across HF ecosystem.
    So locations of data moved from /data/models-.... to /data/hub/models-.... on the Docker.

  • Prefix caching by default ! To help with long running queries TGI will use prefix caching a reuse pre-existing queries in the kv-cache in order to speed up TTFT. This should be totally transparent for most users, however this has required a instense rewrite of internals and therefore bugs can potentially exist. Also we changed kernels from paged_attention to flashinfer (and flashdecoding as a fallback for some specific models that aren't supported by flashinfer).

  • Lots of performance improvements with Marlin and quantization.

What's Changed

Read more