Releases · huggingface/text-generation-inference

12 Mar 10:17

Narsil

v3.2.0

411a282

v3.2.0

Important changes

BREAKING CHANGE: Lots of modifications around tool calling. Tool calling now respects fully OpenAI return results (arguments return type is a string instead of a real JSON object). Lots of improvements around the tool calling and side effects fixed.
Added Gemma 3 support.

What's Changed

fix(neuron): explicitly install toolchain by @dacorvo in #3072
Only add token when it is defined. by @Narsil in #3073
Making sure Olmo (transformers backend) works. by @Narsil in #3074
Making tool_calls a vector. by @Narsil in #3075
Nix: add openai to impure shell for integration tests by @danieldk in #3081
Update --max-batch-total-tokens description by @alvarobartt in #3083
Fix tool call2 by @Narsil in #3076
Nix: the launcher needs a Python env with Torch for GPU detection by @danieldk in #3085
Add request parameters to OTel span for /v1/chat/completions endpoint by @aW3st in #3000
Add qwen2 multi lora layers support by @EachSheep in #3089
Add modules_to_not_convert in quantized model by @jiqing-feng in #3053
Small test and typing fixes by @danieldk in #3078
hotfix: qwen2 formatting by @danieldk in #3093
Pr 3003 ci branch by @drbh in #3007
Update the llamacpp backend by @angt in #3022
Fix qwen vl by @Narsil in #3096
Update README.md by @celsowm in #3095
Fix tool call3 by @Narsil in #3086
Add gemma3 model by @mht-sharma in #3099
Fix tool call4 by @Narsil in #3094
Update neuron backend by @dacorvo in #3098
Preparing relase 3.2.0 by @Narsil in #3100
Try to fix on main CI color. by @Narsil in #3101

New Contributors

@EachSheep made their first contribution in #3089
@jiqing-feng made their first contribution in #3053

Full Changelog: v3.1.1...v3.2.0

Contributors

danieldk, Narsil, and 9 other contributors

Assets 2

04 Mar 17:15

Narsil

v3.1.1

c34bd9d

v3.1.1

What's Changed

Back on nix main. by @Narsil in #2979
hotfix: fix trtllm CI build on release by @Hugoch in #2981
Add strftime_now callable function for minijinja chat templates by @alvarobartt in #2983
impureWithCuda: fix gcc version by @danieldk in #2990
Improve qwen vl impl by @drbh in #2943
Using the "lockfile". by @Narsil in #2992
Triton fix by @sywangyi in #2995
[Backend] Bump TRTLLM to v.0.17.0 by @mfuntowicz in #2991
Updating mllama after strftime. by @Narsil in #2993
Use kernels from the kernel hub by @danieldk in #2988
fix Qwen VL break in intel platform by @sywangyi in #3002
Update the flaky mllama test. by @Narsil in #3015
Preventing single user hugging the server to death by asking by @Narsil in #3016
Putting back the NCCL forced upgrade. by @Narsil in #2999
Support sigmoid scoring function in GPTQ-MoE by @danieldk in #3017
[Backend] Add Llamacpp backend by @angt in #2975
Use eetq kernel from the hub by @danieldk in #3029
Update README.md by @celsowm in #3024
Add loop_controls feature to minijinja to handle {% break %} by @alvarobartt in #2998
Pinning trufflehog. by @Narsil in #3032
It's find in some machine. using hf_hub::api::sync::Api to download c… by @Narsil in #3030
Improve Transformers support by @Cyrilvallez in #2970
feat: add initial qwen2.5-vl model and test by @drbh in #2971
Using public external registry (to use external runners for CI). by @Narsil in #3031
Having less logs in case of failure for checking CI more easily. by @Narsil in #3037
feat: Add the parsing of HF_HUB_USER_AGENT_ORIGIN environment variable for telemetry by @Hugoch in #3027
update ipex and torch to 2.6 for cpu by @sywangyi in #3039
flashinfer 0.2.0.post1 -> post2 by @danieldk in #3040
fix qwen2 vl crash in continous batching by @sywangyi in #3004
Simplify logs2. by @Narsil in #3045
Update Gradio ChatInterface configuration in consuming_tgi.md by @angt in #3042
Improve tool call message processing by @drbh in #3036
Use rotary kernel from the Hub by @danieldk in #3041
Add Neuron backend by @dacorvo in #3033
You need to seek apparently. by @Narsil in #3049
some minor fix by @sywangyi in #3048
fix: run linters and fix formatting by @drbh in #3057
Avoid running neuron integration tests twice by @dacorvo in #3054
Add Gaudi Backend by @baptistecolle in #3055
Fix two edge cases in RadixTrie::find by @danieldk in #3067
Add property-based testing for RadixAllocator by @danieldk in #3068
feat: add support for HF_HUB_USER_AGENT_ORIGIN to add user-agent Origin field in Hub requests. by @Hugoch in #3061
Preparing for release. by @Narsil in #3060
Fix a tiny typo in monitoring.md tutorial by @sadra-barikbin in #3056
Patch rust release. by @Narsil in #3069

New Contributors

@angt made their first contribution in #2975
@celsowm made their first contribution in #3024
@dacorvo made their first contribution in #3033
@sadra-barikbin made their first contribution in #3056

Full Changelog: v3.1.0...v3.1.1

Contributors

danieldk, Narsil, and 11 other contributors

Assets 2

31 Jan 13:26

Narsil

v3.1.0

463228e

v3.1.0

Important changes

Deepseek R1 is fully supported on both AMD and Nvidia !

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \
    ghcr.io/huggingface/text-generation-inference:3.1.0 --model-id deepseek-ai/DeepSeek-R1

What's Changed

Attempt to remove AWS S3 flaky cache for sccache by @mfuntowicz in #2953
Update to attention-kernels 0.2.0 by @danieldk in #2950
fix: Telemetry by @Hugoch in #2957
Fixing the oom maybe with 2.5.1 change. by @Narsil in #2958
Add backend name to telemetry by @Hugoch in #2962
Add fp8 support moe models by @mht-sharma in #2928
Update to moe-kernels 0.8.0 by @danieldk in #2966
Hotfixing intel-cpu (not sure how it was working before). by @Narsil in #2967
Add deepseekv3 by @Narsil in #2968
doc: Update TRTLLM deployment doc. by @Hugoch in #2960
Update moe-kernel to 0.8.2 for rocm by @mht-sharma in #2977
Prepare for release 3.1.0 by @Narsil in #2972

Full Changelog: v3.0.2...v3.1.0

Contributors

danieldk, Narsil, and 3 other contributors

Assets 2

24 Jan 11:16

Narsil

v3.0.2

b70f29d

v3.0.2

Tl;dr

New transformers backend supporting flashattention at roughly same performance as pure TGI for all non officially supported models directly in TGI. Congrats @Cyrilvallez

New models unlocked: Cohere2, olmo, olmo2, helium.

What's Changed

docs(README): supported hardware links TGI AMD GPUs by @guspan-tanadi in #2814
Fixing latest flavor by disabling it. by @Narsil in #2831
fix facebook/opt-125m not working issue by @sywangyi in #2824
Fixup opt to reduce the amount of odd if statements. by @Narsil in #2833
TensorRT-LLM backend bump to latest version + misc fixes by @mfuntowicz in #2791
Feat/trtllm cancellation dev container by @Hugoch in #2795
New arg. by @Narsil in #2845
Fixing CI. by @Narsil in #2846
fix: lint backend and doc files by @drbh in #2850
Qwen2-VL runtime error fix when prompted with multiple images by @janne-alatalo in #2840
Update vllm kernels for ROCM by @mht-sharma in #2826
change xpu lib download link by @sywangyi in #2852
fix: include add_special_tokens in kserve request by @drbh in #2859
chore: fixed some typos and attribute issues in README by @ruidazeng in #2891
update ipex xpu to fix issue in ARC770 by @sywangyi in #2884
Basic flashinfer 0.2 support by @danieldk in #2862
Improve vlm support (add idefics3 support) by @drbh in #2437
Update to marlin-kernels 0.3.7 by @danieldk in #2882
chore: Update jsonschema to 0.28.0 by @Stranger6667 in #2870
Add possible variants for A100 and H100 GPUs for auto-detecting flops by @lazariv in #2837
Update using_guidance.md by @nbroad1881 in #2901
fix crash in torch2.6 if TP=1 by @sywangyi in #2885
Add Flash decoding kernel ROCm by @mht-sharma in #2855
Enable FP8 Per-Tensor Scales and Integrate Marlin/MoE Kernels Repo for ROCm by @mht-sharma in #2825
Baichuan2-13B does not have max_position_embeddings in config by @sywangyi in #2903
docs(conceptual/speculation): available links Train Medusa by @guspan-tanadi in #2863
Fix docker run in README.md by @alvarobartt in #2861
📝 add guide on using TPU with TGI in the docs by @baptistecolle in #2907
Upgrading our rustc version. by @Narsil in #2908
Fix typo in TPU docs by @baptistecolle in #2911
Removing the github runner. by @Narsil in #2912
Upgrading bitsandbytes. by @Narsil in #2910
Do not convert weight scale to e4m3fnuz on CUDA by @danieldk in #2917
feat: improve star coder to support multi lora layers by @drbh in #2883
Flash decoding kernel adding and prefill-chunking and prefix caching enabling in intel cpu/xpu by @sywangyi in #2815
nix: update to PyTorch 2.5.1 by @danieldk in #2921
Moving to uv instead of poetry. by @Narsil in #2919
Add fp8 kv cache for ROCm by @mht-sharma in #2856
fix the crash of meta-llama/Llama-3.2-1B by @sywangyi in #2918
feat: improve qwen2-vl startup by @drbh in #2802
Revert "feat: improve qwen2-vl startup " by @drbh in #2924
flashinfer: switch to plan API by @danieldk in #2904
Fixing TRTLLM dockerfile. by @Narsil in #2922
Flash Transformers modeling backend support by @Cyrilvallez in #2913
Give TensorRT-LLMa proper CI/CD 😍 by @mfuntowicz in #2886
Trying to avoid the random timeout. by @Narsil in #2929
Run pre-commit run --all-files to fix CI by @alvarobartt in #2933
Upgrading the deps to have transformers==4.48.0 necessary by @Narsil in #2937
fix moe in quantization path by @sywangyi in #2935
Clarify FP8-Marlin use on capability 8.9 by @danieldk in #2940
Bump TensorRT-LLM backend dependency to v0.16.0 by @mfuntowicz in #2931
Set alias for max_completion_tokens in ChatRequest by @alvarobartt in #2932
Add NVIDIA A40 to known cards by @kldzj in #2941
[TRTLLM] Expose finish reason by @mfuntowicz in #2841
Tmp tp transformers by @Narsil in #2942
Transformers backend TP fix by @Cyrilvallez in #2945
Trying to put back the archlist (to fix the oom). by @Narsil in #2947

New Contributors

@janne-alatalo made their first contribution in #2840
@ruidazeng made their first contribution in #2891
@Stranger6667 made their first contribution in #2870
@lazariv made their first contribution in #2837
@baptistecolle made their first contribution in #2907
@Cyrilvallez made their first contribution in #2913
@kldzj made their first contribution in #2941

Full Changelog: v3.0.1...v3.0.2

Contributors

danieldk, Narsil, and 15 other contributors

Assets 2

11 Dec 20:13

Narsil

v3.0.1

bb9095a

v3.0.1

Summary

Patch release to handle a few older models and corner cases.

What's Changed

Hotfix link2 by @Narsil in #2812
Small update to docs by @Narsil in #2816
Using both value from config as they might not be correct. by @Narsil in #2817
Update README.md by @RodriMora in #2827
Prepare patch release. by @Narsil in #2829

New Contributors

@RodriMora made their first contribution in #2827

Full Changelog: v3.0.0...v3.0.1

Contributors

Narsil and RodriMora

Assets 2

09 Dec 20:22

Narsil

v3.0.0

8f326c9

v3.0.0

TL;DR

Big new release

Details: https://huggingface.co/docs/text-generation-inference/conceptual/chunking

What's Changed

feat: concat the adapter id to the model id in chat response by @drbh in #2779
Move JSON grammar -> regex grammar conversion to the router by @danieldk in #2772
Use FP8 KV cache when specified by compressed-tensors by @danieldk in #2761
upgrade ipex cpu to fix coredump in tiiuae/falcon-7b-instruct (pageat… by @sywangyi in #2778
Fix: docs typo by @jp1924 in #2777
Support continue final message by @drbh in #2733
Fix doc. by @Narsil in #2792
Removing ../ that broke the link by @Getty in #2789
fix: add merge-lora arg for model id by @drbh in #2788
fix: only use eos_token_id as pad_token_id if int by @dvrogozh in #2774
Sync (most) server dependencies with Nix by @danieldk in #2782
Saving some VRAM. by @Narsil in #2790
fix: avoid setting use_sgmv if no kernels present by @drbh in #2796
use oneapi 2024 docker image directly for xpu by @sywangyi in #2793
feat: auto max_new_tokens by @OlivierDehaene in #2803
Auto max prefill by @Narsil in #2797
Adding A100 compute. by @Narsil in #2806
Enable paligemma2 by @drbh in #2807
Attempt for cleverer auto batch_prefill values (some simplifications). by @Narsil in #2808
V3 doc by @Narsil in #2809
Prep new version by @Narsil in #2810
Hotfixing the link. by @Narsil in #2811

New Contributors

@jp1924 made their first contribution in #2777
@Getty made their first contribution in #2789

Full Changelog: v2.4.1...v3.0.0

Contributors

Getty, danieldk, and 6 other contributors

Assets 2

22 Nov 17:35

OlivierDehaene

v2.4.1

d2ed52f

v2.4.1

Notable changes

Choose input/total tokens automatically based on available VRAM
Support Qwen2 VL
Decrease latency of very large batches (> 128)

What's Changed

feat: add triton kernels to decrease latency of large batches by @OlivierDehaene in #2687
Avoiding timeout for bloom tests. by @Narsil in #2693
Green main by @Narsil in #2697
Choosing input/total tokens automatically based on available VRAM? by @Narsil in #2673
We can have a tokenizer anywhere. by @Narsil in #2527
Update poetry lock. by @Narsil in #2698
Fixing auto bloom test. by @Narsil in #2699
More timeout on docker start ? by @Narsil in #2701
Monkey patching as a desperate measure. by @Narsil in #2704
add xpu triton in dockerfile, or will show "Could not import Flash At… by @sywangyi in #2702
Support qwen2 vl by @drbh in #2689
fix cuda graphs for qwen2-vl by @drbh in #2708
fix: create position ids for text only input by @drbh in #2714
fix: add chat_tokenize endpoint to api docs by @drbh in #2710
Hotfixing auto length (warmup max_s was wrong). by @Narsil in #2716
Fix prefix caching + speculative decoding by @tgaddair in #2711
Fixing linting on main. by @Narsil in #2719
nix: move to tgi-nix main by @danieldk in #2718
fix incorrect output of Qwen2-7B-Instruct-GPTQ-Int4 and Qwen2-7B-Inst… by @sywangyi in #2717
add trust_remote_code in tokenizer to fix baichuan issue by @sywangyi in #2725
Add initial support for compressed-tensors checkpoints by @danieldk in #2732
nix: update nixpkgs by @danieldk in #2746
benchmark: fix prefill throughput by @danieldk in #2741
Fix: Change model_type from ssm to mamba by @mokeddembillel in #2740
Fix: Change embeddings to embedding by @mokeddembillel in #2738
fix response type of document for Text Generation Inference by @jitokim in #2743
Upgrade outlines to 0.1.1 by @aW3st in #2742
Upgrading our deps. by @Narsil in #2750
feat: return streaming errors as an event formatted for openai's client by @drbh in #2668
Remove vLLM dependency for CUDA by @danieldk in #2751
fix: improve find_segments via numpy diff by @drbh in #2686
add ipex moe implementation to support Mixtral and PhiMoe by @sywangyi in #2707
Add support for compressed-tensors w8a8 int checkpoints by @danieldk in #2745
feat: support flash attention 2 in qwen2 vl vision blocks by @drbh in #2721
Simplify two ipex conditions by @danieldk in #2755
Update to moe-kernels 0.7.0 by @danieldk in #2720
PR 2634 CI - Fix the tool_choice format for named choice by adapting OpenAIs scheme by @drbh in #2645
fix: adjust llama MLP name from dense to mlp to correctly apply lora by @drbh in #2760
nix: update for outlines 0.1.4 by @danieldk in #2764
Add support for wNa16 int 2:4 compressed-tensors checkpoints by @danieldk in #2758
nix: build and cache impure devshells by @danieldk in #2765
fix: set outlines version to 0.1.3 to avoid caching serialization issue by @drbh in #2766
nix: downgrade to outlines 0.1.3 by @danieldk in #2768
fix: incomplete generations w/ single tokens generations and models that did not support chunking by @OlivierDehaene in #2770
fix: tweak grammar test response by @drbh in #2769
Add a README section about using Nix by @danieldk in #2767
Remove guideline from API by @Wauplin in #2762
feat: Add automatic nightly benchmarks by @Hugoch in #2591
feat: add payload limit by @OlivierDehaene in #2726
Update to marlin-kernels 0.3.6 by @danieldk in #2771
chore: prepare 2.4.1 release by @OlivierDehaene in #2773

New Contributors

@tgaddair made their first contribution in #2711
@mokeddembillel made their first contribution in #2740
@jitokim made their first contribution in #2743

Full Changelog: v2.3.0...v2.4.1

Contributors

danieldk, Narsil, and 9 other contributors

Assets 2

25 Oct 21:14

OlivierDehaene

v2.4.0

0a655a0

v2.4.0

Notable changes

Experimental prefill chunking (PREFILL_CHUNKING=1)
Experimental FP8 KV cache support
Greatly decrease latency for large batches (> 128 requests)
Faster MoE kernels and support for GPTQ-quantized MoE
Faster implementation of MLLama

What's Changed

nix: remove unused _server.nix file by @danieldk in #2538
chore: Add old V2 backend by @OlivierDehaene in #2551
Remove duplicated RUN in Dockerfile by @alvarobartt in #2547
Micro cleanup. by @Narsil in #2555
Hotfixing main by @Narsil in #2556
Add support for scalar FP8 weight scales by @danieldk in #2550
Add DenseMoELayer and wire it up in Mixtral/Deepseek V2 by @danieldk in #2537
Update the link to the Ratatui organization by @orhun in #2546
Simplify crossterm imports by @orhun in #2545
Adding note for private models in quick-tour document by @ariG23498 in #2548
Hotfixing main. by @Narsil in #2562
Cleanup Vertex + Chat by @Narsil in #2553
More tensor cores. by @Narsil in #2558
remove LORA_ADAPTERS_PATH by @nbroad1881 in #2563
Add LoRA adapters support for Gemma2 by @alvarobartt in #2567
Fix build with --features google by @alvarobartt in #2566
Improve support for GPUs with capability < 8 by @danieldk in #2575
flashinfer: pass window size and dtype by @danieldk in #2574
Remove compute capability lazy cell by @danieldk in #2580
Update architecture.md by @ulhaqi12 in #2577
Update ROCM libs and improvements by @mht-sharma in #2579
Add support for GPTQ-quantized MoE models using MoE Marlin by @danieldk in #2557
feat: support phi3.5 moe by @drbh in #2479
Move flake back to tgi-nix main by @danieldk in #2586
MoE Marlin: support desc_act for groupsize != -1 by @danieldk in #2590
nix: experimental support for building a Docker container by @danieldk in #2470
Mllama flash version by @Narsil in #2585
Max token capacity metric by @Narsil in #2595
CI (2592): Allow LoRA adapter revision in server launcher by @drbh in #2602
Unroll notify error into generate response by @drbh in #2597
New release 2.3.1 by @Narsil in #2604
Revert "Unroll notify error into generate response" by @drbh in #2605
nix: example of local package overrides during development by @danieldk in #2607
Add basic FP8 KV cache support by @danieldk in #2603
Fp8 Cache condition by @flozi00 in #2611
enable mllama in intel platform by @sywangyi in #2610
Upgrade minor rust version (Fixes rust build compilation cache) by @Narsil in #2617
Add support for fused MoE Marlin for AWQ by @danieldk in #2616
nix: move back to the tgi-nix main branch by @danieldk in #2620
CI (2599): Update ToolType input schema by @drbh in #2601
nix: add black and isort to the closure by @danieldk in #2619
AMD CI by @Narsil in #2589
feat: allow tool calling to respond without a tool by @drbh in #2614
Update documentation to most recent stable version of TGI. by @Vaibhavs10 in #2625
Intel ci by @Narsil in #2630
Fixing intel Supports windowing. by @Narsil in #2637
Small fixes for supported models by @osanseviero in #2471
Cpu perf by @Narsil in #2596
Clarify gated description and quicktour by @osanseviero in #2631
update ipex to fix incorrect output of mllama in cpu by @sywangyi in #2640
feat: enable pytorch xpu support for non-attention models by @dvrogozh in #2561
Fixing linters. by @Narsil in #2650
Rollback to ChatRequest for Vertex AI Chat instead of VertexChat by @alvarobartt in #2651
Fp8 e4m3_fnuz support for rocm by @mht-sharma in #2588
feat: prefill chunking by @OlivierDehaene in #2600
Support e4m3fn KV cache by @danieldk in #2655
Simplify the attention function by @danieldk in #2609
fix tgi-entrypoint wrapper in docker file: exec instead of spawning a child process by @oOraph in #2663
fix: prefer inplace softmax to avoid copy by @drbh in #2661
Break cycle between the attention implementations and KV cache by @danieldk in #2627
CI job. Gpt awq 4 by @Narsil in #2665
Make handling of FP8 scales more consisent by @danieldk in #2666
Test Marlin MoE with desc_act=true by @danieldk in #2622
break when there's nothing to read by @sywangyi in #2582
Add impureWithCuda dev shell by @danieldk in #2677
Make moe-kernels and marlin-kernels mandatory in CUDA installs by @danieldk in #2632
feat: natively support Granite models by @OlivierDehaene in #2682
feat: allow any supported payload on /invocations by @OlivierDehaene in #2683
flashinfer: reminder to remove contiguous call in the future by @danieldk in #2685
Fix Phi 3.5 MoE tests by @danieldk in #2684
Add support for FP8 KV cache scales by @danieldk in #2628
Fixing "deadlock" when python prompts for trust_remote_code by always by @Narsil in #2664
[TENSORRT-LLM] - Implement new looper thread based backend by @mfuntowicz in #2357
Fixing rocm gptq by using triton code too (renamed cuda into triton). by @Narsil in #2691
Fixing mt0 test. by @Narsil in #2692
Add support for stop words in TRTLLM by @mfuntowicz in #2678
Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels by @danieldk in #2688

New Contributors

@alvarobartt made their first contribution in https://github.com/huggingface/...

Contributors

danieldk, Narsil, and 15 other contributors

Assets 2

03 Oct 13:01

Narsil

v2.3.1

a094729

v2.3.1

Important changes

Added support for Mllama (3.2, vision models). Flashed, unpadded.
FP8 performance improvements
Moe performance improvements
BREAKING CHANGE - When using tools, models could answer with a tool call notify_error with the content error, it will instead output regular generation.

What's Changed

nix: remove unused _server.nix file by @danieldk in #2538
chore: Add old V2 backend by @OlivierDehaene in #2551
Remove duplicated RUN in Dockerfile by @alvarobartt in #2547
Micro cleanup. by @Narsil in #2555
Hotfixing main by @Narsil in #2556
Add support for scalar FP8 weight scales by @danieldk in #2550
Add DenseMoELayer and wire it up in Mixtral/Deepseek V2 by @danieldk in #2537
Update the link to the Ratatui organization by @orhun in #2546
Simplify crossterm imports by @orhun in #2545
Adding note for private models in quick-tour document by @ariG23498 in #2548
Hotfixing main. by @Narsil in #2562
Cleanup Vertex + Chat by @Narsil in #2553
More tensor cores. by @Narsil in #2558
remove LORA_ADAPTERS_PATH by @nbroad1881 in #2563
Add LoRA adapters support for Gemma2 by @alvarobartt in #2567
Fix build with --features google by @alvarobartt in #2566
Improve support for GPUs with capability < 8 by @danieldk in #2575
flashinfer: pass window size and dtype by @danieldk in #2574
Remove compute capability lazy cell by @danieldk in #2580
Update architecture.md by @ulhaqi12 in #2577
Update ROCM libs and improvements by @mht-sharma in #2579
Add support for GPTQ-quantized MoE models using MoE Marlin by @danieldk in #2557
feat: support phi3.5 moe by @drbh in #2479
Move flake back to tgi-nix main by @danieldk in #2586
MoE Marlin: support desc_act for groupsize != -1 by @danieldk in #2590
nix: experimental support for building a Docker container by @danieldk in #2470
Mllama flash version by @Narsil in #2585
Max token capacity metric by @Narsil in #2595
CI (2592): Allow LoRA adapter revision in server launcher by @drbh in #2602
Unroll notify error into generate response by @drbh in #2597
New release 2.3.1 by @Narsil in #2604

New Contributors

@alvarobartt made their first contribution in #2547
@orhun made their first contribution in #2546
@ariG23498 made their first contribution in #2548
@ulhaqi12 made their first contribution in #2577
@mht-sharma made their first contribution in #2579

Full Changelog: v2.3.0...v2.3.1

Contributors

danieldk, Narsil, and 8 other contributors

Assets 2

20 Sep 16:20

Narsil

v2.3.0

169178b

v2.3.0

Important changes

Renamed HUGGINGFACE_HUB_CACHE to use HF_HOME. This is done to harmonize environment variables across HF ecosystem.
So locations of data moved from /data/models-.... to /data/hub/models-.... on the Docker.
Prefix caching by default ! To help with long running queries TGI will use prefix caching a reuse pre-existing queries in the kv-cache in order to speed up TTFT. This should be totally transparent for most users, however this has required a instense rewrite of internals and therefore bugs can potentially exist. Also we changed kernels from paged_attention to flashinfer (and flashdecoding as a fallback for some specific models that aren't supported by flashinfer).
Lots of performance improvements with Marlin and quantization.

What's Changed

chore: update to torch 2.4 by @OlivierDehaene in #2259
fix crash in multi-modal by @sywangyi in #2245
fix of use of unquantized weights in cohere GQA loading, also enable … by @sywangyi in #2291
Split up layers.marlin into several files by @danieldk in #2292
fix: refactor adapter weight loading and mapping by @drbh in #2193
Using g6 instead of g5. by @Narsil in #2281
Some small fixes for the Torch 2.4.0 update by @danieldk in #2304
Fixing idefics on g6 tests. by @Narsil in #2306
Fix registry name by @XciD in #2307
Support tied embeddings in 0.5B and 1.5B Qwen2 models by @danieldk in #2313
feat: add ruff and resolve issue by @drbh in #2262
Run ci api key by @ErikKaum in #2315
Install Marlin from standalone package by @danieldk in #2320
fix: reject grammars without properties by @drbh in #2309
patch-error-on-invalid-grammar by @ErikKaum in #2282
fix: adjust test snapshots and small refactors by @drbh in #2323
server quantize: store quantizer config in standard format by @danieldk in #2299
Rebase TRT-llm by @Narsil in #2331
Handle GPTQ-Marlin loading in GPTQMarlinWeightLoader by @danieldk in #2300
Pr 2290 ci run by @drbh in #2329
refactor usage stats by @ErikKaum in #2339
enable HuggingFaceM4/idefics-9b in intel gpu by @sywangyi in #2338
Fix cache block size for flash decoding by @danieldk in #2351
Unify attention output handling by @danieldk in #2343
fix: attempt forward on flash attn2 to check hardware support by @drbh in #2335
feat: include local lora adapter loading docs by @drbh in #2359
fix: return the out tensor rather then the functions return value by @drbh in #2361
feat: implement a templated endpoint for visibility into chat requests by @drbh in #2333
feat: prefer stop over eos_token to align with openai finish_reason by @drbh in #2344
feat: return the generated text when parsing fails by @drbh in #2353
fix: default num_ln_in_parallel_attn to one if not supplied by @drbh in #2364
fix: prefer original layernorm names for 180B by @drbh in #2365
fix: fix num_ln_in_parallel_attn attribute name typo in RWConfig by @almersawi in #2350
add gptj modeling in TGI #2366 (CI RUN) by @drbh in #2372
Fix the prefix for OPT model in opt_modelling.py #2370 (CI RUN) by @drbh in #2371
Pr 2374 ci branch by @drbh in #2378
fix EleutherAI/gpt-neox-20b does not work in tgi by @sywangyi in #2346
Pr 2337 ci branch by @drbh in #2379
fix: prefer hidden_activation over hidden_act in gemma2 by @drbh in #2381
Update Quantization docs and minor doc fix. by @Vaibhavs10 in #2368
Pr 2352 ci branch by @drbh in #2382
Add FlashInfer support by @danieldk in #2354
Add experimental flake by @danieldk in #2384
Using HF_HOME instead of CACHE to get token read in addition to models. by @Narsil in #2288
flake: add fmt and clippy by @danieldk in #2389
Update documentation for Supported models by @Vaibhavs10 in #2386
flake: use rust-overlay by @danieldk in #2390
Using an enum for flash backens (paged/flashdecoding/flashinfer) by @Narsil in #2385
feat: add guideline to chat request and template by @drbh in #2391
Update flake for 9.0a capability in Torch by @danieldk in #2394
nix: add router to the devshell by @danieldk in #2396
Upgrade fbgemm by @Narsil in #2398
Adding launcher to build. by @Narsil in #2397
Fixing import exl2 by @Narsil in #2399
Cpu dockerimage by @sywangyi in #2367
Add support for prefix caching to the v3 router by @danieldk in #2392
Keeping the benchmark somewhere by @Narsil in #2401
feat: validate template variables before apply and improve sliding wi… by @drbh in #2403
fix: allocate tmp based on sgmv kernel if available by @drbh in #2345
fix: improve completions to send a final chunk with usage details by @drbh in #2336
Updating the flake. by @Narsil in #2404
Pr 2395 ci run by @drbh in #2406
fix: include create_exllama_buffers and set_device for exllama by @drbh in #2407
nix: incremental build of the launcher by @danieldk in #2410
Adding more kernels to flake. by @Narsil in #2411
add numa to improve cpu inference perf by @sywangyi in #2330
fix: adds causal to attention params by @drbh in #2408
nix: partial incremental build of the router by @danieldk in #2416
Upgrading exl2. by @Narsil in #2415
More fixes trtllm by @mfuntowicz in #2342
nix: build router incrementally by @danieldk in #2422
Fixing exl2 and other quanize tests again. by @Narsil in #2419
Upgrading the tests to match the current workings. by @Narsil in #2423
nix: try to reduce the number of Rust rebuilds by @danieldk in https://github.com/huggingface/text-generation-inference/pull/...

Contributors

danieldk, Narsil, and 12 other contributors

Assets 2

Releases: huggingface/text-generation-inference

v3.2.0

Important changes

What's Changed

New Contributors

Contributors

Uh oh!

v3.1.1

What's Changed

New Contributors

Contributors

Uh oh!

v3.1.0

Important changes

What's Changed

Contributors

Uh oh!

v3.0.2

What's Changed

New Contributors

Contributors

Uh oh!

v3.0.1

Summary

What's Changed

New Contributors

Contributors

Uh oh!

v3.0.0

TL;DR

What's Changed

New Contributors

Contributors

Uh oh!

v2.4.1

Notable changes

What's Changed

New Contributors

Contributors

Uh oh!

v2.4.0

Notable changes

What's Changed

New Contributors

Contributors

Uh oh!

v2.3.1

Important changes

What's Changed

New Contributors

Contributors

Uh oh!

v2.3.0

Important changes

What's Changed

Contributors

Uh oh!