Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
129 changes: 82 additions & 47 deletions docs/speech-to-text/real-time/latency.mdx
Original file line number Diff line number Diff line change
@@ -1,89 +1,124 @@
---
description: 'Learn about latency in the Speechmatics Real-Time server'
description: 'Control response time in Realtime transcription'
keywords: [speechmatics, real-time, rt, latency, low latency, fast, transcription, speech recognition, asr]
sidebar_label: Real-time Latency
sidebar_label: Latency
---

# Real-Time Latency
# Latency settings

When transcribing in real-time, you can control the maximum time to wait for the final transcript. This could be as fast as 0.7 seconds, though allowing a longer time will give a slight accuracy improvement.
Balance speed and accuracy in your Realtime transcription by adjusting latency settings.

For even faster output, use [Partial transcripts](#partial-transcripts) to receive transcription output before higher-accuracy final transcripts are returned.
## Configuration options

## Configuration Example
The following example shows a typical configuration for low latency applications. Include this in the [StartRecognition](/api-ref/realtime-transcription-websocket#startrecognition) message.
Configure real-time latency with the following parameters:

- `max_delay`: Maximum time in seconds (0.7-4.0, default: 4.0) between what's said and final transcript delivery
- `max_delay_mode`: Mode setting (`fixed` or `flexible`, default: `flexible`) for handling [numeral formatting](#numeral-formatting)
- `enable_partials`: Boolean (default: false) to enable [partial transcripts](#partial-transcripts) for faster feedback

Add these parameters to your [StartRecognition](/api-ref/realtime-transcription-websocket#startrecognition) message:

```json
{
"type": "transcription",
"transcription_config": {
// highlight-start
"max_delay": 0.7,
"max_delay_mode": "flexible",
// highlight-end
"enable_partials": true,
// highlight-end
"language": "en",
"operating_point": "enhanced",
"operating_point": "enhanced"
}
}
```

- `max_delay` (Number): Optional. Allowed between 0.7 and 4 seconds. Default is 4 seconds. This is the delay in seconds between the end of a spoken word and returning the Final transcript results. Note that there is a very small amount of additional latency while the server is sending the transcript to the client.
- `max_delay_mode` (String): Optional. Allowed values are `fixed` and `flexible`. Default is `flexible`. This allows some additional time for [Numeral Formatting](#numeral-formatting).
- `enable_partials` (Boolean): Default is false. Whether or not to receive [Partial transcripts](#partial-transcripts) before the Final transcripts are received.

## Accuracy/Latency trade-offs
## Speed vs. accuracy trade-offs

Choose the right `max_delay` setting for your use case:

We recommend experimenting with different settings for the `max_delay` to find the right trade-off between accuracy and latency for your application. Based on our own testing and experience, we can offer a few guidelines to get you started.
| Setting | Accuracy Impact | Recommended Use Cases |
|---------|----------------|----------------------|
| 0.7-1.5s | < 5% degradation | Conversational AI, voice assistants |
| 2.0s | ~1% degradation | Live captioning, broadcast media |
| 4.0s | No degradation | Highest accuracy needs with partial transcripts |

Setting `max_delay` to between 0.7 and 1.5 gives an accuracy degradation of less than 5% relative when compared to the Batch transcription service. This tradeoff is worthwhile for use cases that need ultra-fast responses such as real-time conversational AI.
:::warning
Lower latency settings trade some accuracy for speed. Test thoroughly with your specific audio.
:::

At 2 seconds `max_delay`, there is around 1% relative accuracy degradation when compared to the Batch transcription service. This is the recommended setting for most use cases, such as broadcast captioning.
## Partial transcripts

For the best accuracy, we recommend using a `max_delay` of 4 seconds which is equivalent to our Batch transcription service. This can be combined with Partial transcripts, to give users early feedback of the recognised text.
Get preliminary results faster while waiting for final, more accurate transcripts.

## Partial Transcripts
### How partial transcripts work

Partial transcripts allow you to receive preliminary transcription and update as more context is available until the higher-accuracy [Finals](/api-ref/realtime-transcription-websocket#addtranscript) are returned. Typically Partials are returned in less than 500 milliseconds. [Partial transcripts](/api-ref/realtime-transcription-websocket.mdx#addpartialtranscript) are enabled using the `enable_partials` config option.
- Delivered in under 500ms (vs. final transcripts at your configured `max_delay`)
- Updated continuously as more speech context becomes available
- Enabled with `enable_partials: true` in your configuration

On each Final transcript you will immediately receive a Partial transcript with any remaining words which have not been finalised.
### Limitations

Note that Partial transcripts have some limitations:
- Accuracy is usually 10-25% lower than the Final transcript. This includes punctuation and capitalisation of words.
- The `confidence` field for Partial transcripts has no meaning and should not be relied on.
- Accuracy is typically 10-25% lower than final transcripts
- Punctuation and capitalization may be incorrect
- Confidence scores are not meaningful and should be ignored

## Numeral Formatting
## Numeral formatting

[Numeral Formatting](/speech-to-text/output-enhancements/numeral-formatting) ensures readability of your transcripts by formatting numbers, dates, currencies and other important _entities_ into their written form.
Improve transcript readability with properly formatted numbers, dates, and currencies.

When the `max_delay_mode` is set to `flexible`, and an entity is being spoken, the Final transcript would be delayed until the entity is fully spoken to enable proper formatting. This option should be used in most use-cases for improved accuracy and readability for numbers, currencies, and dates.
### Flexible mode

If you have strict latency requirements, and prefer not to wait for entity formatting to complete, set `max_delay_mode` to `fixed`. Note that in this mode, there will be some reduction in accuracy and readability for numbers, currencies, and dates.
When using `max_delay_mode: "flexible"` (default):
- System waits until an entity (number, date, currency) is fully spoken
- Ensures proper formatting of complex numerical expressions
- Slightly increases latency only when entities are detected

## Example Outputs (Partials and Finals)
With only `Finals` and default `max_delay_mode`, messages received could look like the following:
### Fixed mode

- **(Final)**: I am 35.
For applications with strict latency requirements:
- Set `max_delay_mode: "fixed"` to enforce consistent timing
- System won't wait for entities to complete before returning results

**Final output**: I am 35.
:::warning
Fixed mode reduces accuracy and readability of numbers, currencies, and dates.
:::

With `Partials` enabled and default `max_delay_mode`, messages received could look like the following:
## Example output comparison

- **(Partial)**: I
- **(Partial)**: I am
- **(partial)**: I am third
- **(Partial)**: I am 30
- **(Final)**: I am 35.
### Finals only (default)

With only final transcripts (default configuration):

```
(Final): I am 35.
```

**Final output**: I am 35.
### Partials with flexible mode

With `Partials` enabled and `max_delay_mode` as `fixed`, messages received could look like the following:
With `enable_partials: true` and `max_delay_mode: "flexible"`:

- **(Partial)**: I
- **(Final)**: I am
- **(partial)**: third
- **(Final)**: 30
- **(Partial)**: five
- **(Final)**: five.
```
(Partial): I
(Partial): I am
(Partial): I am third
(Partial): I am 30
(Final): I am 35.
```

Note how the system corrects "30" to "35" in the final transcript.

### Partials with fixed mode

With `enable_partials: true` and `max_delay_mode: "fixed"`:

```
(Partial): I
(Final): I am
(Partial): third
(Final): 30
(Partial): five
(Final): five.
```

**Final output**: I am 30 five.
Final output: "I am 30 five." Note how the number isn't properly formatted.