STT Analysis – ElevenLabs added, Amazon Transcribe keeps rising and Chirp3 remains strong [May 2026]
Index
In previous STT analyses, we already saw that the performance of Speech-to-Text engines varies significantly by language, input type and use case. An engine that performs well on general transcription is not automatically the best choice for postal codes, dates, times or alphanumeric input. In telephone-based customer interactions, these are often exactly the moments where a voice solution needs to prove that its output is reliable and immediately usable.
That is why we continue to measure different STT engines structurally in realistic telephony scenarios. In this new analysis, we once again look at performance for Dutch and English. We compare the results with the previous analysis from February 2026 and show which shifts are relevant for voice implementations in practice.
In this measurement, ElevenLabs has also been added as a new STT engine. This broadens the benchmark and allows us to compare newer players in the market more effectively with engines that have been part of our analyses for longer, such as Azure, Google, Amazon and OpenAI.
A look back at the February 2026 analysis
In the previous analysis, the most notable development was a clear improvement for Amazon Transcribe in Dutch. Where this engine previously scored around 33%, its correctness increased to nearly 60%. The other results remained largely stable.
That was an important insight in itself: improvement is not the only factor that matters. Stability also plays a major role in production environments where voice solutions are used by end users every day.
The conclusion from February therefore still holds: there is no single best STT engine. The right choice depends on language, use case, input type and the extent to which the output still needs to be normalized or processed after transcription.
How do we test different STT engines?
The test setup has remained the same as in previous analyses. We test with a fixed set of standard sentences, spoken by native speakers, using telephone-quality audio. Each recording is checked to exclude as much noise from the dataset as possible. This helps ensure that differences in the results can mainly be attributed to the STT engine itself.
We analyze the engines along two axes:
- Language: Dutch and English.
- Input type: including numeric input, postal codes, dates, times, textual input and alphanumeric input.
We then look at two key metrics:
- Correctness: in how many cases does the output exactly match what the caller said?
- Word error rate: how many words need to be adjusted to make the output match the caller’s input exactly?
As in previous analyses, we look at the output after Seamly has applied normalization and post-processing.
What is new in this analysis?
First, ElevenLabs has been added as a new STT engine in this analysis. This means we can now track the performance of this engine structurally and compare it with the other engines in the benchmark.
ElevenLabs is not yet a clear frontrunner in this measurement, but it does provide a relevant additional point of reference. Especially because the STT market is developing quickly, it is important to include new engines in the benchmark at an early stage.
Second, the naming of some engines has changed in the analysis platform. Labels such as OpenAI - transcribe - NL and OpenAI - transcribe - nl-NL refer to the same engine. In this analysis, we therefore treat those lines as one and the same engine. This is not a different configuration; it is only a change in naming within the dashboard.
Dutch results
Amazon Transcribe continues to improve
For Dutch, Amazon Transcribe once again stands out positively. In the previous analysis, we already saw a clear increase in correctness. In this new measurement, that improvement appears to continue. Towards the end of the measurement period, correctness increases clearly again, while the word error rate decreases.
This makes Amazon Transcribe more interesting for Dutch than in earlier measurements. The development clearly shows why continuous benchmarking matters. An engine that previously seemed less suitable for a specific language can become a better option relatively quickly through model updates or improvements.

Image 1: Correctness per STT engine - Dutch

Image 2: WER per STT engine - Dutch
Azure remains consistently strong for Dutch
Azure remains one of the most stable engines in the benchmark for Dutch. While Amazon Transcribe mainly stands out because of a clear positive movement, Azure stands out because of its consistency. Correctness remains high and WER remains low.
That is relevant for production environments. An engine does not always need to show the biggest increase to be valuable. Especially for voice solutions that are used every day, predictable quality is at least as important as a sudden improvement.
Differences by input type remain significant
At input-type level, we again see clear differences between engines for Dutch. Performance varies particularly for dates and times, numeric input and postal codes.
These are exactly the categories that are often critical in telephone-based customer interactions. Think of dates of birth, customer numbers, appointment times, order numbers or amounts. The results once again confirm that average scores do not say enough. For a voicebot that needs to process a lot of structured data, performance needs to be assessed specifically by input type.

Image 3: Correctness per input type - Dutch
English results
GCloud Chirp3 remains the strongest and most stable engine for English
For English, GCloud Chirp3 remains the strongest and most stable engine in this analysis. Correctness remains relatively high and WER remains low. This measurement therefore confirms the picture from the previous analysis, in which Chirp3 also clearly emerged as a strong choice for English input.
That is relevant for English-language voicebots, especially when the application requires consistent transcription across multiple input types. Chirp3 not only shows strong general performance, but also remains relatively stable over time.

Image 4: Correctness per STT engine - English

Image 5: WER per STT engine - English
English postal codes remain strong, while alphanumeric input remains unpredictable
At input-type level, postal codes in English generally continue to perform well across several engines. We saw the same pattern in earlier analyses. For this type of input, recognition appears to be relatively mature across many engines.
Alphanumeric input, on the other hand, remains more unpredictable. When letters and numbers are combined, the differences between engines become clearly visible. This is relevant for use cases in which customers speak reference numbers, license plates, order codes or customer codes, for example.
Here too, an engine that scores well on average is not automatically the best choice for every voicebot. The right choice depends on the type of information the caller needs to provide.

Image 6: Correctness per input type - English
Comparison with February
Compared with the previous analysis, we see two important developments.
First, Amazon Transcribe continues its positive trend for Dutch. Where the previous analysis already showed a clear improvement, the engine appears to become stronger again in this new measurement.
Second, GCloud Chirp3 remains consistently strong for English. This confirms the earlier picture that Chirp3 is a reliable choice for English input.
In addition, ElevenLabs has been added to our benchmark. This engine is not yet a clear frontrunner, but it does provide an additional reference point in the comparison.
Why are these insights important?
This analysis once again confirms that STT choices need to be validated continuously. A one-off benchmark is not enough, because performance can change. General performance is also not sufficient to make a good choice. The relevant question is not only which engine scores best on average, but above all which engine best fits a specific language, use case and input type.
A voicebot that mainly processes free text places different demands on STT than a voicebot that needs to recognize many postal codes, customer numbers, dates or alphanumeric codes.
How does Seamly use these insights?
We use the results directly in our voice implementations. They help us determine which STT engine is the best fit for a specific language, customer question or use case. The analyses also show where Seamly’s own normalization and post-processing can be further refined.
For partners, this means they do not have to advise their customers based on assumptions or general benchmarks. They can use up-to-date insights from realistic telephony scenarios. Crucially, this includes the types of input that often matter most in real customer conversations, such as postal codes, numbers, dates, times and alphanumeric combinations.
Want to know more?
Through continuous benchmark research, we ensure that voice solutions via Seamly are connected to the STT engine that best fits the customer’s language, use case and input.
We would be happy to show you in a demo how Seamly helps conversational platforms expand their solution to the telephony channel, with STT choices based on current performance in realistic call scenarios.