Skip to content

STT Analysis – Shifts in Amazon Transcribe and Speechmatics [February 2026]

Index

Prefer reading in Dutch?

In our first STT analysis, we demonstrated the significant differences between Speech-to-Text engines, depending on language and input type. Since then, we have continued to measure and analyze various engines. In the February 2026 update, we share the most recent measurements. The results demonstrate why continuous testing is essential for reliable voice solutions.

Brief review: what did we see earlier?

In the previous analysis, we saw clear differences between STT engines for both Dutch and English:

  • For Dutch, Azure performed best in terms of correctness and word error rate.
  • For English, Google Chirp3 stood out, particularly for input types such as dates and times.
  • Performance varied greatly depending on the input type, such as numeric input, postal codes, and alphanumeric input.

The initial analysis immediately made one thing clear: there is no such thing as the “best STT engine.” The right choice depends on the language, use case, and input type. Because STT models are updated regularly, we will continue to conduct this research. After all, what works well today may change in a few months' time.

How do we test different engines?

Our test setup is unchanged from the previous analysis. We test with a fixed set of standard sentences, spoken by native speakers, in telephone-quality audio. Each recording is checked to prevent noise in the dataset. This ensures that differences in results are actually attributable to the STT engine itself—and not to the input.

What are we testing?

As in the previous analysis, we are testing two axes:

  • Two different languages: Dutch and English
  • Different input types, such as numbers, addresses, postal codes, dates, and times

We are again looking at two key statistics:

  • Correctness: in how many cases does the textual output exactly match what the caller says?
  • Word error rate (WER): how many words do we need to adjust to make the output match the caller's input exactly?

We analyze the output after Seamly has applied normalization and post-processing, so that results can be compared fairly and consistently within realistic call scenarios.

What do the new analyses show us?

Amazon Transcribe – Dutch

For Dutch, we see a striking increase in correctness with Amazon Transcribe. Whereas this engine previously scored around 33%, correctness has been almost 60% over the past two months.

This means that in significantly more cases, the transcription directly matches the spoken input. We see this improvement across multiple input types, such as numeric input, postal codes, dates, textual, and alphanumeric input.

This indicates that Amazon has actively improved the Dutch model. It's a good example of why continuous measurement is important: engines that previously seemed less suitable for a language can develop relatively quickly.

Image 1: Correctness per STT-engine - Dutch

Image 2: Correctness per input type - Dutch

Speechmatics – English

For English, we see a less positive development at Speechmatics. The word error rate has risen from approximately 69% to 77% over the past two months.

A higher WER means that more corrections are needed to ensure that the transcription matches the spoken input exactly. Although this does not necessarily mean that the engine is “unusable,” it does make it clear that the output has become less consistent within our test setup.

These kinds of shifts underscore that performance can not only improve, but also temporarily decline due to model updates.

Image 3: word error rate per STT-engine - English

Other results largely stable

For all other engines, languages, and input types, we see no major changes in this measurement. Correctness and word error rate remain in line with the previous analysis. This is valuable information in itself: stability is at least as important as progress, especially in production environments where voice solutions are used daily by end users.

Why we continue to do this

This update shows why we continue to analyze STT engines on an ongoing basis: the performance of STT engines can improve, but it can also deteriorate. The differences often only become apparent in a telephony context and for specific input types or languages. Amazon Transcribe shows no improvement in correctness for English, but does show improvement for Dutch.

By measuring structurally, we know:

  • Which engine currently best suits a language and use case
  • Where additional normalization and post-processing remain necessary
  • When it is wise to reconsider engine choices

By measuring and comparing STT engines, we gain insight into how speech recognition develops in practice. We do this based on measurable performance in real telephony scenarios. We use these insights to improve our voice solution: by recommending the right engine for each language and use case, and by continuously refining our own normalization and post-processing.

Want to know more? We would be happy to show you in a demo how Seamly helps conversational platforms unlock their solution for the telephony channel. With STT choices that match your customer's use case, language, and input.