In analyzing the benchmark data, it's clear that Deepgram's transcription API surpasses the other two . When handling brief prompts that elicit short responses, the latency between the agent (GPT-4) and the synthesizer (Azure) appears relatively comparable. This similarity can be attributed to the minimal text size being processed for synthesis. However, as the text volume expands, there's a noticeable divergence, with the synthesizer's latency significantly exceeding that of the agent. This suggests that the synthesizer's performance is more sensitive to increases in text length, impacting its efficiency in time-sensitive applications. This process can be complex and time-consuming, particularly for synthesizers that produce highly natural-sounding speech. It involves several stages, including text processing, linguistic analysis, waveform generation, and more.

NOTES:

If the prompts are too long, the demo cuts off at the middle and starts responding. This does cut off the prompt, however, it does catch few parts of the prompt if the user continues to speak. This will have to be fixed as it interrupts user prompt.
Therefore, asking 2-3 questions in a single prompt was the most this test was able to achieve.

Actual End-to-End Latency

Deepgram (Transcriber): 2.35%

GPT4 (Agent): 15.81%

Azure (Synthesizer): 81.84%

User Perceived Latency

Deepgram (Transcriber): 15.66% (From end of user prompt to transcription completion)

GPT4 (Agent): 45.72% (Transcription completion to start of synthesis)

Azure (Synthesizer): 38.61% (Start of synthesis to the first voice chunk sent)

‣

End-to-End Flow

‣

Demo API Latencies

Actual End-to-End Latency

User Perceived Latency

End-to-End Flow

User Perceived Latency