The good news: there are many potential providers and options for cobbling something together.
The bad news: this problem is not easy, and the products from the top research and product teams are not very robust.
You can find the list of all self-serve machine translation API providers at modelfront.com/compare. Most of those same providers also offer speech recognition APIs, and speech recognition is also available on many devices.
But, depending on your scenario, you may be better off using a speech-to-speech approach (vs. glueing together multiple systems), and even a local model (vs. an external API), for three reasons: quality and latency, and the interaction of the two - which is that users don't want to wait for the full sentence, but also don't like translated text flickering as new words come.
If you search r/machinetranslation for speech OR simultaneous OR interpreting, you'll find:
a launch announcement for "interpreter mode" from Google Assistant
a Baidu announcement on a quality improvement
two articles from Mattia di Gangi at FBK
the flickering paper from Google (Re-translation versus Streaming for Simultaneous Translation)
the Translatron article and paper from Google
a landscape survey from Apple
the NeurST toolkit GitHub repo from ByteDance (TikTok)
There was a keynote from Baidu Research on this at WMT 2019, and recently a bit more on flickering from Google, but both focussed on their own products, not offerings for external developers.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…