New voice AI offers customizable and emotional voices

[14:03 Fri,28.February 2025 by Thomas Richter]

The startup Hume has just released a new voice AI with interesting features. With "Octave," you can not only create your own new voices with very specific characteristics via prompt, but also give them an emotional expression, making them sound angry or sad, for example. Octave also possesses emotional text understanding, meaning the system should be able to interpret the content of a text in such a way that it is spoken with the emotion appropriate to the respective passage.

Octave is the first text-to-speech system based on its own large language model (LLM), which understands words in context and thus should be able to generate a voicing with the appropriate emotions, rhythm, cadence, emphasis, pauses, and even dialects. Voices should be able to implicitly convey subtle nuances of meaning – such as whispering intimacy, biting sarcasm, ironic sharpness, or underlying aggression. These capabilities are interesting for filmmakers who want to quickly test different voiceover styles, create natural-sounding speaker voices or character voices – be it for trailers, documentaries, or character dialogues, and who want to avoid elaborate voice recordings.

Users can also explicitly modify voices and change them specifically via text prompt, making their expression, for example, happier, sadder, more frustrated, angrier, or more sarcastic. It is also possible to specifically change the expression of only one sentence or part of a sentence.

Voice Design and Acting Instructions

Two essential functions are available to users to define the speech output according to their own wishes:

- Voice Design: Users can create their desired voice via a simple text description ("country girl," "serious documentary speaker," or "grumpy medieval farmer"). The system then follows these specifications and generates corresponding vocal ranges and character traits.

- Acting Instructions: Emotions or speaking styles can be fine-tuned at the sentence or even word level. A short director&s instruction such as "speak the next word whispered and slightly annoyed" is sufficient to generate different variants of the same sentence in just a few seconds.

Octave TTS is primarily designed for offline use, for example for voiceovers in documentaries, commercials, audiobooks, pre-visualizations, or character dialogues in games. Real-time interactions – such as live streaming formats – are not currently a primary focus, although similar streaming options are available in Hume&s older EVI-TTS models.

Fortunately, you can try out the quality of Octave yourself for free and generate 10 minutes of speech per month yourself – with a self-defined voice. Hume also demonstrates the capabilities of its model (with many examples) at www.hume.ai/blog/octave-the-first-text-to-speech-model-that-understands-what-its-saying, and you can also pit Octave against other models in a blind test at arena.hume.ai/ with a freely defined voice and text and decide for yourself which is better.

Planned: Voice Cloning

For the future, Hume plans to introduce a voice cloning function, where a real voice is replicated based on short reference recordings. This would allow production companies, for example, to virtually expand their regular voice actors – for example, for smaller dubbing jobs or additional voice variations.

Hume Octave vs ElevenLabs

Comparison with competing products

Currently, there are many established providers such as ElevenLabs (and also open-source solutions) in the text-to-voice market against which Octave, as a newcomer, must compete. Hume wants to score points not only with the unique new function of freely definable own voices but also with its pricing model, which should only amount to about half the cost of ElevenLabs, as well as the speech quality. In internal tests with 180 subjects, Octave achieved better values for audio quality (71.6% approval), naturalness (51.7%), and accuracy to the requested voice design (57.7%), according to Hume.

Prices

- Free (./create.pl/month): 10,000 characters text-to-speech (~10 minutes)
- Starter (/month): 30,000 characters (~30 minutes)
- Creator (/month): 100,000 characters (~100 minutes), additional characters from ./create.pl.20/1,000
- Pro (/month): 500,000 characters (~500 minutes), additional characters from ./create.pl.15/1,000
- Scale (/month): 2 million characters (~2,000 minutes), additional characters from ./create.pl.13/1,000
- Business (/month): 10 million characters (~10,000 minutes), additional characters from ./create.pl.10/1,000
- Enterprise: On request, unlimited usage options with adapted contract conditions

The cost per 1,000 characters decreases with increasing price class, so Octave can be worthwhile for larger productions. All variants allow the use of the Voice Design function, while the generated audio file can be exported in common formats such as MP3, WAV, or PCM.

more infos at bei www.hume.ai

deutsche Version dieser Seite: Neue Voice-KI bietet customisierbare und emotionale Stimmen