Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Prime Voice AI: AI speech software (elevenlabs.io)
61 points by danboarder on Feb 1, 2023 | hide | past | favorite | 53 comments


What? Most of the voices I tried sound really intense, even angry. Very strange emotional flavors for what should be quite neutral text inputs. The laughing was literally ha, ha, ha, ha. Not even remotely a genuine human laugh.

The actual sound quality of the output is impressive (clear treble, no weird artifacts between syllables, etc.), but I just don't understand the weird "edginess" of the speech.


I think their model poorly weights the spatial context in sentences. Humans speak with a little rhythm: cadence. The lack of cadence and emotion puts it in deep in the uncanny valley.


They have a dial for stability and one for clarity. The stability one in particular when turned up reduces the expressiveness.


Yeah, when I tried their demo I was a bit confused too - it's not that impressive.

But their model is really good when it comes to cloning voices from small audio samples. It was discovered by 4chan unfortunately [1]. I have only seen a few clips but all of them were racist, sexist or worse. Not appropriate to link them here, I guess. You can see an official sample on their YouTube channel [2]. However, the voices and conversations I've heard yesterday, other than being disgusting, were high quality, believable and full of emotions. The voices of Obi Wan Kenobi or Joe Biden sounded so genuine that it was creepy. I know there has been tools to deepfake voices for years now, but this is the first time I'm seeing one that sounds so authentic.

[1]: https://www.theverge.com/2023/1/31/23579289/ai-voice-clone-d...

[2]: https://www.youtube.com/watch?v=17_xLsqny9E


Thats a pretty high bar, have you tried inputting ha ha ha into literally any other TTS engine? It never produces great results.


Best I've ever heard. Steep pricing to get only 2hrs a month and only 2,500 characters at a time though. I was about to sign up to use this to read articles to me but that amounts to about 4 articles per month and fed into the generator in parts at a time.


The reason why ElevenLabs is so good is not because of the default voices, it's because it's so easy to train new voices. You only need a minute or two of someone speaking and it can mimic the voice pretty well, good enough to fool most people.

However their pricing is completely wrong, should be cheaper and offer more.


I found the pricing surprisingly reasonable (not affiliated)


I just noticed they improved the pricing a lot since last time I checked.

Before you had free, 10k a month & $22 a month for 60k, now they added $5 a month for 30k and $22 a month for 100k.

Attribution for free is pointless, as with custom voices in theory it would be hard to detect and enforce anyways.


few-shot learning for neural voices is over 3 years old now?


For hobbyist use, is this really any better than macOS' "say" command?

Once you've downloaded the Premium voices (e.g. Zoe) it's just a CLI, no API or hidden bells and whistles.

    $ say -v 'Zoe (Premium)' "This is an example of the Zoe voice for my comment on Hacker News."

You'll have to download the voice ahead of time, but Zoe (public) and Maeve (internal) are both excellent voices.


I don't have a mac, do these voices rely on a cloud component to function or do they run locally? (without internet access is the real test I guess)


Neat! Took me a while to work out how to download extra voices so in case I can save anyone else a bit of googling here's a guide: https://support.apple.com/en-gb/guide/mac-help/mchlp2290/mac


I really dig some of the voices here, they are really well done.


Zoe is good, but (to my ear) this is far superior. Zoe is still in the uncanny valley. This feels completely natural.


excellent voice yes but the voices here are superior


The voice sounds good. However, I would like to see, if its able to parse and read i.e a PDF file in a good flow. I use (and pay) speechify, in the daily basis, to read through pdf books, for my studies. I see that they still have a lot to improve, but I still couldn't find a better solution. Any suggestion?


I had the same suboptimal experience with Speechify. Did not renew.

There are text extraction utilities for PDFs which reconstitute paragraphs and whatnot. Seems like an obvious thing to do. I suggested it, but didn't hear back.

I imagine that PDF munging skills aren't common and a solo developer doesn't have the bandwidth to be smart about so many different techs.


i did the suggestion too, and they provided me a kind of roadmap in this direction - I'm just a customer, no relationship with them, or anything.. the roadmap however was just a bunch of tickets, no deadline.. so its in their radar, probably based on your and probably many others opinions.. I'm however not really willing to renew it too (unless they give me a good discount) and would like to explorer other solutions.


Its pretty good. I've been using Amazon's Polly which so far to me has been the most realistic (https://aws.amazon.com/polly/). I feel like Polly still has an edge with variety of voices.


Azure is so far ahead on neural voices it's not even funny.

https://azure.microsoft.com/en-us/products/cognitive-service...


That one is pretty average, the stuff of elevenlabs is much better.


Still sounds clearly like TTS to me.


[flagged]


Could you clarify then? I also thought it sounded like TTS and not as natural speech.


Related - I've found BeyondWords to be really nice. Its generated speech is not quite this good, but it's close, and it has a library of fairly different voices. Plus, it's UI allows you to create audio with a mix of voices, which is not offered by most other such services.

Plug warning - I've been using it to create narration for short stories with it for a while, and the output is better than I would have expected. Here's a recent example involving two characters talking - https://storiesby.ai/p/melancholy-musings-over-drinks


I would disagree, they're almost too articulate so you get that very artificial clipped and stilted speech from those voices that you used in your story. It's especially apparent in the female voice.


Have you heard their demo reading the great gatsby? Best TTS I've ever heard by a margin ...

https://www.youtube.com/watch?v=qRPTwPuZLjk


Slow your roll Eleven. Sounds like you have beef with everything I feed you.


The default "Adam" voice sounds life like, but I wouldn't call him "conversational/clear". He sounds too forceful and dramatic like he belongs in a cartoon.


With real voice actors, we can direct them to say their lines with more sadness. Or guarded desperation and struggle, on the verge of crying but clinging to hope... etc. This kind of subtle direction is not possible with artificial speech.

For narration it can work. But for dramatic character acting in animated films, the results make the characters sound like terrible actors. More granular control is needed over specific words, syllables, tone, emphasis and timing.


Gave it a test, and wow it's very impressive. A lot you can do with the free version also, hopefully this takes off.


Is there an open source or perpetual license way of “cloning” ones voice?

This would be a boon to those who have lost or will lose the ability to speak or speak well. Especially if it can be integrated into communication apps and ones cell phone.

The number of people who could use this is going up as the hpv+ head and neck cancer wave ramps up.


In case anyone knows, what's the defensible moat here?

I can get almost the same quality using open source models. Plus I can fine-tune them to get custom voices. That means any company who needs TTS is cheaper off paying me once to build them a customized open source solution instead of forever paying this company per minute.


Hmmm i don't know of any open source project that can get similar quality? Can you name one? This one also allows fine tuning for custom voices on a minute of audio and it works great.


TorToiSe[0] is pretty good but I agree 11 is currently state of the art. Won't be long until GP is correct though. 1.5 years at best is my guess. The next moat will be multiple languages and maybe something like more control over the tone which is something perhaps more suited to a product.

[0]https://github.com/neonbjb/tortoise-tts


Recent and related:

This Voice Doesn't Exist – Generative Voice AI - https://news.ycombinator.com/item?id=34361651 - Jan 2023 (260 comments)


-- think google Neural2 sounds better - https://cloud.google.com/text-to-speech/docs/voices --


I want to hear this voice applied to ChatGPT output.


I'm using it for aidev.codes (which uses OpenAI's new models similar to ChatGPT) in the new dialog stuff I am developing such as for interviewing clients for requirements. The issue right now is that even though they have a streaming endpoint, the latency is all over the place and often not really adequate for something that is supposed to be conversational. But when it's working well it's just about fast enough. I probably should ask them if there is a trick. Right now I am sending multiple sentences at a time and then playing them one after another when the audio element emits the ended event


Coming soon in this repo https://github.com/dasdata/gortanagtp


When this is an app I can use like Siri or Okay Google with an ElevenLab or equivalent voice, I will subscribe. Looking forward to this!


Why is it called gtp and not gpt?


misspelling, thanks for point out


I did this last week. I asked ChatGPT to write a short article, and then used Eleven’s “Josh” voice to read it.

I had to adjust the punctuation to get it to sound more natural, but it was surprisingly good. Way better than the AI voices I hear on YouTube videos.


with another bot describing each scene and inputting that into stable diffusion


Package it as visual novel game and sell it


Is there an open source version we can use?


Found this project the other day that is probably good enough quality for most uses

i.e. sounds very natural and not like a robot from the 1980s and doesn't require a cloud service and can run on modest hardware

https://github.com/rhasspy/larynx

Demo: https://m.youtube.com/watch?v=hBmhDf8cl0k


There is new better one https://github.com/rhasspy/larynx2


I've tried that before, it's okay but not to Eleven Labs' level.


Didn’t work from my iPhone


I got it to work with Safari on iOS 15.7.2, iPhone 7+. Firefox was a no go, I tried it first.


Wow!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: