What? Most of the voices I tried sound really intense, even angry. Very strange emotional flavors for what should be quite neutral text inputs. The laughing was literally ha, ha, ha, ha. Not even remotely a genuine human laugh.
The actual sound quality of the output is impressive (clear treble, no weird artifacts between syllables, etc.), but I just don't understand the weird "edginess" of the speech.
I think their model poorly weights the spatial context in sentences. Humans speak with a little rhythm: cadence. The lack of cadence and emotion puts it in deep in the uncanny valley.
Yeah, when I tried their demo I was a bit confused too - it's not that impressive.
But their model is really good when it comes to cloning voices from small audio samples. It was discovered by 4chan unfortunately [1]. I have only seen a few clips but all of them were racist, sexist or worse. Not appropriate to link them here, I guess. You can see an official sample on their YouTube channel [2]. However, the voices and conversations I've heard yesterday, other than being disgusting, were high quality, believable and full of emotions. The voices of Obi Wan Kenobi or Joe Biden sounded so genuine that it was creepy. I know there has been tools to deepfake voices for years now, but this is the first time I'm seeing one that sounds so authentic.
Best I've ever heard. Steep pricing to get only 2hrs a month and only 2,500 characters at a time though. I was about to sign up to use this to read articles to me but that amounts to about 4 articles per month and fed into the generator in parts at a time.
The reason why ElevenLabs is so good is not because of the default voices, it's because it's so easy to train new voices. You only need a minute or two of someone speaking and it can mimic the voice pretty well, good enough to fool most people.
However their pricing is completely wrong, should be cheaper and offer more.
The voice sounds good. However, I would like to see, if its able to parse and read i.e a PDF file in a good flow. I use (and pay) speechify, in the daily basis, to read through pdf books, for my studies. I see that they still have a lot to improve, but I still couldn't find a better solution. Any suggestion?
I had the same suboptimal experience with Speechify. Did not renew.
There are text extraction utilities for PDFs which reconstitute paragraphs and whatnot. Seems like an obvious thing to do. I suggested it, but didn't hear back.
I imagine that PDF munging skills aren't common and a solo developer doesn't have the bandwidth to be smart about so many different techs.
i did the suggestion too, and they provided me a kind of roadmap in this direction - I'm just a customer, no relationship with them, or anything.. the roadmap however was just a bunch of tickets, no deadline.. so its in their radar, probably based on your and probably many others opinions.. I'm however not really willing to renew it too (unless they give me a good discount) and would like to explorer other solutions.
Its pretty good. I've been using Amazon's Polly which so far to me has been the most realistic (https://aws.amazon.com/polly/). I feel like Polly still has an edge with variety of voices.
Related - I've found BeyondWords to be really nice. Its generated speech is not quite this good, but it's close, and it has a library of fairly different voices. Plus, it's UI allows you to create audio with a mix of voices, which is not offered by most other such services.
Plug warning - I've been using it to create narration for short stories with it for a while, and the output is better than I would have expected. Here's a recent example involving two characters talking - https://storiesby.ai/p/melancholy-musings-over-drinks
I would disagree, they're almost too articulate so you get that very artificial clipped and stilted speech from those voices that you used in your story. It's especially apparent in the female voice.
The default "Adam" voice sounds life like, but I wouldn't call him "conversational/clear". He sounds too forceful and dramatic like he belongs in a cartoon.
With real voice actors, we can direct them to say their lines with more sadness. Or guarded desperation and struggle, on the verge of crying but clinging to hope... etc. This kind of subtle direction is not possible with artificial speech.
For narration it can work. But for dramatic character acting in animated films, the results make the characters sound like terrible actors. More granular control is needed over specific words, syllables, tone, emphasis and timing.
Is there an open source or perpetual license way of “cloning” ones voice?
This would be a boon to those who have lost or will lose the ability to speak or speak well. Especially if it can be integrated into communication apps and ones cell phone.
The number of people who could use this is going up as the hpv+ head and neck cancer wave ramps up.
In case anyone knows, what's the defensible moat here?
I can get almost the same quality using open source models. Plus I can fine-tune them to get custom voices. That means any company who needs TTS is cheaper off paying me once to build them a customized open source solution instead of forever paying this company per minute.
Hmmm i don't know of any open source project that can get similar quality? Can you name one? This one also allows fine tuning for custom voices on a minute of audio and it works great.
TorToiSe[0] is pretty good but I agree 11 is currently state of the art. Won't be long until GP is correct though. 1.5 years at best is my guess. The next moat will be multiple languages and maybe something like more control over the tone which is something perhaps more suited to a product.
I'm using it for aidev.codes (which uses OpenAI's new models similar to ChatGPT) in the new dialog stuff I am developing such as for interviewing clients for requirements. The issue right now is that even though they have a streaming endpoint, the latency is all over the place and often not really adequate for something that is supposed to be conversational. But when it's working well it's just about fast enough. I probably should ask them if there is a trick. Right now I am sending multiple sentences at a time and then playing them one after another when the audio element emits the ended event
The actual sound quality of the output is impressive (clear treble, no weird artifacts between syllables, etc.), but I just don't understand the weird "edginess" of the speech.