In one of our most popular blog posts, Best AI Voice Generators: Top 3 options (2023), we talked about ElevenLabs being the best and most versatile AI voice generator of its time. Why? ElevenLabs preserves the vocal identity and delivery style of the original speaker, you.
This article is written to help us take your AI voice to the next level: what if you can create an audiobook with your own AI voice (cloned using AI technologies), in less than one hour? And under $100?
The high production cost of audiobooks (before AI voice)
Time and money are among the biggest barriers for authors and creators to record and produce their own audiobooks. Hence most authors today do not have professionally produced audiobooks.
- Narration: If you hire a professional voice actor, their fee can range anywhere from $100 to $500 per finished hour (PFH) or more. These hourly rates are dependent on the experience and demand of the narrator. Instead, if you can clone your voice that is of high quality and can be used to narrate the book yourself, the cost will be a fraction of what you pay for a narrator.
- Editing and Post-production: After the narration, your audio will need to be edited and mastered. Professional audio editing services charge from $50 to $100 per hour.
We won’t go into the additional costs such as cover design, and ongoing marketing efforts to promote your audiobook. On top of that when your audiobook is finally published, distribution platforms such as ACX take 40-60% of each sale.
Needless to say, the timeline for completing an audiobook can easily take months! Most of the authors I know would have to give up other projects just so they could travel to the recording studios and work with editors consistently to get it done. If one person is sick, whether it’s the author, the narrator, or the editor, the project is inevitably delayed.
As much as creators and authors are excited about the release of their audiobooks, the long road of getting there can really drain you emotionally and financially.
The good news is that there’s a better way!
What is ElevenLabs?
ElevenLabs is a generative voice AI software that explores “the most advanced text-to-speech and voice cloning” technology. Anyone who can communicate through speaking can use ElevenLabs to create lifelike voiceovers using their own voices. In 2023, ElevenLabs further improved their voice cloning features and launched Professional Voice Cloning (PVC) to create the perfect digital replica of your voice using the most advanced voice cloning AI, available through their Creator subscription tier, and we believe it’s the best setup for creating your audiobook. More on this in just a moment.
But first, we must question why we should create audiobooks with ElevenLabs, instead of using a premade AI voice.
Why create audiobooks with ElevenLabs
There are many options to create audiobooks with AI voices these days. In fact, most of the generative AI voice software offer premade AI voices, and so does ElevenLabs! You can see in their Speech Synthesis dropdown, it’s optional to clone your own voice.
However, it’s infinitely better and more interesting to narrate your audiobook using your own voice! We all know this, but finding the right generative AI voice software is the first and most important step.
People don’t and won’t listen to your audiobook if the AI voice is bad.
We need to confront the fact that poor-quality AI voices will not only make the project difficult but will also turn away listeners who are excited about your book, and may never return to another audiobook from you.
By poor-quality AI voices, we mean:
- Lack of emotional expression
- Not enough vocal variety
- No contextual adaption
And more! Poor or bad AI voices will ruin the experience for you and your audiobook listeners. The audio will sound flat, unemotional, disconnecting, and some even say just plain irritating.
This is precisely what we want to avoid by using quality software such as ElevenLabs. If you are reading this because you’re about to give up AI voices, make sure to give ElevenLabs a try for your next project.
Not convinced? Check out this example from Seth Godin
Our favorite marketing teacher Seth Godin recorded this episode (“The dance with AI, reality and identity”) of his Akimbo podcast using ElevenLabs. He did tell us at the end of the episode, but it took us listeners a while to figure out when he started and stopped using his AI voice for the episode. We were stunned by the results.
Steps to create audiobooks with ElevenLabs
Step 1. Sign up for an ElevenLabs account
You need to Sign up for an ElevenLabs account, but you can try ElevenLabs for free first. For recording your audiobook, we recommend the Creator subscription ($22/month) as it’s designed for “content creators seeking compelling narration for their content and access to Professional Voice Cloning (PVC).”
Step 2. Use “Add voice” for cloning
Click on “Speech Synthesis”, then click on “+Add voice” to begin cloning your voice!
You will be taken to the VoiceLab where you can access your existing cloned voice profiles, or add a new one by clicking on “Add Generative or Cloned Voice”.
You will be given the choice of a “Type of voice to create”. Again we recommend “Professional Voice Cloning” which requires a Creator’s subscription.
Once subscribed to the Creator version, you will notice that the character limit is 110,000. The average length book has about 50,000 words. Each word has about 5 characters. So the average length book is about 250,000 characters. With the Creator version, you have the option to “Enable usage-based billing (surpass 110,000 characters). You can turn on the toggle on.
Step 3. Understand usage-based billing for completing your book
You have 110,000 characters included in your current subscription. For every 1,000 characters above that you will be charged $0.3 (30 cents). ElevenLabs will charge your payment method every time your account reaches $44.
Hence for an average 250,000-character book, you are looking at a total cost of around $66, that is $22 (subscription cost) + $42 (additional character cost).
Step 4. Generate your book using the project feature
Once your voice is cloned and ready to use, you can begin generating your book! Your voice won’t be ready right away if you are using Professional Voice Cloning (which takes about 4 weeks as of Fall 2023). If you need to record your book right away, you can choose to use Instant Voice Cloning instead.
- To access the project feature, click on “Projects” at the top, and then click on “+ Create new project”.
2. I prefer using “Create an empty project” because this allows me to build out the chapters myself.
3. Start building out your chapters on the right-hand side! You can also include sections such as “Introduction” or anything else that comes before Chapter 1, Chapter 2, etc.
4. When you are done, click “Convert”.
What about Voice Settings?
Voice Settings is seen as one of the dropdowns. By default, Stability is set at 65%, Clarify and Similarity enhancement is set at 72%, and Style Exggaeration is set at 0%. This default setting generally works well. But to ensure you like the way it’s set up, we recommend you test it out with just a few paragraphs of your book first. If you are going to change the default setting and eventually find a setting that suits your voice best, PLEASE remember to note down the settings by percentage (how they are set up), so you can replicate the setting again in the future.
Here’s what each setting means:
- More variable: Increasing variability can make speech more expressive with output varying between re-generations. It can also lead to instabilities.
- More stable: Increasing stability will make the voice more consistent between re-generations, but it can also make it sound a bit monotone. On longer text fragments we recommend lowering this value.
Clarify and Similarity Enhancement
- Low: Low values are recommended if background artifacts are present in generated speech.
- High: High enhancement boosts overall voice clarity and target speaker similarity. Very high values can cause artifacts, so adjusting this setting to find the optimal value is encouraged.
- None: no style exaggeration
- High: High values are recommended if the style of the speech should be exaggerated compared to the uploaded audio. Higher values can lead to more instability in the generated speech. Setting this to 0.0 will greatly increase generation speed and is the default setting
- Boost the similarity of the synthesized speech and the voice at the cost of some generation speed.
Step 5. Stitching audios together
While there are multiple ways to stitch together audio, I do recommend that you consider working with an audio editor on this final step. It won’t take much time, and the cost won’t be significant. This will ensure the quality and transitions between sections are smooth and professional. If you want to consult us on this step, you can contact us here.
Alternatively, you could also complete this step on your own. There are two main ways to stitch audio together: using a digital audio workstation (DAW) or using an online audio joiner.
Using a DAW:
- Open your DAW and import the audio files that you want to stitch together.
- Arrange the audio files in the order that you want them to play.
- Use the DAW’s editing tools to trim and fade the audio files so that they transition smoothly from one to the next.
- Export the final stitched audio file.
DAWs we love include:
Using an online audio joiner:
- Go to an online audio joiner website, such as Clideo or Audio Joiner.
- Upload the audio files that you want to stitch together.
- Arrange the audio files in the order that you want them to play.
- Click the “Stitch” button to merge the audio files together.
- Download the final stitched audio file.
Which method you choose will depend on your personal preferences and needs. If you are comfortable using a DAW, then that will give you the most control over the stitching process. However, if you are not familiar with DAWs, then using an online audio joiner is a quick and easy way to stitch audio together.
How does Professional Voice Cloning (PVC) work?
Professional Voice Cloning (PVC), unlike Instant Voice Cloning (IVC) which lets you clone voices with very short samples nearly instantaneously, allows you to train a hyper-realistic model of a voice. This is achieved by training a dedicated model on a large set of voice data to produce a model that’s indistinguishable from the original voice.
Here’s what you should know in terms of ElevenLab’s process, timeline, and best practices for achieving optimal results.
Since the custom models require fine-tuning and training, it will take some time before you can use your voice clone. Giving an estimate is challenging as it depends on the number of people in the queue before you and a few other factors. However, we recommend estimating somewhere between ~4 weeks until you receive your voice clone. We hope it may be done quicker, but this remains a rough estimate.
🎙️ Professional Recording Equipment: Use high-quality recording equipment for optimal results as the AI will clone everything about the audio. High-quality input = high-quality output. Any microphone will work, but an XLR mic going into a dedicated audio interface would be our recommendation. A few general recommendations on low-end would be something like an Audio Technica AT2020 or a Rode NT1 going into a Focusrite interface or similar.
🗣️ Use a Pop Filter: Use a pop-filter when recording. This will minimize plosives when recording.
📏 Microphone Distance: Position yourself at the right distance from the microphone – approximately two fists away from the mic is recommended, but it also depends on what type of recording you want.
💥 Noise-Free Recording: Ensure that the audio input doesn’t have any interference, like background music or noise. The AI cloning works best with clean, uncluttered audio.
🎧 Room Acoustics: Preferably, record in an acoustically-treated room. This reduces unwanted echoes and background noises, leading to clearer audio input for the AI. You can make something temporary using a thick duvet or quilt to dampen the recording space.
⚙️ Audio Pre-processing: Consider editing your audio beforehand if you’re aiming for a specific sound output. For instance, if you want a polished podcast-like output, pre-process your audio to match that quality, of if you have long pauses or many “uhm”s and “ahm”s between words as the AI will mimic those as well.
🎚️Volume Control: Maintain a consistent volume that’s loud enough to be clear but not so loud that it causes distortion. The goal is to achieve a balanced and steady audio level. The ideal would be between -23dB and -18dB RMS with a true peak of -3dB.
🔊Sufficient Audio Length: Provide at least 30 minutes of high-quality audio that follows the above guidelines for best results – preferably closer to 3 hours of audio. The more quality data you can feed into the AI, the better the voice clone will be. The number of samples is irrelevant; the total runtime is what matters. However, if you plan to upload multiple hours of audio, it is better to split it into multiple ~30-minute samples. This makes it easier to upload.
📁 Uploading: After pressing upload, you will not be able to make any changes to the clone and it will be locked in. Ensure that you have uploaded the correct samples that you want to you.
✅ Verify Your Voice: Once everything is recorded and uploaded, you will be asked to verify your voice. To ensure a smooth experience, please try to verify your voice using the same or similar equipment used to record the samples and in a tone and delivery that is similar to what was present in the samples. If you do not have access to the same equipment, try verifying the best you can. If it fails, you will have to reach out to support.
Keep in mind that all of this depends on the output you want. The AI will try to clone everything in the audio, but for the AI to work optimally and predictably, we suggest following the guidelines mentioned above.
Please note: if PVC feels overwhelming and takes too long to train, you can still use ElevenLab’s Instant Voice Cloning to record and clone your voice, and then produce your AI audiobook.
ACX and AI Voices
However, there is one barrier to TTS-powered audiobook accessibility. As per their current policy, ACX/Audible does not allow the publishing of audiobooks narrated by AI voices. However, an auto-narrated audiobook using AI voices is accepted on other platforms.
Where to Publish Your AI Voice Audiobooks
As of the writing of this article, ACX/Audible does not allow AI audiobooks, but there are other platforms to include major players such as:
- Google Play Books
- A.I. Book Publisher
As well as publishing platforms such as:
- Findaway Voices
- Kobo Writing Life
- Author’s Republic
Pricing for ElevenLabs
You can get started with ElevenLabs for free. They have additional packages including:
- Starter ($5/month)
- Creator ($22/month)
- Independent publisher ($99/month)
- Growing businesses ($330/month)
Legal, compliance, and not-so-fun stuff you should know
There are a number of legal and compliance issues to consider when producing and distributing AI audiobooks. These include:
- Copyright: AI audiobooks are likely to be protected by copyright, just like traditional audiobooks. This means that you will need to obtain permission from the copyright holder before producing or distributing an AI audiobook. In short, focus on AI audiobooks for books written by you, not someone else.
- Accuracy: AI audiobooks can be very accurate, but it is important to make sure that the content is accurate before distributing it. This is especially important for compliance-related audiobooks, which may contain complex legal and regulatory information.
- Attribution: If you are using AI to generate the content of the audiobook, you should attribute the content to the AI system. This is important for transparency and to avoid claims of plagiarism.
Best Practices for AI Voice in Audiobooks
- Have your audiobooks reviewed by a human. Before distributing your AI audiobooks, have them reviewed by a human to ensure that the content is accurate and appropriate. This is especially important for compliance-related audiobooks.
- Use clear and concise language. When writing the content for your AI audiobooks, use clear and concise language that is easy to understand. This will help to avoid any confusion or misunderstandings. If the book is written and reviewed by humans, the content is likely to be be higher quality and more relevant for your readers. However, if you are creating the book entirely using generative AI, it’s essential to have it reviewed by you or a human editor.
- Keep up with the latest laws and regulations. The laws and regulations surrounding AI are constantly evolving, so it is important to keep up with the latest developments. This will help you to ensure that your AI audiobooks are always compliant.
Conclusion: Record an AI voice trained by you in ElevenLabs
So is it worth spending the time and money (under $100) to create your audiobook with your voice?
The answer is YES if the high cost and complicated logistics of recording an audiobook prevent you from creating one for your book. ElevenLabs is a wonderful alternative that helps make audiobooks accessible not only to you as a creator, but to your listeners who would prefer audiobooks over other formats of your book.
If you are a self-published author and own the rights to your book, AI-voice cloning and distributing your book on AI audiobook platforms are often easier.
However, if you are working with a publisher who owns the rights to your book, you will need to consult them first before recording an audiobook with or without AI.
Generative AI is changing constantly, and so is the publishing industry including audiobook production and distribution. I hope this article sheds light on creators, and authors who wish to tell their stories, and grow their businesses by reaching a bigger audience.
At Feisworld, we believe in full-stack content marketing to help creators and small businesses thrive on multiple platforms and media without boundaries. If that sounds interesting, check out How to Grow Your Business with Full-Stack Content Marketing and AI (Masterclass)
You might also like…
- Best AI Voice Generators: Top 3 options (2023)
- D-ID Speaking Portrait: Create Digital Talking Characters from Watercolor Portraits
- What Is Full-Stack Content Marketing? The Key To Business Success In The Era Of AI (2023)