Text-to-speech (TTS) synthesis is an Artificial Intelligence (AI) technique that renders a preferably naturally sounding speech given an arbitrary text. It is a key technological component in many important applications, including virtual assistants, AI-generated audiobooks, speech-to-speech translation, AI news reporters, audible driving guidance, and digital humans. In the past decade, we have observed significant progress made in TTS. These new developments are mainly attributed to Deep Learning techniques and are usually referred to as neural TTS. Many neural TTS systems have achieved human quality for the tasks they are designed for.
This book first introduces the history of TTS technologies and overviews neural TTS, and provides preliminary knowledge on language and speech processing, neural networks and Deep Learning, and deep generative models. It then introduces neural TTS from the perspective of key components (text analyses, acoustic models, vocoders, and end-to-end models) and advanced topics (expressive and controllable, robust, model-efficient, and data-efficient TTS). It also points some future research directions and collects some resources related to TTS.
Although many TTS books have been published, this book is the first of its kind that provides a comprehensive introduction to neural TTS, including but not limited to the key components such as text analysis, acoustic model, and vocoder, the key milestone models such as Tacotron, DeepVoice, FastSpeech, and the more advanced techniques such as expressive and controllable TTS, robust TTS, and efficient TTS. Xu Tan, the author of this book, has contributed significantly to the recent advances in TTS. He has developed several impactful neural TTS systems such as FastSpeech 1/2, DelightfulTTS, and NaturalSpeech, the latter of which has achieved human parity on the TTS benchmark dataset. His knowledge of the domain and his first-hand experience with the topic allow him to organize the contents effectively and make them more accessible to readers, and to describe the key concepts, the basic methods, and the state-of-the-art techniques and their relationships in detail and clearly. I am very glad that he introduced and clarified many key concepts and background knowledge at the beginning of this book so that people with little or no knowledge of TTS can also read and understand the book effectively.
This is a very well-written book and certainly one that provides useful and thoughtful information to readers at various levels. I believe this book is a great reference book for all researchers, practitioners, and students who are interested in quickly grasping the history, the state-of-the-art, and the future directions of speech synthesis or are interested in gaining insightful ideas on the development of TTS.
Author(s): Xu Ta
Series: Artificial Intelligence: Foundations, Theory, and Algorithms
Publisher: Springer
Year: 2023
Language: English
Pages: 214
Foreword by Dong Yu
Foreword by Heiga Zen
Foreword by Haizhou Li
Preface
Acknowledgements
Contents
Acronyms
About the Author
1 Introduction
1.1 Motivation
1.2 History of TTS Technology
1.2.1 Articulatory Synthesis
1.2.2 Formant Synthesis
1.2.3 Concatenative Synthesis
1.2.4 Statistical Parametric Synthesis
1.3 Overview of Neural TTS
1.3.1 TTS in the Era of Deep Learning
1.3.2 Key Components of TTS
1.3.3 Advanced Topics in TTS
1.3.4 Other Taxonomies of TTS
1.3.5 Evolution of Neural TTS
1.4 Organization of This Book
References
Part I Preliminary
2 Basics of Spoken Language Processing
2.1 Overview of Linguistics
2.1.1 Phonetics and Phonology
2.1.2 Morphology and Syntax
2.1.3 Semantics and Pragmatics
2.2 Speech Chain
2.2.1 Speech Production and Articulatory Phonetics
Voiced vs Unvoiced and Vowels vs Consonants
Source-Filter Model
2.2.2 Speech Transmission and Acoustic Phonetics
2.2.3 Speech Perception and Auditory Phonetics
How Human Perceives Sound
Difference Between Auditory Perceptions and Physical Property of Sound
Evaluation Metrics for Speech Perception
2.3 Speech Signal Processing
2.3.1 Analog-to-Digital Conversion
Sampling
Quantization
2.3.2 Time to Frequency Domain Transformation
Discrete-Time Fourier Transform (DTFT)
Discrete Fourier Transform (DFT)
Fast Fourier Transform (FFT)
Short-Time Fourier Transform (STFT)
2.3.3 Cepstral Analysis
2.3.4 Linear Predictive Coding/Analysis
2.3.5 Speech Parameter Estimation
Voiced/Unvoiced/Silent Speech Detection
F0 Detection
Formant Estimation
2.3.6 Overview of Speech Processing Tasks
References
3 Basics of Deep Learning
3.1 Machine Learning Basics
3.1.1 Learning Paradigms
Supervised Learning
Unsupervised Learning
Reinforcement Learning
Semi-supervised Learning
Self-supervised Learning
Pre-training/Fine-Tuning
Transfer Learning
3.1.2 Key Components of Machine Learning
3.2 Deep Learning Basics
3.2.1 Model Structures: DNN/CNN/RNN/Self-attention
DNN
CNN
RNN
Self-attention
Comparison Between Different Structures
3.2.2 Model Frameworks: Encoder/Decoder/Encoder-Decoder
Encoder
Decoder
Encoder-Decoder
3.3 Deep Generative Models
3.3.1 Autoregressive Models
3.3.2 Normalizing Flows
3.3.3 Variational Auto-encoders
3.3.4 Denoising Diffusion Probabilistic Models
3.3.5 Score Matching with Langevin Dynamics, SDEs, and ODEs
3.3.6 Generative Adversarial Networks
3.3.7 Comparisons of Deep Generative Models
References
Part II Key Components in TTS
4 Text Analyses
4.1 Text Processing
4.1.1 Document Structure Detection
4.1.2 Text Normalization
4.1.3 Linguistic Analysis
Sentence Breaking and Type Detection
Word/Phrase Segmentation
Part-of-Speech Tagging
Homograph and Word Sense Disambiguation
4.2 Phonetic Analysis
4.2.1 Polyphone Disambiguation
4.2.2 Grapheme-to-Phoneme Conversion
4.3 Prosodic Analysis
4.3.1 Pause, Stress, and Intonation
4.3.2 Pitch, Duration, and Loudness
4.4 Text Analysis from a Historic Perspective
4.4.1 Text Analysis in SPSS
4.4.2 Text Analysis in Neural TTS
References
5 Acoustic Models
5.1 Acoustic Models from a Historic Perspective
5.1.1 Acoustic Models in SPSS
5.1.2 Acoustic Models in Neural TTS
5.2 Acoustic Models with Different Structures
5.2.1 RNN-Based Models (e.g., Tacotron Series)
Tacotron
Tacotron 2
Other Tacotron Related Acoustic Models
5.2.2 CNN-Based Models (e.g., DeepVoice Series)
5.2.3 Transformer-Based Models (e.g., FastSpeech Series)
TransformerTTS
FastSpeech
FastSpeech 2
5.2.4 Advanced Generative Models (GAN/Flow/VAE/Diffusion)
GAN-Based Models
Flow-Based Models
VAE-Based Models
Diffusion-Based Models
References
6 Vocoders
6.1 Vocoders from a Historic Perspective
6.1.1 Vocoders in Signal Processing
6.1.2 Vocoders in Neural TTS
6.2 Vocoders with Different Generative Models
6.2.1 Autoregressive Vocoders (e.g., WaveNet)
6.2.2 Flow-Based Vocoders (e.g., Parallel WaveNet, WaveGlow)
6.2.3 GAN-Based Vocoders (e.g., MelGAN, HiFiGAN)
6.2.4 Diffusion-Based Vocoders (e.g., WaveGrad,DiffWave)
6.2.5 Other Vocoders
References
7 Fully End-to-End TTS
7.1 Prerequisite Knowledge for Reading This Chapter
7.2 End-to-End TTS from a Historic Perspective
7.2.1 Stage 0: Character→Linguistic→Acoustic→Waveform
7.2.2 Stage 1: Character/Phoneme→Acoustic→Waveform
7.2.3 Stage 2: Character→Linguistic→Waveform
7.2.4 Stage 3: Character/Phoneme→Spectrogram→Waveform
7.2.5 Stage 4: Character/Phoneme→Waveform
7.3 Fully End-to-End Models
7.3.1 Two-Stage Training (e.g., Char2Wav, ClariNet)
7.3.2 One-Stage Training (e.g., FastSpeech 2s,EATS, VITS)
7.3.3 Human-Level Quality (e.g., NaturalSpeech)
References
Part III Advanced Topics in TTS
8 Expressive and Controllable TTS
8.1 Categorization of Variation Information in Speech
8.1.1 Text/Content Information
8.1.2 Speaker/Timbre Information
8.1.3 Style/Emotion Information
8.1.4 Recording Devices or Noise Environments
8.2 Modeling Variation Information for Expressive Synthesis
8.2.1 Explicit or Implicit Modeling
8.2.2 Modeling in Different Granularities
8.3 Modeling Variation Information for Controllable Synthesis
8.3.1 Disentangling for Control
8.3.2 Improving Controllability
8.3.3 Transfering with Control
References
9 Robust TTS
9.1 Improving Generalization Ability
9.2 Improving Text-Speech Alignment
9.2.1 Enhancing Attention
9.2.2 Replacing Attention with Duration Prediction
9.3 Improving Autoregressive Generation
9.3.1 Enhancing AR Generation
9.3.2 Replacing AR Generation with NAR Generation
References
10 Model-Efficient TTS
10.1 Parallel Generation
10.1.1 Non-Autoregressive Generation with CNN or Transformer
10.1.2 Non-Autoregressive Generation with GAN, VAE, or Flow
10.1.3 Iterative Generation with Diffusion
10.2 Lightweight Modeling
10.2.1 Model Compression
10.2.2 Neural Architecture Search
10.2.3 Other Technologies
10.3 Efficient Modeling with Domain Knowledge
10.3.1 Linear Prediction
10.3.2 Multiband Modeling
10.3.3 Subscale Prediction
10.3.4 Multi-Frame Prediction
10.3.5 Streaming or Chunk-Wise Synthesis
10.3.6 Other Technologies
References
11 Data-Efficient TTS
11.1 Language-Level Data-Efficient TTS
11.1.1 Self-Supervised Training
11.1.2 Cross-Lingual Transfer
11.1.3 Semi-Supervised Training
11.1.4 Mining Dataset in the Wild
11.1.5 Purely Unsupervised Learning
11.2 Speaker-Level Data-Efficient TTS
11.2.1 Improving Generalization
11.2.2 Cross-Domain Adaptation
11.2.3 Few-Data Adaptation
11.2.4 Few-Parameter Adaptation
11.2.5 Zero-Shot Adaptation
References
12 Beyond Text-to-Speech Synthesis
12.1 Singing Voice Synthesis
12.1.1 Challenges in Singing Voice Synthesis
12.1.2 Representative Models for Singing Voice Synthesis
12.2 Voice Conversion
12.2.1 Brief Overview of Voice Conversion
12.2.2 Representative Methods for Voice Conversion
12.3 Speech Enhancement/Separation
References
Part IV Summary and Outlook
13 Summary and Outlook
13.1 Summary
13.2 Future Directions
13.2.1 High-Quality Speech Synthesis
13.2.2 Efficient Speech Synthesis
References
A Resources of TTS
B TTS Model List
References