Voice Personalization. For a given text in any language or… | by Sangramsing Kayte | Jul, 2021
The language characteristics derived from a certain text are mapped on acoustic features via a GANs-based TTS system.
The features of the linguistic input include binary answers to linguistic context questions and numerical values, such as the number of words in the current sentence, the position of the current syllable in the word, and the length of the current phoneme.
The acoustic features consist of melted cepstral coefficients and exciting parameters and of the basic frequency. The neural network parameters can be trained by using the input and output pairs derived from a training data set. Finally, a WORLD vocoder is used to synthesize speech from the acoustic
features. The proposed model consists of two components as follow:
Emotional Analysis Module:-
The goal of this module is the differentiation of the level of emotional strength and the integration control vector that is used to manipulate emotional strength continually. The module for emotion analysis takes emotional talk as input and produces a vector that is fed into the module of speech synthesis as additional information for expression.
The following steps are followed by this module:
1. Input the emotional speech corpus and extract the emotional characteristics of speech using open smile tool.
2. Apply k-means clustering to cluster the sentiments in the dataset
3. Obtain the embedding vector, which represents sentiment using the t-SNE algorithm.
1. How Conversational AI can Automate Customer Service
2. Automated vs Live Chats: What will the Future of Customer Service Look Like?
3. Chatbots As Medical Assistants In COVID-19 Pandemic
4. Chatbot Vs. Intelligent Virtual Assistant — What’s the difference & Why Care?
For the emotion analysis module, the inputs were 384- dimensional emotion features were extracted from the speech corpus. Then we will perform the cluster analysis to categorize the dataset into happy, sad, angry and neutral. We applying the Kmeans clustering algorithm, to partition the dataset into k levels for the speech corpus without annotated labels. After cluster analysis, an embedding vector using a t-SNE algorithm will be applied and the output is converted into a 2-dimensional embedding vector.
The speech synthesis module takes the text and the control vector as input. The text front-end module extracts the linguistic features from the text, which is concatenated with the control embedding vector as input features for the GAN. The model is then trained to learn a function that maps the input features to acoustic features.
Finally, the emotion acoustic features are directly fed into a vocoder to synthesize the final speech waveform with a variety of sentiments in the output speech.
For the speech synthesis module, the input consisted of 297-dimensional linguistic features. We will also include the embedding vector features generated from the emotion analysis module. The output consisted of the acoustic features that were extracted using the
WORLD vocoder. The acoustic features consisted of a 60-dimensional model-generalized cepstral coefficients (MGC) vector, a 5-dimensional band-aperiodicity (BAP) vector and the fundamental frequency, which were extracted every 5 ms. Thus, the output features of the neural networks consisted of the MGCs, BAPs, log of F0 with the deltas and delta-deltas, as well as a voiced/unvoiced binary feature.
Database:- We need 8–10 hours of speech corpus from American English speakers with emotions i.e angry, happy-sad, and neutral. It is important to note that we need high quality of emotions which means that there must be differentiation of emotions in data to generate emotion-based synthesis.
The open-source framework is free for commercial use
Python 3.6 or 3.7
Tensorflow, keras — (under the Apache)
PyTorch, Matplotlib, SciPy, numpy, tqdm, numbatqdm, SoundFile, multiprocesss, Unidecode, scikit learn — (under the BSD)
Webrtcvad, inflect, sounddevice (sounddevice is alternative to PyAudio) — (under the MIT)
Librosa- (under the ISC)
We need a high-quality emotional dataset. Another challenging aspect is using a GAN-based model to produce emotions in the synthesized speech.
The detailed framework of the proposed system
Frontend Text Processing:- The task of the frontend text processing block is to extract the linguistic features from a given input text. It consists of the following steps.
Text Normalization:- This step defines a set of rules for expansion of commonly used abbreviations, acronyms, numbers based on the context, etc.
Letter-To-Sound (LTS) rules:- LTS rules indicate how the written text has to be spoken.
Labelling:- Appropriate sound units, such as syllables, phones, diphones, triphones will be applied. Based on the units, the labelled speech corpus will be created. We will explore automatic segmentation algorithms for this task.
Acoustic Analysis:- In this step, from speech corpus or given speech signal, the system (vocal tract), as well as excitation source-related features, will be extracted. In particular, spectral features, pitch, fundamental frequency (f0), energy, aperiodicity related parameters will be extracted.
Mapping the system: We can map the text and acoustic features using GANs. The GANs is a deep generative model, that simultaneously trains two networks, namely, a generator G that estimates the mapping function between the representative pairs and a discriminator D that acts as a binary classifier. The D network accepts the real samples coming from the natural speech distribution $Y$ and the fake samples generated by G. The output of the discriminator, 1/(1+ exp (D(Y))), represents the posterior probability that input Y is natural data. The D network is adversarially trained to maximize the likelihood of the samples coming from Y as real and minimizes the likelihood of the samples coming from the model distribution $Yˆ$ (output of generator) as fake. In the other words, the D network is trained to make the posterior probability 1 (i.e., natural speech) for the natural data and 0 (i.e., emotion synthesized speech) for the generated data, while the generator is trained to deceive the D network. Both the D and G networks are trained using Stochastic Gradient Descent (SGD) algorithm. Further, the synthesized speech will be determined on the probabilistic ratio of emotions i.e happy, neutral, sad, and angry. Depending on the acoustic features these emotions will be indulged in the speech. First, by using natural data Y and generated data Yˆ, we calculate the discriminator loss Y, Yˆ. The objective function can be formulated as:
After updating the discriminator, we calculate the adversarial loss of the generator which deceives the discriminator. A set of model parameters of the generator G is updated by using the stochastic gradient. The adversarial framework minimizes the approximated divergence between the two distributions of natural speech and the synthesized speech data.
Credit: Source link