Msdtron: a high-capability multi-speaker speech synthesis system for diverse data using characteristic information

Paper: arXiv

Authors: Qinghua Wu, Quanbo Shen, Jian Luan, Yujun Wang

Abstract: In multi-speaker speech synthesis, data from a number of speakers usually tend to have great diversity due to the fact that the speakers may differ largely in ages, speaking styles, emotions, and so on. It is important but challenging to improve the modeling capabilities for multi-speaker speech synthesis. To address the issue, this paper proposes a high-capability speech synthesis system, called Msdtron, in which 1) a representation of the harmonic structure of speech, called excitation spectrogram, is designed to directly guide the learning of harmonics in mel-spectrogram. 2) conditional gated LSTM (CGLSTM) is proposed to control the flow of text content information through the network by re-weighting the gates of LSTM using speaker information. The experiments show a significant reduction in reconstruction error of mel-spectrogram in the training of the multi-speaker model, and a great improvement is observed in the subjective evaluation of speaker adapted model.

Random Samples from Compared Systems

Synthesized Speech

Record	Baseline	System-1	System-2(Proposed)
s0: 西藏好吃的可多了，比如，酥油糌粑，干酪，藏面，酸奶，青稞酒

s1: 星星在天上，你在我心里，星星是夜空的，而你是我的

s2: 我的家人有个伟大的名字，叫科技工作者，又名程序员

s3: 厉害了，这个问题我要花时间想想，先说点其他的吧

s4: 点击下方的自定义场景，可以定制个性化的内容哦