期刊论文详细信息
EURASIP Journal on Audio, Speech, and Music Processing
W2VC: WavLM representation based one-shot voice conversion with gradient reversal distillation and CTC supervision
Empirical Research
Lin Wang1  Ying Hu1  Liang He2  Hao Huang3  Jichen Yang4 
[1] School of Computer Science and Technology, Xinjiang University, Urumqi, China;School of Computer Science and Technology, Xinjiang University, Urumqi, China;Tsinghua National Laboratory for Information Science and Technology, Department of Electronic Engineering, Tsinghua University, Beijing, China;School of Computer Science and Technology, Xinjiang University, Urumqi, China;Xinjiang Key Laboratory of Multi-lingual Information Technology, Urumqi, China;School of Cyber Security, Guangdong Polytechnic Normal University, Guangzhou, China;
关键词: Voice conversion;    Self-supervised pre-trained Representation;    Gradient reversal layer (GRL);    CTC;   
DOI  :  10.1186/s13636-023-00312-8
 received in 2023-03-03, accepted in 2023-10-12,  发布年份 2023
来源: Springer
PDF
【 摘 要 】

Non-parallel data voice conversion (VC) has achieved considerable breakthroughs due to self-supervised pre-trained representation (SSPR) being used in recent years. Features extracted by the pre-trained model are expected to contain more content information. However, in common VC with SSPR, there is no special implementation to remove speaker information in the content representation extraction by SSPR, which prevents further purification of the speaker information from SSPR representation. Moreover, in conventional VC, Mel-spectrogram is often selected as the reconstructed acoustic feature, which is not consistent with the input of the content encoder and results in some information lost. Motivated by the above, we proposed W2VC to settle the issues. W2VC consists of three parts: (1) We reconstruct feature from WavLM representation (WLMR) that is more consistent with the input of content encoder; (2) Connectionist temporal classification (CTC) is used to align content representation and text context from phoneme level, content encoder plus gradient reversal layer (GRL) based speaker classifier are used to remove speaker information in the content representation extraction; (3) WLMR-based HiFi-GAN is trained to convert WLMR to waveform speech. VC experimental results show that GRL can purify well the content information of the self-supervised model. The GRL purification and CTC supervision on the content encoder are complementary in improving the VC performance. Moreover, the synthesized speech using the WLMR retrained vocoder achieves better results in both subjective and objective evaluation. The proposed method is evaluated on the VCTK and CMU databases. It is shown the method achieves 8.901 in objective MCD, 4.45 in speech naturalness, and 3.62 in speaker similarity of subjective MOS score, which is superior to the baseline.

【 授权许可】

CC BY   
© The Author(s) 2023

【 预 览 】
附件列表
Files Size Format View
RO202311105488910ZK.pdf 2539KB PDF download
Fig. 2 105KB Image download
12951_2015_155_Article_IEq72.gif 1KB Image download
Fig. 6 578KB Image download
12951_2016_246_Article_IEq1.gif 1KB Image download
Fig. 1 531KB Image download
Fig. 2 2578KB Image download
Scheme. 1 8432KB Image download
Fig. 4 2807KB Image download
Fig. 6 412KB Image download
Fig. 4 371KB Image download
12951_2017_255_Article_IEq51.gif 1KB Image download
MediaObjects/41021_2023_280_MOESM1_ESM.docx 35KB Other download
12951_2017_255_Article_IEq52.gif 1KB Image download
Fig. 4 1969KB Image download
Fig. 9 1203KB Image download
Fig. 1 498KB Image download
Fig. 1 384KB Image download
12951_2016_246_Article_IEq12.gif 1KB Image download
12951_2016_246_Article_IEq13.gif 1KB Image download
12951_2016_246_Article_IEq15.gif 1KB Image download
12951_2016_246_Article_IEq16.gif 1KB Image download
MediaObjects/41408_2023_927_MOESM7_ESM.docx 44KB Other download
Fig. 2 591KB Image download
Fig. 1 1118KB Image download
Fig. 1 1893KB Image download
12888_2023_5292_Article_IEq1.gif 1KB Image download
12951_2017_255_Article_IEq56.gif 1KB Image download
Fig. 4 891KB Image download
MediaObjects/12888_2023_5283_MOESM1_ESM.doc 122KB Other download
Fig. 2 109KB Image download
Fig. 1 283KB Image download
Fig. 3 232KB Image download
MediaObjects/12888_2023_5250_MOESM1_ESM.doc 111KB Other download
12936_2016_1411_Article_IEq110.gif 1KB Image download
Fig. 4 486KB Image download
Fig. 5 699KB Image download
【 图 表 】

Fig. 5

Fig. 4

12936_2016_1411_Article_IEq110.gif

Fig. 3

Fig. 1

Fig. 2

Fig. 4

12951_2017_255_Article_IEq56.gif

12888_2023_5292_Article_IEq1.gif

Fig. 1

Fig. 1

Fig. 2

12951_2016_246_Article_IEq16.gif

12951_2016_246_Article_IEq15.gif

12951_2016_246_Article_IEq13.gif

12951_2016_246_Article_IEq12.gif

Fig. 1

Fig. 1

Fig. 9

Fig. 4

12951_2017_255_Article_IEq52.gif

12951_2017_255_Article_IEq51.gif

Fig. 4

Fig. 6

Fig. 4

Scheme. 1

Fig. 2

Fig. 1

12951_2016_246_Article_IEq1.gif

Fig. 6

12951_2015_155_Article_IEq72.gif

Fig. 2

【 参考文献 】
  • [1]
  • [2]
  • [3]
  • [4]
  • [5]
  • [6]
  • [7]
  • [8]
  • [9]
  • [10]
  • [11]
  • [12]
  • [13]
  • [14]
  • [15]
  • [16]
  • [17]
  • [18]
  • [19]
  • [20]
  • [21]
  • [22]
  • [23]
  • [24]
  • [25]
  • [26]
  • [27]
  • [28]
  • [29]
  • [30]
  • [31]
  • [32]
  • [33]
  • [34]
  • [35]
  文献评价指标  
  下载次数:5次 浏览次数:2次