Everyone-Can-Sing: Zero-Shot Singing Voice Synthesis and Conversion with Speech Reference
Shuqi Dai, Yunyun Wang, Roger B. Dannenberg, Zeyu Jin
More demo can be found at: https://www.shuqid.net/zero-shot-singing-synthesis
Demo 1: Overall ability for different models
Source |
Unseen Target Reference |
Zero-Shot SVC (b) Given Lyrics Alignment |
GR0 |
|
Speech |
|
|
Speech |
|
|
Speech |
|
|
Singing |
|
|
Singing |
|
|
Speech |
|
|
Speech |
|
|
Speech |
|
|
Singing |
|
|
Speech |
|
|
Singing |
|
|
Speech |
|
Source |
Unseen Target Reference |
Zero-Shot SVS (a) With Score and Lyrics |
GR0 |
|
Singing |
|
|
Speech |
|
|
Singing |
|
|
Singing |
|
|
Singing |
|
|
Singing |
|
Demo 2: Ablation Study on Different Model Conditions
1. Unseen Speech Target Reference vs. Singing Target Reference
Source |
Unseen Speech Target |
Zero-shot SVS (a) |
Unseen Singing Target |
Zero-shot SVS (a) |
|
|
|
|
|
Source |
Unseen Speech Target |
Zero-shot SVC (b) |
Unseen Singing Target |
Zero-shot SVC (b) |
|
|
|
|
|
Source |
Unseen Speech Target |
Zero-shot SVC (c) |
Unseen Singing Target |
Zero-shot SVC (c) |
|
|
|
|
|
2. Mixed Training Strategy vs. Singing Training Only
Source |
Unseen Target |
Zero-shot SVS (a) - Mixed Training |
Zero-shot SVS (a) - Singing Training Only |
|
|
|
|
Source |
Unseen Target |
Zero-shot SVC (b) - Mixed Training |
Zero-shot SVC (b) - Singing Training Only |
|
|
|
|
3. Adjust Pitch (applies pitch adjustment during inference) vs. Original Pitch
Source |
Unseen Target |
Zero-shot SVS (a) - Adjust Pitch |
Zero-shot SVS (a) - Original Pitch |
|
|
|
|
Source |
Unseen Target |
Zero-shot SVC (b) - Adjust Pitch |
Zero-shot SVC (b) - Original Pitch |
|
|
|
|
Source |
Unseen Target |
Zero-shot SVC (c) - Adjust Pitch |
Zero-shot SVC (c) - Original Pitch |
|
|
|
|
4. Cross Language (target reference and source/lyrics being in different languages) vs. Same Language
Source |
Cross Language Target |
Zero-shot SVS (a) |
Same Language Target |
Zero-shot SVS (a) |
|
|
|
|
|
Source |
Cross Language Target |
Zero-shot SVC (b) |
Same Language Target |
Zero-shot SVC (b) |
|
|
|
|
|
Source |
Cross Language Target |
Zero-shot SVC (c) |
Same Language Target |
Zero-shot SVC (c) |
|
|
|
|
|
5. Different Gender (the target reference and source (or score pitch range) are of different genders) vs. Same Gender
Source |
Different Gender Target |
Zero-shot SVS (a) |
Same Gender Target |
Zero-shot SVS (a) |
|
|
|
|
|
Source |
Different Gender Target |
Zero-shot SVC (b) |
Same Gender Target |
Zero-shot SVC (b) |
|
|
|
|
|
Source |
Different Gender Target |
Zero-shot SVC (c) |
Same Gender Target |
Zero-shot SVC (c) |
|
|
|
|
|
6. More Opera Style Samples