NAI-SVC Experiment Log

Reference

https://www.bilibili.com/video/BV1Hr4y197Cy

Environment

Linux (millennium-A24)

Python 3.10.13

Initial Setup

Usage

File Structure

./dataset_raw/

store wave files, something like

./filelists/

store wav file names, referenced from root path

./raw/

inference source

./results/

generated results

Pretrain Model Downloads

Dataset Preparation

Requirements for Dataset

Minimum: 100 entries of 5~15s audio clips

Normal: 1.5 hours of audio

Sampling rate: 48000

BGM Removal

Noise Removal

Slice

Filter

Remove audio clips that are shorter than 4 seconds, and cut the clips that are longer than 10 seconds.

Normalize

In Adobe Audition, select "Window" -> "Match Loudness" to open the loudness-matching panel.

Set the settings to use

Match To: ITU-R BS.1770-3 Loudness

Target Loudness: -11 LUFS

Tolerance: 0.5 LU

Max True Peak Level: -1 dBTP

Finally, click "Run" to run loudness matching.

Put wav files in dataset_raw/<speaker>/*.wav

Preprocess

Training

Shallow Diffusion Model

Main Model

Model Fusion

Loss

loss/g/f0、loss/g/mel 和 loss/g/total 应当是震荡下降的,并最终收敛在某个值

loss/g/kl 应当是低位震荡的

loss/g/fm 应当在训练的中期持续上升,并在后期放缓上升趋势甚至开始下降

Model Architecture

Last updated

Was this helpful?