AP-BWE

Speech bandwidth extension (BWE) refers to increasing the bandwidth range of speech signals, enhancing the speech quality towards brighter and fuller. This paper proposed a generative adversarial network (GAN) based BWE model with parallel prediction of Amplitude and Phase spectra, named AP-BWE, which achieves both efficient and high-quality wideband waveform generation. Notably, to our knowledge, AP-BWE is the first to achieve the direct extension of the high-frequency phase spectrum, which is beneficial for improving the effectiveness of existing BWE methods. The proposed AP-BWE generator is entirely based on convolutional neural networks (CNNs), it features a dual-stream architecture with mutual interaction, where the amplitude stream and the phase stream communicate with each other and respectively extend the high-frequency components from the narrowband amplitude and phase spectra. To improve the naturalness of the extended speech signals, we employ a multi-period discriminator at the waveform level and design a pair of multi-resolution amplitude and phase discriminators at the spectral level, respectively. Experimental results demonstrate that our proposed AP-BWE achieves state-of-the-art performance in speech quality for both BWE tasks targeting sampling rates of 16 kHz and 48 kHz. In terms of generation efficiency, due to the all-convolutional architecture and all-frame-level operations, the proposed AP-BWE can generate 48 kHz waveform samples 292.3 times faster than real-time on a single RTX 4090 GPU and 18.1 times faster than real-time on CPU.

Input sr	Wideband	Sinc	TFiLM	AFiLM	NVSR	AP-BWE (Ours)

2 kHz

4 kHz

8 kHz

Input sr	Wideband	Sinc	NU-Wave 2	UDM+	mdctGAN	AP-BWE (Ours)

8 kHz

12 kHz

16 kHz

24 kHz

Wideband	AP-BWE (Ours)	w/o MPD	w/o MRAD	w/o MRPD	MPD Only


MRAD Only	MRPD Only	w/o Disc.	w/o A to P	w/o P to A	w/o Connections

Wideband	NU-Wave2	UDM+	mdctGAN	AP-BWE (Ours)

Wideband	NU-Wave2	UDM+	mdctGAN	AP-BWE (Ours)

Towards Efficient and High-Quality Bandwidth Extension with Parallel Amplitude-Phase Prediction

Abstract

I. Audio Samples with Target sampling Rate of 16kHz

II. Audio Samples with Target Sampling Rate of 48kHz

III. Ablation Study (8 kHz to 48 kHz)

IV. Cross-Dataset Evaluation

Libri-TTS (8 kHz to 24 kHz)

HiFi-TTS (8 kHz to 44.1 kHz)