Towards Efficient and High-Quality Bandwidth Extension with Parallel Amplitude-Phase Prediction

National Engineering Research Center of Speech and Language Information Processing
University of Science and Technology of China

Abstract

Speech bandwidth extension (BWE) refers to increasing the bandwidth range of speech signals, enhancing the speech quality towards brighter and fuller. This paper proposed a generative adversarial network (GAN) based BWE model with parallel prediction of Amplitude and Phase spectra, named AP-BWE, which achieves both efficient and high-quality wideband waveform generation. Notably, to our knowledge, AP-BWE is the first to achieve the direct extension of the high-frequency phase spectrum, which is beneficial for improving the effectiveness of existing BWE methods. The proposed AP-BWE generator is entirely based on convolutional neural networks (CNNs), it features a dual-stream architecture with mutual interaction, where the amplitude stream and the phase stream communicate with each other and respectively extend the high-frequency components from the narrowband amplitude and phase spectra. To improve the naturalness of the extended speech signals, we employ a multi-period discriminator at the waveform level and design a pair of multi-resolution amplitude and phase discriminators at the spectral level, respectively. Experimental results demonstrate that our proposed AP-BWE achieves state-of-the-art performance in speech quality for both BWE tasks targeting sampling rates of 16 kHz and 48 kHz. In terms of generation efficiency, due to the all-convolutional architecture and all-frame-level operations, the proposed AP-BWE can generate 48 kHz waveform samples 292.3 times faster than real-time on a single RTX 4090 GPU and 18.1 times faster than real-time on CPU.


I. Audio Samples with Target sampling Rate of 16kHz


Input sr Wideband Sinc TFiLM AFiLM NVSR AP-BWE (Ours)
2 kHz
4 kHz
8 kHz

II. Audio Samples with Target Sampling Rate of 48kHz


Input sr Wideband Sinc NU-Wave 2 UDM+ mdctGAN AP-BWE (Ours)
8 kHz
12 kHz
16 kHz
24 kHz

III. Ablation Study (8 kHz to 48 kHz)


Wideband AP-BWE (Ours) w/o MPD w/o MRAD w/o MRPD MPD Only
MRAD Only MRPD Only w/o Disc. w/o A to P w/o P to A w/o Connections

IV. Cross-Dataset Evaluation


Libri-TTS (8 kHz to 24 kHz)


Wideband NU-Wave2 UDM+ mdctGAN AP-BWE (Ours)

HiFi-TTS (8 kHz to 44.1 kHz)


Wideband NU-Wave2 UDM+ mdctGAN AP-BWE (Ours)