Explicit Estimation of Magnitude and Phase Spectra in Parallel for High-Quality Speech Enhancement

National Engineering Research Center of Speech and Language Information Processing
University of Science and Technology of China

Abstract

Phase information has a significant impact on speech perceptual quality and intelligibility. However, existing speech enhancement methods encounter limitations in explicit phase estimation due to the non-structural nature and wrapping characteristics of the phase, leading to a bottleneck in enhanced speech quality. To overcome the above issue, in this paper, we proposed MP-SENet, a novel Speech Enhancement Network that explicitly enhances Magnitude and Phase spectra in parallel. The proposed MP-SENet comprises a Transformer-embedded encoder-decoder architecture. The encoder aims to encode the input distorted magnitude and phase spectra into time-frequency representations, which are further fed into time-frequency Transformers for alternatively capturing time and frequency dependencies. The decoder comprises a magnitude mask decoder and a phase decoder, directly enhancing magnitude and wrapped phase spectra by incorporating a magnitude masking architecture and a phase parallel estimation architecture, respectively. Multi-level loss functions explicitly defined on the magnitude spectra, wrapped phase spectra, and short-time complex spectra are adopted to jointly train the MP-SENet model. A metric discriminator is further employed to compensate for the incomplete correlation between these losses and human auditory perception. Experimental results demonstrate that our proposed MP-SENet achieves state-of-the-art performance across multiple speech enhancement tasks, including speech denoising, dereverberation, and bandwidth extension. Compared to existing phase-aware speech enhancement methods, it further mitigates the compensation effect between the magnitude and phase by explicit phase estimation, elevating the perceptual quality of enhanced speech. Remarkably, for the speech denoising task, the proposed MP-SENet yields a PESQ of 3.60 on the VoiceBank+DEMAND dataset and 3.62 on the DNS challenge dataset.

BibTeX

@inproceedings{lu2023mp, title={{MP-SENet}: A Speech Enhancement Model with Parallel Denoising of Magnitude and Phase Spectra}, author={Lu, Ye-Xin and Ai, Yang and Ling, Zhen-Hua}, booktitle={Proc. Interspeech}, pages={3834--3838}, year={2023} }

@article{lu2023explicit, title={Explicit Estimation of Magnitude and Phase Spectra in Parallel for High-Quality Speech Enhancement}, author={Lu, Ye-Xin and Ai, Yang and Ling, Zhen-Hua}, journal={arXiv preprint arXiv:2308.08926}, year={2023} }

Scene	Noisy	Clean	DB-AIAT	CMGAN	MP-SENet (Ours)

Sample 1

Sample 2

Sample 3

	Noisy	Clean	FRCRN	MFNet	MP-SENet (Ours)

Sample 1

Sample 2

Sample 3

	Reverberant	Clean	UNet	CMGAN	MP-SENet (Ours)

Sample 1

Sample 2

Sample 3

	Narrowband	Wideband	NVSR	CMGAN	MP-SENet (Ours)

Sample 1

Sample 2

	Narrowband	Wideband	NVSR	CMGAN	MP-SENet (Ours)

Sample 1

Sample 2

Explicit Estimation of Magnitude and Phase Spectra in Parallel for High-Quality Speech Enhancement

Abstract

I. Audio Samples of Speech Denoising

VoiceBank+DEMAND Dataset

DNS Challenge Dataset

II. Audio Samples of Speech Dereverberation

III. Audio Samples of Speech Bandwidth Extension

8 kHz to 16 kHz

4 kHz to 16 kHz

IV. SNR-wise Evaluation on the VoiceBank+DEMAND Dataset

V. Ablation Study on the VoiceBank+DEMAND Dataset

BibTeX

SNR	Noisy	Clean	DB-AIAT	CMGAN	MP-SENet (Ours)

-5 dB

0 dB

5 dB

10 dB

15 dB

	Noisy	Clean	MP-SENet	w/ Conformer	Magnitude Only

Sample 1

Sample 2

	Complex Only	w/o Phase Loss	w/o Complex Loss	w/o Consistency Loss	w/o Metric Discriminator

Sample 1

Sample 2