RESEARCH
DeFTAN-II: Efficient Multichannel Speech Enhancement with Subgroup Processing
With the recent rising of live streaming services such as YouTube, outdoor recording is frequently observed in public. In particular, multichannel recording outdoor is commonly utilized to provide immersive audio to an audience beyond the screen. However, since outdoor recording inevitably entails ambient noise, ambient noise must be suppressed to make the desired speech heard well.
We present DeFTAN-II, an efficient multichannel speech enhancement model based on transformer architecture and subgroup processing. Despite the success of transformers in speech enhancement, they face challenges in capturing local relations, reducing the high computational complexity, and lowering memory usage. To address these limitations, we introduce subgroup processing in our model, combining subgroups of locally emphasized features with other subgroups containing original features.
Through extensive comparisons with state-of-the-art multichannel speech enhancement models, we demonstrate that DeFTAN-II with subgroup processing outperforms existing methods at significantly lower computational complexity. Moreover, we evaluate the model’s generalization capability on real-world data without fine-tuning, which further demonstrates its effectiveness in practical scenarios.
This work was supported by the BK21 Four program through the National Research Foundation (NRF) funded by the Ministry of Education of Korea, the National Research Council of Science and Technology (NST) granted by the Korean government (MSIT)(No. CRC21011), and the Center for Applied Research in Artificial Intelligence (CARAI) funded by DAPA and ADD (UD230017TD).