Dataset

Public datasets and resources released by Smart Sound Lab.

Code & Resources

Official Smart Sound Lab GitHub

Explore dataset release code, baseline implementations, usage guides, and reproducible experiments maintained by our lab team.

SSEU

SSEU

Recently, Large Audio Language Models (LALMs) have progressed rapidly, demonstrating their strong efficacy in universal audio understanding through cross-modal integration. To evaluate LALMs’ audio understanding performance, researchers have proposed different benchmarks. However, key aspects for real-world interactions are underexplored in existing benchmarks, i.e., audio signals typically contain both speech and non-speech components, and energy levels of these components can vary significantly across different scenarios. Moreover, most benchmarks do not consider the joint understanding of speech, scene, and events within the same audio clip. In this work, we introduce SSEU-Bench, the first versatile audio understanding benchmark that explicitly accounts for energy differences between speech and non-speech audio, with both independent and joint understanding settings for speech, scene, and events. Furthermore, we demonstrate that some LALMs tend to underperform on certain tasks in a joint understanding setting. To address this issue, we introduce Chain-of-Thought, which effectively improves LALMs’ joint audio understanding performance by decomposing complex tasks into simpler reasoning steps.

Auditory Scene Analysis 2 (ASA2) Dataset

Auditory Scene Analysis 2 (ASA2) Dataset

We constructed a new dataset for multichannel USS and polyphonic audio classification tasks. The proposed dataset is designed to reflect various conditions, including moving sources with temporal onsets and offsets. For foreground sound sources, signals from 13 audio classes were selected from open-source databases (Pixabay¹, FSD50K, Librispeech, MUSDB18, Vocalsound). These signals were resampled to 16 kHz and pre-processed by either padding zeros or cropping to 4 seconds. Each sound source has a 75% probability of being a moving source, with speeds ranging from 0 to 3 m/s. The dataset features between 2 to 5 foreground sound sources, along with one background noise from the diffused TAU-SNoise dataset² with a signal-to-noise ratio (SNR) ranging from 6 to 30 dB. The simulations were conducted using gpuRIR. Room dimensions were set to a width and length between 5 and 8 meters, and a height between 3 and 4 meters, with reverberation times ranging from 0.2 to 0.6 seconds. These parameters were sampled from uniform distributions. We simulated spatialized sound sources using a 4-channel tetrahedral microphone array with a radius of 4.2 cm. The procedure for dataset generation is illustrated in the below figure, and details about class configuration and durations of audio clips are provided in the below table. This dataset poses a significant challenge for separation tasks due to the inclusion of moving sources, onset and offset conditions, overlapped in-class sources, and noisy reverberant environments.