Dataset

Public datasets and resources released by Smart Sound Lab.

Code & Resources

Official Smart Sound Lab GitHub

Explore dataset release code, baseline implementations, usage guides, and reproducible experiments maintained by our lab team.

Drone sound dataset

Drone sound dataset

​Wonjun Yi; Jung-Woo Choi; Jae-Woo Lee arxiv : https://arxiv.org/abs/2304.11708 Accepted at 29th International Congress on Sound and Vibration (ICSV29).  ​ To improve the safety of drone operations, one should detect the mechanical faults of drones in real-time. The drone sound dataset was constructed by collecting the operating sounds of drones from microphones mounted on three different drones in an anechoic chamber. The dataset includes various operating conditions of drones, such as flight directions (front, back, right, left, clockwise, counterclockwise) and faults on propellers and motors. The drone sounds were then mixed with noises recorded in five different spots on the university campus, with a signal-to-noise ratio (SNR) varying from 10 dB to 15 dB.

6DoF SRIR dataset

6DoF SRIR dataset

The 6DOF RIR dataset (aka 6DRIRset) includes room impulse responses (RIRs) measured by nine spherical microphone arrays (SMAs; Zylia ZM-1S) distributed in a semi-cuboid room. 6DRIRset is specialized by its massive loudspeaker positions (392 locations), which were incorporated for the 6DOF source localization task.

6 DOF real SRIR dataset

6 DOF real SRIR dataset

The 6DOF RIR dataset (aka 6DRIRset) includes room impulse responses (RIRs) measured by nine spherical microphone arrays (SMAs; Zylia ZM-1S) distributed in a semi-cuboid room. 6DRIRset is specialized by its massive loudspeaker positions (392 locations), which were incorporated for the 6DOF source localization task.

SSEU

SSEU

Recently, Large Audio Language Models (LALMs) have progressed rapidly, demonstrating their strong efficacy in universal audio understanding through cross-modal integration. To evaluate LALMs’ audio understanding performance, researchers have proposed different benchmarks. However, key aspects for real-world interactions are underexplored in existing benchmarks, i.e., audio signals typically contain both speech and non-speech components, and energy levels of these components can vary significantly across different scenarios. Moreover, most benchmarks do not consider the joint understanding of speech, scene, and events within the same audio clip. In this work, we introduce SSEU-Bench, the first versatile audio understanding benchmark that explicitly accounts for energy differences between speech and non-speech audio, with both independent and joint understanding settings for speech, scene, and events. Furthermore, we demonstrate that some LALMs tend to underperform on certain tasks in a joint understanding setting. To address this issue, we introduce Chain-of-Thought, which effectively improves LALMs’ joint audio understanding performance by decomposing complex tasks into simpler reasoning steps.

Auditory Scene Analysis 2 (ASA2) Dataset

Auditory Scene Analysis 2 (ASA2) Dataset

We constructed a new dataset for multichannel USS and polyphonic audio classification tasks. The proposed dataset is designed to reflect various conditions, including moving sources with temporal onsets and offsets. For foreground sound sources, signals from 13 audio classes were selected from open-source databases (Pixabay¹, FSD50K, Librispeech, MUSDB18, Vocalsound). These signals were resampled to 16 kHz and pre-processed by either padding zeros or cropping to 4 seconds. Each sound source has a 75% probability of being a moving source, with speeds ranging from 0 to 3 m/s. The dataset features between 2 to 5 foreground sound sources, along with one background noise from the diffused TAU-SNoise dataset² with a signal-to-noise ratio (SNR) ranging from 6 to 30 dB. The simulations were conducted using gpuRIR. Room dimensions were set to a width and length between 5 and 8 meters, and a height between 3 and 4 meters, with reverberation times ranging from 0.2 to 0.6 seconds. These parameters were sampled from uniform distributions. We simulated spatialized sound sources using a 4-channel tetrahedral microphone array with a radius of 4.2 cm. The procedure for dataset generation is illustrated in the below figure, and details about class configuration and durations of audio clips are provided in the below table. This dataset poses a significant challenge for separation tasks due to the inclusion of moving sources, onset and offset conditions, overlapped in-class sources, and noisy reverberant environments.