Research

Speech Enhancement/Separation

In a non-studio recording environment, recorded speech could be degraded by a microphone or environmental noise. The goal of this research is to enhance a speech signal by suppressing the noise of a degraded signal using the DNN architecture.

Binaural Rendering

To realize a truly intelligent and adaptive audio rendering system, we aim to build a DNN architecture that can extract environmental information from audio signals and noises. Using the extracted information and original 3D audio data, we can regenerate the 3D audio scene best suits the given playback environment.

Room Geometry Inference

Estimating indoor geometries is a crucial step in creating realistic digital twins of indoor spaces. Traditionally, methods to determine indoor geometries relied on vision-based techniques. However, creating an accurate digital twin becomes challenging when the camera-captured indoor image has hidden areas in the scene. To overcome this challenge, we use acoustic echoes to discover room geometry.

Anomaly Detection

Anomalous sounds refer to atypical sound patterns that deviate from the expected or normal acoustic behavior, which can arise from various factors such as equipment malfunctions, environmental disturbances, or unexpected events. In anomalous sound detection tasks, the main goal is to detect anomalous sound by only using normal sound data with machine learning techniques.

Target Sound Extraction

When a lot of different types of sounds are mixed with noise and reverberation, the goal is to isolate the desired signal related to the specific input clues (e.g., class label on the figure.) We are focusing on a Transformer-based model capable of extracting target sounds by using class labels and direction information as clues.

Sound Event Localization and Detection

The type and location of sound events are important information for sound-scene. Sound event localization and detection (SELD) is a task to classify sound events and localize the direction-of-arrival (DoA) in the temporal domain utilizing multichannel acoustic signals. The spatial, spectral, and temporal information is crucial to detecting and localizing sound events. However, prior studies employ spectral and channel information as the embedding for a sequence of temporal attention. This usage limits the deep neural network from extracting meaningful features from the spectral or spatial domains. Therefore, we propose a novel framework that bolsters SELD performance by independently applying attention mechanisms to distinct domains.