-->

Listen2Scene: Interactive material-aware binaural sound propagation for reconstructed 3D scenes (IEEE VR 2024)

University of Maryland, College Park, USA

Abstract

We present an end-to-end binaural audio rendering approach (Listen2Scene) for virtual reality (VR) and augmented reality (AR) applications. We propose a novel neural-network-based binaural sound propagation method to generate acoustic effects for 3D models of real environments. Any clean audio or dry audio can be convolved with the generated acoustic effects to render audio corresponding to the real environment. We propose a graph neural network that uses both the material and the topology information of the 3D scenes and generates a scene latent vector. Moreover, we use a conditional generative adversarial network (CGAN) to generate acoustic effects from the scene latent vector. Our network is able to handle holes or other artifacts in the reconstructed 3D mesh model. We present an efficient cost function to the generator network to incorporate spatial audio effects. Given the source and the listener position, our learning-based binaural sound propagation approach can generate an acoustic effect in 0.1 milliseconds on an NVIDIA GeForce RTX 2080 Ti GPU and can easily handle multiple sources. We have evaluated the accuracy of our approach with binaural acoustic effects generated using an interactive geometric sound propagation algorithm and captured real acoustic effects. We also performed a perceptual evaluation and observed that the audio rendered by our approach is more plausible as compared to audio rendered using prior learning-based sound propagation algorithms.

Supplementary Video




Geometric sound propogation algorithm vs Listen2Scene

We compared sound-rendered 3D scenes with 2 sound sources. In this experiment, we evaluate whether our approach creates overall sound effects very similar to interactive geometric sound propogation algorithm. In practice, Listen2Scene is two orders of magnitude faster than the interactive geometric sound propagation algorithm.

Geometric sound propogation

Listen2Scene




Clean (dry sound) vs Listen2Scene

We compared sound-rendered 3D scenes with a single sound source and 2 sound sources. In this experiment, we evaluate whether our approach creates continuous and smooth sound effects when moving around the scene and whether the user can perceive the indirect sound effects

Single Source

Clean (Dry Sound)

Listen2Scene


Two Sources

Clean (Dry Sound)

Listen2Scene




MESH2IR (2022) vs Listen2Scene

We generated auralized video each for a single source and 2 sources. Our goal is to investigate whether the participants feel that the sound effects in the left and right ears change smoothly and synchronously as the user walks in the scene. In addition to distance, we investigate whether our sound effects change smoothly with the direction of the source.

Single Source

MESH2IR

Listen2Scene


Two Sources

MESH2IR

Listen2Scene




Listen2Scene-No-Mat vs Listen2Scene

We auralized two scenes with a single source from a medium-sized and a large scene, and another scene with two sources. In this experiment, we evaluate whether the reverberation effects from Listen2Scene match closely with the environment when compared with Listen2Scene-No-Mat. Our goal is to evaluate the perceptual benefits of adding material characteristics in our learning method.

Single Source (Medium-Sized Scene)

Listen2Scene-No-Mat

Listen2Scene


Single Source (Large Scene)

Listen2Scene-No-Mat

Listen2Scene


Two Sources

Listen2Scene-No-Mat

Listen2Scene




GWA vs Listen2Scene

We auralized 3 scenes with a one or two source from 3D-Front Dataset using high-quality RIRs generated from GWA and our Listen2Scene. GWA compute high-quality impulse responses corresponding to accurate low-frequency and high-frequency wave effects by automatically calibrating geometric acoustic ray-tracing with a finite-difference time-domain wave solver. In this experiment we evaluate the robutness of our Listen2Scene on completely new 3D scene not used for training.

GWA

Listen2Scene




GWA

Listen2Scene




GWA

Listen2Scene