MESH2IR: Neural Acoustic Impulse Response Generator for Complex 3D Scenes (ACM Multimedia 2022)

Abstract

We propose a mesh-based neural network (MESH2IR) to generate acoustic impulse responses (IRs) for indoor 3D scenes represented using a mesh. The IRs are used to create a high-quality sound experience in interactive applications and audio processing. Our method can handle input triangular meshes with arbitrary topologies (2K - 3M triangles). We present a novel training technique to train MESH2IR using energy decay relief and highlight its benefits. We also show that training MESH2IR on IRs preprocessed using our proposed technique significantly improves the accuracy of IR generation. We reduce the non-linearity in the mesh space by transforming 3D scene meshes to latent space using a graph convolution network. Our MESH2IR is more than 200 times faster than a geometric acoustic algorithm on a CPU and can generate more than 10,000 IRs per second on an NVIDIA GeForce RTX 2080 Ti GPU for a given furnished indoor 3D scene. The acoustic metrics are used to characterize the acoustic environment. We show that the acoustic metrics of the IRs predicted from our MESH2IR match the ground truth with less than 10\% error. We also highlight the benefits of MESH2IR on audio and speech processing applications such as speech dereverberation and speech separation. To the best of our knowledge, ours is the first neural-network-based approach to predict IRs from a given 3D scene mesh in real-time.

Neural Sound Rendering

Scenario 1

A person sings inside a bedroom and the listener moves around the house. The sound source (i.e., song) is represented using a red sphere.

Scenario 2

In this example, the phone rings in the bedroom and we play the piano in the living room. We move the listener around the house. The sound sources (i.e., ring and piano) are represented using a red sphere.

Scenario 3

In this example, a person speaks in the hall and a machine works inside a room. We move the listener around the house. The sound sources (i.e., speech and machine) are represented using a red sphere.

Overall Architecture

The architecture of our MESH2IR. Our mesh encoder network encodes a indoor 3D scene mesh to the latent space. The mesh latent vector and the source and listener locations are combined to produce a scene vector embedding. The generator network generates an IR corresponding to the input scene vector embedding. For the given scene vector embedding, the discriminator network discriminates between the generated IR and the ground truth IR during training.

Expansion of Mesh Encoder Network

The expansion of our mesh encoder in the Overall Architecture. Our encoder network transforms the indoor 3D scene mesh into a latent vector. The topology information (edge connectivity) and the node features (vertex coordinates) are extracted from the mesh and passed to our graph neural network.}

BibTeX


  @article{ratnarajah2022mesh2ir,
  title={MESH2IR: Neural Acoustic Impulse Response Generator for Complex 3D Scenes},
  author={Ratnarajah, Anton and Tang, Zhenyu and Aralikatti, Rohith Chandrashekar and Manocha, Dinesh},
  journal={arXiv preprint arXiv:2205.09248},
  year={2022}
}