Abstract: We propose to characterize and improve the performance of blind room impulse response (RIR) estimation systems in the context of a downstream application scenario, far-field automatic speech recognition (ASR). We first draw the connection between improved RIR estimation and improved ASR performance, as a means of evaluating neural RIR estimators. We then propose a GAN-based architecture that encodes RIR features from reverberant speech and constructs an RIR from the encoded features, and uses a novel energy decay relief loss to optimize for capturing energy-based properties of the input reverberant speech. We show that our model outperforms the state-of-the-art baselines on acoustic benchmarks (by 72% on the energy decay relief and 22% on the early reflection energy metrics), as well as in an ASR evaluation task (by 6.9% in word error rate).
Ground truth RIR
Input reverberant speech
Estimated RIR using S2IR-GAN (Ours)
Reconstructed reverberant speech using estimated RIR