Denoising Diffusion Probabilistic Models (DDPM) have been used extensively with great success in the vision field, with many models showing particularly high-quality results in image inpainting. We propose applying similar diffusion methods to the speech domain, with the goal of performing super-resolution on speech samples. We believe that an analogous method to image inpainting can be performed on low resolution speech samples to retrieve a target high-resolution sample. Throughout this study, we compare super-resolution results from multiple baseline models with an unconditional diffusion-based approach.
Listening samples for evaluation
We recommend using headphones for this section.
| 196-122150-0000 | 196-122150-0001 | |
|---|---|---|
![]() |
![]() |
|
| Input | ||
![]() |
![]() |
|
| Target | ||
![]() |
![]() |
|
| LSTM | ||
![]() |
![]() |
|
| U-Net | ||
![]() |
![]() |
|
| NU-wave2 | ||
![]() |
![]() |
|
| Repaint (Our model) |
Unconditional diffusion produced plausible sounds from random noise
![]() |
![]() |
|
| Unconditional diffusion |













