A groundbreaking AI technology can now create realistic street-view images based solely on audio recordings, with an 80% success rate in human tests.
Developed by Assistant Professor Yuhao Kang and his team at the University of Texas at Austin, the “Soundscape-to-Image Diffusion Model” was trained using audio-visual clips from YouTube, spanning urban and rural streets across North America, Asia, and Europe.
The AI system analyzed 10-second clips to learn how specific sounds correlated with visual elements, such as buildings, greenery, and sky.
Once trained, it generated images for 100 unseen audio tracks. Judges matched these AI-generated images to their respective sounds with high accuracy, highlighting the system’s effectiveness.

Interestingly, the generated images also captured lighting conditions, including sunny, cloudy, and nighttime settings, based on ambient sound variations.
The study, published in Nature, suggests applications in forensics and urban planning, with researchers noting its potential to enhance understanding of how sound influences mental health and community design.