AI Heard That! ICML 2025 Workshop on Machine Learning for Audio

For questions, email mlforaudioworkshop@gmail.com

Workshop Description

Machine learning research for audio applications has experienced a surge of innovation in recent years, with prominent and widely relevant advancements rapidly emerging and momentum continuing to build. There are numerous key problems within the audio research domain that continue to attract widespread attention. This ongoing relevance, alongside the success of the Machine Learning for Audio workshop at NeurIPS 2023, has inspired us to bring this workshop at ICML 2025. We believe that bringing this workshop to a wider audience will provide a good opportunity to bring together both practitioners of audio tools along with machine learning researchers interested in audio, in order to foster community, discussion, and future collaboration. In addition, with the field moving so rapidly, we believe this workshop will provide a dedicated space for the crucial ethical discussions that must be facilitated among researchers around applications of generative machine learning for audio.

The Machine Learning for Audio workshop at ICML 2025 will cover a broad range of tasks and challenges involving audio data. These include, but are not limited to: methods of speech modeling, environmental sound generation or other forms of ambient sound, novel generative models, music generation in the form of raw audio, text-to-speech methods, denoising of speech and music, data augmentation, classification of acoustic events, transcription, source separation, and multimodal problems.

We plan to solicit original extended abstracts (up to 4 pages) in these areas, which will be reviewed by the organizers and an additional set of reviewers. We anticipate approximately 30 accepted submissions. To avoid potential conflicts of interest, no organizer or reviewer will review a submitted paper from the same organization as the organizer or reviewer, enforced by CMT. We also plan to run a demo session alongside the poster session, where contributors will be able to present live demos of their work.

Our team of organizers were involved with two separate audio-related workshops at ICML 2022: the Workshop on Machine Learning for Audio Synthesis and ICML Expressive Vocalizations Workshop and Competition. We then combined our organizing committees and offered a workshop at NeurIPS 2023 entitled the Workshop on Machine Learning for Audio. This year, we have added new organizers to the team and plan to improve upon previous iterations of the workshop with a lineup of prominent in-person invited speakers, more accessible data distribution (as outlined below), and more.

Data Release

Recognizing the scarcity of free, publicly available audio data, Modulate and Hume AI will contribute several datasets in the speech domain alongside the workshop, all of large scale for their respective domains. These datasets, accessible via Google Drive, will include acted speech (professionally acted scripts), spontaneous speech (streamer content), mimicked speech (short-form emotive recordings), and mimicked non-verbal speech. The organizers hope this allows researchers from smaller research groups and academia to work with and validate findings on larger, more generalizable datasets. In previous iterations, multiple submissions utilized versions of provided data in their work, and a corresponding white paper was subsequently posted on arXiv.

Further details on available data described here.

Call for Papers

We are calling for extended abstracts up to 4 pages excluding references. Accepted submissions will be posted on the workshop website but not published/archived. Several submissions will be chosen for 15-minute contributed talks and the remaining selected submissions will participate in the poster & demo session. Please make sure submissions adhere to the NeurIPS format. The review process will be double-blind so please make sure not to put any author information in your submission. Authors may also submit supplementary materials along with their papers if they wish (e.g., a preview of a potential demo). Reviewers will not be required to read/view/listen to said supplementary material.

Submission Portal

Timeline

Proposed Schedule

We plan for the workshop to be an 8-hour event. Below is an approximate timetable of the workshop schedule, subject to change. We have been careful to facilitate ample time for informal discussion during the coffee break, poster & demo session, and open conversation session, as well as time for audience participation during the panel discussion and Q&A sections following invited talks.

Time Activity Description
9:00 Invited Speakers 1 & 2 Two 25-minute talks by invited speakers and Q&A.
10:00 Contributed Talks 1-3 Three 15-minute contributed talks by selected submissions and Q&A.
11:00 Coffee Break
11:30 Invited Speakers 3 & 4 Two 25-minute talks by invited speakers and Q&A.
12:30 Lunch
1:30 Poster & Demo Session Poster session alongside live demos from selected submissions.
2:30 Invited Speakers 5 & 6 Two 25-minute talks by invited speakers and Q&A.
3:30 Contributed Talks 4-6 Three 15-minute contributed talks by selected submissions and Q&A.
4:30 Panel Discussion Panel of invited speakers, where a moderator will facilitate discussion, including questions from the audience.
5:00 Wrap-up and Open Conversation A few minutes of closing remarks followed by informal conversation among workshop attendees.

Invited Speakers

We have curated a list of invited speakers from a wide variety of fields within the audio domain, listed below along with brief biographies. All confirmed invited speakers will be attending in-person.

James Betker is a research scientist at OpenAI, where he is one of the audio leads for GPT-4o. He is also the lead author of DALL-E 3. Previously, he created TorToiSe, a popular open source text-to-speech system. He also had a long tenure as a senior software engineer at Garmin, where he developed vehicular navigation systems. His research interests include generative models for audio and images.

Daniel PW Ellis is a research scientist at Google. From 2000-2015, he was a professor in the Electrical Engineering Department at Columbia University. In 2015, he joined Google in New York. He also runs the AUDITORY email list of over 2000 worldwide researchers in perception and cognition of sound. His research interests include speech recognition, music description, and environmental sound processing.

Albert Gu is an assistant professor at Carnegie Mellon University. Previously, he received his PhD from Stanford University. He is broadly interested in theoretical and empirical aspects of deep learning. His research involves understanding and developing approaches that can be practically useful for modern large-scale machine learning models, such as his current focus on deep sequence models. His work on state-space models, and in particular S4 and its variants, has been hugely influential in the audio community.

Laura Laurenti is a postdoctoral scholar at ETH-Zurich, where she studies the application of deep learning audio models to seismic data. She received her PhD from La Sapienza University of Rome. Her research includes applying deep learning to laboratory earthquakes, foundation models for seismic data, and diffusion models for seismic data.

Pratyusha Sharma is a PhD student in EECS at MIT, advised by Prof. Antonio Torralba and Prof. Jacob Andreas. She enjoys thinking about the interplay between language, reasoning and sequential decision making. Her research goal is to understand systems that exhibit broadly intelligent behaviors (AI systems and biological organisms) and build better AI systems. She has a broad range of speaking experience, such as invited talks at TED AI, the National Oceanic and Atmospheric Administration, and the Biennial Conference on the Biology of Marine Mammals just in the last year. Her research was also recently featured in National Geographic Magazine.

Organizers

Alice Baird is a senior AI research scientist at Hume AI, NY, USA, where she works on modeling expressive human behaviors from audio and other modalities. She earned her Ph.D. at the University of Augsburg in 2022. Her work on emotion understanding from auditory, physiological, and multimodal data has been widely published in leading journals and conferences. She has co-organized several machine learning competitions, including the 2022 ICML Expressive Vocalizations Workshop and the 2023 NeurIPS Workshop on Machine Learning for Audio.

Sander Dieleman is a research scientist at DeepMind in London, UK, where he contributed to the development of AlphaGo and WaveNet. His research focuses on generative modeling of perceptual signals at scale, including audio (speech & music) and visual data. He has co-organized multiple workshops, including the NeurIPS workshop on machine learning for creativity and design (2017-2020), the Recsys workshop on deep learning for recommender systems (2016-2018), the Machine Learning for Audio Synthesis workshop at ICML 2022, and the Workshop on Machine Learning for Audio at NeurIPS 2023.

Chris Donahue is an assistant professor at Carnegie Mellon University and a research scientist at Google DeepMind. His research focuses on developing and responsibly deploying generative AI for music and creativity to unlock and augment human creative potential. His work includes improving machine learning methods for controllable generative modeling for music, audio, and sequential data, as well as deploying interactive systems that allow a broad audience—including non-musicians—to harness generative music AI through intuitive controls.

Brian Kulis is an associate professor at Boston University and a former Amazon Scholar who worked on Alexa. His research focuses on machine learning, particularly applications in audio problems such as detection and generation. He has won best paper awards at ICML and CVPR and has organized multiple workshops at ICCV, NeurIPS, and ICML. He has also served as an area or senior area chair at major AI conferences and has organized tutorials at ICML and ECCV.

David Liu is a Ph.D. student in the Department of Computer Science at Boston University. His research focuses on deep learning for audio, with a particular emphasis on state-space models. He earned his bachelor’s degree in computer science, data science, and mathematics from the University of Wisconsin - Madison in 2023.

Rachel Manzelli is the Machine Learning Team Lead at Modulate, where she leads the development of audio generation and classification models supporting moderation teams in detecting harms in voice conversations (ToxMod) and real-time voice conversion (VoiceWear). Previously, she worked at Macro as a machine learning engineer, focusing on source separation models. She has co-organized the Machine Learning for Audio Synthesis workshop at ICML 2022 and the Workshop on Machine Learning for Audio at NeurIPS 2023. She earned her bachelor’s degree in computer engineering from Boston University in 2019, where she conducted research in structured music generation and MIR.

Shrikanth Narayanan is a University Professor and holder of the Niki and Max Nikias Chair in Engineering at the University of Southern California (USC). Shri is a Fellow of the National Academy of Inventors (NAI), the Acoustical Society of America (ASA), the Institute of Electrical and Electronics Engineers (IEEE), the International Speech Communication Association (ISCA), the Association for Psychological Science (APS), the American Association for the Advancement of Science (AAAS), American Institute for Medical and Biological Engineering (AIMBE) and the Association for the Advancement of Affective Computing (AAAC). Shri is a member of the European Academy of Sciences and Arts and a 2022 Guggenheim Fellow.