For questions, email mlforaudioworkshop@gmail.com
Machine learning research for audio applications has experienced a surge of innovation in recent years, with prominent and widely relevant advancements rapidly emerging and momentum continuing to build. There are numerous key problems within the audio research domain that continue to attract widespread attention. This ongoing relevance, alongside the success of the Machine Learning for Audio workshop at NeurIPS 2023 and ICML 2025, has inspired us to bring this workshop at ICML 2026. We believe that bringing this workshop to a wider audience will provide a good opportunity to bring together both practitioners of audio tools along with machine learning researchers interested in audio, in order to foster community, discussion, and future collaboration. In addition, with the field moving so rapidly, we believe this workshop will provide a dedicated space for the crucial ethical discussions that must be facilitated among researchers around applications of generative machine learning for audio.
The Machine Learning for Audio workshop at ICML 2026 will cover a broad range of tasks and challenges involving audio data. These include, but are not limited to: methods of speech modeling, environmental sound generation or other forms of ambient sound, novel generative models, music generation in the form of raw audio, text-to-speech methods, denoising of speech and music, data augmentation, classification of acoustic events, transcription, source separation, and multimodal problems.
We plan to solicit original extended abstracts (up to 4 pages) in these areas, which will be reviewed by the organizers and an additional set of reviewers. We anticipate approximately 30 accepted submissions. To avoid potential conflicts of interest, no organizer or reviewer will review a submitted paper from the same organization as the organizer or reviewer, enforced by CMT. We also plan to run a demo session alongside the poster session, where contributors will be able to present live demos of their work.
Our team of organizers were involved with two separate audio-related workshops at ICML 2022: the Workshop on Machine Learning for Audio Synthesis and ICML Expressive Vocalizations Workshop and Competition. We then combined our organizing committees and offered a workshop at NeurIPS 2023 entitled the Workshop on Machine Learning for Audio. Last year, we added new organizers to the team and hosted a workshop at ICML 2025. This year, we plan to improve upon previous iterations of the workshop with a lineup of prominent in-person invited speakers, more accessible data distribution (as outlined below), and more.
Recognizing the scarcity of free, publicly available audio data, Modulate and Hume AI will contribute several datasets in the speech domain alongside the workshop, all of large scale for their respective domains. These datasets, accessible via Google Drive, will include acted speech (professionally acted scripts), spontaneous speech (streamer content), mimicked speech (short-form emotive recordings), and mimicked non-verbal speech. The organizers hope this allows researchers from smaller research groups and academia to work with and validate findings on larger, more generalizable datasets. In previous iterations, multiple submissions utilized versions of provided data in their work, and a corresponding white paper was subsequently posted on arXiv.
Further details on available data described here.
We are calling for extended abstracts up to 4 pages excluding references. Accepted submissions will be posted on the workshop website but not published/archived. Several submissions will be chosen for 15-minute contributed talks and the remaining selected submissions will participate in the poster & demo session. Please make sure submissions adhere to the ICML format. The review process will be double-blind so please make sure not to put any author information in your submission. Authors may also submit supplementary materials along with their papers if they wish (e.g., a preview of a potential demo). Reviewers will not be required to read/view/listen to said supplementary material.
Timeline
Submission deadline (main paper & all supplementary material): May 25 23:59:59 AOE
Accept/Reject notification date: June 1 AOE
We plan for the workshop to be an 8-hour event. Below is an approximate timetable of the workshop schedule, subject to change. We have been careful to facilitate ample time for informal discussion during the coffee break, poster & demo session, and open conversation session, as well as time for audience participation during the panel discussion and Q&A sections following invited talks.
Minje Kim is an Associate Professor in the Siebel School of Computing and Data Science at the University of Illinois Urbana-Champaign and an Amazon Scholar. His research focuses on efficient machine learning for audio, including efficient data representations (e.g., neural audio coding), intelligent signal processing (e.g., speech enhancement and source separation), and generative modeling of audio.
Marius Miron is a Senior AI Research Scientist at the Earth Species Project, where he builds machine-learning and signal-processing methods for bioacoustics to help decode animal communication. Previously, he worked in music AI and audio signal processing (including a PhD on orchestral music source separation) at the Music Technology Group at Pompeu Fabra University.
Tara Sainath is a Distinguished Research Scientist at Google DeepMind and co-lead of the Gemini Audio pillar, known for applying deep learning to advance automatic speech recognition. She earned her S.B., M.Eng., and PhD in EECS from MIT and previously worked at IBM’s T.J. Watson Research Center.
Juhan Nam is a professor at KAIST's Graduate School of Culture Technology and leads the Music and Audio Computing Lab, where he researches music information retrieval and audio/music signal processing. He also serves as an affiliate professor at the Kim Jaechul Graduate School of Artificial Intelligence and the Graduate School of Metaverse. He is a co-founder of Neutune and AudAi.
Heiga Zen is a Principal Scientist at Google DeepMind in Japan, where he researches speech technology and machine learning. He is one of the original authors and first maintainer of the HMM-based speech synthesis system (HTS), and is a Fellow of ISCA and IEEE.
Alice Baird is a senior AI research scientist at Hume AI, NY, USA, where she works on modeling expressive human behaviors from audio and other modalities. She earned her Ph.D. at the University of Augsburg in 2022. Her work on emotion understanding from auditory, physiological, and multimodal data has been widely published in leading journals and conferences. She has co-organized several machine learning competitions, including the 2022 ICML Expressive Vocalizations Workshop and the 2023 NeurIPS Workshop on Machine Learning for Audio.
Sander Dieleman is a research scientist at DeepMind in London, UK, where he contributed to the development of AlphaGo and WaveNet. His research focuses on generative modeling of perceptual signals at scale, including audio (speech & music) and visual data. He has co-organized multiple workshops, including the NeurIPS workshop on machine learning for creativity and design (2017-2020), the Recsys workshop on deep learning for recommender systems (2016-2018), the Machine Learning for Audio Synthesis workshop at ICML 2022, and the Workshop on Machine Learning for Audio at NeurIPS 2023.
Chris Donahue is an assistant professor at Carnegie Mellon University and a research scientist at Google DeepMind. His research focuses on developing and responsibly deploying generative AI for music and creativity to unlock and augment human creative potential. His work includes improving machine learning methods for controllable generative modeling for music, audio, and sequential data, as well as deploying interactive systems that allow a broad audience—including non-musicians—to harness generative music AI through intuitive controls.
Brian Kulis is an associate professor at Boston University and a former Amazon Scholar who worked on Alexa. His research focuses on machine learning, particularly applications in audio problems such as detection and generation. He has won best paper awards at ICML and CVPR and has organized multiple workshops at ICCV, NeurIPS, and ICML. He has also served as an area or senior area chair at major AI conferences and has organized tutorials at ICML and ECCV.
David Liu is a Ph.D. student in the Department of Computer Science at Boston University. His research focuses on deep learning for audio, with a particular emphasis on state-space models. He earned his bachelor’s degree in computer science, data science, and mathematics from the University of Wisconsin - Madison in 2023.
Rachel Manzelli is the Machine Learning Team Lead at Modulate, where she leads the development of audio generation and classification models supporting moderation teams in detecting harms in voice conversations (ToxMod) and real-time voice conversion (VoiceWear). Previously, she worked at Macro as a machine learning engineer, focusing on source separation models. She has co-organized the Machine Learning for Audio Synthesis workshop at ICML 2022 and the Workshop on Machine Learning for Audio at NeurIPS 2023. She earned her bachelor’s degree in computer engineering from Boston University in 2019, where she conducted research in structured music generation and MIR.
Shrikanth Narayanan is a University Professor and holder of the Niki and Max Nikias Chair in Engineering at the University of Southern California (USC). Shri is a Fellow of the National Academy of Inventors (NAI), the Acoustical Society of America (ASA), the Institute of Electrical and Electronics Engineers (IEEE), the International Speech Communication Association (ISCA), the Association for Psychological Science (APS), the American Association for the Advancement of Science (AAAS), American Institute for Medical and Biological Engineering (AIMBE) and the Association for the Advancement of Affective Computing (AAAC). Shri is a member of the European Academy of Sciences and Arts and a 2022 Guggenheim Fellow.
| # | Paper | Authors |
|---|---|---|
| 1 | Flow Fake: Parametric Efficient Alternative for Transformers | Divyansh Sharma, Shivaay Dhondiyal, Dinesh Kumar Vishwakarma |
| 2 | MondegreensEval: A Phonetic Benchmark for Measuring Language-model Bias in Automatic Speech Recognition | Wan Ju Kang |
| 3 | PianoKontext: Expressive Performance Rendering from Deadpan Context | Dmitrii Gavrilev |
| 4 | RCbench: Benchmarking Retrospective Clarification in ASR | Wei-Ting Huang, Chin-Yuan Yeh, De-Nian Yang, Ming-Syan Chen |
| 5 | Closing the Loop: PID Feedback Control for Interpretable Activation Steering in Symbolic Music Generation | Ioannis Prokopiou, Pantelis Vikatos, Maximos Kaliakatsos-Papakostas, Theodoros Giannakopoulos, Themos Stafylakis |
| 6 | StethoLM: Audio Language Model for Cardiopulmonary Analysis Across Clinical Tasks | Yishan Wang, Tsai-Ning Wang, Mathias Funk, Aaqib Saeed |
| 7 | Multi-task Learning is Not Enough: Representational Entanglement in Dual-output Second Language Speech Recognition | Seung Hwan Cho, Young-Min Kim |
| 8 | ERM-MinMaxGAP: Benchmarking and Mitigating Gender Bias in Multilingual Multimodal Speech-LLM Emotion Recognition | Zi Haur Pang, Xiaoxue Gao, Tatsuya Kawahara, Nancy Chen |
| 9 | RIME: Enabling Large-Scale Agentic Music Post-Production | Noah Schaffer, Nikhil Singh |
| 10 | Drift-Augmented Scoring: Text-Derived Noise Robustness for Zero-Shot Audio–Language Classification | Tu Vo, Sheir Zaheer, Chan Youn Park |
| 11 | The Perceived Fragility of Explanations in Audio Models: Manipulation of Attribution with Unchanged Predictions | Dominik Wiącek, Mateusz Modrzejewski |
| 12 | Probing Token Spaces under Generator Shift in AI-Generated Music Detection | Joonyong Park, Jungwoo Kim, Junyoung Koh, Yuki Saito |
| 13 | SpeakStream: Streaming TTS with Interleaved Data | He Bai, Tatiana Likhomanenko, Zijin Gu, Navdeep Jaitly |
| 14 | AV-JEPA: Extending LeJEPA to Audio-Visual Self-Supervised Learning | Benjamin Robson, Santeri Mentu, Wenshuai Zhao, Arno Solin |
| 15 | SpokenUS: A Spoken User Simulator for Task-Oriented Dialogue | Jonggeun Lee, Junseong Pyo, Yohan Jo |
| 16 | Focus Then Listen: An Empirical Study of Plug-and-Play Audio Enhancer for Noise-Robust Large Audio Language Models | Han Yin, Yang Xiao, Younghoo Kwon, Ting Dang, Jung-Woo Choi |
| 17 | Stacking Complementary CLAP Embeddings for Improving Text-Audio Alignment Correspondence Scoring | Sheng Li, Jiyi Li, Takahiro Shinozaki |
| 18 | Towards Truly Multilingual ASR: Generalizing Code-Switching ASR to Unseen Language Pairs | Gio Paik, Hyunseo Shin, Soungmin Lee |
| 19 | DialBGM: A Benchmark for Background Music Recommendation from Everyday Multi-Turn Dialogues | JoonHyeok Shin, Jaehoon Kang, Yujun Lee, Hanna Lee, Yejin Lee, Yoonji Park, Kyuhong Shim |
| 20 | Probing Warmth-Mediated Harm in Speech-Enabled LLMs for Mental-Health Conversations | Eugenia Kim, Bolor-Erdene Jagdagdorj, Dina Pekelis, Leah Zulas, Amanda Minnich |
| 21 | How Small Can a Tandem Speech Front-End Be? Diagnosing Front-End Capacity with Layer Removal | Manato Yaguchi, So Kuroki |
| 22 | Representation Matters in Randomized Smoothing for Audio Classification | Jong-Ik Park, Shreyas Chaudhari, Jose Moura, Carlee Joe-Wong |
| 23 | Prior Dominance in Audio-Visual LLMs: When Generative Models Memorize Over Reasoning Under Cross-modal Conflict | Adarsh Sudheer, David Li, Omar El-Banna, Ishaan Kodarapu, Arjun Bahuguna |
| 24 | Beyond Binary Instrument QA: Probing Instrument Grounding in Music Audio-Language Models | Yujun Lee, JoonHyeok Shin, Hyoeun Kim, Kyuhong Shim |
| 25 | A Geometric Perspective on Composable Emotion Steering in Text-to-Speech Models | Siyi Wang, James Bailey, Ting Dang |
| 26 | Titans-as-a-Layer: Test-Time Memory for Conversational Speech Emotion Recognition | Daniel Chen, Qicong Hu, Yang Xiao, Ting Dang, Hong Jia |
| 27 | Evaluating Open-Weight Audio Models for Privacy Verification in Clinical Speech Redaction | Joseph Colonel, Adam Davidson, Guillermo Cecchi, Baihan Lin |
| 28 | Probing-Based Test-Time Steering of Music Diffusion Transformers | Junyoung Koh |
| 29 | What Matters for Music-Centered Recognition in Audio-Language Models? | Wenye Ma, Ichiro Fujinaga |
| 30 | AVENUE: Audio-Video EditiNg Understanding and Evaluation | Hayeon Kim, Yoojin Jang, Jaejun Yoo |
| 31 | Learning to Hear Motion Before Naming It | Katerina Vinciguerra |
| 32 | Benchmarking Diarization Models | Luca Lanzendörfer, Florian Grötschla, Cesare Blaser, Roger Wattenhofer |
| 33 | Cinematic Source Separation with Dialogue-Driven Sidechain Ducking | Atoof Shakir, Florian Grötschla, Luca Lanzendörfer, Roger Wattenhofer |
| 34 | Mechanistic Insights into Audio-Language Models for Impaired Speech | Pehuén Moure, Bilal Bounajma, Niclas Pokel, Yingqiang Gao, Roman Boehringer, Longbiao Cheng, Gonçalo Guiomar, Shih-Chii Liu |
| 35 | Speaker Separation via Audio Language Modeling | Luca Lanzendörfer, Constantin Pinkl, Florian Grötschla, Roger Wattenhofer |
| 36 | Residual Stream Contrast: A Training-Free Counterfactual Listening Test for Whisper Hallucinations | Arnesh Batra |
| 37 | Flow Matching-Based Speech Source Separation with Best-of-N Biometric Sampling | Anastasia Zorkina, Alexandr Anikin, Nikita Khmelev, Anastasiya Korenevskaya, Sergey Novoselov, Vladimir Volokhov, Maxim Korenevsky, Yuriy Matveev |
| 38 | Blind Audio Restoration using Contrastive Diffusion Guidance | Sattwik Basu, Chaitanya Amballa, Zhongweiyang Xu, Jorge V Sampedro, Srihari Nelakuditi, Romit Roy Choudhury |
| 39 | Physically Grounded Video-to-Audio Generation | Hyun-Bin Oh, Yuhta Takida, Toshimitsu Uesaka, Tae-Hyun Oh, Yuki Mitsufuji |
| 40 | Evaluating Pretrained Music Embeddings for Cross-Performance Jazz Standard Recognition | Çağrı Eser |
| 41 | Alethia: A Foundational Encoder for Voice Deepfakes | Yi Zhu, Brahmi Dwivedi, Jayaram Raghuram, Surya Koppisetti |
| 42 | Post-Training Speech Enhancement Language Models with Perceptual Rewards | Frédéric Berdoz, Luca Lanzendörfer, Antonis Asonitis, Roger Wattenhofer |
| 43 | Behavioral World Models as Missing Infrastructure for Responsible Generative Audio | Brownsatfford Abraham |
| 44 | Multimodal Video-to-Music Recommendation via Semantic Retrieval and Temporal Reranking | Seungheon Doh, Minhee Lee, Sangmoon Lee, Ben Sangbae Chon, Juhan Nam |
| 45 | ListenCare: Encounter-Grounded Audio Question Answering for Long-Form Clinical Conversation Speech | Seongsu Bae, Chaeeun Shim, Sungbae Park, Edward Choi |
| 46 | Where to Read a Frozen Audio Encoder: Objective-Induced Geometry and Zero-Label Layer Selection | Arnesh Batra, Aniket Khandelwal, Arush Gumber, Krish Thukral |
| 47 | Faithful Is Not Interpretable: Sparse Features, Circuits, and Robustness in Frozen Audio Encoders | Arnesh Batra, Aniket Khandelwal, Arush Gumber, Krish Thukral |
| 48 | Best-of-N TTS Evaluation is Confounded by ASR Family Alignment | Taehyung Yu, Seongjae Kang |
| 49 | Expressive Hindi Audiobook Generation with CLAP-Based Retrieval | William Xing, Kiran Raja, Pranav Anuraag, Arjun Bahuguna, Vasu Sharma |
| 50 | Testing Audio Captioning Metrics with Controlled Semantic Perturbations | Assel Yermekova, Vadim Popov, Tasnima Sadekova, Georgii Aparin |
| 51 | PCL: Partitioned Continual Learning via Unsupervised Latent Experts for Audio Classification | Gautham Krishna Gudur, Mohit Malu, Tanmay Khandait, Reza Rahimi Azghan, Anirudh Rayas, Pavan Turaga, Joydeep Ghosh, Hassan Ghasemzadeh, Edison Thomaz, Giulia Pedrielli |
| 52 | Few-Shot Synthetic Accented Speech for ASR Fine-Tuning: What Helps and When? | Yurii Halychanskyi, Nimet Beyza Bozdag, Mark Hasegawa-Johnson, Dilek Hakkani-Tür, Volodymyr Kindratenko |
| 53 | Prosodic Differences Between Child-Directed and Adult-Directed Speech in Text-to-Speech Generation | Jinyoung Jo, Katherine Nguyen, Sean Choi |
| 54 | Decoding Insect Song: A Multitask Semisupervised Orthoptera Bioacoustic Classifier | Olga Isupova, Danil Kuzin, Ella Browning, Tom Mills, Steven Reece |
| 55 | Autoregressive Zero-Shot Voice Conversion | Luca Lanzendörfer, Frédéric Berdoz, Antonis Asonitis, Roger Wattenhofer |
| 56 | Multilingual Speech Editing | Antonis Asonitis, Luca Lanzendörfer, Frédéric Berdoz, Roger Wattenhofer |
The Microsoft CMT service was used for managing the peer-reviewing process for this conference. This service was provided for free by Microsoft and they bore all expenses, including costs for Azure cloud services as well as for software development and support.