[“Learning to Listen: ICML 2026 Workshop on Machine Learning for Audio”]

For questions, email mlforaudioworkshop@gmail.com

This workshop will be held on Friday, July 10.

Workshop Description

Machine learning research for audio applications has experienced a surge of innovation in recent years, with prominent and widely relevant advancements rapidly emerging and momentum continuing to build. There are numerous key problems within the audio research domain that continue to attract widespread attention. This ongoing relevance, alongside the success of the Machine Learning for Audio workshop at NeurIPS 2023 and ICML 2025, has inspired us to bring this workshop at ICML 2026. We believe that bringing this workshop to a wider audience will provide a good opportunity to bring together both practitioners of audio tools along with machine learning researchers interested in audio, in order to foster community, discussion, and future collaboration. In addition, with the field moving so rapidly, we believe this workshop will provide a dedicated space for the crucial ethical discussions that must be facilitated among researchers around applications of generative machine learning for audio.

The Machine Learning for Audio workshop at ICML 2026 will cover a broad range of tasks and challenges involving audio data. These include, but are not limited to: methods of speech modeling, environmental sound generation or other forms of ambient sound, novel generative models, music generation in the form of raw audio, text-to-speech methods, denoising of speech and music, data augmentation, classification of acoustic events, transcription, source separation, and multimodal problems.

We plan to solicit original extended abstracts (up to 4 pages) in these areas, which will be reviewed by the organizers and an additional set of reviewers. We anticipate approximately 30 accepted submissions. To avoid potential conflicts of interest, no organizer or reviewer will review a submitted paper from the same organization as the organizer or reviewer, enforced by CMT. We also plan to run a demo session alongside the poster session, where contributors will be able to present live demos of their work.

Our team of organizers were involved with two separate audio-related workshops at ICML 2022: the Workshop on Machine Learning for Audio Synthesis and ICML Expressive Vocalizations Workshop and Competition. We then combined our organizing committees and offered a workshop at NeurIPS 2023 entitled the Workshop on Machine Learning for Audio. Last year, we added new organizers to the team and hosted a workshop at ICML 2025. This year, we plan to improve upon previous iterations of the workshop with a lineup of prominent in-person invited speakers, more accessible data distribution (as outlined below), and more.

Data Release

Recognizing the scarcity of free, publicly available audio data, Modulate and Hume AI will contribute several datasets in the speech domain alongside the workshop, all of large scale for their respective domains. These datasets, accessible via Google Drive, will include acted speech (professionally acted scripts), spontaneous speech (streamer content), mimicked speech (short-form emotive recordings), and mimicked non-verbal speech. The organizers hope this allows researchers from smaller research groups and academia to work with and validate findings on larger, more generalizable datasets. In previous iterations, multiple submissions utilized versions of provided data in their work, and a corresponding white paper was subsequently posted on arXiv.

Further details on available data described here.

Call for Papers

We are calling for extended abstracts up to 4 pages excluding references. Accepted submissions will be posted on the workshop website but not published/archived. Several submissions will be chosen for 15-minute contributed talks and the remaining selected submissions will participate in the poster & demo session. Please make sure submissions adhere to the ICML format. The review process will be double-blind so please make sure not to put any author information in your submission. Authors may also submit supplementary materials along with their papers if they wish (e.g., a preview of a potential demo). Reviewers will not be required to read/view/listen to said supplementary material.

Submission Portal

Timeline

Submission deadline (main paper & all supplementary material): May 25 23:59:59 AOE
Accept/Reject notification date: June 1 AOE

Proposed Schedule

We plan for the workshop to be an 8-hour event. Below is an approximate timetable of the workshop schedule, subject to change. We have been careful to facilitate ample time for informal discussion during the coffee break, poster & demo session, and open conversation session, as well as time for audience participation during the panel discussion and Q&A sections following invited talks.

Time (KST)	Topic
8:25 AM	Opening remarks — Brian Kulis
8:30 AM	Heiga Zen: From Statistical Models to Foundation Capabilities: The Historical Progression of Speech Generation
9:00 AM	Marius Miron: Animal Language Processing: AI to decode communication beyond humans
9:30 AM	Flow Fake: Parametric Efficient Alternative for Transformers
9:50 AM	MondegreensEval: A Phonetic Benchmark for Measuring Language-model Bias in Automatic Speech Recognition
10:10 AM	PianoKontext: Expressive Performance Rendering from Deadpan Context
10:30 AM	Coffee break
11:00 AM	Juhan Nam: LLM4FM: Empowering LLMs to Generate Yamaha DX7 Patches from Text and Audio
11:30 AM	Minje Kim: Neural Speech and Audio Coding: Efficient Representations, Emerging Capabilities, and Open Challenges
12:00 PM	Lunch
1:00 PM	Poster Session
2:00 PM	Tara Sainath: Audio Processing for Large Language Models
2:30 PM	Prosodic Differences Between Child-Directed and Adult-Directed Speech in Text-to-Speech Generation
2:50 PM	Few-Shot Synthetic Accented Speech for ASR Fine-Tuning: What Helps and When?
3:10 PM	Coffee break
3:30 PM	Autoregressive Zero-Shot Voice Conversion
3:50 PM	Decoding Insect Song: A Multitask Semisupervised Orthoptera Bioacoustic Classifier
4:10 PM	Multilingual Speech Editing
4:30 PM	Closing remarks — Brian Kulis

Invited Speakers

We have curated a list of invited speakers from a wide variety of fields within the audio domain, listed below along with brief biographies. All confirmed invited speakers will be attending in-person.

Minje Kim is an Associate Professor in the Siebel School of Computing and Data Science at the University of Illinois Urbana-Champaign and an Amazon Scholar. His research focuses on efficient machine learning for audio, including efficient data representations (e.g., neural audio coding), intelligent signal processing (e.g., speech enhancement and source separation), and generative modeling of audio.

Marius Miron is a Senior AI Research Scientist at the Earth Species Project, where he builds machine-learning and signal-processing methods for bioacoustics to help decode animal communication. Previously, he worked in music AI and audio signal processing (including a PhD on orchestral music source separation) at the Music Technology Group at Pompeu Fabra University.

Tara Sainath is a Distinguished Research Scientist at Google DeepMind and co-lead of the Gemini Audio pillar, known for applying deep learning to advance automatic speech recognition. She earned her S.B., M.Eng., and PhD in EECS from MIT and previously worked at IBM’s T.J. Watson Research Center.

Juhan Nam is a professor at KAIST's Graduate School of Culture Technology and leads the Music and Audio Computing Lab, where he researches music information retrieval and audio/music signal processing. He also serves as an affiliate professor at the Kim Jaechul Graduate School of Artificial Intelligence and the Graduate School of Metaverse. He is a co-founder of Neutune and AudAi.

Heiga Zen is a Principal Scientist at Google DeepMind in Japan, where he researches speech technology and machine learning. He is one of the original authors and first maintainer of the HMM-based speech synthesis system (HTS), and is a Fellow of ISCA and IEEE.

Talks

Speaker	Title	Abstract
Marius Miron	Animal Language Processing: AI to decode communication beyond humans	AI already supports conservation at a remarkable scale. Audio and video monitoring now generate petabytes of data for biodiversity monitoring and ecological research. These systems primarily detect and classify, they can tell us if a species is present or absent. But what if we could move beyond detecting species into understanding their communication? Animal Language Processing is an emerging interdisciplinary field that combines biology, bioacoustics, and AI to understand the meaning and function of animal communication: bird dialects, spider courtship dances, calls to label conspecifics (the animal equivalent of names), matriarchal clans that pass culture across generations. For machine learning, this is a vast test bed with profound implications for our relationship with the rest of nature. However, it requires defining new evaluation metrics and methods and reaching beyond anthropocentric assumptions to deal with the biggest challenge in the field: lack of ground truth or predefined labels.
Juhan Nam	LLM4FM: Empowering LLMs to Generate Yamaha DX7 Patches from Text and Audio	Programming FM synthesizers is notoriously difficult due to the highly non-linear relationship between parameters and sound. In this talk, I present LLM4FM, a framework that enables Large Language Models (LLMs) to generate Yamaha DX7 patches from text descriptions or audio examples. To support this task, we introduce DX7Caps, the first dataset pairing DX7 patches with natural-language captions, and propose Operator-Isolated Audio Grounding for CoT Distillation. We further extend the framework to sound matching by generating DX7 patches directly from audio. Results from objective evaluations, listening tests, and LLM-based assessments demonstrate the potential of LLMs as practical assistants for FM sound design.
Minje Kim	Neural Speech and Audio Coding: Efficient Representations, Emerging Capabilities, and Open Challenges	This talk will provide an overview of the rapidly evolving landscape of neural speech and audio coding (NSAC). Recent progress has shown that data-driven coding systems, when paired with appropriate learning objectives and model architectures, can achieve substantial gains in coding efficiency over conventional approaches. Beyond improved representational efficiency, NSAC also opens the door to new capabilities that are difficult to realize with traditional codecs, including personalized speech coding, task-specific representations for audio coding for machines (ACoM), and cascaded residual learning frameworks that make neural codecs more flexible and expressive. The talk will also discuss key challenges that are specific to NSAC, including the computational cost of neural encoders and decoders, the risk of hallucination and other generative artifacts, and the difficulty of balancing perceptual quality with faithful signal reconstruction. Finally, I will highlight recent efforts to address these issues through more efficient model architectures, improved training strategies, and semantic loss functions that better align codec behavior with human perception and downstream machine-listening tasks.
Heiga Zen	From Statistical Models to Foundation Capabilities: The Historical Progression of Speech Generation	The field of speech generation has undergone massive transformations, evolving from physical model simulations and concatenative systems to advanced foundational models. This talk will trace the historical progression of generative approaches for speech generation.

Organizers

Alice Baird is a senior AI research scientist at Hume AI, NY, USA, where she works on modeling expressive human behaviors from audio and other modalities. She earned her Ph.D. at the University of Augsburg in 2022. Her work on emotion understanding from auditory, physiological, and multimodal data has been widely published in leading journals and conferences. She has co-organized several machine learning competitions, including the 2022 ICML Expressive Vocalizations Workshop and the 2023 NeurIPS Workshop on Machine Learning for Audio.

Sander Dieleman is a research scientist at DeepMind in London, UK, where he contributed to the development of AlphaGo and WaveNet. His research focuses on generative modeling of perceptual signals at scale, including audio (speech & music) and visual data. He has co-organized multiple workshops, including the NeurIPS workshop on machine learning for creativity and design (2017-2020), the Recsys workshop on deep learning for recommender systems (2016-2018), the Machine Learning for Audio Synthesis workshop at ICML 2022, and the Workshop on Machine Learning for Audio at NeurIPS 2023.

Chris Donahue is an assistant professor at Carnegie Mellon University and a research scientist at Google DeepMind. His research focuses on developing and responsibly deploying generative AI for music and creativity to unlock and augment human creative potential. His work includes improving machine learning methods for controllable generative modeling for music, audio, and sequential data, as well as deploying interactive systems that allow a broad audience—including non-musicians—to harness generative music AI through intuitive controls.

Brian Kulis is an associate professor at Boston University and a former Amazon Scholar who worked on Alexa. His research focuses on machine learning, particularly applications in audio problems such as detection and generation. He has won best paper awards at ICML and CVPR and has organized multiple workshops at ICCV, NeurIPS, and ICML. He has also served as an area or senior area chair at major AI conferences and has organized tutorials at ICML and ECCV.

David Liu is a Ph.D. student in the Department of Computer Science at Boston University. His research focuses on deep learning for audio, with a particular emphasis on state-space models. He earned his bachelor’s degree in computer science, data science, and mathematics from the University of Wisconsin - Madison in 2023.

Rachel Manzelli is the Machine Learning Team Lead at Modulate, where she leads the development of audio generation and classification models supporting moderation teams in detecting harms in voice conversations (ToxMod) and real-time voice conversion (VoiceWear). Previously, she worked at Macro as a machine learning engineer, focusing on source separation models. She has co-organized the Machine Learning for Audio Synthesis workshop at ICML 2022 and the Workshop on Machine Learning for Audio at NeurIPS 2023. She earned her bachelor’s degree in computer engineering from Boston University in 2019, where she conducted research in structured music generation and MIR.

Shrikanth Narayanan is a University Professor and holder of the Niki and Max Nikias Chair in Engineering at the University of Southern California (USC). Shri is a Fellow of the National Academy of Inventors (NAI), the Acoustical Society of America (ASA), the Institute of Electrical and Electronics Engineers (IEEE), the International Speech Communication Association (ISCA), the Association for Psychological Science (APS), the American Association for the Advancement of Science (AAAS), American Institute for Medical and Biological Engineering (AIMBE) and the Association for the Advancement of Affective Computing (AAAC). Shri is a member of the European Academy of Sciences and Arts and a 2022 Guggenheim Fellow.

Accepted Papers

#	Paper	Authors
1	Flow Fake: Parametric Efficient Alternative for Transformers	Divyansh Sharma, Shivaay Dhondiyal, Dinesh Kumar Vishwakarma
2	MondegreensEval: A Phonetic Benchmark for Measuring Language-model Bias in Automatic Speech Recognition	Wan Ju Kang
3	PianoKontext: Expressive Performance Rendering from Deadpan Context	Dmitrii Gavrilev
4	RCbench: Benchmarking Retrospective Clarification in ASR	Wei-Ting Huang, Chin-Yuan Yeh, De-Nian Yang, Ming-Syan Chen
5	Closing the Loop: PID Feedback Control for Interpretable Activation Steering in Symbolic Music Generation	Ioannis Prokopiou, Pantelis Vikatos, Maximos Kaliakatsos-Papakostas, Theodoros Giannakopoulos, Themos Stafylakis
6	StethoLM: Audio Language Model for Cardiopulmonary Analysis Across Clinical Tasks	Yishan Wang, Tsai-Ning Wang, Mathias Funk, Aaqib Saeed
7	Multi-task Learning is Not Enough: Representational Entanglement in Dual-output Second Language Speech Recognition	Seung Hwan Cho, Young-Min Kim
8	ERM-MinMaxGAP: Benchmarking and Mitigating Gender Bias in Multilingual Multimodal Speech-LLM Emotion Recognition	Zi Haur Pang, Xiaoxue Gao, Tatsuya Kawahara, Nancy Chen
9	RIME: Enabling Large-Scale Agentic Music Post-Production	Noah Schaffer, Nikhil Singh
10	Drift-Augmented Scoring: Text-Derived Noise Robustness for Zero-Shot Audio–Language Classification	Tu Vo, Sheir Zaheer, Chan Youn Park
11	The Perceived Fragility of Explanations in Audio Models: Manipulation of Attribution with Unchanged Predictions	Piotr Kitłowski, Dominik Wiącek, Mateusz Modrzejewski
12	Probing Token Spaces under Generator Shift in AI-Generated Music Detection	Joonyong Park, Jungwoo Kim, Junyoung Koh, Yuki Saito
13	SpeakStream: Streaming TTS with Interleaved Data	He Bai, Tatiana Likhomanenko, Zijin Gu, Navdeep Jaitly
14	AV-JEPA: Extending LeJEPA to Audio-Visual Self-Supervised Learning	Benjamin Robson, Santeri Mentu, Wenshuai Zhao, Arno Solin
15	SpokenUS: A Spoken User Simulator for Task-Oriented Dialogue	Jonggeun Lee, Junseong Pyo, Yohan Jo
16	Focus Then Listen: An Empirical Study of Plug-and-Play Audio Enhancer for Noise-Robust Large Audio Language Models	Han Yin, Yang Xiao, Younghoo Kwon, Ting Dang, Jung-Woo Choi
17	Stacking Complementary CLAP Embeddings for Improving Text-Audio Alignment Correspondence Scoring	Sheng Li, Jiyi Li, Takahiro Shinozaki
18	Towards Truly Multilingual ASR: Generalizing Code-Switching ASR to Unseen Language Pairs	Gio Paik, Hyunseo Shin, Soungmin Lee
19	DialBGM: A Benchmark for Background Music Recommendation from Everyday Multi-Turn Dialogues	JoonHyeok Shin, Jaehoon Kang, Yujun Lee, Hanna Lee, Yejin Lee, Yoonji Park, Kyuhong Shim
20	Probing Warmth-Mediated Harm in Speech-Enabled LLMs for Mental-Health Conversations	Eugenia Kim, Bolor-Erdene Jagdagdorj, Dina Pekelis, Leah Zulas, Amanda Minnich
21	How Small Can a Tandem Speech Front-End Be? Diagnosing Front-End Capacity with Layer Removal	Manato Yaguchi, So Kuroki
22	Representation Matters in Randomized Smoothing for Audio Classification	Jong-Ik Park, Shreyas Chaudhari, Jose Moura, Carlee Joe-Wong
23	Prior Dominance in Audio-Visual LLMs: When Generative Models Memorize Over Reasoning Under Cross-modal Conflict	Adarsh Sudheer, David Li, Omar El-Banna, Ishaan Kodarapu, Arjun Bahuguna
24	Beyond Binary Instrument QA: Probing Instrument Grounding in Music Audio-Language Models	Yujun Lee, JoonHyeok Shin, Hyoeun Kim, Kyuhong Shim
25	A Geometric Perspective on Composable Emotion Steering in Text-to-Speech Models	Siyi Wang, James Bailey, Ting Dang
26	Titans-as-a-Layer: Test-Time Memory for Conversational Speech Emotion Recognition	Daniel Chen, Qicong Hu, Yang Xiao, Ting Dang, Hong Jia
27	Evaluating Open-Weight Audio Models for Privacy Verification in Clinical Speech Redaction	Joseph Colonel, Adam Davidson, Guillermo Cecchi, Baihan Lin
28	Probing-Based Test-Time Steering of Music Diffusion Transformers	Junyoung Koh
29	What Matters for Music-Centered Recognition in Audio-Language Models?	Wenye Ma, Ichiro Fujinaga
30	AVENUE: Audio-Video EditiNg Understanding and Evaluation	Hayeon Kim, Yoojin Jang, Jaejun Yoo
31	Learning to Hear Motion Before Naming It	Katerina Vinciguerra
32	Benchmarking Diarization Models	Luca Lanzendörfer, Florian Grötschla, Cesare Blaser, Roger Wattenhofer
33	Cinematic Source Separation with Dialogue-Driven Sidechain Ducking	Atoof Shakir, Florian Grötschla, Luca Lanzendörfer, Roger Wattenhofer
34	Mechanistic Insights into Audio-Language Models for Impaired Speech	Pehuén Moure, Bilal Bounajma, Niclas Pokel, Yingqiang Gao, Roman Boehringer, Longbiao Cheng, Gonçalo Guiomar, Shih-Chii Liu
35	Speaker Separation via Audio Language Modeling	Luca Lanzendörfer, Constantin Pinkl, Florian Grötschla, Roger Wattenhofer
36	Residual Stream Contrast: A Training-Free Counterfactual Listening Test for Whisper Hallucinations	Arnesh Batra
37	Flow Matching-Based Speech Source Separation with Best-of-N Biometric Sampling	Anastasia Zorkina, Alexandr Anikin, Nikita Khmelev, Anastasiya Korenevskaya, Sergey Novoselov, Vladimir Volokhov, Maxim Korenevsky, Yuriy Matveev
38	Blind Audio Restoration using Contrastive Diffusion Guidance	Sattwik Basu, Chaitanya Amballa, Zhongweiyang Xu, Jorge V Sampedro, Srihari Nelakuditi, Romit Roy Choudhury
39	Physically Grounded Video-to-Audio Generation	Hyun-Bin Oh, Yuhta Takida, Toshimitsu Uesaka, Tae-Hyun Oh, Yuki Mitsufuji
40	Evaluating Pretrained Music Embeddings for Cross-Performance Jazz Standard Recognition	Çağrı Eser
41	Alethia: A Foundational Encoder for Voice Deepfakes	Yi Zhu, Brahmi Dwivedi, Jayaram Raghuram, Surya Koppisetti
42	Post-Training Speech Enhancement Language Models with Perceptual Rewards	Frédéric Berdoz, Luca Lanzendörfer, Antonis Asonitis, Roger Wattenhofer
43	Behavioral World Models as Missing Infrastructure for Responsible Generative Audio	Brownsatfford Abraham
44	Multimodal Video-to-Music Recommendation via Semantic Retrieval and Temporal Reranking	Seungheon Doh, Minhee Lee, Sangmoon Lee, Ben Sangbae Chon, Juhan Nam
45	ListenCare: Encounter-Grounded Audio Question Answering for Long-Form Clinical Conversation Speech	Seongsu Bae, Chaeeun Shim, Sungbae Park, Edward Choi
46	Where to Read a Frozen Audio Encoder: Objective-Induced Geometry and Zero-Label Layer Selection	Arnesh Batra, Aniket Khandelwal, Arush Gumber, Krish Thukral
47	Faithful Is Not Interpretable: Sparse Features, Circuits, and Robustness in Frozen Audio Encoders	Arnesh Batra, Aniket Khandelwal, Arush Gumber, Krish Thukral
48	Best-of-N TTS Evaluation is Confounded by ASR Family Alignment	Taehyung Yu, Seongjae Kang
49	Expressive Hindi Audiobook Generation with CLAP-Based Retrieval	William Xing, Kiran Raja, Pranav Anuraag, Arjun Bahuguna, Vasu Sharma
50	Testing Audio Captioning Metrics with Controlled Semantic Perturbations	Assel Yermekova, Vadim Popov, Tasnima Sadekova, Georgii Aparin
51	PCL: Partitioned Continual Learning via Unsupervised Latent Experts for Audio Classification	Gautham Krishna Gudur, Mohit Malu, Tanmay Khandait, Reza Rahimi Azghan, Anirudh Rayas, Pavan Turaga, Joydeep Ghosh, Hassan Ghasemzadeh, Edison Thomaz, Giulia Pedrielli
52	Few-Shot Synthetic Accented Speech for ASR Fine-Tuning: What Helps and When?	Yurii Halychanskyi, Nimet Beyza Bozdag, Mark Hasegawa-Johnson, Dilek Hakkani-Tür, Volodymyr Kindratenko
53	Prosodic Differences Between Child-Directed and Adult-Directed Speech in Text-to-Speech Generation	Jinyoung Jo, Katherine Nguyen, Sean Choi
54	Decoding Insect Song: A Multitask Semisupervised Orthoptera Bioacoustic Classifier	Olga Isupova, Danil Kuzin, Ella Browning, Tom Mills, Steven Reece
55	Autoregressive Zero-Shot Voice Conversion	Luca Lanzendörfer, Frédéric Berdoz, Antonis Asonitis, Roger Wattenhofer
56	Multilingual Speech Editing	Antonis Asonitis, Luca Lanzendörfer, Frédéric Berdoz, Roger Wattenhofer

The Microsoft CMT service was used for managing the peer-reviewing process for this conference. This service was provided for free by Microsoft and they bore all expenses, including costs for Azure cloud services as well as for software development and support.