Vision augmented hearing
Theo Murphy meeting organised by Professor Jennifer Bizley, Professor Michael Akeroyd, Professor Adrian KC Lee.
Acoustic information is not the sole determinant of how the everyday world sounds: our brains rely on vision to rescue hearing in situations when audition is hazy or worse. Perception continuously, seamlessly binds information across senses, but how remains mysterious. We will gather diverse experts to unify the latest research and chart a path towards better virtual and augmented-reality technology.
Programme
The programme, including speaker biographies and abstracts, is available below but please note the programme may be subject to change.
Poster session
There will be a poster session from 5pm on Tuesday 3 March 2026. Registered attendees will be invited to submit a proposed poster title and abstract (up to 200 words). Acceptances may be made on a rolling basis so we recommend submitting as soon as possible in case the session becomes full. Submissions made within one month of the meeting may not be included in the programme booklet.
Attending the event
This event is intended for researchers in relevant fields.
- Free to attend and in-person only
- When requesting an invitation, please briefly state your expertise and reasons for attending
- Requests are reviewed by the meeting organisers on a rolling basis. You will receive a link to register if your request has been successful
- Catering options will be available to purchase upon registering. Participants are responsible for booking their own accommodation. Please do not book accommodation until you have been invited to attend the meeting by the meeting organisers
Please note that scientific meetings hosted by the Royal Society do not necessarily represent a Royal Society position or signify an endorsement of the speakers or content presented.
Enquiries: Contact the Scientific Programmes team.
Organisers
Schedule
Chair
Dr Addison Billing
University of Cambridge, UK
Dr Addison Billing
University of Cambridge, UK
| 08:30-09:00 |
Registration
|
|---|---|
| 09:00-09:30 |
Welcome by the lead organisers
Dr Jennifer BizleyUniversity College London, UK
Dr Jennifer BizleyUniversity College London, UK Jennifer Bizley is Professor of Auditory Neuroscience and a Wellcome Career Development Award holder at the UCL Ear Institute. Her work explores the brain basis of listening and, in particular, how auditory and non-auditory factors influence the processing of sound. Her research combines behavioural methods with techniques to measure and manipulate neural activity as well as anatomical and computational approaches.
Professor Adrian KC LeeUniversity of Washington, US
Professor Adrian KC LeeUniversity of Washington, US Adrian KC Lee is a Professor in the Department of Speech & Hearing Sciences and at the Institute for Learning and Brain Sciences at the University of Washington, Seattle, USA. He obtained his bachelor’s degree in electrical engineering at the University of New South Wales and his doctorate at the Harvard-MIT Division in Health Sciences and Technology. His research focuses on developing multimodal imaging techniques to investigate the cortical network involved in auditory scene analysis and attention, especially through designing novel behavioral paradigms that bridge the gap in psychoacoustics, multisensory and neuroimaging research. |
| 09:30-09:45 |
Discussion
|
| 09:45-10:15 |
The sounds your ears make when your eyes move: eye movement-related eardrum oscillations (EMREOs) and their role in linking visual and auditory space
Before vision can augment hearing, the visual and auditory aspects that relate to the same underlying event must be linked correctly. For example, the sight of lip movements is only helpful if they come from the person who is actually talking. Linking-by-location is a sensible method for forming such associations correctly. However, sounds and sights are localized differently – visual stimulus location is ascertained based on the retinal location of the image, and sounds are localized based on sound delay and loudness differences across the two ears. In humans and many other species, the eyes can move with respect to the head/ears. The correspondence between retinal image location and sound delay/loudness differences has to be flexibly governed by the position of the eyes with respect to the head. We recently discovered that the brain sends signals regarding eye movements to the ears, causing oscillations of the eardrum and producing self-generated sounds that can be detected via earbud microphones (Gruters et al, PNAS 2018). These eye movement-related eardrum oscillations (EMREOs) likely constitute the first step of a coordinate transformation of auditory signals into common coordinates with the visual system (Lovich et al PNAS 2023).
Professor Jennifer GrohDuke University, US
Professor Jennifer GrohDuke University, US Jennifer M Groh is Professor of Psychology & Neuroscience, Neurobiology, Computer Science, and Biomedical Engineering at Duke University, where she is a member of the Center for Cognitive Neuroscience and the Duke Institute for Brain Sciences. Her research concerns how the brain represents spatial information and performs computations on those representations. Her discoveries have shed light on how the brain transforms auditory signals to permit communication with visual signals - despite major differences in the neural “language” used by each sense. She is the recipient of numerous awards including a John Simon Guggenheim fellowship. She has authored many scientific publications as well as a well-regarded book for a general audience (Making Space: How the Brain Knows Where Things Are, Harvard University Press, 2014) and a related popular Coursera course The Brain and Space. |
| 10:15-10:30 |
Discussion
|
| 10:30-11:00 |
Break
|
| 11:00-11:30 |
Tracing the effect of visual stimuli on speech encoding along the human auditory pathway
In noisy settings, seeing a talker allows them to be much better understood. Several studies have demonstrated cortical effects of audio-visual integration in humans and animal models. Subcortically, some work in animals has shown effects of visual stimuli in auditory areas, but there is very little human work to back that up. In this study, we presented 23 listeners with audio-visual speech under two conditions: coherent, in which the acoustic and visual speech matched, and incoherent, in which the visual speech was replaced with a different recording of the same talker. The target speech was presented alongside two acoustic masker talkers. Listeners were asked to report keywords. We recorded EEG and computed the brainstem temporal response function, from which we derived a waveform for each condition resembling the auditory brainstem response (ABR). Behavioral results confirmed the perceptual benefit of the congruent condition over the incongruent: all subjects showed better performance, with a mean improvement of 10% correct. ABR waveforms to target speech did not differ between the two audio-visual conditions. Responses to masker speech were similarly unaffected by the visual stimulus. It is clear from our behavioral results and countless prior studies that congruent visual speech improves understanding in the presence of background noise. Audio-visual integration of speech signals has been shown in humans in later cortical waves, but was not seen subcortically in our present study. This is consistent with recent work from our lab showing that selective attention impacts cortical but not subcortical EEG responses in human listeners.
Dr Ross MaddoxUniversity of Michigan, US
Dr Ross MaddoxUniversity of Michigan, US Ross Maddox earned his PhD and MS in Biomedical Engineering from Boston University, and his BS in Sound Engineering from the University of Michigan. Following his PhD, he completed a postdoctoral appointment at the University of Washington Institute for Learning & Brain Sciences (I-LABS). He was on the faculty of the Departments of Biomedical Engineering and Neuroscience at the University of Rochester before moving to the Kresge Hearing Research Institute at the University of Michigan in 2024. |
| 11:30-11:45 |
Discussion
|
| 11:45-12:15 |
Visual signals in ferret auditory cortex
In this talk, I will summarise prior work demonstrating that visual signals are present across throughout ferret auditory cortex (AC), and that the temporal coherence of visual input with sound amplitude can shape the representation of sound mixtures in AC. I will then discuss recent work that aims to identify the source(s) of visual inputs to auditory cortex, through a combination of cortical cooling and viral tract tracing. We find that silencing a sub-region of visual cortex (the posterior bank of the suprasylvian sulcus, referred to as PSS, and adjacent area 21) impacted around half of visually responsive units in AC, most commonly resulting in decreased firing rate. These data are consistent with the idea that some visual information in AC enters via an excitatory input from SSY. However, preserved visual responses in many other units suggest the involvement of additional pathways, and a small number of units in which responses emerged or increased during cooling suggest an additional indirect role for SSY. Anatomical tracing has revealed that both lemniscal and non-lemniscal thalamic regions send projections to secondary AC in the ferret (in addition to the known connections with primary regions), which we suggest are likely additional pathways via which visual information enters AC. Subsequently, we have probed the visual features that most effectively drive responses in AC, and revealed differences in sensitivity to visual features between primary and non-primary AC. Our current interest is in the interaction of selective attention with audio-visual integration, for which we have developed a ferret version of the auditory selective attention task presented in Maddox et al., 2015. We find a similar impact of audio-visual temporal coherence on performance as previously reported in humans, although the effect appears to change over time as animals repeatedly perform the task over months, suggesting possible impacts of expertise and/or overtraining. Future work involves the analysis of neural data collected while animals perform this task.
Dr Rebecca NorrisUniversity College London, UK
Dr Rebecca NorrisUniversity College London, UK Rebecca Norris is a postdoctoral research associate in the lab of Professor Jennifer Bizley. Rebecca completed her PhD at the University of Melbourne in Australia, studying the impact of post-synaptic gene mutations on dissociable aspects of cognition in rodent models. Following completion of her doctoral work, Rebecca moved to London to work on auditory cortical processing and multi-sensory integration in the Bizley lab, where she has developed expertise in chronic electrophysiology, behavioural analysis and anatomical tracing in the ferret model. |
| 12:15-12:30 |
Discussion
|
Chair
Dr Axelle Calcus
Université libre de Bruxelles
Dr Axelle Calcus
Université libre de Bruxelles
Axelle Calcus is an Associate Professor at Université libre de Bruxelles (ULB), focusing on auditory cognitive neuroscience, with a strong focus on development, especially adolescence. After graduating from her PhD (2015, ULB), she spent time in the US (Boston University), the UK (University College London) and in France (École normale supérieure), working in auditory neurophysiology in adults, children and adolescents. Currently, her research project aims at understanding the mechanisms underlying the protracted maturation of complex auditory processing (including auditory scene analysis and speech perception in noise) in typical and atypical children/adolescents, using a combination of psychoacoustic and neuroimaging methods.
| 13:30-14:00 |
The function of top-down processes in segmenting and selecting objects in the visual scene
Accurate segmentation of the visual scene allows us to select and manipulate objects in our environment. Top-down connections in sensory systems are thought to modulate activity in primary sensory areas to enhance object-related activity while suppressing background activity. Here I will discuss recent work in mice, monkeys and humans showing how connectivity between cells in higher visual areas tuned for border-ownership and cells in V1 leads to precise scene segmentation. I will show how interaction with local circuitry in V1 allows top-down connections to drive activity, even in the absence of bottom-up input from the retina. Finally, I will discuss how segmentation processes evolve over time, from an early phase where local contextual effects determine activity to a later phase where the global scene organisation is represented in primary visual cortex.
Dr Matthew SelfUniversity of Glasgow, UK
Dr Matthew SelfUniversity of Glasgow, UK Dr Matthew Self is a Senior Lecturer at the School of Psychology and Neuroscience at the University of Glasgow. His work follows two broad research themes. In the visual system he studies how feedforward and feedback circuits are used to segment the visual scene into objects and backgrounds, the neural circuits that mediate contextual and predictive effects, and the role of top-down processes in visual perception. He also collaborates with neurosurgeons, in the Netherlands and internationally, to record single-cell activity in the human hippocampus during cognitive behaviours. He studies how hippocampal activity can be controlled and used to learn spatiotemporal information. |
|---|---|
| 14:00-14:15 |
Discussion
|
| 14:15-14:45 |
Natural audiovisual speech encoding in the early stages of the human cortical hierarchy
Seeing a speaker’s face in a noisy environment can greatly improve one’s ability to understand what they are saying, a process that is attributed to the multisensory integration of audio and visual speech. In this talk, I will present a model of such multisensory integration that is based on the notion that visual speech can influence auditory speech processing at multiple stages of processing – including an early stage based on the correlated dynamics of visual and auditory speech and later stages where the form of visual articulators helps with linguistic categorization. This model relies on the hypothesis that visual cortex represents both low-level visual features and higher-level linguistic cues and that these representations can differentially and flexibly influence the processing of audio speech. I will present evidence for this model across a series of studies that involved modeling EEG responses obtained from adult participants while they were presented with naturalistic audio-visual speech stimuli.
Professor Aaron NidifferUniversity of Rochester. US
Professor Aaron NidifferUniversity of Rochester. US Aaron Nidiffer is a Research Assistant Professor at the University of Rochester where he was previously a postdoctoral researcher with Ed Lalor. He came to Rochester by way of a Bachelor's degree from King College (now King University) and a PhD from Vanderbilt University under the supervision of Mark Wallace. His scientific interests are rooted in auditory and audiovisual binding and object perception and have expanded to include visual and audiovisual speech perception - and more recently perception of sign language. |
| 14:45-15:00 |
Discussion
|
| 15:00-15:30 |
Break
|
| 15:30-16:00 |
See what you hear: Making sense of the senses
Adaptive behaviour in a complex, dynamic, and multisensory world raises some of the most fundamental questions for neural processing, notably perceptual inference, decision making, learning, binding, attention and probabilistic computations. In this talk, I will present our recent behavioural, computational and neural research that investigates how the brain tackles these challenges. First, I will focus on how the brain solves the causal inference or binding problem, deciding whether signals come from common causes and should hence be integrated or else be processed independently. Combining psychophysics, Bayesian modelling and neuroimaging (fMRI, EEG) we show that the brain arbitrates between sensory integration and segregation consistent with the principles of Bayesian Causal Inference by dynamically encoding multiple perceptual estimates across the cortical hierarchy. Next, I will explore how prior expectations and attentional mechanisms can modulate sensory integration. Finally, I will show research into how the brain solves the causal inference problem in more complex environments with multiple signals and sources.
Professor Uta NoppeneyDonders University, Netherlands
Professor Uta NoppeneyDonders University, Netherlands Uta Noppeney’s research investigates the neural mechanisms of perceptual inference, learning, decision making, attention and probabilistic computations through a multisensory lens combining psychophysics, computational modelling (Bayesian, neural network) and neuroimaging (fMRI, M/EEG) in humans. She is a Professor at the Neurophysics department and a Principal Investigator at the Donders Institute for Brain, Cognition and Behaviour, Radboud University (Netherlands). Previously, she was the director of the Computational Neuroscience and Cognitive Robotics Centre at the University of Birmingham (UK) and an independent research group leader at the Max Planck Institute for Biological Cybernetics, Tübingen (Germany). She is the recipient of a Young Investigator Award of the Cognitive Neuroscience Society (2013), a Turing Fellowship (2018) and two ERC grants ( 2013, 2023). She is a member of the Academia Europaea and an academic editor of PLOS Biology and Multisensory Research. |
| 16:00-16:15 |
Discussion
|
| 16:15-16:45 |
Specialized - but flexible - prefrontal networks for auditory and visual cognition
In daily life, we encounter rich, complex sensory input, with multiple visual, auditory, and multisensory sources present. As our lived experience makes clear, these stimuli are not processed in isolation, but are combined into multisensory percepts that are shaped by attention and other task goals. To better understand the neural mechanisms of attention in multisensory settings, we first used fMRI to map visual-biased and auditory-biased networks that extend into prefrontal cortex. Within these networks, we observe cross-sensory activation that is dependent on task demands: when the brain needs to maintain information about spatial locations, it leverages the visual system’s spatiotopic maps, regardless of the originating sensory modality. Representational analyses of cognitive control, including task goals and attention set, confirm that this cross-modal activation encodes information about the features a listener is targeting. These neural results are supported by behavioural findings confirming that auditory attention and working memory tasks with high spatial demands load both the visual and auditory cognitive networks. This body of work characterizes prefrontal cortex networks with specialized computational abilities linked to specific sensory modalities, and demonstrates that they can be flexibly recruited depending on a listener’s goals.
Professor Abigail NoyceCarnegie Mellon University
Professor Abigail NoyceCarnegie Mellon University Abigail Noyce is Assistant Research Professor at Carnegie Mellon University, where she co-leads of the Lab in Multisensory Neuroscience. She studies the mechanisms by which sensory and perceptual processing constrain attention, memory, and other cognitive capabilities. Her non-research projects include cultivating communities, creating textiles, and raising two children. |
| 16:45-17:00 |
Discussion
|
| 17:00-18:30 |
Drinks reception and poster session
|
| 18:30-00:00 |
Close
|
Chair
Dr Pip Coen
University College London
Dr Pip Coen
University College London
Dr Coen received his degree in Natural Sciences from the University of Cambridge in 2009, and his PhD in Neuroscience from Princeton University in 2015, working with Professor Mala Murthy. After a postdoctoral fellowship with Professors Matteo Carandini and Kenneth Harris at UCL, he established his laboratory in the Cell and Developmental Biology department at UCL in 2023. His lab uses chronic electrophysiology, optogenetics, and behavioural paradigms in mice to investigate the neural mechanisms underlying audio-visual integration and sensory-guided decisions.
| 09:00-09:30 |
Multisensory Speech Perception: Models and Mechanisms
The most natural form of human interaction is face-to-face, integrating auditory information from the voice of the talker with visual information from the face of the talker. A dramatic illustration of the influence of visual information on speech perception is provided by the McGurk effect. In this illusion, pairing an auditory "ba" with an incongruent visual "ga" sometimes results in the percept of a different syllable. We show that an artificial neural network known as AVHuBERT also perceives the McGurk effect. Both humans and AVHuBERT report a mixture of percepts to McGurk stimuli, including the auditory component of the stimulus, the fusion percept of "da" reported in the original description of the illusion, and other syllables, including "fa" and "ah". The similar responses of humans and AVHuBERT to McGurk stimuli suggest that artificial neural networks may provide a useful model for human audiovisual speech perception. The neural basis of audiovisual speech perception was examined using stereoencephalographic (SEEG) recordings from neurosurgical patients. These recording demonstrate an anatomical boundary between the superior temporal gyrus, which responds similarly to auditory-only and audiovisual speech, and the superior temporal sulcus, which shows larger and faster responses to audiovisual compared with auditory-only speech. Consistent with behavioural studies, audiovisual enhancement was more pronounced in the presence of auditory noise.
Dr Michael BeauchampUniversity of Pennsylvania, US
Dr Michael BeauchampUniversity of Pennsylvania, US Michael Beauchamp grew up in Waterloo, Ontario, Canada before attending Harvard University, where he received an undergraduate degree in Biology in 1992. He received his PhD from the Neuroscience Graduate Program at University of California, San Diego in 1997. As a postdoctoral fellow in the NIH Intramural Research Program in Bethesda, he worked with Jim Haxby (1997 - 2000) and Alex Martin (2000 - 2005). Michael's first faculty position was at the University of Texas at Houston Medical School, followed by a transition to Baylor College of Medicine in 2015. In 2021, Michael moved to Philadelphia to become a Professor and the Vice-Chair for Research of the Neurosurgery Department in the Perelman School of Medicine at the University of Pennsylvania. |
|---|---|
| 09:30-09:45 |
Discussion
|
| 09:45-10:15 |
Audiovisual scene dynamics and their influence on loudness perception
Understanding speech in noisy, multisensory environments is a fundamental challenge for both listeners and researchers. While most studies focus on speech intelligibility, less is known about how the perceptual construct of loudness is influenced by the structure of the audiovisual (AV) scene. This work presents how systematically manipulating the temporal synchrony between auditory and visual signals, as well as the linguistic content of the speech modulates perceived loudness. Our results show that increased AV asynchrony leads to a significant drop in perceived loudness, with effects emerging beyond natural synchrony ranges. Surprisingly, linguistic complexity also alters loudness ratings, even when physical intensity is held constant. By transforming subjective ratings into an objective measure of perceived target-to-masker ratio (TMR), we demonstrate that extreme AV asynchrony results in a 2–3 dB drop in perceived TMR, independent of linguistic content. These findings highlight the value of loudness perception as a sensitive index of AV scene analysis. This approach offers new avenues for studying multisensory processing in both typical and neurodiverse populations, and for developing accessible protocols that decouple scene analysis from linguistic ability.
Dr Liesbeth GijbelsMeta Reality Labs Research, US
Dr Liesbeth GijbelsMeta Reality Labs Research, US Liesbeth Gijbels is a research scientist with a rich academic background in Speech and Hearing Sciences (PhD, Master, Bachelor), Psychology, and Education. The focus of her work is to bridge clinical practice and academic research in Speech and Hearing Sciences. Liesbeth has 10+ years of hands-on clinical experience supporting individuals with communication, learning, and hearing challenges. After moving to the US in 2018, she completed her PhD at the University of Washington, focusing on audiovisual speech perception and the cognitive processes underlying human communication. Building on her clinical and academic foundations, Liesbeth’s current work as a research scientist at Meta Reality Labs centers on developing AI-driven tools and technologies to enhance hearing and communication. |
| 10:15-10:30 |
Discussion
|
| 10:30-11:00 |
Break
|
| 11:00-11:30 |
A model incorporating the influence of gaze on apparent sound source direction
The mistakes that listeners make in determining the direction of a sound source are not solely the consequence of sampling error from a distribution centered around the source’s true position. Burgeoning evidence suggests that the distributions themselves have characteristic shifts or biases that depend on adaptive processes of different time courses, the spatial statistics and recent history of the auditory scene itself, the relative angle of the head and of the torso, and on eye gaze angle, to name a few contributing factors. These effects indicate that perceived acoustic space is not fixed but undergoes continual bias remapping in response to behavioural and sensory context. It is the intent of this talk to review the history of what we know about such systematic changes in spatial acoustic perception and in particular the interaction with gaze. We will then present a neurophysiologically inspired model of how head angle and eye gaze influence sound localization, as well as related phenomena such as sound source segregation and spatial release from masking. We conclude by comparing model predictions with key results from the literature, demonstrating that the model captures a common structure underpinning the diverse ways in which gaze alters spatial auditory perception.
Dr Owen BrimijoinReality Labs Research (Meta), US
Dr Owen BrimijoinReality Labs Research (Meta), US Dr Brimijoin earned his bachelor's degree in Neuroscience and Behavior at Wesleyan University before completing his PhD in Brain and Cognitive Sciences at the University of Rochester. His doctoral research focused on nonlinearity in spectrotemporal auditory receptive fields—exploring how neurons develop preferences for the way sounds change in spectral content over time. During his postdoc, he used multi-unit physiology in mouse models to examine how aging and hearing loss affect this nonlinearity, contributing to the difficulty many people experience understanding speech in noisy environments. He then transitioned to human psychophysics, spending nearly a decade as a scientist at the MRC Institute of Hearing Research in Glasgow. There, he studied speech intelligibility, listener behavior, sound localization, and auditory motion perception, with a particular focus on how both normal-hearing and hearing-impaired listeners use their own movement – intentional and unintentional – to understand speech and build a spatial understanding of the acoustic world. Now at Reality Labs Research, he continues to work at the intersection of speech perception, listening intent, and spatial hearing with the goal of building augmented reality devices with the potential to significantly improve hearing and understanding in challenging environments. |
| 11:30-11:45 |
Discussion
|
| 11:45-12:15 |
Seeing with your ears: the logic of crossmodal connections in the mouse brain
The organisation of neuronal networks in the brain is highly structured; however, the principles underlying how structure relates to function are only beginning to be understood. What are the rules governing communication between sensory areas involved in processing different streams of information? I will present how the structure of connections in cortical and subcortical areas supports cross-modal spatiotemporal representations. I will also share our developments in biologically inspired AI models for the analysis of large-scale neuronal recordings, and how these models capture cortical cross-modal plasticity in the representation of natural scenes across days. Ultimately, by combining experimental and theoretical approaches, we aim to establish how circuit motifs support multisensory integration.
Dr Florencia IacarusoThe Francis Crick Institute, UK
Dr Florencia IacarusoThe Francis Crick Institute, UK Florencia Iacaruso is a group leader at the Francis Crick Institute. At the interface between systems neuroscience, computational modelling, and sensory physiology, her lab investigates how the brain integrates information from different sensory modalities to guide behavior. |
| 12:15-12:30 |
Discussion
|
Chair
Dr Cesare Parise
University of Liverpool, UK
Dr Cesare Parise
University of Liverpool, UK
Dr Cesare Parise is a Senior Lecturer in the Department of Psychology at the University of Liverpool. He received his DPhil from the University of Oxford and held postdoctoral positions at the Max Planck Institute for Biological Cybernetics and the Center of Excellence for Interaction Technologies at the University of Bielefeld. Prior to joining Liverpool, he worked for four years as a Perception Scientist at Meta Reality Labs. His research aims to develop stimulus-computable models that account for psychophysical, behavioural, and neurophysiological responses to multisensory stimulation, with particular emphasis on vision, audition, and touch. This programme combines fundamental, theory-driven work with applied perception science, with current interests spanning clinical applications (e.g., sensory substitution) and emerging interactive technologies (eg, virtual and augmented reality).
| 13:30-14:00 |
Learnings from the audio-visual speech enhancement challenge: from stimuli design to evaluation
Speech enhancement technologies have rapidly developed in the last decade. Multi-modal speech perception has inspired the next generation of speech enhancement technologies to explore multi-modal approaches that leverage visual information to overcome scenarios that can be challenging for audio-only speech enhancement models (e.g., overlapping speakers). The Audio-Visual Speech Enhancement Challenge (AVSEC) provided the first benchmark to assess algorithms that use lip-reading information to augment their speech enhancement capabilities. Throughout its four editions, we provided carefully designed datasets that enabled us to explore AV-SE performance in a range of listening conditions different from those commonly used in laboratory settings. In this talk, I will present our proposed approach to scalable stimuli design from in-the-wild data and introduce the AVSEC protocol for human listening evaluation of AV-SE systems. I will present an overview of the listening test results throughout different editions of the challenge and discuss them in relation to characteristics of the designed stimuli.
Dr Lorena AldanaUniversity of Edinburgh, UK
Dr Lorena AldanaUniversity of Edinburgh, UK Lorena Aldana is a Research Associate at the University of Edinburgh. She has a background in sound engineering and computer science. She was a DAAD scholar at The University of Bielefeld and finished her PhD in 2021. Her research interests lie at the intersection of multi-modal speech and hearing technologies, audio signal processing and machine learning. Lorena has been a technical lead of the four successful editions of the International Audio-Visual Speech Enhancement Challenge (AVSEC). Her current research focuses on advancing evaluation methods for speech and hearing technologies addressing ecological validity and integrating individual differences in hearing beyond current standard clinical methods. |
|---|---|
| 14:00-14:15 |
Discussion
|
| 14:15-14:45 |
Vision in spatial hearing assessment and training
Spatial hearing is most commonly assessed using sound localisation tasks that are widely regarded as a gold standard for evaluating auditory spatial function. These paradigms typically rely on highly controlled experimental conditions, in which listeners are visually deprived, required to keep their head stationary, and exposed to brief noise bursts presented in different locations within anechoic environments. While such methods provide precise and reproducible measurements, they represent listening situations that are largely detached from everyday auditory experience. In natural environments, sound sources are multiple and rather complex from a spectral envelope point of view, listeners actively move their head and body, and spatial judgments are frequently resolved through dynamic interactions between proprioception, audition and vision. This mismatch raises fundamental questions about the ecological validity of traditional spatial hearing assessments and about the role of vision in both assessing and training spatial auditory functions. Even alternative approaches, such as speech-based measures of spatial release from masking, implicitly depend on visual information, which supports spatial expectation, source identification, and audio-visual speech integration. Consequently, vision is not merely an auxiliary modality but an integral component of spatial hearing in real-world contexts. I will present recent work exploring assessment paradigms that explicitly incorporate vision while preserving the functional contribution of auditory spatial cues. A first study focuses on a visual search task in which spatialised sound guides attention toward a visually defined target will be introduced. Although the task can ultimately be completed using vision alone, auditory information significantly affects reaction times, movement trajectories, and search strategies, providing sensitive performance metrics beyond localisation accuracy. For example, this paradigm has revealed subtle but systematic differences between individualised and non-individualised head-related transfer functions. In a second one, a game-based localisation tasks employing audio-visual anchors and auditory distractors shows that visible and head-tracked sound sources can facilitate faster learning than non-tracked and visually absent ones. Finally implications of these findings for spatial hearing training. Across laboratory and clinical applications, including training for bilateral cochlear implant users, our results highlight the importance of multimodal feedback for perceptual remapping. Coupling auditory and visual information supports adaptation to altered spatial cues, while progressively reducing visual guidance enables the transfer of learning to audition alone. Together, these findings argue for a systematic integration of vision into spatial hearing assessment and rehabilitation, moving beyond purely auditory paradigms toward more ecologically valid and functionally informative approaches.
Professor Lorenzo PicinaliImperial College London, UK
Professor Lorenzo PicinaliImperial College London, UK I am a Professor in Spatial Acoustics and Immersive Audio, and I lead the Audio Experience Design team (www.axdesign.co.uk). Our research explores both the perceptual and computational aspects of acoustics, alongside their practical applications. We aim to understand how humans perceive spatial sound features, such as source locations and reverberation, and use this knowledge to design algorithms and tools for creating virtual acoustic simulations. Our goal is to seamlessly blend real and virtual acoustic environments, ultimately enhancing hearing. I also work on eco-acoustic monitoring, designing autonomous recorders, and using audio to study the impact of human activity on remote ecosystems. |
| 14:45-15:00 |
Discussion
|
| 15:00-15:30 |
Break
|
| 15:30-16:00 |
What’s in a model? Quantifying audiovisual integration of speech
Speech comprehension is often boosted by watching a talker’s face and has been reported vary between individuals and across the materials used for testing. However, the benefit is dependent on exact stimulus conditions and highly non-linear, making it challenging to compare across individuals and groups, experimental manipulations and treatments. I will describe a model which is simple enough to use for quantitative analysis of data and appears to capture relationship between speech perception performance in audio-only, visual-only and audio-visual situations. The Multi-Stage Noise model considers speech perception as a process limited first by modality specific low-level processing followed by later supra-modal limitations. In tests so far (~400 people) audio-visual speech perception is well predicted from individual differences in unisensory processing across a wide range of people, implying that the integration process itself is relatively stable. Here, I will use the model as source of simulated data, which presumably captures the non-linear properties of real data, and compare various other methods of quantifying the relationships in performance across modalities. This reveals that many conventional ways of quantifying integration and the “benefit of seeing the talker” result in high-false positive rates – which could be interpreted as changes in audio-visual integration. The Multi-Stage Noise model, which can be used within a Bayesian statistical framework, may offer a more robust “null” model to quantify audio-visual integration.
Professor Chris SumnerNottingham Trent University, UK
Professor Chris SumnerNottingham Trent University, UK Chris’ research focuses on understanding the neural computation underlying how we hear. He does this with a variety of methods, but often by building computer models (simulations) of neurons, neural systems, and behaviour, with the aim of relating processing by single neurons to our perception of sound. He believes that this understanding is critical in order to tackle the problems associated with hearing loss. Recent research interests include: How sensory processing influences the coding and recognition of complex acoustic signals such as (but not limited to) speech and how this is affected by hearing loss; the mechanisms underlying resolution of sound frequency in the auditory system; audio-visual integration of speech. |
| 16:00-16:15 |
Discussion
|
| 16:15-17:00 |
Panel discussion/overview
|
| 17:00-00:00 |
Close
|