Vision augmented hearing

03 - 04 March 2026 09:00 - 17:00 Holiday Inn Manchester - City Centre Free
Vision- lead image

Theo Murphy meeting organised by Professor Jennifer Bizley, Professor Michael Akeroyd, Professor Adrian KC Lee.

Acoustic information is not the sole determinant of how the everyday world sounds: our brains rely on vision to rescue hearing in situations when audition is hazy or worse. Perception continuously, seamlessly binds information across senses, but how remains mysterious. We will gather diverse experts to unify the latest research and chart a path towards better virtual and augmented-reality technology.

Programme

The programme, including speaker biographies and abstracts, is available below but please note the programme may be subject to change.

Poster session

There will be a poster session from 5pm on Tuesday 3 March 2026. Registered attendees will be invited to submit a proposed poster title and abstract (up to 200 words). Acceptances may be made on a rolling basis so we recommend submitting as soon as possible in case the session becomes full. Submissions made within one month of the meeting may not be included in the programme booklet.

Attending the event

This event is intended for researchers in relevant fields.

  • Free to attend and in-person only
  • When requesting an invitation, please briefly state your expertise and reasons for attending
  • Requests are reviewed by the meeting organisers on a rolling basis. You will receive a link to register if your request has been successful
  • Catering options will be available to purchase upon registering. Participants are responsible for booking their own accommodation. Please do not book accommodation until you have been invited to attend the meeting by the meeting organisers

Please note that scientific meetings hosted by the Royal Society do not necessarily represent a Royal Society position or signify an endorsement of the speakers or content presented.

Enquiries: Contact the Scientific Programmes team.

Organisers

  • Jennifer Bizley

    Dr Jennifer Bizley

    Jennifer Bizley is Professor of Auditory Neuroscience and a Wellcome Career Development Award holder at the UCL Ear Institute. Her work explores the brain basis of listening and, in particular, how auditory and non-auditory factors influence the processing of sound. Her research combines behavioural methods with techniques to measure and manipulate neural activity as well as anatomical and computational approaches.

  • Michael Ackeroyd

    Professor Michael Akeroyd

    Michael A Akeroyd is Professor of Hearing Sciences at the University of Nottingham. He received a PhD degree from the University of Cambridge in 1995 and then was a researcher at successively the MRC Applied Psychology Unit, MRC Institute of Hearing Research (MRC IHR), University of Connecticut Health Centre, the University of Sussex, and the Scottish Section of MRC IHR, before becoming the final director of MRC IHR from 2015 to its closure. His current research interests are auditory impairment and disability, hearing aids, quality of life, spatial hearing, and psychophysics. He is a Fellow of the Acoustical Society of America and of the International Collegium of Rehabilitative Audiology.

  • Professor Adrian KC Lee

    Professor Adrian KC Lee

    Adrian KC Lee is a Professor in the Department of Speech & Hearing Sciences and at the Institute for Learning and Brain Sciences at the University of Washington, Seattle, USA. He obtained his bachelor’s degree in electrical engineering at the University of New South Wales and his doctorate at the Harvard-MIT Division in Health Sciences and Technology. His research focuses on developing multimodal imaging techniques to investigate the cortical network involved in auditory scene analysis and attention, especially through designing novel behavioral paradigms that bridge the gap in psychoacoustics, multisensory and neuroimaging research.

Schedule

Chair

Addison Billing

Dr Addison Billing

University of Cambridge, UK

08:30-09:00 Registration
09:00-09:30 Welcome by the lead organisers
Dr Jennifer Bizley

Dr Jennifer Bizley

University College London, UK

Professor Adrian KC Lee

Professor Adrian KC Lee

University of Washington, US

09:30-09:45 Discussion
09:45-10:15 The sounds your ears make when your eyes move: eye movement-related eardrum oscillations (EMREOs) and their role in linking visual and auditory space

Before vision can augment hearing, the visual and auditory aspects that relate to the same underlying event must be linked correctly. For example, the sight of lip movements is only helpful if they come from the person who is actually talking. Linking-by-location is a sensible method for forming such associations correctly. However, sounds and sights are localized differently – visual stimulus location is ascertained based on the retinal location of the image, and sounds are localized based on sound delay and loudness differences across the two ears. In humans and many other species, the eyes can move with respect to the head/ears. The correspondence between retinal image location and sound delay/loudness differences has to be flexibly governed by the position of the eyes with respect to the head.

We recently discovered that the brain sends signals regarding eye movements to the ears, causing oscillations of the eardrum and producing self-generated sounds that can be detected via earbud microphones (Gruters et al, PNAS 2018). These eye movement-related eardrum oscillations (EMREOs) likely constitute the first step of a coordinate transformation of auditory signals into common coordinates with the visual system (Lovich et al PNAS 2023).

Professor Jennifer Groh

Professor Jennifer Groh

Duke University, US

10:15-10:30 Discussion
10:30-11:00 Break
11:00-11:30 Tracing the effect of visual stimuli on speech encoding along the human auditory pathway

In noisy settings, seeing a talker allows them to be much better understood. Several studies have demonstrated cortical effects of audio-visual integration in humans and animal models. Subcortically, some work in animals has shown effects of visual stimuli in auditory areas, but there is very little human work to back that up.

In this study, we presented 23 listeners with audio-visual speech under two conditions: coherent, in which the acoustic and visual speech matched, and incoherent, in which the visual speech was replaced with a different recording of the same talker. The target speech was presented alongside two acoustic masker talkers. Listeners were asked to report keywords. We recorded EEG and computed the brainstem temporal response function, from which we derived a waveform for each condition resembling the auditory brainstem response (ABR).

Behavioral results confirmed the perceptual benefit of the congruent condition over the incongruent: all subjects showed better performance, with a mean improvement of 10% correct. ABR waveforms to target speech did not differ between the two audio-visual conditions. Responses to masker speech were similarly unaffected by the visual stimulus.

It is clear from our behavioral results and countless prior studies that congruent visual speech improves understanding in the presence of background noise. Audio-visual integration of speech signals has been shown in humans in later cortical waves, but was not seen subcortically in our present study. This is consistent with recent work from our lab showing that selective attention impacts cortical but not subcortical EEG responses in human listeners.

Dr Ross Maddox

Dr Ross Maddox

University of Michigan, US

11:30-11:45 Discussion
11:45-12:15 Visual signals in ferret auditory cortex

In this talk, I will summarise prior work demonstrating that visual signals are present across throughout ferret auditory cortex (AC), and that the temporal coherence of visual input with sound amplitude can shape the representation of sound mixtures in AC. I will then discuss recent work that aims to identify the source(s) of visual inputs to auditory cortex, through a combination of cortical cooling and viral tract tracing. We find that silencing a sub-region of visual cortex (the posterior bank of the suprasylvian sulcus, referred to as PSS, and adjacent area 21) impacted around half of visually responsive units in AC, most commonly resulting in decreased firing rate. These data are consistent with the idea that some visual information in AC enters via an excitatory input from SSY. However, preserved visual responses in many other units suggest the involvement of additional pathways, and a small number of units in which responses emerged or increased during cooling suggest an additional indirect role for SSY.

Anatomical tracing has revealed that both lemniscal and non-lemniscal thalamic regions send projections to secondary AC in the ferret (in addition to the known connections with primary regions), which we suggest are likely additional pathways via which visual information enters AC.

Subsequently, we have probed the visual features that most effectively drive responses in AC, and revealed differences in sensitivity to visual features between primary and non-primary AC.

Our current interest is in the interaction of selective attention with audio-visual integration, for which we have developed a ferret version of the auditory selective attention task presented in Maddox et al., 2015. We find a similar impact of audio-visual temporal coherence on performance as previously reported in humans, although the effect appears to change over time as animals repeatedly perform the task over months, suggesting possible impacts of expertise and/or overtraining. Future work involves the analysis of neural data collected while animals perform this task.

Dr Rebecca Norris

Dr Rebecca Norris

University College London, UK

12:15-12:30 Discussion

Chair

Dr Axelle Calcus

Dr Axelle Calcus

Université libre de Bruxelles

13:30-14:00 The function of top-down processes in segmenting and selecting objects in the visual scene

Accurate segmentation of the visual scene allows us to select and manipulate objects in our environment. Top-down connections in sensory systems are thought to modulate activity in primary sensory areas to enhance object-related activity while suppressing background activity. Here I will discuss recent work in mice, monkeys and humans showing how connectivity between cells in higher visual areas tuned for border-ownership and cells in V1 leads to precise scene segmentation. I will show how interaction with local circuitry in V1 allows top-down connections to drive activity, even in the absence of bottom-up input from the retina. Finally, I will discuss how segmentation processes evolve over time, from an early phase where local contextual effects determine activity to a later phase where the global scene organisation is represented in primary visual cortex.

Dr Matthew Self

Dr Matthew Self

University of Glasgow, UK

14:00-14:15 Discussion
14:15-14:45 Natural audiovisual speech encoding in the early stages of the human cortical hierarchy

Seeing a speaker’s face in a noisy environment can greatly improve one’s ability to understand what they are saying, a process that is attributed to the multisensory integration of audio and visual speech. In this talk, I will present a model of such multisensory integration that is based on the notion that visual speech can influence auditory speech processing at multiple stages of processing – including an early stage based on the correlated dynamics of visual and auditory speech and later stages where the form of visual articulators helps with linguistic categorization. This model relies on the hypothesis that visual cortex represents both low-level visual features and higher-level linguistic cues and that these representations can differentially and flexibly influence the processing of audio speech. I will present evidence for this model across a series of studies that involved modeling EEG responses obtained from adult participants while they were presented with naturalistic audio-visual speech stimuli.

Professor Aaron Nidiffer

Professor Aaron Nidiffer

University of Rochester. US

14:45-15:00 Discussion
15:00-15:30 Break
15:30-16:00 See what you hear: Making sense of the senses

Adaptive behaviour in a complex, dynamic, and multisensory world raises some of the most fundamental questions for neural processing, notably perceptual inference, decision making, learning, binding, attention and probabilistic computations. In this talk, I will present our recent behavioural, computational and neural research that investigates how the brain tackles these challenges. First, I will focus on how the brain solves the causal inference or binding problem, deciding whether signals come from common causes and should hence be integrated or else be processed independently. Combining psychophysics, Bayesian modelling and neuroimaging (fMRI, EEG) we show that the brain arbitrates between sensory integration and segregation consistent with the principles of Bayesian Causal Inference by dynamically encoding multiple perceptual estimates across the cortical hierarchy. Next, I will explore how prior expectations and attentional mechanisms can modulate sensory integration. Finally, I will show research into how the brain solves the causal inference problem in more complex environments with multiple signals and sources.

Professor Uta Noppeney

Professor Uta Noppeney

Donders University, Netherlands

16:00-16:15 Discussion
16:15-16:45 Specialized - but flexible - prefrontal networks for auditory and visual cognition

In daily life, we encounter rich, complex sensory input, with multiple visual, auditory, and multisensory sources present. As our lived experience makes clear, these stimuli are not processed in isolation, but are combined into multisensory percepts that are shaped by attention and other task goals. To better understand the neural mechanisms of attention in multisensory settings, we first used fMRI to map visual-biased and auditory-biased networks that extend into prefrontal cortex. Within these networks, we observe cross-sensory activation that is dependent on task demands: when the brain needs to maintain information about spatial locations, it leverages the visual system’s spatiotopic maps, regardless of the originating sensory modality. Representational analyses of cognitive control, including task goals and attention set, confirm that this cross-modal activation encodes information about the features a listener is targeting. These neural results are supported by behavioural findings confirming that auditory attention and working memory tasks with high spatial demands load both the visual and auditory cognitive networks. This body of work characterizes prefrontal cortex networks with specialized computational abilities linked to specific sensory modalities, and demonstrates that they can be flexibly recruited depending on a listener’s goals.

Professor Abigail Noyce

Professor Abigail Noyce

Carnegie Mellon University

16:45-17:00 Discussion
17:00-18:30 Drinks reception and poster session
18:30-00:00 Close

Chair

Pip Coen

Dr Pip Coen

University College London

09:00-09:30 Multisensory Speech Perception: Models and Mechanisms

The most natural form of human interaction is face-to-face, integrating auditory information from the voice of the talker with visual information from the face of the talker. A dramatic illustration of the influence of visual information on speech perception is provided by the McGurk effect. In this illusion, pairing an auditory "ba" with an incongruent visual "ga" sometimes results in the percept of a different syllable. We show that an artificial neural network known as AVHuBERT also perceives the McGurk effect. Both humans and AVHuBERT report a mixture of percepts to McGurk stimuli, including the auditory component of the stimulus, the fusion percept of "da" reported in the original description of the illusion, and other syllables, including "fa" and "ah". The similar responses of humans and AVHuBERT to McGurk stimuli suggest that artificial neural networks may provide a useful model for human audiovisual speech perception. The neural basis of audiovisual speech perception was examined using stereoencephalographic (SEEG) recordings from neurosurgical patients. These recording demonstrate an anatomical boundary between the superior temporal gyrus, which responds similarly to auditory-only and audiovisual speech, and the superior temporal sulcus, which shows larger and faster responses to audiovisual compared with auditory-only speech. Consistent with behavioural studies, audiovisual enhancement was more pronounced in the presence of auditory noise.

Dr Michael Beauchamp

Dr Michael Beauchamp

University of Pennsylvania, US

09:30-09:45 Discussion
09:45-10:15 Audiovisual scene dynamics and their influence on loudness perception

Understanding speech in noisy, multisensory environments is a fundamental challenge for both listeners and researchers. While most studies focus on speech intelligibility, less is known about how the perceptual construct of loudness is influenced by the structure of the audiovisual (AV) scene. This work presents how systematically manipulating the temporal synchrony between auditory and visual signals, as well as the linguistic content of the speech modulates perceived loudness. Our results show that increased AV asynchrony leads to a significant drop in perceived loudness, with effects emerging beyond natural synchrony ranges. Surprisingly, linguistic complexity also alters loudness ratings, even when physical intensity is held constant. By transforming subjective ratings into an objective measure of perceived target-to-masker ratio (TMR), we demonstrate that extreme AV asynchrony results in a 2–3 dB drop in perceived TMR, independent of linguistic content.

These findings highlight the value of loudness perception as a sensitive index of AV scene analysis. This approach offers new avenues for studying multisensory processing in both typical and neurodiverse populations, and for developing accessible protocols that decouple scene analysis from linguistic ability.

Dr Liesbeth Gijbels

Dr Liesbeth Gijbels

Meta Reality Labs Research, US

10:15-10:30 Discussion
10:30-11:00 Break
11:00-11:30 A model incorporating the influence of gaze on apparent sound source direction

The mistakes that listeners make in determining the direction of a sound source are not solely the consequence of sampling error from a distribution centered around the source’s true position. Burgeoning evidence suggests that the distributions themselves have characteristic shifts or biases that depend on adaptive processes of different time courses, the spatial statistics and recent history of the auditory scene itself, the relative angle of the head and of the torso, and on eye gaze angle, to name a few contributing factors. These effects indicate that perceived acoustic space is not fixed but undergoes continual bias remapping in response to behavioural and sensory context. It is the intent of this talk to review the history of what we know about such systematic changes in spatial acoustic perception and in particular the interaction with gaze. We will then present a neurophysiologically inspired model of how head angle and eye gaze influence sound localization, as well as related phenomena such as sound source segregation and spatial release from masking. We conclude by comparing model predictions with key results from the literature, demonstrating that the model captures a common structure underpinning the diverse ways in which gaze alters spatial auditory perception.

Dr Owen Brimijoin

Dr Owen Brimijoin

Reality Labs Research (Meta), US

11:30-11:45 Discussion
11:45-12:15 Seeing with your ears: the logic of crossmodal connections in the mouse brain

The organisation of neuronal networks in the brain is highly structured; however, the principles underlying how structure relates to function are only beginning to be understood. What are the rules governing communication between sensory areas involved in processing different streams of information? I will present how the structure of connections in cortical and subcortical areas supports cross-modal spatiotemporal representations. I will also share our developments in biologically inspired AI models for the analysis of large-scale neuronal recordings, and how these models capture cortical cross-modal plasticity in the representation of natural scenes across days. Ultimately, by combining experimental and theoretical approaches, we aim to establish how circuit motifs support multisensory integration.

Dr Florencia Iacaruso

Dr Florencia Iacaruso

The Francis Crick Institute, UK

12:15-12:30 Discussion

Chair

Cesare Parise

Dr Cesare Parise

University of Liverpool, UK

13:30-14:00 Learnings from the audio-visual speech enhancement challenge: from stimuli design to evaluation

Speech enhancement technologies have rapidly developed in the last decade. Multi-modal speech perception has inspired the next generation of speech enhancement technologies to explore multi-modal approaches that leverage visual information to overcome scenarios that can be challenging for audio-only speech enhancement models (e.g., overlapping speakers).

The Audio-Visual Speech Enhancement Challenge (AVSEC) provided the first benchmark to assess algorithms that use lip-reading information to augment their speech enhancement capabilities. Throughout its four editions, we provided carefully designed datasets that enabled us to explore AV-SE performance in a range of listening conditions different from those commonly used in laboratory settings. In this talk, I will present our proposed approach to scalable stimuli design from in-the-wild data and introduce the AVSEC protocol for human listening evaluation of AV-SE systems. I will present an overview of the listening test results throughout different editions of the challenge and discuss them in relation to characteristics of the designed stimuli.

Dr Lorena Aldana

Dr Lorena Aldana

University of Edinburgh, UK

14:00-14:15 Discussion
14:15-14:45 Vision in spatial hearing assessment and training

Spatial hearing is most commonly assessed using sound localisation tasks that are widely regarded as a gold standard for evaluating auditory spatial function. These paradigms typically rely on highly controlled experimental conditions, in which listeners are visually deprived, required to keep their head stationary, and exposed to brief noise bursts presented in different locations within anechoic environments. While such methods provide precise and reproducible measurements, they represent listening situations that are largely detached from everyday auditory experience. In natural environments, sound sources are multiple and rather complex from a spectral envelope point of view, listeners actively move their head and body, and spatial judgments are frequently resolved through dynamic interactions between proprioception, audition and vision.

This mismatch raises fundamental questions about the ecological validity of traditional spatial hearing assessments and about the role of vision in both assessing and training spatial auditory functions. Even alternative approaches, such as speech-based measures of spatial release from masking, implicitly depend on visual information, which supports spatial expectation, source identification, and audio-visual speech integration. Consequently, vision is not merely an auxiliary modality but an integral component of spatial hearing in real-world contexts.

I will present recent work exploring assessment paradigms that explicitly incorporate vision while preserving the functional contribution of auditory spatial cues. A first study focuses on a visual search task in which spatialised sound guides attention toward a visually defined target will be introduced. Although the task can ultimately be completed using vision alone, auditory information significantly affects reaction times, movement trajectories, and search strategies, providing sensitive performance metrics beyond localisation accuracy. For example, this paradigm has revealed subtle but systematic differences between individualised and non-individualised head-related transfer functions. In a second one, a game-based localisation tasks employing audio-visual anchors and auditory distractors shows that visible and head-tracked sound sources can facilitate faster learning than non-tracked and visually absent ones.

Finally implications of these findings for spatial hearing training. Across laboratory and clinical applications, including training for bilateral cochlear implant users, our results highlight the importance of multimodal feedback for perceptual remapping. Coupling auditory and visual information supports adaptation to altered spatial cues, while progressively reducing visual guidance enables the transfer of learning to audition alone. Together, these findings argue for a systematic integration of vision into spatial hearing assessment and rehabilitation, moving beyond purely auditory paradigms toward more ecologically valid and functionally informative approaches.

Professor Lorenzo Picinali

Professor Lorenzo Picinali

Imperial College London, UK

14:45-15:00 Discussion
15:00-15:30 Break
15:30-16:00 What’s in a model? Quantifying audiovisual integration of speech

Speech comprehension is often boosted by watching a talker’s face and has been reported vary between individuals and across the materials used for testing. However, the benefit is dependent on exact stimulus conditions and highly non-linear, making it challenging to compare across individuals and groups, experimental manipulations and treatments. I will describe a model which is simple enough to use for quantitative analysis of data and appears to capture relationship between speech perception performance in audio-only, visual-only and audio-visual situations. The Multi-Stage Noise model considers speech perception as a process limited first by modality specific low-level processing followed by later supra-modal limitations. In tests so far (~400 people) audio-visual speech perception is well predicted from individual differences in unisensory processing across a wide range of people, implying that the integration process itself is relatively stable. Here, I will use the model as source of simulated data, which presumably captures the non-linear properties of real data, and compare various other methods of quantifying the relationships in performance across modalities. This reveals that many conventional ways of quantifying integration and the “benefit of seeing the talker” result in high-false positive rates – which could be interpreted as changes in audio-visual integration. The Multi-Stage Noise model, which can be used within a Bayesian statistical framework, may offer a more robust “null” model to quantify audio-visual integration.

Professor Chris Sumner

Professor Chris Sumner

Nottingham Trent University, UK

16:00-16:15 Discussion
16:15-17:00 Panel discussion/overview
17:00-00:00 Close