Skip to content

Overview

Scientific discussion meeting organised by Professor Michael Morgan FRS, Dr Paul Linton, Professor Jenny Read, Dr Dhanraj Vishwanath, Professor Sarah Creem-Regehr and Professor Fulvio Domini.

Leading approaches to computer vision (SLAM: simultaneous localization and mapping), animal navigation (cognitive maps), and human vision (optimal cue integration), start from the assumption that the aim of 3D vision is to produce a metric reconstruction of the environment. Recent advances in machine learning, single-cell recording in animals, virtual reality, and visuomotor control, all challenge this assumption. The purpose of this meeting was to bring these different disciplines together to formulate an alternative approach to 3D vision.

The schedule of talks and speaker biographies and abstracts are available below. An accompanying journal issue has been published in Philosophical Transactions of the Royal Society B.

Attending the event

This meeting has taken place. Watch the recordings on our YouTube channel.

Enquiries: please contact the Scientific Programmes team

Organisers

Schedule


Chair

15:00-15:05
Introduction
15:05-15:30
Neural priors, neural encoders and neural renderers

Abstract

Scene representation – the process of converting visual sensory data into useful descriptions – is a requirement for intelligent behaviour. Scene representation can be achieved with three components: a prior (which scenes are likely?), an encoder (which scenes correspond to this image?), and a renderer (which images correspond to this scene?). This talk will describe how neural priors, encoders and renderers can be trained without any human-provided labels, and show how this unlocks new capabilities eg in protein structure understanding.

Speakers

15:30-16:00
Multi-scale predictive representations and human-like RL

Abstract

Many artificial agents pass benchmarks for solving specific tasks, but what would make their behaviour seem more human-like to humans? In previous research Dr Momennejad used reinforcement learning (RL) to study how humans learn multiscale predictive representations for memory, planning, and navigation. Based on this work, she will first present behavioural, fMRI, and electrophysiology evidence that hippocampal and prefrontal hierarchies learn multi-scale predictive representations, and update them via offline replay. She will then present recent work in which the group assesses the human-likeness of algorithms that all solve a navigation task. To this end, the group designed a Turing test to assess and compare human-likeness of agents navigating an XBOX game ("Bleeding Edge"). Together, representation and replay models enhance our understanding of how the brain's algorithms underlie behaviour in health and pathology. In turn, designing AI algorithms inspired by models that are neurally and behaviourally plausible can advance the state of the art of human-like RL.

 

Speakers

16:00-16:20
Discussion

Chair

16:40-17:10
Generalization in data-driven control

Abstract

Current machine learning methods are primarily deployed for tackling prediction problems, which are almost always cast as supervised learning tasks. Despite decades of advances in reinforcement learning and learning-based control, the applicability of these methods to domains that require open-world generalization – autonomous driving, robotics, aerospace, and other applications – remain challenging. Realistic environments require effective generalization, and effective generalization requires training on large and diverse datasets that are representative of the likely test-time scenarios. Dr Levine will discuss why this poses a particular challenge for learning-based control, and present some recent research directions that aim to address this challenge. He will discuss how offline reinforcement learning algorithms can make it possible for learning-based control systems to utilize large and diverse real-world datasets, how the use of diverse data can enable robotic systems to navigate real-world environments, and how multi-task and contextual policies can enable broad generalization to a range of user-specified goals.

Speakers

17:10-17:40
Understanding 3D vision as a policy network

Abstract

A 'policy network' is a term used in reinforcement learning to describe the set of actions that are generated in different states, where the ‘state’ reflects both the current sensory stimulus and goal. This is not the typical foundation for describing 3D vision, which in computer vision is based on reconstruction (eg Simultaneous Localisation And Mapping) while in neuroscience the predominant hypothesis has assumed that there are 3D transformations between retino-centric, ego-centric and world-based reference frames. Theoretical and experimental evidence in support of this neural hypothesis is lacking. Professor Glennerster will describe instead an approach that avoids 3D coordinate frames. A policy network for saccades (pure rotations of the camera/eye around the optic array) is a starting point for understanding (i) an ego-centric representation of visual direction, distance, slant and depth relief (what Marr hoped to achieve with his 2½-D sketch) and (ii) a hierarchical, compositional representation for navigation. We have known for a long time how the brain can implement a policy network so, if we could describe 3D vision in terms of a policy network (where the actions are either saccades or head translations), we would have moved closer to a neurally plausible model of 3D vision.

Speakers

17:40-18:00
Discussion

Chair

15:00-15:30
Stupid stereoscopic algorithms that still work

Abstract

Stereopsis has traditionally been considered a complex visual ability, restricted to large-brained animals. The discovery in the 1980s that insects, too, have stereopsis therefore challenged theories of stereopsis. How can such simple brains see in 3D? One answer is simply that insect stereopsis is much lower-resolution, and probably does not produce even a coarse depth map across the visual field. Rather, it may aim to produce simple behaviour, such as orienting towards the closer of two objects or triggering a strike when prey comes within range. Scientific thinking about stereopsis has been unduly anthropomorphic, for example assuming that stereopsis must require binocular fusion or a solution of the stereo correspondence problem. In fact, useful behaviour can be produced with very basic stereoscopic algorithms which make no attempt to achieve fusion or correspondence. This may explain why some aspects of insect stereopsis seem poorly designed from an engineering point of view: for example, paying no attention to whether interocular contrast or velocities match. Such 'stupid' algorithms demonstrably work well enough in practice for their species, and may prove useful in particular autonomous applications.

Speakers

15:30-16:00
Visual processing in the brain during navigation

Abstract

Much of our everyday visual experience is based on our movements through the world, when we navigate between different places – from within a room, to between cities. Is visual function the same during navigation? Dr Saleem will be presenting work where using a virtual reality environment, and presenting identical visual stimuli in different locations, they asked if spatial position modulates activity in the visual system. Activity in the primary visual cortex (V1) was found to be strongly modulated by spatial position, and this modulation persisted across higher visual areas in the cortex. This modulation was not present in inputs to visual cortex from the lateral geniculate nucleus. Furthermore, the spatial modulation of visual responses was stronger when animals actively navigated, rather than passively view the environment. These results suggest that the spatial modulation of visual information arises in V1 with active navigation. The Saleem Lab has also been investigating feedback inputs to V1, and visual responses to optic flow stimuli. They have also developed an open-source software paradigm, BonVision, that can both present both 2D and 3D stimuli in a common framework, while maintaining the precision and replicability of standardised visual experiments.

Speakers

16:00-16:20
Discussion

Chair

16:40-17:10
The cognitive map of 3D space: not as metric as we thought?

Abstract

The mammalian representation of navigable space (space that an animal moves itself through) is supported by a network of brain regions, centred on the hippocampus, that transform raw sensory signals into an internal map-like representation that can be used in navigation. It has long been thought that this map is metric, because its central units, the place cells, respond parametrically to metric changes in the environment such as stretching. This view was consolidated by the discovery of grid cells, which have evenly spaced firing fields that reveal metric computations such as speed, direction and distance. However, how these neurons behave in complex 3D space suggests that the map is not absolutely metric but is rather only loosely so, being tailored to the environment structure and/or shaped by its movement affordances. This accords with studies showing that humans seem to use a less metric and more topological internal map when performing spatial judgements. The emerging picture is one of a hierarchical processing system with highly metric processing of near space but progressively more topological maps at larger scales. This may be a way of saving processing resources, and could reflect a more general organisational principle of complex cognition.

Speakers

17:10-17:40
Locally ordered representation of 3D space in the entorhinal cortex

Abstract

As animals navigate on a two-dimensional surface, neurons in the medial entorhinal cortex (MEC) known as grid cells are activated when the animal passes through multiple locations (firing fields) arranged in a hexagonal lattice tiling the locomotion surface. However, although our world is three-dimensional (3D), it is unclear how the MEC represents 3D space. The group recorded from MEC cells in freely flying bats and identified several classes of spatial neurons, including 3D border cells, 3D head-direction cells, and neurons with multiple 3D firing fields. Many of these multifield neurons were 3D grid cells, whose neighbouring fields were separated by a characteristic distance – forming a local order – but lacked global lattice arrangement of the fields. Thus, whereas 2D grid cells form a global lattice – characterized by both local and global order – 3D grid cells exhibited only local order, creating a locally ordered metric for space. The group modelled grid cells as emerging from pairwise interactions between fields, which yielded a hexagonal lattice in 2D and local order in 3D, describing both 2D and 3D grid cells using one unifying model. Together, these data and model illuminate fundamental differences and similarities between neural codes for 3D and 2D space in the mammalian brain.

Speakers

17:40-18:00
Discussion

Chair

15:00-15:30
Tripartite encoding of visual 3D space

Abstract

A major challenge for prevailing models of human 3D vision is their inability to provide a satisfactory account of important aspects of our subjective awareness of 3D visual space. Reviewing phenomenological observations, empirical data, evolutionary logic and neurophysiological evidence, this presentation argues that human conscious awareness of visual space is underwritten by three separate spatial encodings that are optimized for specific regions of visual space (1) encoding of unscaled 3D object shape and layout (relative depth); (2) encoding of scaled intra- and inter-object distances (scaled depth) for near space (3) egocentric encoding of distances for ambulatory space. This account of separate and neurophysiological distinct spatial encodings can account for a number of important observations in the subjective awareness of 3D space, such as the paradoxical human capacity to perceive 3-dimensionality in 2-dimensional pictorial images, the unique subjective impression of object tangibility, negative space and object realness associated with binocular stereopsis and the capacity to be subjectively aware of distances beyond the peri-personal space even in the absence of binocular vision. This account provides a basis to better understand the conditions that underlie the subjective feeling of visual spatial immersion and presence.

Speakers

15:30-16:00
New approaches to visual scale and visual shape

Abstract

Human 3D vision is thought to triangulate the size, distance, direction, and 3D shape of objects using vision from the two eyes. But all four of these capacities rely on the visual system knowing where the eyes are pointing. Dr Linton's experimental work on size and distance challenge this account, suggesting a purely retinal account of visual size and distance, and likely direction and 3D shape. This requires new accounts of visual scale and visual shape. For visual scale, he argues that observers rely on natural scene statistics to associate accentuated stereo depth (largely from horizontal disparities) with closer distances. This implies that depth / shape is resolved before size and distance. For visual shape, he argues that depth / shape from the two eyes is a solution to a different problem (rivalry eradication between two retinal images treated as if they are from the same viewpoint), rather than the visual system attempting to infer scene geometry (by treating the two retinal images as two different views of the same scene from different viewpoints). Dr Linton also draws upon his book, which questions whether other depth cues (perspective, shading, motion) really have any influence on this process.

Speakers

16:00-16:20
Discussion

Chair

16:40-17:10
Perception and action in Virtual and Augmented Reality

Abstract

Virtual and Augmented Reality (VR and AR) methods provide both opportunities and challenges for research and applications involving space perception. The opportunities result from the ability to immerse a user in a realistic environment in which they can interact, while at the same time having the ability to control and manipulate environmental and body-based cues in ways that are difficult or impossible to do in the real world. The challenge comes from the notion that virtual environments will be most useful if they achieve high perceptual fidelity – that observers will perceive and act in the mediated environment as they would in the real world. A pervasive finding across early research on space perception in virtual environments is that absolute distance is underestimated as compared to the real world. Using the challenge of underestimation of scale as a starting point, this talk presents new measures (perceived affordances) and methods of feedback (body-based cues), as well as advances in technologies (mixed reality) and cues (shadows), that contribute to a broader understanding of perceptual fidelity across the continuum of mediated environments.

Speakers

17:10-17:40
Engineering challenges for realistic displays

Abstract

How can a display appear indistinguishable from reality? Dr Lanman describes how to pass this 'visual Turing test' using AR/VR headsets, emphasizing the joint design of optics, display components, rendering algorithms, and sensing elements. Specifically, this presentation will focus on the engineering challenges for advancing along four axes: resolution, accommodation, distortion correction, and dynamic range.

Speakers

17:40-18:00
Discussion

Chair

15:00-15:30
A novel non-probabilistic model of 3D cue integration explains both perception and action

Abstract

It will be argued that perceptual and action tasks that require the encoding of 3D information are both based on the same set of computations. These are described by a computational theory of 3D cue integration, which constitutes a novel theoretical framework to study 3D vision in humans. The proposed computational theory differs from the current mainstream approaches to the problem in two fundamental ways. First, it assumes that 3D mechanisms are deterministic processes that map a given visual stimulus to a unique 3D representation. In contrast, the currently held view of perception as Bayesian inference postulates a probabilistic nature of 3D representation. Second, the proposed theory posits that 3D processing is heuristic, finding correct solutions to the problem only in ideal viewing conditions and not as a general goal of visual computations. The deterministic and heuristic nature of these computations is therefore inconsistent with Bayesian approaches that model brain mechanisms as processes that derive the most accurate and precise representation of 3D structures. Instead, this theory predicts systematic biases in depth estimates that identically affect perceptual judgements and goal-directed actions.

Speakers

15:30-16:00
Dissociations between perception and action in size-distance scaling

Abstract

One of the most puzzling abilities of the human brain is size constancy: an object is perceived as having the same size even though its image on the retina varies continuously with viewing distance. An accurate representation of size is critical not only for perceptual recognition, but also for goal-directed actions, such as grasping. In fact, to successfully grasp an object, our grip aperture needs to be scaled to the true size of the object irrespective of viewing distance, a scaling operation that can be referred to as grip constancy. In this talk, Dr Sperandio will present findings from studies on both healthy volunteers and a neurological patient with large bilateral lesions that include V1 and most of the occipital cortex. By measuring perceptual judgments and grasp kinematics in response to conditions in which the image on the retina was either different (for example, by placing an object of a given physical size near and far from the observers) or constant (for example, by placing a small object near and a big object far) in size, she will provide evidence that the neural mechanisms underlying size constancy for perception and action are dissociable and rely upon distinct representations of size.

Speakers

16:00-16:10
Discussion

Chair


16:20-16:50
Do you hear what I see? How do early blind individuals experience object motion?

Abstract

Perceiving object motion is fundamentally multisensory, yet little is known about similarities and differences in motion computations across different senses. Insight can be provided by examining auditory motion processing in early blind individuals. Early blindness leads to ‘recruitment’ of the ‘visual’ motion area hMT+ for auditory motion processing. Meanwhile, the planum temporale, associated with auditory motion in sighted individuals, shows reduced selectivity for auditory motion, suggesting competition between cortical areas for functional role. 

According to the metamodal hypothesis of cross-modal plasticity developed by Pascual-Leone, the recruitment of hMT+ is driven by it being a metamodal structure containing “operators that execute a given function or computation regardless of sensory input modality”. According to the metamodal hypothesis, the computations underlying auditory motion processing in early blind individuals should be analogous to visual motion processing in sighted individuals – relying on non-separable spatiotemporal filters.

Inconsistent with the metamodal hypothesis, auditory motion filters, in both blind and sighted subjects, are separable in space and time. The computations underlying auditory motion processing in early blind individuals are not qualitatively altered; instead, the recruitment of hMT+ to extract motion information from auditory input includes significant modification of its normal computational operations.

Speakers

16:50-17:20
The role of binocular vision in the development of visuomotor control and performance of fine motor skills

Abstract

The ability to perform accurate, precise and temporally coordinated goal-directed actions is fundamentally important to activities of daily life, as well as skilled occupational and recreational performance. Vision provides a key sensory input for the normal development of visuomotor skills. Normal visual development is disrupted by amblyopia, a neurodevelopmental disorder characterized by impaired visual acuity in one eye and reduced binocularity, which affects 2–4% of children and adults. This presentation will discuss a growing body of research which demonstrates that binocular vision provides an important input for optimal development of the visuomotor system, specifically visually guided upper limb movements such as reaching and grasping. Research shows that decorrelated binocular experience is associated with both deficits and compensatory adaptations in visuomotor control. Parallel studies with typically developing children and visually normal adults provide converging evidence supporting the contribution of stereopsis to the control of grasping. Overall, this research advances our understanding about the role of binocular vision in the development and performance of visuomotor skills, which is the first step towards developing assessment tools and targeted rehabilitations for children with neurodevelopment disorders at risk of poor visuomotor outcomes.

Speakers

17:20-17:30
Discussion

Chair

17:40-18:30
Panel discussion

Abstract

The Chairs (Dr Andrew Fitzgibbon, Professor Matteo Carandini, Dr Mar Gonzalez-Franco, and Professor Jody Culham) discuss future directions for 3D vision in an interactive question and answer session with the audience.

Speakers