This page is archived

Links to external sources may no longer work as intended. The content may not represent the latest thinking in this area or the Society’s current position on the topic.

New approaches to 3D vision

01 - 04 November 2021 15:00 - 18:30

Scientific discussion meeting organised by Professor Michael Morgan FRS, Dr Paul Linton, Professor Jenny Read, Dr Dhanraj Vishwanath, Professor Sarah Creem-Regehr and Professor Fulvio Domini.

Leading approaches to computer vision (SLAM: simultaneous localization and mapping), animal navigation (cognitive maps), and human vision (optimal cue integration), start from the assumption that the aim of 3D vision is to produce a metric reconstruction of the environment. Recent advances in machine learning, single-cell recording in animals, virtual reality, and visuomotor control, all challenge this assumption. The purpose of this meeting was to bring these different disciplines together to formulate an alternative approach to 3D vision.

The schedule of talks and speaker biographies and abstracts are available below. An accompanying journal issue has been published in Philosophical Transactions of the Royal Society B.

Attending the event

This meeting has taken place. Watch the recordings on our YouTube channel.

Enquiries: please contact the Scientific Programmes team

Organisers

  • Professor Michael Morgan FRS, City, University of London, UK

    Professor Michael Morgan is an Experimental Psychologist whose main interest is in Visual Perception. He graduated in Natural Sciences from the University of Cambridge in 1964 and has held teaching and research positions in the Universities of Cambridge, Durham, UCL, Edinburgh (Darwin Professorial Fellow) and most recently, City, University of London. His main publications have been in the areas of Spatial Vision, Motion Perception and Eye Movements. His contributions to 3D vision include investigations of the role of interocular spatiotemporal phase differences. He is the author of two books on Vision: Molyneux’s Question (1977) and The Space Between Our Ears (2003).

  • Dr Paul Linton, City, University of London, UK

    Paul Linton is a Research Fellow in 3D Vision at the Centre for Applied Vision Research, City, University of London. He was previously a Stipendiary Lecturer at the University of Oxford, and a Teaching Fellow at University College London. He was also a member of the DeepFocus team at Facebook Reality Labs, where he used vision science to inform the development of virtual and augmented reality technology. He is the author of the book The Perception and Cognition of Visual Space (Palgrave, 2017), which challenges contemporary accounts of depth cue integration. His experimental research shows that humans are unable to triangulate the size and distance of objects, with implications for visual scale, binocular disparity processing, multisensory integration, and object interaction. His recent theoretical work considers the extent of cognitive processing in V1. For further details, please visit: https://linton.vision.

  • Professor Jenny Read, Newcastle University, UK

    Jenny Read is Professor of Vision Science at Newcastle University’s Biosciences Institute. She took an undergraduate degree in physics (1994), a doctorate in theoretical physics (1997) and a Masters in neuroscience (1999) at Oxford University, UK. From 1997–2001 she was a Wellcome Training Fellow in Mathematical Biology at Oxford University, then from 2001–2005 a postdoctoral fellow at the US National Eye Institute in Bethesda, Maryland. She returned to the UK in 2005 with a Royal Society University Research Fellowship. Her lab works on many aspects of visual perception, especially binocular and stereoscopic vision. Current projects include modelling how visual cortex encodes binocular information, developing a new stereo vision test for children, and uncovering how insects see in stereoscopic 3D. More information and all publications are available at http://www.jennyreadresearch.com.

  • Dr Dhanraj Vishwanath, University of St Andrews, UK

    Dr Dhanraj Vishwanath is Senior Lecturer in Perception at the School of Psychology and Neuroscience at the University of St Andrews. His research interests are in 3D vision, visual aesthetics, eye movements and attention, with a special focus on phenomenological and philosophical issues. With his collaborators, he has made empirical and theoretical contributions in pictorial space perception, the role of blur in depth perception, the phenomenology of stereopsis, as well as spatial localization in eye movements and attention. In addition to his current work on 3D perception, he is working on a theoretical account of the psychology of visual art and aesthetics. He received his PhD from Rutgers University, New Brunswick, followed by postdoctoral work at UC Berkeley.

  • Professor Sarah Creem-Regehr, University of Utah, USA

    Sarah Creem-Regehr is a Professor in the Psychology Department at the University of Utah. She also holds faculty appointments in the School of Computing and the Neuroscience program at the University of Utah. She received her PhD in Psychology from the University of Virginia. Her research examines how humans perceive, learn, and navigate spaces in natural, virtual, and visually impoverished environments. Her research takes an interdisciplinary approach, combining the study of space perception and spatial cognition with applications in visualization and virtual environments. She co-authored the book Visual Perception from a Computer Graphics Perspective and was previously Associate Editor for Psychonomic Bulletin & Review and Journal of Experimental Psychology: Human Perception and Performance. She is currently Associate Editor for Quarterly Journal of Experimental Psychology and she will become Editor-in-Chief of Cognitive Research: Principles and Implications in January 2022.

  • Professor Fulvio Domini, Brown University, USA

    Fulvio Domini completed his Masters in Electrical Engineering and PhD in Experimental Psychology at the University of Trieste, Italy. He joined that faculty at Brown University, USA in 1999 where is currently Professor at the Department of Cognitive, Linguistic and Psychological Sciences. His research team investigates how the human visual system processes 3D visual information to allow successful interactions with the environment. His approach is to combine computational methods and behavioral studies to understand what are the visual features that establish the mapping between vision and action. Contrary to the commonly held assumption that perception and action stem from separate visual mechanisms, he takes a fundamentally different view, proposing that perception and action form a coordinated system: perception informs action about the state of the world and, in turn, action shapes perception by signalling when it is faulty. 

Schedule

Chair

Dr Andrew Fitzgibbon FREng, Microsoft, UK

15:00 - 15:05 Introduction
15:05 - 15:30 Neural priors, neural encoders and neural renderers

Scene representation – the process of converting visual sensory data into useful descriptions – is a requirement for intelligent behaviour. Scene representation can be achieved with three components: a prior (which scenes are likely?), an encoder (which scenes correspond to this image?), and a renderer (which images correspond to this scene?). This talk will describe how neural priors, encoders and renderers can be trained without any human-provided labels, and show how this unlocks new capabilities eg in protein structure understanding.

Dr SM Ali Eslami, DeepMind, UK

15:30 - 16:00 Multi-scale predictive representations and human-like RL

Many artificial agents pass benchmarks for solving specific tasks, but what would make their behaviour seem more human-like to humans? In previous research Dr Momennejad used reinforcement learning (RL) to study how humans learn multiscale predictive representations for memory, planning, and navigation. Based on this work, she will first present behavioural, fMRI, and electrophysiology evidence that hippocampal and prefrontal hierarchies learn multi-scale predictive representations, and update them via offline replay. She will then present recent work in which the group assesses the human-likeness of algorithms that all solve a navigation task. To this end, the group designed a Turing test to assess and compare human-likeness of agents navigating an XBOX game ("Bleeding Edge"). Together, representation and replay models enhance our understanding of how the brain's algorithms underlie behaviour in health and pathology. In turn, designing AI algorithms inspired by models that are neurally and behaviourally plausible can advance the state of the art of human-like RL.

 

Dr Ida Momennejad, Microsoft Research NYC, USA

16:00 - 16:20 Discussion

Chair

Dr Andrew Fitzgibbon FREng, Microsoft, UK

16:40 - 17:10 Generalization in data-driven control

Current machine learning methods are primarily deployed for tackling prediction problems, which are almost always cast as supervised learning tasks. Despite decades of advances in reinforcement learning and learning-based control, the applicability of these methods to domains that require open-world generalization – autonomous driving, robotics, aerospace, and other applications – remain challenging. Realistic environments require effective generalization, and effective generalization requires training on large and diverse datasets that are representative of the likely test-time scenarios. Dr Levine will discuss why this poses a particular challenge for learning-based control, and present some recent research directions that aim to address this challenge. He will discuss how offline reinforcement learning algorithms can make it possible for learning-based control systems to utilize large and diverse real-world datasets, how the use of diverse data can enable robotic systems to navigate real-world environments, and how multi-task and contextual policies can enable broad generalization to a range of user-specified goals.

Dr Sergey Levine, UC Berkeley and Google, USA

17:10 - 17:40 Understanding 3D vision as a policy network

A 'policy network' is a term used in reinforcement learning to describe the set of actions that are generated in different states, where the ‘state’ reflects both the current sensory stimulus and goal. This is not the typical foundation for describing 3D vision, which in computer vision is based on reconstruction (eg Simultaneous Localisation And Mapping) while in neuroscience the predominant hypothesis has assumed that there are 3D transformations between retino-centric, ego-centric and world-based reference frames. Theoretical and experimental evidence in support of this neural hypothesis is lacking. Professor Glennerster will describe instead an approach that avoids 3D coordinate frames. A policy network for saccades (pure rotations of the camera/eye around the optic array) is a starting point for understanding (i) an ego-centric representation of visual direction, distance, slant and depth relief (what Marr hoped to achieve with his 2½-D sketch) and (ii) a hierarchical, compositional representation for navigation. We have known for a long time how the brain can implement a policy network so, if we could describe 3D vision in terms of a policy network (where the actions are either saccades or head translations), we would have moved closer to a neurally plausible model of 3D vision.

Professor Andrew Glennerster, University of Reading, UK

17:40 - 18:00 Discussion

Chair

Professor Matteo Carandini, University College London, UK

15:00 - 15:30 Stupid stereoscopic algorithms that still work

Stereopsis has traditionally been considered a complex visual ability, restricted to large-brained animals. The discovery in the 1980s that insects, too, have stereopsis therefore challenged theories of stereopsis. How can such simple brains see in 3D? One answer is simply that insect stereopsis is much lower-resolution, and probably does not produce even a coarse depth map across the visual field. Rather, it may aim to produce simple behaviour, such as orienting towards the closer of two objects or triggering a strike when prey comes within range. Scientific thinking about stereopsis has been unduly anthropomorphic, for example assuming that stereopsis must require binocular fusion or a solution of the stereo correspondence problem. In fact, useful behaviour can be produced with very basic stereoscopic algorithms which make no attempt to achieve fusion or correspondence. This may explain why some aspects of insect stereopsis seem poorly designed from an engineering point of view: for example, paying no attention to whether interocular contrast or velocities match. Such 'stupid' algorithms demonstrably work well enough in practice for their species, and may prove useful in particular autonomous applications.

Professor Jenny Read, Newcastle University, UK

15:30 - 16:00 Visual processing in the brain during navigation

Much of our everyday visual experience is based on our movements through the world, when we navigate between different places – from within a room, to between cities. Is visual function the same during navigation? Dr Saleem will be presenting work where using a virtual reality environment, and presenting identical visual stimuli in different locations, they asked if spatial position modulates activity in the visual system. Activity in the primary visual cortex (V1) was found to be strongly modulated by spatial position, and this modulation persisted across higher visual areas in the cortex. This modulation was not present in inputs to visual cortex from the lateral geniculate nucleus. Furthermore, the spatial modulation of visual responses was stronger when animals actively navigated, rather than passively view the environment. These results suggest that the spatial modulation of visual information arises in V1 with active navigation. The Saleem Lab has also been investigating feedback inputs to V1, and visual responses to optic flow stimuli. They have also developed an open-source software paradigm, BonVision, that can both present both 2D and 3D stimuli in a common framework, while maintaining the precision and replicability of standardised visual experiments.

Dr Aman Saleem, University College London, UK

16:00 - 16:20 Discussion

Chair

Professor Matteo Carandini, University College London, UK

16:40 - 17:10 The cognitive map of 3D space: not as metric as we thought?

The mammalian representation of navigable space (space that an animal moves itself through) is supported by a network of brain regions, centred on the hippocampus, that transform raw sensory signals into an internal map-like representation that can be used in navigation. It has long been thought that this map is metric, because its central units, the place cells, respond parametrically to metric changes in the environment such as stretching. This view was consolidated by the discovery of grid cells, which have evenly spaced firing fields that reveal metric computations such as speed, direction and distance. However, how these neurons behave in complex 3D space suggests that the map is not absolutely metric but is rather only loosely so, being tailored to the environment structure and/or shaped by its movement affordances. This accords with studies showing that humans seem to use a less metric and more topological internal map when performing spatial judgements. The emerging picture is one of a hierarchical processing system with highly metric processing of near space but progressively more topological maps at larger scales. This may be a way of saving processing resources, and could reflect a more general organisational principle of complex cognition.

Professor Kate Jeffery, University College London, UK

17:10 - 17:40 Locally ordered representation of 3D space in the entorhinal cortex

As animals navigate on a two-dimensional surface, neurons in the medial entorhinal cortex (MEC) known as grid cells are activated when the animal passes through multiple locations (firing fields) arranged in a hexagonal lattice tiling the locomotion surface. However, although our world is three-dimensional (3D), it is unclear how the MEC represents 3D space. The group recorded from MEC cells in freely flying bats and identified several classes of spatial neurons, including 3D border cells, 3D head-direction cells, and neurons with multiple 3D firing fields. Many of these multifield neurons were 3D grid cells, whose neighbouring fields were separated by a characteristic distance – forming a local order – but lacked global lattice arrangement of the fields. Thus, whereas 2D grid cells form a global lattice – characterized by both local and global order – 3D grid cells exhibited only local order, creating a locally ordered metric for space. The group modelled grid cells as emerging from pairwise interactions between fields, which yielded a hexagonal lattice in 2D and local order in 3D, describing both 2D and 3D grid cells using one unifying model. Together, these data and model illuminate fundamental differences and similarities between neural codes for 3D and 2D space in the mammalian brain.

Gily Ginosar, Weizmann Institute of Science, Israel

17:40 - 18:00 Discussion

Chair

Dr Mar Gonzalez-Franco, Microsoft Research, USA

15:00 - 15:30 Tripartite encoding of visual 3D space

A major challenge for prevailing models of human 3D vision is their inability to provide a satisfactory account of important aspects of our subjective awareness of 3D visual space. Reviewing phenomenological observations, empirical data, evolutionary logic and neurophysiological evidence, this presentation argues that human conscious awareness of visual space is underwritten by three separate spatial encodings that are optimized for specific regions of visual space (1) encoding of unscaled 3D object shape and layout (relative depth); (2) encoding of scaled intra- and inter-object distances (scaled depth) for near space (3) egocentric encoding of distances for ambulatory space. This account of separate and neurophysiological distinct spatial encodings can account for a number of important observations in the subjective awareness of 3D space, such as the paradoxical human capacity to perceive 3-dimensionality in 2-dimensional pictorial images, the unique subjective impression of object tangibility, negative space and object realness associated with binocular stereopsis and the capacity to be subjectively aware of distances beyond the peri-personal space even in the absence of binocular vision. This account provides a basis to better understand the conditions that underlie the subjective feeling of visual spatial immersion and presence.

Dr Dhanraj Vishwanath, University of St Andrews, UK

15:30 - 16:00 New approaches to visual scale and visual shape

Human 3D vision is thought to triangulate the size, distance, direction, and 3D shape of objects using vision from the two eyes. But all four of these capacities rely on the visual system knowing where the eyes are pointing. Dr Linton's experimental work on size and distance challenge this account, suggesting a purely retinal account of visual size and distance, and likely direction and 3D shape. This requires new accounts of visual scale and visual shape. For visual scale, he argues that observers rely on natural scene statistics to associate accentuated stereo depth (largely from horizontal disparities) with closer distances. This implies that depth / shape is resolved before size and distance. For visual shape, he argues that depth / shape from the two eyes is a solution to a different problem (rivalry eradication between two retinal images treated as if they are from the same viewpoint), rather than the visual system attempting to infer scene geometry (by treating the two retinal images as two different views of the same scene from different viewpoints). Dr Linton also draws upon his book, which questions whether other depth cues (perspective, shading, motion) really have any influence on this process.

Dr Paul Linton, City, University of London, UK

16:00 - 16:20 Discussion

Chair

Dr Mar Gonzalez-Franco, Microsoft Research, USA

16:40 - 17:10 Perception and action in Virtual and Augmented Reality

Virtual and Augmented Reality (VR and AR) methods provide both opportunities and challenges for research and applications involving space perception. The opportunities result from the ability to immerse a user in a realistic environment in which they can interact, while at the same time having the ability to control and manipulate environmental and body-based cues in ways that are difficult or impossible to do in the real world. The challenge comes from the notion that virtual environments will be most useful if they achieve high perceptual fidelity – that observers will perceive and act in the mediated environment as they would in the real world. A pervasive finding across early research on space perception in virtual environments is that absolute distance is underestimated as compared to the real world. Using the challenge of underestimation of scale as a starting point, this talk presents new measures (perceived affordances) and methods of feedback (body-based cues), as well as advances in technologies (mixed reality) and cues (shadows), that contribute to a broader understanding of perceptual fidelity across the continuum of mediated environments.

Professor Sarah Creem-Regehr, University of Utah, USA

17:10 - 17:40 Engineering challenges for realistic displays

How can a display appear indistinguishable from reality? Dr Lanman describes how to pass this 'visual Turing test' using AR/VR headsets, emphasizing the joint design of optics, display components, rendering algorithms, and sensing elements. Specifically, this presentation will focus on the engineering challenges for advancing along four axes: resolution, accommodation, distortion correction, and dynamic range.

Dr Douglas Lanman, Facebook Reality Labs, USA

17:40 - 18:00 Discussion

Chair

Professor Jody Culham, Western University, Canada

15:00 - 15:30 A novel non-probabilistic model of 3D cue integration explains both perception and action

It will be argued that perceptual and action tasks that require the encoding of 3D information are both based on the same set of computations. These are described by a computational theory of 3D cue integration, which constitutes a novel theoretical framework to study 3D vision in humans. The proposed computational theory differs from the current mainstream approaches to the problem in two fundamental ways. First, it assumes that 3D mechanisms are deterministic processes that map a given visual stimulus to a unique 3D representation. In contrast, the currently held view of perception as Bayesian inference postulates a probabilistic nature of 3D representation. Second, the proposed theory posits that 3D processing is heuristic, finding correct solutions to the problem only in ideal viewing conditions and not as a general goal of visual computations. The deterministic and heuristic nature of these computations is therefore inconsistent with Bayesian approaches that model brain mechanisms as processes that derive the most accurate and precise representation of 3D structures. Instead, this theory predicts systematic biases in depth estimates that identically affect perceptual judgements and goal-directed actions.

Professor Fulvio Domini, Brown University, USA

15:30 - 16:00 Dissociations between perception and action in size-distance scaling

One of the most puzzling abilities of the human brain is size constancy: an object is perceived as having the same size even though its image on the retina varies continuously with viewing distance. An accurate representation of size is critical not only for perceptual recognition, but also for goal-directed actions, such as grasping. In fact, to successfully grasp an object, our grip aperture needs to be scaled to the true size of the object irrespective of viewing distance, a scaling operation that can be referred to as grip constancy. In this talk, Dr Sperandio will present findings from studies on both healthy volunteers and a neurological patient with large bilateral lesions that include V1 and most of the occipital cortex. By measuring perceptual judgments and grasp kinematics in response to conditions in which the image on the retina was either different (for example, by placing an object of a given physical size near and far from the observers) or constant (for example, by placing a small object near and a big object far) in size, she will provide evidence that the neural mechanisms underlying size constancy for perception and action are dissociable and rely upon distinct representations of size.

Dr Irene Sperandio, University of Trento, Italy

16:00 - 16:10 Discussion

Chair

Professor Jody Culham, Western University, Canada

16:20 - 16:50 Do you hear what I see? How do early blind individuals experience object motion?

Perceiving object motion is fundamentally multisensory, yet little is known about similarities and differences in motion computations across different senses. Insight can be provided by examining auditory motion processing in early blind individuals. Early blindness leads to ‘recruitment’ of the ‘visual’ motion area hMT+ for auditory motion processing. Meanwhile, the planum temporale, associated with auditory motion in sighted individuals, shows reduced selectivity for auditory motion, suggesting competition between cortical areas for functional role. 

According to the metamodal hypothesis of cross-modal plasticity developed by Pascual-Leone, the recruitment of hMT+ is driven by it being a metamodal structure containing “operators that execute a given function or computation regardless of sensory input modality”. According to the metamodal hypothesis, the computations underlying auditory motion processing in early blind individuals should be analogous to visual motion processing in sighted individuals – relying on non-separable spatiotemporal filters.

Inconsistent with the metamodal hypothesis, auditory motion filters, in both blind and sighted subjects, are separable in space and time. The computations underlying auditory motion processing in early blind individuals are not qualitatively altered; instead, the recruitment of hMT+ to extract motion information from auditory input includes significant modification of its normal computational operations.

Professor Ione Fine, University of Washington, USA

16:50 - 17:20 The role of binocular vision in the development of visuomotor control and performance of fine motor skills

The ability to perform accurate, precise and temporally coordinated goal-directed actions is fundamentally important to activities of daily life, as well as skilled occupational and recreational performance. Vision provides a key sensory input for the normal development of visuomotor skills. Normal visual development is disrupted by amblyopia, a neurodevelopmental disorder characterized by impaired visual acuity in one eye and reduced binocularity, which affects 2–4% of children and adults. This presentation will discuss a growing body of research which demonstrates that binocular vision provides an important input for optimal development of the visuomotor system, specifically visually guided upper limb movements such as reaching and grasping. Research shows that decorrelated binocular experience is associated with both deficits and compensatory adaptations in visuomotor control. Parallel studies with typically developing children and visually normal adults provide converging evidence supporting the contribution of stereopsis to the control of grasping. Overall, this research advances our understanding about the role of binocular vision in the development and performance of visuomotor skills, which is the first step towards developing assessment tools and targeted rehabilitations for children with neurodevelopment disorders at risk of poor visuomotor outcomes.

Dr Ewa Niechwiej-Szwedo, University of Waterloo, Canada

17:20 - 17:30 Discussion

Chair

Professor Michael Morgan FRS, City, University of London, UK

17:40 - 18:30 Panel discussion

The Chairs (Dr Andrew Fitzgibbon, Professor Matteo Carandini, Dr Mar Gonzalez-Franco, and Professor Jody Culham) discuss future directions for 3D vision in an interactive question and answer session with the audience.

Dr Andrew Fitzgibbon FREng, Microsoft, UK

Professor Matteo Carandini, University College London, UK

Dr Mar Gonzalez-Franco, Microsoft Research, USA

Professor Jody Culham, Western University, Canada