Skip to content

Overview

Scientific discussion meeting organised by Professor Nicholas Higham FRS, Laura Grigori and Professor Jack Dongarra.

As computer architectures evolve, numerical algorithms for high-performance computing struggle to cope with the high resolution and data intensive methods that are now key to many research fields. This meeting brought together computer and computational scientists who are developing innovative scalable algorithms and software with application scientists who need to explore and adopt the new algorithms in order to achieve peta/exascale performance.

Speaker abstracts and biographies can be found below. Recorded audio of the presentations is also available below. An accompanying journal issue for this meeting was published in Philosophical Transactions of the Royal Society A.

Attending this event

This meeting has taken place.

Enquiries: contact the Scientific Programmes team

Organisers

Schedule


Chair

09:00-09:15
Introduction
09:15-09:45
Hierarchical algorithms on hierarchical architectures

Speakers


Listen to the audio (mp3)

09:45-09:55
Discussion
09:55-10:25
Antisocial parallelism: avoiding, hiding and managing communication

Speakers


Listen to the audio (mp3)

10:25-10:35
Discussion
10:35-11:05
Coffee
11:05-11:35
Big telescope, big data: towards exa-scale with the SKA

Abstract

The Square Kilometre Array (SKA) will be the world’s largest radio telescope. The pre-construction design effort for the SKA started in 2012 and involves approximately 100 organisations in 20 countries. The scale of the project makes it a huge endeavour, not only in terms of the science that it will ultimately undertake, but also in terms of the engineering and development that is required to design and build such a unique instrument. This design effort is now drawing to a close and early science operations are anticipated to begin in ~2025.

With raw data rates capable of producing zetta-byte volumes per year, the online compute and data compression for the telescope is a key component of observatory operations. Expected to run in soft real-time, this processing will require a significant fraction of an exa-flop when at full capacity and will reduce the data rate from the observatory to the outside world to a few hundred peta-bytes of data products per year.

Astronomers will receive their data through a network of worldwide SKA Regional Centres (SRCs). Unlike the observatory science data processing, it is not possible to define a finite set of compute models for an SRC. An SKA Regional Centre requires the flexibility to enable a broad range of single user development and processing, as well as providing an infrastructure that can support efficient large-scale compute for the standardised data processing of reserved access key science projects.

Here Professor Scaife will discuss the processing model for the SKA telescope, from the antennas to the regional centres, and highlight the components of this processing that dominate the compute load. She will discuss the details of specific algorithms being used and approaches that have been adopted to improve efficiency, both in terms of optimisation and re-factoring.

Speakers


Listen to the audio (mp3)

11:35-11:45
Discussion
11:45-12:15
Algorithms for in situ data analytics in next generation molecular dynamics workflows

Abstract

Molecular dynamics (MD) simulations studying the classical time evolution of a molecular system at atomic resolution are widely recognized in the fields of chemistry, material sciences, molecular biology, and drug design; these simulations are one of the most common simulations on supercomputers. Next-generation supercomputers will have dramatically higher performance than do current systems, generating more data that needs to be analyzed (ie in terms of number and length of MD trajectories). The coordination of data generation and analysis cannot rely on manual, centralized approaches as it is predominately done today.

In this talk Dr Taufer will discuss how the combination of machine learning and data analytics algorithms, workflow management methods, and high performance computing systems can transition the runtime analysis of larger and larger MD trajectories towards the exascale era. Dr Taufer will demonstrate her group's approach on three case studies: protein-ligand docking simulations, protein folding simulations, and analytics of protein functions depending on proteins’ three-dimensional structures. She will show how, by mapping individual substructures to metadata, frame by frame at runtime, it is possible to study the conformational dynamics of proteins in situ. The ensemble of metadata can be used for automatic, strategic analysis and steering of MD simulations within a trajectory or across trajectories, without manually identify those portions of trajectories in which rare events take place or critical conformational features are embedded.

Speakers


Listen to the audio (mp3)

12:15-12:25
Discussion

Chair

13:25-13:55
Numerical computing challenges for large-scale deep learning

Abstract

In this talk Professor Stevens will discuss the convergence of traditional high-performance computing, data analytics and deep learning and some of the architectural, algorithmic and software challenges this convergence creates as we push the envelope on the scale and volume of training and inference runs on todays' largest machines. Deep learning is beginning to have significant impact in science, engineering and medicine. The use of HPC platforms in deep learning ranges from training single models at high speed, training large number of models in sweeps for model development and model discovery, hyper-parameter optimisation, and uncertainty quantification as well as large-scale ensembles for data preparation, inferencing on large-scale data and for data post-processing. Already it has been demonstrated that some reinforcement learning problems need multiple exaops/s days of computing to reach state-of-the-art performance. This need for more performance is driving the development of architectures aimed at accelerating deep learning training and inference beyond the already high-performance of GPUs. These new 'AI' architectures are often optimised for common cases in deep learning, typically deep convolutional networks and variations of half-precision floating point. Professor Stevens will review some of these new accelerator design points, the approaches to acceleration and scalability, and discuss some of the driver science problems in deep learning.

Speakers


Listen to the audio (mp3)

13:55-14:05
Discussion
14:05-14:35
High-performance sampling of determinantal point processes

Abstract

Despite having been introduced for sampling eigenvalue distributions of ensembles of random matrices, Determinantal Point Processes (DPPs) have recently been popularised due to their usage in encouraging diversity in recommender systems. Traditional sampling schemes have used dense Hermitian eigensolvers to reduce sampling to an equivalent of a low-rank diagonally-pivoted Cholesky factorization, but researchers are starting to understand deeper connections to Cholesky that avoid the need for spectral decompositions. This talk begins with a proof that one can sample a DPP via a trivial tweak of an LDL factorization that flips a Bernoulli coin weighted by each nominal pivot: simply keep an item if the coin lands on heads, or decrement the diagonal entry by one otherwise. The fundamental mechanism is that Schur complement elimination of variables in a DPP kernel matrix generates the kernel matrix of the conditional distribution if said variables are known to be in the sample.

While researchers have begun connecting DPP sampling and Cholesky factorization to avoid expensive dense spectral decompositions, high-performance implementations have yet to be explored, even in the dense regime. The primary contributions of this talk (other than the aforementioned theorem) are side-by-side implementations and performance results of high-performance dense and sparse-direct DAG-scheduled DPP sampling and LDL factorizations. The software is permissively open sourced as part of the catamari project at gitlab.com/hodge_star/catamari.

Speakers


Listen to the audio (mp3)

14:35-14:45
Discussion
14:45-15:10
Coffee
15:10-15:40
Computing beyond the end of Moore’s Law

Abstract

Moore’s Law is a techno-economic model that has enabled the Information Technology (IT) industry to nearly double the performance and functionality of digital electronics roughly every two years within a fixed cost, power and area. Within a decade, the technological underpinnings for the process Gordon Moore described will come to an end as lithography gets down to atomic scale. This talk provides an updated view of what a 2021-2023 system might look like and the challenges ahead, based on our most recent understanding of technology roadmaps. It also will discuss the tapering of historical improvements in lithography, and how it affects options available to continue scaling of successors to the first exascale machine.

Speakers


Listen to the audio (mp3)

15:40-15:50
Discussion
15:50-16:20
Exascale applications: skin in the game

Abstract

As noted in Wikipedia, skin in the game refers to having “incurred risk by being involved in achieving a goal”, where “skin is a synecdoche for the person involved, and game is the metaphor for actions on the field of play under discussion”. For exascale applications under development in the U.S Department of Energy (DOE) Exascale Computing Project (ECP), nothing could be more apt, with the skin being exascale applications and the game being delivering comprehensive science-based computational applications that effectively exploit exascale HPC technologies to provide breakthrough modelling and simulation and data science solutions. These solutions must yield high-confidence insights and answers to our nation’s most critical problems and challenges in scientific discovery, energy assurance, economic competitiveness, health enhancement, and national security.

Exascale applications (and their companion co-designed computational motifs) are a foundational element of the ECP and are the vehicle for delivery of consequential solutions and insight from exascale systems. The breadth of these applications runs the gamut: chemistry and materials; energy production and transmission; earth and space science; data analytics and optimisation; and national security. Each ECP application is focused on targeted development to address a unique mission challenge problem, ie, one that possesses solution amenable to simulation insight, represents a strategic problem important to a DOE mission program, and is currently intractable without the computational power of exascale. Any tangible progress requires close coordination with exascale application, algorithm, and software development to adequately address six key application development challenges: porting to accelerator-based architectures; exposing additional parallelism; coupling codes to create new multi-physics capability; adopting new mathematical approaches; algorithmic or model improvements; and leveraging optimised libraries.

Each ECP application possesses a unique development plan base on its requirements-based combination of physical model enhancements and additions, algorithm innovations and improvements, and software architecture design and implementation. Illustrative examples of these development activities will be given along with results achieved to date on existing DOE supercomputers such as the Summit system at Oak Ridge National Laboratory.

Speakers


Listen to the audio (mp3)

16:20-16:30
Discussion
16:30-17:00
Poster flash talks
17:00-18:00
Poster session

Chair

09:00-09:30
Machine learning and big scientific data

Abstract

There is now broad recognition within the scientific community that the ongoing deluge of scientific data is fundamentally transforming academic research. Turing Award winner Jim Gray referred to this revolution as 'The Fourth Paradigm: Data Intensive Scientific Discovery'. Researchers now need new tools and technologies to manipulate, analyse, visualise, and manage the vast amounts of research data being generated at the national large-scale experimental facilities. In particular, machine learning technologies are fast becoming a pivotal and indispensable component in modern science, from powering discovery of modern materials to helping us handling large-scale imagery data from microscopes and satellites. Despite these advances, the science community lacks a methodical way of assessing and quantifying different machine learning ecosystems applied to data-intensive scientific applications. There has so far been little effort to construct a coherent, inclusive and easy to use benchmark suite targeted at the ‘scientific machine learning’ (SciML) community. Such a suite would enable an assessment of the performance of different machine learning models, applied to a range of scientific applications running on different hardware architectures – such as GPUs and TPUs – and using different machine learning frameworks such as PyTorch and TensorFlow. In this paper, Professor Hey will outline his approach for constructing such a 'SciML benchmark suite' that covers multiple scientific domains and different machine learning challenges. The output of the benchmarks will cover a number of metrics, not only the runtime performance, but also metrics such as energy usage, and training and inference performance. Professor Hey will present some initial results for some of these SciML benchmarks.

Speakers


Listen to the audio (mp3)

09:30-09:40
Discussion
09:40-10:10
Iterative linear algebra in the exascale era

Abstract

Iterative methods for solving linear algebra problems are ubiquitous throughout scientific and data analysis applications and are often the most expensive computations in large-scale codes. Approaches to improving performance often involve algorithm modification to reduce data movement or the selective use of lower precision in computationally expensive parts. Such modifications can, however, result in drastically different numerical behavior in terms of convergence rate and accuracy due to finite precision errors. A clear, thorough understanding of how inexact computations affect numerical behaviour is thus imperative in balancing tradeoffs in practical settings.

In this talk, Dr Carson focuses on two general classes of iterative methods: Krylov subspace methods and iterative refinement. She presents bounds on attainable accuracy and convergence rate in finite precision variants of Krylov subspace methods designed for high performance and show how these bounds lead to adaptive approaches that are both efficient and accurate. Then, motivated by recent trends in multiprecision hardware, she presents new forward and backward error bounds for general iterative refinement using three precisions. The analysis suggests that if half precision is implemented efficiently, it is possible to solve certain linear systems up to twice as fast and to greater accuracy.

As we push toward exascale level computing and beyond, designing efficient, accurate algorithms for emerging architectures and applications is of utmost importance. Dr Carson finishes by discussing extensions in new applications areas and the broader challenge of understanding what increasingly large problem sizes will mean for finite precision computation both in theory and practice.

Speakers


Listen to the audio (mp3)

10:10-10:20
Discussion
10:20-10:50
Coffee
10:50-11:20
Stochastic rounding and reduced-precision fixed-point arithmetic for solving neural ODEs

Abstract

There is increasing interest from users of large-scale computation in smaller and/or simpler arithmetic types. The main reasons for this are energy efficiency and memory footprint/bandwidth. However, the outcome of a simple replacement with lower precision types is increased numerical errors in the results. The practical effects of this range from unimportant to intolerable. Professor Furber will describe some approaches for reducing the errors of lower-precision fixed-point types and arithmetic relative to IEEE double-precision floating-point. He will give detailed examples in an important domain for numerical computation in neuroscience simulations: the solution of Ordinary Differential Equations (ODEs). He will look at two common model types and demonstrate that rounding has an important role in producing improved precision of spike timing from explicit ODE solution algorithms. In particular the group finds that stochastic rounding consistently provides a smaller error magnitude - in some cases by a large margin - compared to single-precision floating-point and fixed-point with rounding-to-nearest across a range of Izhikevich neuron types and ODE solver algorithms. We also consider simpler and computationally much cheaper alternatives to full stochastic rounding of all operations, inspired by the concept of 'dither' that is a widely understood mechanism for providing resolution below the LSB in digital audio, image and video processing. Professor Furber will discuss where these alternatives are are likely to work well and suggest a hypothesis for the cases where they will not be as effective. These results will have implications for the solution of ODEs in all subject areas, and should also be directly relevant to the huge range of practical problems that are modelled as Partial Differential Equations (PDEs).

Speakers


Listen to the audio (mp3)

11:20-11:30
Discussion
11:30-12:00
Rethinking deep learning: architectures and algorithms

Abstract

Professor Constantinides will consider the problem of efficient inference using deep neural networks. Deep neural networks are currently a key driver for innovation in both numerical algorithms and architecture, and are likely to form an important workload for computational science in the future. While algorithms and architectures for these computations are often developed independently, this talk will argue for a holistic approach. In particular, the notion of application efficiency needs careful analysis in the context of new architectures. Given the importance of specialised architectures in the future, Professor Constantinides will focus on custom neural network accelerator design. He will define a notion of computation general enough to encompass both the typical design specification of a computation performed by a deep neural network and its hardware implementation. This will allow us to explore - and make precise - some of the links between neural network design methods and hardware design methods. Bridging the gap between specification and implementation requires us to grapple with questions of approximation, which he will formalise and explore opportunities to exploit. This will raise questions about the appropriate network topologies, finite precision data-types, and design/compilation processes for such architectures and algorithms, and how these concepts combine to produce efficient inference engines. Some new results will be presented on this topic, providing partial answers to these questions, and we will explore fruitful avenues for future research.

Speakers


Listen to the audio (mp3)

12:00-12:10
Discussion
12:10-12:40
Reduced numerical precision and imprecise computing for ultra-accurate next-generation weather and climate models

Abstract

Developing reliable weather and climate models must surely rank amongst the most important tasks to help society become resilient to the changing nature of weather and climate extremes. With such models we can determine the type of infrastructure investment needed to protect against the worst weather extremes, and can prepare for specific extremes well in advance of their occurrence.

Weather and climate models are based on numerical representations of the underlying nonlinear partial differential equations of fluid flow. However, they do not conform to the usual paradigm in numerical analysis: you give me a problem you want solved to a certain accuracy and I will give you an algorithm to achieve this. In particular, with current supercomputers, we are unable to solve the underpinning equations to the accuracy we would like. As a result, models exhibit pervasive biases against observations. These biases can be as large as the signals we are trying to simulate or predict. For climate modelling the numerical paradigm instead becomes this: for a given computational resource, what algorithms can produce the most accurate weather and climate simulations? Professor Palmer argues for an oxymoron: that to maximise accuracy we need to abandon the use of 64-bit precision where it is not warranted. Examples are given of the use of 16-bit numerics for large parts of the model code. The computational savings so made can be reinvested to increase model resolution and thereby increase model accuracy. In assessing the relative degradation of model performance at fixed resolution with reduced-precision numerics, the group utilises the notion of stochastic parametrisation as a representation of model uncertainty. Such stochasticity can itself reduce model systematic error. The benefit of stochasticity suggests a role for low-energy non-deterministic chips in future HPC.

Speakers


Listen to the audio (mp3)

12:40-12:50
Discussion

Chair

14:00-14:30
Memory-aware algorithms for automatic differentiation and backpropagation

Abstract

In this talk Dr Pallez will discuss the impact of memory in the computation of automatic differentiation or for the back-propagation step of machine learning algorithms. He will show different strategies based on the amount of memory available. In particular he will discuss optimal strategies when one can reuse memory slots, and when considering a hierarchical memory platform.

Speakers


Listen to the audio (mp3)

14:30-14:40
Discussion
14:40-15:10
Post-K: the first ‘exascale’ supercomputer for convergence of HPC and big data/AI

Abstract

With the rapid rise and increase of Big Data and AI as a new breed of high-performance workloads on supercomputers, we need to accommodate them at scale, and thus the need for R&D for HW and SW Infrastructures where traditional simulation-based HPC and Big Data/AI would converge. Post-K is the flagship next generation national supercomputer being developed by Riken R-CCS and Fujitsu in collaboration. Post-K will have hyperscale datacenter class resource in a single exascale machine, with well more than 150,000 nodes of sever-class A64fx many-core Arm CPUs with the new SVE (Scalable Vector Extension) for the first time in the world, augmented with HBM2 memory paired with CPUs for the first time, exhibiting nearly a Terabyte/s memory bandwidth for both HPC and Big Data rapid data movements, along with AI/deep learning. Post-K’s target performance is 100 times speedup on some key applications cf its predecessor, the K-Computer, realised through extensive co-design process involving the entire Japanese HPC community. It also will likely to be the premier big data and AI/ML infrastructure for Japan; currently, the group is conducting research to scale deep learning to more than 100,000 nodes on Post-K, where they expect to obtain near top GPU-class performance on each node.

Speakers


Listen to the audio (mp3)

15:10-15:20
Discussion
15:20-15:50
Coffee
15:50-16:20
Accelerated sparse linear algebra: emerging challenges and capabilities for numerical algorithms and software

Abstract

For the foreseeable future, the performance potential for most high-performance scientific software will require effective use of hosted accelerator processors, characterised by massive concurrency, high bandwidth memory and stagnant latency. These accelerators will also typically have a backbone of traditional networked multicore processors that serve as the starting point for porting applications and the default execution environment for non-accelerated computations.

In this presentation Dr Heroux will briefly characterise experiences with these platforms for sparse linear algebra computations. He will discuss requirements for sparse computations that are distinct from dense computations and how these requirements impact algorithm and software design. He will also discuss programming environment trends that can impact design and implementation choices. In particular, Dr Heroux will discuss strategies and practical challenges of increasing performance portability across the diverse node architectures that continue to emerge, as computer system architects innovate to keep improving performance in the presence of stagnant latencies, and simultaneously add capabilities that address the needs of data science.

Finally, Dr Heroux will discuss the role and potential for resilient computations that may eventually be required as we continue to push the limits of system reliability.  While system designers have managed to preserve the illusion of a 'reliable digital machine' for scientific software developers, the cost of doing so continues to increase, as does the risk that a future leadership platform may never reach acceptable reliability levels. If we can develop resilient algorithms and software that survives failure events, we can reduce the cost, schedule delays and complexity challenges of future systems.

Speakers


Listen to the audio (mp3)

16:20-16:30
Discussion
16:30-17:00
Closing remarks

Listen to the audio (mp3)