Skip to content
Events

Storage and indexing of massive data

Event

Starts:

February
072013

09:00

Ends:

February
082013

17:00

Location

Kavli Royal Society Centre, Chicheley Hall, Newport Pagnell, Buckinghamshire, MK16 9JJ

Overview

Theo Murphy international scientific meeting organised by Professor Costas Iliopoulos, Dr Simon Puglisi and Professor Maxime Crochemore.

Event details

The world is drowning in data! The current and ever increasing rate at which data is being produced by scientific and social instruments if fast outstripping our capacity to process and make use of it. This meeting will examine the impressive computational methods developed for coping with massive data to date and look forward, sketching a roadmap to deal with challenges in the decades ahead.

Biographies of the organisers and speakers are available below and you can download a programme. Recorded audio of the presentations are available by clicking on the names of the speakers below.

Enquiries: Contact the events team

Event organisers

Select an organiser for more information

Schedule of talks

Session 1

3 talks Show detail Hide detail

Chair

Professor Maxime Crochemore, King’s College London, UK

Show speakers

Histogram indexing and block pattern indexing

Professor Amihood Amir, Bar-llan University, Israel

Abstract

Histogram indexing, also known as jumbled pattern indexing and permutation indexing is one of the important current open problems in pattern matching. It was introduced about six years ago and has seen active research since. Yet, to date there is no algorithm that can preprocess a text T in time o( |T|2/polylog |T| ) and achieve histogram indexing, even over a binary alphabet, in time independent of the text length.

We show a strong connection between the histogram indexing problem and the block-mass pattern matching problem. The block-mass pattern matching problem is a recently introduced problem, motivated by issues in mass spectrometry. The reduction we show between the two problems is amazingly simple. Its value lies in recognizing the connection between these two apparently disparate problems, rather than the complexity of the reduction. This reduction trivially achieves progress that has been sought in vain for over half a decade.

Show speakers

Finding subtree repeats to slash the time for phylogenetic analyses

Dr Solon Pissis, University of Florida, USA and Heidelberg Institute for Theoretical Studies, Germany

Abstract

Given an unordered labeled tree T, our goal is to group repeating subtrees of T into equivalence classes. By unordered, we mean that the  order of the descendant nodes (children) of any node of T is irrelevant.  We present a simple and time-optimal algorithm for solving this problem, and show that the running time of our method is linear with respect to the size of T.

We show how this algorithm can be applied to minimize the number of operations required to compute the likelihood of a phylogenetic tree comprising thousands of species and massive amounts of whole-transcriptome or even whole-genome molecular data. Moreover, we show how the presented algorithm can easily be modified to operate on trees that do not satisfy some or any of the aforementioned assumptions on the tree structure. For instance, it can also be applied to ordered or unlabeled trees.

Show speakers

Session 2

3 talks Show detail Hide detail

Chair

Professor Amihood Amir, Bar-llan University, Israel

Show speakers

Algorithmic challenges in regulatory networks

Professor Esko Ukkonen, University of Helsinki, Finland

Abstract

Transcription factors (TFs) are proteins that regulate gene activity. By binding to a specific DNA region of other genes they turn them on (or off), initiating the production of another protein. This induces network structured regulatory cascades of the DNA driven machinery of a cell. The TF binding sites (motifs) in DNA are often modeled using so-called position weight matrices. This model specifies a multinomial distribution of sequences and assumes that different positions within a motif are independent of each other. To avoid false positives in binding site prediction, more accurate models would be needed. We are developing models that utilize the fact that the TFs not only bind to DNA but also to each other. While the individual interactions between TFs are often weak, the complexes formed by both TF-DNA and TF-TF interactions are much more stable than those based on TF-DNA interactions alone. Thus, the dimerization mediated by DNA gives much richer regulatory machinery that has a potentially stronger capability to explain the regulation of gene expression.  The talk will describe recent developments in modeling and predicting such regulatory complexes, using alignment-based and alignment-free techniques and data from high-throughput SELEX experiments, and applying this to explain observed binding patterns in DNA. (Joint work with J Taipale, T Kivioja, P. Rastas, A Jolma and J Toivonen.)

Show speakers

Order preserving matching on massive data

Professor Kunsoo Park, Seoul National University, South Korea

Abstract

We introduce a new string matching problem called order-preserving matching on numeric strings where a pattern matches a text if the text contains a substring whose relative orders coincide with those of the pattern. Order-preserving matching is applicable to many scenarios such as stock price analysis and musical melody matching in which the order relations should be matched instead of the strings themselves.

Solving order-preserving matching has to do with representations of order relations of a numeric string. We define prefix representation and nearest neighbor representation, which lead to efficient algorithms for order-preserving matching. We present efficient algorithms for single and multiple pattern cases. For single pattern matching, we give an O(n log m) time algorithm and optimize it further to obtain O(n + m log m) time. For multiple pattern matching, we give an O(n log m) time algorithm.

Show speakers

Session 3

3 talks Show detail Hide detail

Chair

Professor Kunsoo Park, Seoul National University, South Korea

Show speakers

Compression and indexing of DNA sequence data

Dr Anthony Cox, Illumina Cambridge Ltd, UK

Abstract

The Burrows–Wheeler transform (BWT) is the foundation of many algorithms for compressing and indexing text data. With the advent of “next-generation” DNA sequencing technologies, collections of billions of DNA "reads" (sequences) are now commonplace in bioinformatics. Widely-used tools such as Bowtie and BWA accelerate the alignment of such datasets by employing BWT-based indexes of the genomes they are aligned to, but the relatively high computational cost of constructing these indexes for large datasets has meant that the indexing of the reads themselves has remained comparatively unexplored.

In earlier studies we presented algorithms that allow the BWT to be built on moderate hardware for datasets of the scales typically encountered in whole human genome sequencing. Here I will discuss the implications of this work for data compression and DNA sequence analysis, describing how we have applied our methods to gene finding and to the identification of bacteria in metagenomic samples.

Show speakers

Unfolding protein folding

Dr Sophia Kossida, Biomedical Research Foundation of the Academy of Athens, Greece

Abstract

Secondary, tertiary and quaternary structure is far more conserved in protein than the corresponding primary amino acid sequence. Therefore evolutionary relationships of proteins, protein structure–function predictions, molecular docking and comparative modeling should all be based on analyses, searches and databases containing structural information. Herein, a new tool has been developed that performs protein similarity searches based on protein secondary structural elements rather than just primary sequence information. The query input sequence can either be of known or unknown structure. In both cases the primary amino acid sequence is converted to amino acid Structural Features Sequence (SFS) format. The SFS format is ‘per residue’ annotation method based on the structural conformation of each amino acid in the query sequence (α-helix, β-sheet, coil, etc). If the query input sequence is of unknown structure, it is subjected to secondary structure prediction algorithms and the SFS format is consequently deducted. Our algorithm is broken down in three parts; in part 1, the query sequence is converted to SFS format, in part 2 the Query-SFS is structurally aligned against the reference SFS databases and finally, in part 3, the structural similarity results are combined with classic primary sequence based results and outputted to user.

Show speakers

Session 4

3 talks Show detail Hide detail

Chair

Professor Maxime Crochemore, King’s College London, UK

Show speakers

Toward efficient compression of NGS data

Dr Laurent Mouchard, University of Rouen, France

Abstract

Automatic sequencers have been intensively used in the 90's, and participated to the sequencing/assembly of the Human Genome in 2000.

Nowadays, new technologies, such as Next-Generation Sequencing (NGS) also named High-Throughput Sequencing (HTS) are able to decode within few weeks huge amounts of DNA sequences (95 GB per run) from a minimal biological sample for approximately 2000 €.

These massive data (tens of millions of pair-ended of 75bp-long reads) have to be processed for obtaining a de novo assembly or mapped on a reference genome for spotting the differences between pathology and regular polymorphism between individuals.

Storing efficiently the data that is produced by the NGS technologies or the data that has been processed by assemblers/mappers is a mandatory process: original raw data might have to be reused whenever a new tool/application is produced that offers new possibilities or provides more sensitive results.

We present here recent strategies that have been developed for an efficient storage of NGS data, namely the FastQ files, we will present intrinsic limits and draw perspectives.

Show speakers

Encodings for range selection and top-k queries

Professor Rajeev Raman, University of Leicester, UK

Abstract

An array A[1..n] of values can be preprocessed in linear time so that later, given i and j, one can compute in constant time the position of a maximum in A[i..j]. The structure does not access A and uses 2n+o(n) bits, which is asymptotically optimal. In this paper we generalize this result in various ways to the problem of finding the k-th maximum in a range, or the top-k elements in a range, without accessing A. We give upper and lower bounds, which match in some cases.

Show speakers

Session 5

3 talks Show detail Hide detail

Chair

Professor Gad Landau, University of Haifa, Israel and NYU-Poly, USA

Show speakers

External suffix arrays for large-scale string search

Professor Alistair Moffat, University of Melbourne, Australia

Abstract

We consider the problem of string search – identifying the locations in a fixed text T of length n at which patterns P of length m appear -- focussing on situations in which T is static but large, and it is not possible to hold an index to it (even when compressed) in main memory.  In such cases limiting the number of random-access disk transfers required during index processing is the paramount consideration.  Here we give a comprehensive evaluation and implementation description of a recent approach to this problem that never requires more than two disk accesses per pattern P for count queries, and yet consumes around half the storage of a conventional in-memory suffix array data structure.  The new mechanism makes use of succinct data structures and a searching approach based on the Burrows-Wheeler Transform.  Experiments with this and a range of alternatives have been carried out against a variety of data types and files of up to 64 GB using a laptop computer with just 4 GB of main memory, and demonstrate the speed and versatility of the new approach.

Show speakers

Large scale detection of repetitions

Professor Bill Smyth, McMaster University, Canada

Abstract

Combinatorics on words began more than 100 years ago with a demonstration that an infinitely-long string with NO repetitions could be constructed on an alphabet of only three letters. Computing ALL the repetitions in a string is one of the oldest and most important problems of computational stringology. About a dozen years ago it was discovered that repetitions can be computed as a byproduct of the linear-time computation of all the maximal periodicities (or runs) in a string.  However, even though the computation is linear, it is also brute force: global data structures that make heavy use of space need to be computed in a preprocessing phase, despite the fact that the expected number of runs in a string is known to be sparse in the string length.  In this talk I explore the possibility that repetitions (perhaps also other regularities in strings) can be computed in a manner commensurate with the size of the output. This improvement in software, derived from enhanced combinatorial insight, could enable us to process terabytes of data tomorrow as we process gigabytes today - with no speedup in hardware required.

Show speakers

Session 6

3 talks Show detail Hide detail

Chair

Dr Anthony Cox, Illumina Cambridge Ltd, UK

Show speakers

Scalable index construction

Dr Juha Kärkkäinen, University of Helsinki, Finland

Abstract

Full-text indexes are powerful tools for indexing and analysing sequential data. Their construction is the computationally most demanding task in many applications. In this talk I will give an overview of construction techniques that can be scaled to handle massive sequence collections using compressed representations, external memory computation, parallelisation and distributed computing.  The main focus is on using suffix sorting algorithms to compute the suffix array or the Burrows-Wheeler transform from which most full-text indexes can be constructed. If time allows, I will briefly touch some other topics too including longest-common-prefix array construction and Lempel-Ziv parsing.

Show speakers

Understanding the data environment: the case of big social data

Dr Mark Elliot, University of Manchester, UK

Abstract

The term “Data Environment” denotes the context in which data reside, a context consisting of other data, users, infrastructures and governance protocols and practices. Data Environment Analysis (DEA) is employed, within the field of Statistical Disclosure Control (see e.g. Duncan et al 2011), to resolve legal and ethical privacy constraints with the exigencies of the Open Data agenda. In that context, DEA enables us to identify whether data is truly personal or not.  In the era of big and linked data, Data Environment Analysis (DEA) has the potential serve another purpose: to provide a map of the information society/ies. This paper describes the processes of data environment, the initial work that has been done on automating DEA at Manchester and the further work that we propose.

Show speakers

Session 7

3 talks Show detail Hide detail

Chair

Professor Bill Smyth, McMaster University, Canada

Show speakers

Succinct data structures for strings and sequences

Professor Roberto Grossi, University of Pisa, Italy

Abstract

The talk describes some problems that arise in industrial applications and how succinct and compressed data structures for strings and sequences can help using little space. The first part of the talk is devoted to semi-indexing semi-structured data, e.g. query log data in JSON format. The second part involves prefix searching on a set of strings which is helpful e.g. in search engines, and describes some succinct data structures for storing a set of strings. The third part of the talk considers a sequence of strings (not to be confused with a set of strings) and introduces the wavelet trie for storing textual data (e.g. URLs) in time order and supporting extra functionalities that exploit this order. The discussed results are joint works with Giuseppe Ottaviano presented at ACM CIKM 2011, ACM-SIAM ALENEX 2012, and ACM PODS 2012.

Show speakers

Algorithms on grammar-compressed strings

Professor Gad Landau, University of Haifa, Israel and NYU-Poly, USA

Abstract

Grammar based compression, where one replaces a long string by a small context-free grammar that generates the string, is a simple and powerful paradigm that captures many of the popular compression schemes, including the Lempel-Ziv family, Run-Length Encoding, Byte-Pair Encoding, Sequitur and Re-Pair.

Let S be a string of length N given as a grammar G(S) of size n. The random access problem is to compactly represent G(S) while supporting fast random access queries. That is, given an index i, report S[i] without decompressing S. We will first present a linear space representation of G(S) that supports O (log N) random access time. This representation extends to efficiently support substring decompression. Namely, we can decompress any substring S[i]:::S[j] in the same complexity as a random access query and additional O(j- i) time.

Once we obtain an efficient substring decompression method, it can then serve as a basis for a compressed version of classical pattern matching. Namely, we can take any black-box (uncompressed) approximate pattern matching algorithm and turn it into a corresponding algorithm over grammar compressed strings. We will then focus on a specific algorithm for computing the edit distance of two grammar-compressed strings. This algorithm requires O(nN) time and uses the compression in a more complicated way (i.e., not through random access queries).

Show speakers

Session 8

2 talks Show detail Hide detail

Chair

Professor Esko Ukkonen, University of Helsinki, Finland

Show speakers

Fast random access and approximate pattern matching for compressed highly repetitive text collections

Dr Travis Gagie, University of Helsinki, Finland

Abstract

Motivated by the imminent growth of massive, highly redundant databases in genomics and the World Wide Web we study the problem of compressing a string database while simultaneously supporting fast random access, substring extraction and pattern matching to the underlying string(s). We describe several data structures and space-time tradeoffs for these problems, all of which exploit the Lempel-Ziv (LZ77) factorisation of the underlying collection to achieve both small space usage and fast runtimes.

Show speakers
Storage and indexing of massive data Kavli Royal Society Centre, Chicheley Hall Newport Pagnell Buckinghamshire MK16 9JJ