Support us | Visit us | Contact us
Theo Murphy international scientific meeting organised by Professor Costas Iliopoulos, Dr Simon Puglisi and Professor Maxime Crochemore.
The world is drowning in data! The current and ever increasing rate at which data is being produced by scientific and social instruments if fast outstripping our capacity to process and make use of it. This meeting will examine the impressive computational methods developed for coping with massive data to date and look forward, sketching a roadmap to deal with challenges in the decades ahead.
Biographies of the organisers and speakers are available below and you can download a programme. Recorded audio of the presentations are available by clicking on the names of the speakers below.
Enquiries: Contact the events team
Professor Costas Iliopoulos, King’s College London, UK
Costas S Iliopoulos received a BSc in Mathematics from the University of Athens in 1980, an MSc by research in Computer Science in 1981 and Phd in Computer Science in 1983, both from Warwick University. He is currently a Professor of Algorithm Design at King’s College London.
Prior to that he was Assistant Professor at New Jersey Institute of Technology, Assistant Professor at Purdue University and Lecturer/Reader at Royal Holloway College of the University of London. He is serving as editor of the Journal of Combinatorial Mathematics and Combinatorial Computing (2000 -- currently) as well as an editor of the International Journal of Systems Biology and Biomedical Technologies. He is also editor, of the Theoretical Computer Science book series, World Scientific (2000 -- currently). Additionally he is the Editor-in-Chief and Managing Editor of the Journal of Discrete Algorithms} (1998 -- currently). He is co-chair of the IFIP Technical committee of Bioinformatics. He is currently a member of the Steering committee of International workshop of combinatorial algorithms. He has served on a numerous program committees of c. onferences, SPIRE, CPM, CATS, AWOCA, ISAAC, LATA, PSW, LAW, PART, MFCS, Combionets etc.. He supervised over 25 PhD students. He held numerous Royal Society grants with Korea, France, Twavan, Australia, Prague, Warsaw, South Africa etc He also held several EPSRC grants. He is a member of the EPRSC and Royal society panels.
Dr Simon Puglisi, King’s College London, UK
Simon J Puglisi is a Newton Fellow in the Department of Informatics, King's College London. He completed his PhD at Curtin University in December 2007, and subsequently was awarded an Australian Postdoctoral Fellowship at Royal Melbourne Institute of Technology, where he became a member of the Search Engine Group. His current research focuses on efficient algorithms and data structures for searching and manipulating large strings, trees, and graphs, and applications in bioinformatics, information retrieval, and data mining. Frequent themes are the interplay between search and compression, and the boundary of theory and practice.
Professor Maxime Crochemore, King’s College London, UK
Professor Maxime Crochemore received his PhD in 1978 and his Doctorat d'état (DSc) in 1983 at the University of Rouen (France). He got his first professorship position at the University of Paris-Nord in 1985 where he acted as President of the Department of Mathematics and Computer Science for two years. He became professor at the University Paris 7 in 1989 and was involved in the creation of the University of Marne-la-Vallée where he is Professor, Emeritus from 2007. He also created the Computer Science research laboratory of this university in 1991 and was the director until 2005. He was Deputy Scientific Director of the Information and Communication Department of CNRS from 2004 to 2006. He was Senior Research Fellow from 2002 to 2007 and is presently Professor at King's College London.
Professor Crochemore's research interests are in the design and analysis of algorithms. His major achievements are on string algorithms, which includes pattern matching, text indexing, coding, and text compression. He also works on the combinatorial background of these subjects and on their applications to bio-informatics. He has co-authored several textbooks on algorithms and published more than 200 articles, among which 144 are listed on the DBLP Bibliography Server as of February 2011.
He has been the recipient of several French grants on string algorithms and bio-informatics. He participated in a good number of international projects on algorithms and supervised to completion more than twenty PhD students.
Professor Maxime Crochemore, King's College London, UKChair
Professor Amihood Amir, Bar-llan University, IsraelHistogram indexing and block pattern indexing
Amihood Amir received his PhD in 1983 at Bar-Ilan University. He did his post-doc at MIT, and was on the faculty at Tufts, University of Maryland at College Park, and Georgia Tech, before joining Bar-Ilan University in 1994. Today he is a professor of Computer Science at Bar-Ilan University and a Research Professor at Johns Hopkins University. Amir had been head of the Bar-Ilan Responsa Project, chairman of the Department of Computer Science and dean of the College of Exact Sciences at Bar-Ilan University. He has chaired the Computer Science Advisory Committee of the Israel Ministry of Education, and was a panel member on various panels of the NSF, Israel Ministry of Science, and the Israel Institute for Higher Education.
Amihood Amir is the author of over a hundred research papers, and recipient of grants from the NSF, ISF, BSF, the Israel Ministry of Science, and the Israel Ministry of Industry and Commerce. In addition, he was a co-PI on bi-national grants with the UK, Italy, France and Finland. Amir serves on the editorial board of Information and Computation, and served on the program committee of dozens of major Computer Science conferences.
Histogram indexing, also known as jumbled pattern indexing and permutation indexing is one of the important current open problems in pattern matching. It was introduced about six years ago and has seen active research since. Yet, to date there is no algorithm that can preprocess a text T in time o( |T|2/polylog |T| ) and achieve histogram indexing, even over a binary alphabet, in time independent of the text length.
We show a strong connection between the histogram indexing problem and the block-mass pattern matching problem. The block-mass pattern matching problem is a recently introduced problem, motivated by issues in mass spectrometry. The reduction we show between the two problems is amazingly simple. Its value lies in recognizing the connection between these two apparently disparate problems, rather than the complexity of the reduction. This reduction trivially achieves progress that has been sought in vain for over half a decade.
Dr Solon Pissis, University of Florida, USA and Heidelberg Institute for Theoretical Studies, GermanyFinding subtree repeats to slash the time for phylogenetic analyses
Dr Solon P Pissis received his 4-year Diploma (2007) in Computer Science from the University of Athens, Greece, his MSc (2008) in High-performance Computing from the University of Edinburgh, UK, and his PhD (2012) in Computer Science from King’s College London, UK. He is currently a Research Associate at the University of Florida, USA and a visiting scientist at Heidelberg Institute for Theoretical Studies, Germany. His main areas of research include algorithms for molecular sequence analysis, approximate string-matching algorithms, combinatorics on words, parallel algorithms and high-performance computing, and computational molecular evolution.
Given an unordered labeled tree T, our goal is to group repeating subtrees of T into equivalence classes. By unordered, we mean that the order of the descendant nodes (children) of any node of T is irrelevant. We present a simple and time-optimal algorithm for solving this problem, and show that the running time of our method is linear with respect to the size of T.
We show how this algorithm can be applied to minimize the number of operations required to compute the likelihood of a phylogenetic tree comprising thousands of species and massive amounts of whole-transcriptome or even whole-genome molecular data. Moreover, we show how the presented algorithm can easily be modified to operate on trees that do not satisfy some or any of the aforementioned assumptions on the tree structure. For instance, it can also be applied to ordered or unlabeled trees.
Professor Amir Amirhood, Bar-Ilan University, IsraelChair
Professor Esko Ukkonen, University of Helsinki, FinlandAlgorithmic challenges in regulatory networks
Esko Ukkonen is since 1985 Professor of Computer Science of the University of Helsinki and since 2010 the Head of the Department of Computer Science. He has had visiting positions at the University of California, Berkeley, the University of Freiburg, and the University of Bielefeld. Dr Ukkonen has published about 160 original scientific articles. His research interests include algorithms and data structures,
combinatorial pattern matching, machine learning, and computational biology (biological sequence analysis in particular). In 1999-2001 he was the chairman of the Finnish Society for Computer Science, in 1999-2004 an academy professor of the Academy of Finland, and since 2002 he has been directing a center-of-excellence (From Data to Knowledge & Algorithmic Data Analysis) of the Academy of Finland. In 2004-2008, he was Research Director of the Helsinki Institute for Information Technology (HIIT), a joint institute of the Aalto University and the University of Helsinki. Dr Ukkonen has received the Biotechnology Prize of the Alko Group (1996) and the Science Prize of the City of Helsinki (2007), and he is a member of the Finnish Academy of Science and Letters.
Transcription factors (TFs) are proteins that regulate gene activity. By binding to a specific DNA region of other genes they turn them on (or off), initiating the production of another protein. This induces network structured regulatory cascades of the DNA driven machinery of a cell. The TF binding sites (motifs) in DNA are often modeled using so-called position weight matrices. This model specifies a multinomial distribution of sequences and assumes that different positions within a motif are independent of each other. To avoid false positives in binding site prediction, more accurate models would be needed. We are developing models that utilize the fact that the TFs not only bind to DNA but also to each other. While the individual interactions between TFs are often weak, the complexes formed by both TF-DNA and TF-TF interactions are much more stable than those based on TF-DNA interactions alone. Thus, the dimerization mediated by DNA gives much richer regulatory machinery that has a potentially stronger capability to explain the regulation of gene expression. The talk will describe recent developments in modeling and predicting such regulatory complexes, using alignment-based and alignment-free techniques and data from high-throughput SELEX experiments, and applying this to explain observed binding patterns in DNA. (Joint work with J Taipale, T Kivioja, P. Rastas, A Jolma and J Toivonen.)
Professor Kunsoo Park, Seoul National University, South KoreaOrder preserving matching on massive data
Kunsoo Park received BS and MS in computer engineering from Seoul National University in 1983 and 1985, respectively, and PhD in computer science from Columbia University in 1991. From 1991 to 1993 he was a lecturer at King's College, University of London, and then he joined Seoul National University, where he is currently a professor in the Department of Computer Science and Engineering. He was also a visiting research fellow at Curtin University, Australia in 1995, and a visiting professor at University of Marne-la-Vallee, France in 2005. His research interests include design and analysis of algorithms, bioinformatics, and cryptography.
We introduce a new string matching problem called order-preserving matching on numeric strings where a pattern matches a text if the text contains a substring whose relative orders coincide with those of the pattern. Order-preserving matching is applicable to many scenarios such as stock price analysis and musical melody matching in which the order relations should be matched instead of the strings themselves.
Solving order-preserving matching has to do with representations of order relations of a numeric string. We define prefix representation and nearest neighbor representation, which lead to efficient algorithms for order-preserving matching. We present efficient algorithms for single and multiple pattern cases. For single pattern matching, we give an O(n log m) time algorithm and optimize it further to obtain O(n + m log m) time. For multiple pattern matching, we give an O(n log m) time algorithm.
Professor Kunsoo Park, Seoul National University, South KoreaChair
Dr Anthony Cox, Illumina Cambridge Ltd, UKCompression and indexing of DNA sequence data
After working in software engineering in the space industry and at the Wellcome Trust Sanger Institute, Dr. Cox joined a Cambridge University spinout called Solexa in 2002 to work on the technology that now underlies Illumina’s market-leading range of DNA sequencing products.
He was responsible for much of the sequence analysis software that accompanied early versions of the instrument including ELAND, the first aligner designed for high-throughput short read sequences. The current research interests of his group include cancer genome sequencing, metagenomics, the detection of complex structural variants and the applications of compressed indexing to DNA sequence analysis.
The Burrows–Wheeler transform (BWT) is the foundation of many algorithms for compressing and indexing text data. With the advent of “next-generation” DNA sequencing technologies, collections of billions of DNA "reads" (sequences) are now commonplace in bioinformatics. Widely-used tools such as Bowtie and BWA accelerate the alignment of such datasets by employing BWT-based indexes of the genomes they are aligned to, but the relatively high computational cost of constructing these indexes for large datasets has meant that the indexing of the reads themselves has remained comparatively unexplored.
In earlier studies we presented algorithms that allow the BWT to be built on moderate hardware for datasets of the scales typically encountered in whole human genome sequencing. Here I will discuss the implications of this work for data compression and DNA sequence analysis, describing how we have applied our methods to gene finding and to the identification of bacteria in metagenomic samples.
Dr Sophia Kossida, Biomedical Research Foundation of the Academy of Athens, GreeceUnfolding protein folding
Sophia Kossida received her BSc degree (1995) in Biology from the University of Crete, Greece. She was awarded her DPhil in Bioinformatics, from Oxford University, UK (1998). She worked at Harvard University, USA within the FlyBase group. She was employed as Senior Scientist within Lion Bioscience Research Inc. in MA, USA where she worked on the human genome mining project. She was appointed Director of Bioinformatics of Endocube in Toulouse, France. She joined Novartis in Switzerland as Lab head within the Functional Genomics Group. She joined the Biomedical Research Foundation of the Academy of Athens (BRFAA) (2004) as tenure track Bioinformatician. Her scientific interests lie within the Comparative Genomics, Molecular Evolution, Drug Design fields. She has approximately 65 publications in peer reviewed international journals, 15 peer reviewed articles in conference proceedings, 15 book chapters and she is the inventor of 24 international patents. She is in the editorial board of several scientific journals. She has got extensive teaching experience and has organized several workshops and conferences. Her team at BRFAA, www.bioacademy.gr/bioinformatics , was appointed the National contact point of Bioinformatics for EMBnet since 2005. She has secured approximately 2 million Euros funding from national and European competitive research grants, since her appointment at BRFAA.
Secondary, tertiary and quaternary structure is far more conserved in protein than the corresponding primary amino acid sequence. Therefore evolutionary relationships of proteins, protein structure–function predictions, molecular docking and comparative modeling should all be based on analyses, searches and databases containing structural information. Herein, a new tool has been developed that performs protein similarity searches based on protein secondary structural elements rather than just primary sequence information. The query input sequence can either be of known or unknown structure. In both cases the primary amino acid sequence is converted to amino acid Structural Features Sequence (SFS) format. The SFS format is ‘per residue’ annotation method based on the structural conformation of each amino acid in the query sequence (α-helix, β-sheet, coil, etc). If the query input sequence is of unknown structure, it is subjected to secondary structure prediction algorithms and the SFS format is consequently deducted. Our algorithm is broken down in three parts; in part 1, the query sequence is converted to SFS format, in part 2 the Query-SFS is structurally aligned against the reference SFS databases and finally, in part 3, the structural similarity results are combined with classic primary sequence based results and outputted to user.
Dr Laurent Mouchard, University of Rouen, FranceToward efficient compression of NGS data
Dr Laurent Mouchard has received a Bachelor of Science in Mathematics, a Master of Science in Computer Science and a PhD in Computational Biology from the University of Rouen.He has participated to the Assembly of the Human Genome at Celera Genomics, in Gene Myers's Informatics Research Team.
He is one of the co-founder of the "Societe Francaise de Bioinformatique" and an elected member of its board. Dr Mouchard supervised several PhD students working on biological sequences analysis and more recently on medical images analysis. Forthcoming activities include merging sequence data, medical images and metadata.
Automatic sequencers have been intensively used in the 90's, and participated to the sequencing/assembly of the Human Genome in 2000.
Nowadays, new technologies, such as Next-Generation Sequencing (NGS) also named High-Throughput Sequencing (HTS) are able to decode within few weeks huge amounts of DNA sequences (95 GB per run) from a minimal biological sample for approximately 2000 €.
These massive data (tens of millions of pair-ended of 75bp-long reads) have to be processed for obtaining a de novo assembly or mapped on a reference genome for spotting the differences between pathology and regular polymorphism between individuals.
Storing efficiently the data that is produced by the NGS technologies or the data that has been processed by assemblers/mappers is a mandatory process: original raw data might have to be reused whenever a new tool/application is produced that offers new possibilities or provides more sensitive results.
We present here recent strategies that have been developed for an efficient storage of NGS data, namely the FastQ files, we will present intrinsic limits and draw perspectives.
Professor Rajeev Raman, University of Leicester, UKEncodings for range selection and top-k queries
After defending his PhD thesis in October 1991 from the University of Leicester, he was a Postdoctoral Fellow at the Max-Planck-Institut für Informatik in Saarbruecken with Kurt Mehlhorn, and then at UMIACS, University of Maryland with Uzi Vishkin. He joined the Algorithm Design Group at King's College London in 1994, and has been at Leicester since January 2001, where he was Head of Department from 2003-06.
An array A[1..n] of values can be preprocessed in linear time so that later, given i and j, one can compute in constant time the position of a maximum in A[i..j]. The structure does not access A and uses 2n+o(n) bits, which is asymptotically optimal. In this paper we generalize this result in various ways to the problem of finding the k-th maximum in a range, or the top-k elements in a range, without accessing A. We give upper and lower bounds, which match in some cases.
Professor Gad Landau, University of Haifa, Israel and NYU-Poly, USAChair
Gad M Landau has a PhD in Computer Science from the University of Tel Aviv, Israel, 1987. He is a Professor of Computer Science in the Department of Computer Science at the University of Haifa, Israel and a Research Professor in the Department of Computer Science and Engineering at NYU-Poly, NY, USA. He was the head of both departments. Landau has co-authored over 120 refereed papers in the areas of String Algorithm, Computational Biology and Data Structures. He was the Co-Chair of the program committees of CPM 2001 and 2008, and is an editor of the Journal of Discrete Algorithms
Professor Alistair Moffat, University of Melbourne, Australia External suffix arrays for large-scale string search
Professor Alistair Moffat has been a member of the academic staff at The University of Melbourne since 1986. He has published widely in the fields of text and index compression, information retrieval systems, and experimental measurement and evaluation, and has published more than 170 papers across these areas during his career. He is also an author of two technical books ("Managing Gigabytes", 1994 and 1999; and "Compression and Coding Algorithms", 2002); and an undergraduate textbook ("Programming, Problem Solving and Abstraction with C", 2003 and 2013). Alistair has served roles as Chair and Program Chair of a range of conferences, and as an Associate Editor of research journals, including Journal of Information Retrieval, and ACM Transactions on Information Systems.
Alistair was Chair of the Department of Computer Science and Software Engineering at The University of Melbourne from 2007 to 2011, and Associate Dean (Curriculum) in the Melbourne School of Engineering from 2007-2009; and he has made strong contributions to the undergraduate and graduate teaching programs of the University.
We consider the problem of string search – identifying the locations in a fixed text T of length n at which patterns P of length m appear -- focussing on situations in which T is static but large, and it is not possible to hold an index to it (even when compressed) in main memory. In such cases limiting the number of random-access disk transfers required during index processing is the paramount consideration. Here we give a comprehensive evaluation and implementation description of a recent approach to this problem that never requires more than two disk accesses per pattern P for count queries, and yet consumes around half the storage of a conventional in-memory suffix array data structure. The new mechanism makes use of succinct data structures and a searching approach based on the Burrows-Wheeler Transform. Experiments with this and a range of alternatives have been carried out against a variety of data types and files of up to 64 GB using a laptop computer with just 4 GB of main memory, and demonstrate the speed and versatility of the new approach.
Professor Bill Smyth, McMaster University, CanadaLarge scale detection of repetitions
Bill Smyth has been involved with computing, IT, and computer science for 55 years. After completing his Bachelor's degree in Mathematics (University of Toronto, 1957), he spent 10 years as one of the pioneers of computer consultancy, dealing with applications to science, aeronautical and civil engineering, and business. From 1967 to 1982 he was employed with various agencies of the United Nations, providing consultancy and hands-on assistance to developing countries in areas such as National Computer Policy, training in computer systems development, and management consultancy.For the last 30 years he has been in academia; currently he is Professor Emeritus at McMaster University, with appointments also at King's College London and the University of Western Australia. His main interests are in combinatorial algorithms, especially on strings, particularly very long ones, where efficiencies in the use of time and space are essential.
Combinatorics on words began more than 100 years ago with a demonstration that an infinitely-long string with NO repetitions could be constructed on an alphabet of only three letters. Computing ALL the repetitions in a string is one of the oldest and most important problems of computational stringology. About a dozen years ago it was discovered that repetitions can be computed as a byproduct of the linear-time computation of all the maximal periodicities (or runs) in a string. However, even though the computation is linear, it is also brute force: global data structures that make heavy use of space need to be computed in a preprocessing phase, despite the fact that the expected number of runs in a string is known to be sparse in the string length. In this talk I explore the possibility that repetitions (perhaps also other regularities in strings) can be computed in a manner commensurate with the size of the output. This improvement in software, derived from enhanced combinatorial insight, could enable us to process terabytes of data tomorrow as we process gigabytes today - with no speedup in hardware required.
Dr Anthony Cox, Illumina Cambridge Ltd, UKChair
Dr Juha Kärkkäinen, University of Helsinki, FinlandScalable index construction
Juha Kärkkäinen received his PhD in Computer Science at the University of Helsinki in 1999, where he is currently working as a researcher. His research interests are in the area of string algorithms and data structures, particularly text indexing and compression.
Dr Kärkkäinen has been a Postdoc and a Research Associate at the Max-Planck-Institute for Informatics in Saarbruecken, Germany, a Visiting Scientist at Google Zurich, and a Lecturer and a Researcher at the University of Helsinki. He chaired and organized the CPM 2012 conference.
Full-text indexes are powerful tools for indexing and analysing sequential data. Their construction is the computationally most demanding task in many applications. In this talk I will give an overview of construction techniques that can be scaled to handle massive sequence collections using compressed representations, external memory computation, parallelisation and distributed computing. The main focus is on using suffix sorting algorithms to compute the suffix array or the Burrows-Wheeler transform from which most full-text indexes can be constructed. If time allows, I will briefly touch some other topics too including longest-common-prefix array construction and Lempel-Ziv parsing.
Dr Mark Elliot, University of Manchester, UKUnderstanding the data environment: the case of big social data
Mark Elliot has worked at CCSR since 1996 with a spell as director from 2005-2008, mainly in the field of statistical confidentiality, founding the international recognised Confidentiality and Privacy Research Group (CAPRI) in 2002, and has managed numerous research projects within CAPRI remit. He is one of the key international researchers in the field of Statistical Disclosure and has an extensive portfolio of research grants and publications in the field. He leads the UK Anonymisation Network. He has recently lead on an ESRC funded consultation project under the digital research entitled “Data Horizons” which is to report on the future of data within the Social Sciences.
Dr Elliot collaborates widely with non-academic partners, particularly with national statistical agencies (e.g. Office for National Statistics, US Bureau of the Census, Australian Bureau of Statistics) where he has been a key influence on disclosure control methodology used in censuses and surveys and where the SUDA software - developed in collaboration with colleagues in Computer Science in Manchester - is currently employed.
The term “Data Environment” denotes the context in which data reside, a context consisting of other data, users, infrastructures and governance protocols and practices. Data Environment Analysis (DEA) is employed, within the field of Statistical Disclosure Control (see e.g. Duncan et al 2011), to resolve legal and ethical privacy constraints with the exigencies of the Open Data agenda. In that context, DEA enables us to identify whether data is truly personal or not. In the era of big and linked data, Data Environment Analysis (DEA) has the potential serve another purpose: to provide a map of the information society/ies. This paper describes the processes of data environment, the initial work that has been done on automating DEA at Manchester and the further work that we propose.
Professor Bill Smyth, McMaster University, CanadaChair
Professor Roberto Grossi, University of Pisa, ItalySuccinct data structures for strings and sequences
Roberto Grossi is professor of Computer Science at the University of Pisa. His research interests are both in theoretical problems for core research and in applications and experimental work. Specifically, his interests are focused on algorithms for combinatorial pattern matching and mining on strings, sequences, trees and matrices; design of algorithms and data structures for external and hierarchical memory; implicit, succinct and compressed data structures for (compressed) data sets; space- and time-efficient compressed indexing and fast searching in compressed text; text indexing and editing; multi-dimensional data structures; algorithm engineering for quick-access tables and dictionaries; routing algorithms for networks and robot; graph algorithms and pattern discovery in graphs and sequences.
The talk describes some problems that arise in industrial applications and how succinct and compressed data structures for strings and sequences can help using little space. The first part of the talk is devoted to semi-indexing semi-structured data, e.g. query log data in JSON format. The second part involves prefix searching on a set of strings which is helpful e.g. in search engines, and describes some succinct data structures for storing a set of strings. The third part of the talk considers a sequence of strings (not to be confused with a set of strings) and introduces the wavelet trie for storing textual data (e.g. URLs) in time order and supporting extra functionalities that exploit this order. The discussed results are joint works with Giuseppe Ottaviano presented at ACM CIKM 2011, ACM-SIAM ALENEX 2012, and ACM PODS 2012.
Professor Gad Landau, University of Haifa, Israel NYU-Poly, USAAlgorithms on grammar-compressed strings
Grammar based compression, where one replaces a long string by a small context-free grammar that generates the string, is a simple and powerful paradigm that captures many of the popular compression schemes, including the Lempel-Ziv family, Run-Length Encoding, Byte-Pair Encoding, Sequitur and Re-Pair.
Let S be a string of length N given as a grammar G(S) of size n. The random access problem is to compactly represent G(S) while supporting fast random access queries. That is, given an index i, report S[i] without decompressing S. We will first present a linear space representation of G(S) that supports O (log N) random access time. This representation extends to efficiently support substring decompression. Namely, we can decompress any substring S[i]:::S[j] in the same complexity as a random access query and additional O(j- i) time.
Once we obtain an efficient substring decompression method, it can then serve as a basis for a compressed version of classical pattern matching. Namely, we can take any black-box (uncompressed) approximate pattern matching algorithm and turn it into a corresponding algorithm over grammar compressed strings. We will then focus on a specific algorithm for computing the edit distance of two grammar-compressed strings. This algorithm requires O(nN) time and uses the compression in a more complicated way (i.e., not through random access queries)
Professor Esko Ukkonen, University of Helsinki, FinlandChair
Esko Ukkonen (born 1950) is since 1985 Professor of Computer Science of the University of Helsinki and since 2010 the Head of the Department of Computer Science. He has had visiting positions at the University of California, Berkeley, the University of Freiburg, and the University of Bielefeld. Dr Ukkonen has published about 160 original scientific articles. His research interests include algorithms and data structures,
Dr Travis Gagie, University of Helsinki, FinlandFast random access and approximate pattern matching for compressed highly repetitive text collections
Travis Gagie earned a BSc in Cognitive Science from Queen's University (Canada) and an MSc in Computer Science from the University of Toronto, studied in Italy for three years and received a Dr rer nat in Bioinformatics from Bielefeld University (Germany). He then worked for three years as a post-doc at the University of Chile and at Aalto University (Finland). He now studies compressed data structures as a post-doc at the University of Helsinki and the Helsinki Institute for Information Technology.
Motivated by the imminent growth of massive, highly redundant databases in genomics and the World Wide Web we study the problem of compressing a string database while simultaneously supporting fast random access, substring extraction and pattern matching to the underlying string(s). We describe several data structures and space-time tradeoffs for these problems, all of which exploit the Lempel-Ziv (LZ77) factorisation of the underlying collection to achieve both small space usage and fast runtimes.
Public lecture 5 Dec
Conference 11 Dec
Enter your email address to receive updates about scientific meetings at the Royal Society
Full listing of our events and exhibitions.
Watch videos of past events.
Most of our talks are free and open to the public.
We host major conferences for leading scientists.
Explore our annual science exhibition
Contact the events team.