It is happening on a large scale in many libraries, collections and archives. It is developing fast and it devours paper. No, I am not going to tell you the tale of the silverfish, but of a larger all-consuming beast: digitisation. But have no fear, we are taming it.
The Royal Society has embarked on a large project to create a digital collection of all its pre-digital-era publications, which contain almost 350 years of content. This project aims to make the Royal Society’s collection available to a wider audience on digital platforms, and preserve the analogue collections of the Society.
First, let me run you through the digitisation process itself. Take the Royal Society library room — look at the volumes on its shelves and then imagine trying to recreate this on the internet. Where to start?
The process of digitisation can be divided into 3 different steps: capture, index and upload.
- Capture corresponds to the process of leafing through every single page to take a picture of it. Images are digitally processed to reach a good level of consistency, but even before problems of quality arise, many questions need to be resolved: what impact will the process have on the original collection? Should the images be as similar to the originals as possible or focused on getting a clear text? Should images be in black and white so that the files are as light and easy to load as possible, or in full colour? Should pages be captured separately or two facing pages captured as one?
To answer the first question, our initial step was to assess the condition of the collection and determine a method of capture which would have as little impact on the bound volumes as possible. The first periodical of the Royal Society, the Philosophical Transactions, was printed in 1665; the last issue we are digitising dates from December 1996. The condition and treatment of such a vast and important collection therefore varies tremendously. We decided that scanning put too much pressure on the spines, so instead all volumes would be captured using mounted cameras, with the books resting on cradles. We also decided that volumes from the 17th and 18th century would be captured on the premises, to ensure the safe-handling of those unique volumes.
As far as images are concerned, our approach was to offer the best possible reproduction of the original volumes, capturing each page individually, in full-colour and with no border. In addition to the pdfs that will be made available on our journals website, we are keeping preservation master-copies taken at the highest resolution available on the market.
- Indexing is a fundamental aspect of digitisation which is too often considered – wrongly, in my opinion – as a secondary process. The core of the project is to unlock the content from the printed pages of the journals. Each volume, issue and article published within the collection should be easy to find once on the web, and indexing is what makes this possible. In terms of production, the choice we faced was between relying on software which mines the data by recognising layout elements to extract titles, authors, dates, etc., or to employ indexers to catalogue the journals. After testing, it became clear that while software is indispensable when extracting a large quantity of text, it is not reliable enough to automatically produce metadata which describes the articles. Indexers are therefore capturing, in a systematic way, a set amount of information on each article, such as title, author, affiliation, role, dates of submission, presentation and publication, type of article, abstract, illustrations…This metadata will be stored in a database, which will give access to detailed information on the history and evolution of publishing within the Royal Society, and could become a wonderful tool for historians of science. In addition to this refined metadata, each page passes through Optical Character Recognition (OCR) software to extract the text and maths. The recognition for modern text is truly excellent, however for older content, the software is baffled by different characters (ƒ, for example) and the less well-defined ink.
- Digital collections sometimes result from the very urgent need to preserve documents that might be at risk of disappearing. For instance, the British Library Endangered Archive Programme has been spear-heading digitisation as a means of preservation for endangered documents. Our project does not stem from this type of concern for our collections, which thankfully have been impeccably taken care of over time. It is the strategic aim of the Royal Society to promote science and provide scientists with resources to support their research, and this drive has motivated our project. The last step of the project is therefore to make it available through a digital platform, uploading PDF versions of the articles with the newly collected metadata. This will give users access to some of the most important scientific texts, such as Newton’s theory of light and colour, Franklin’s electrical experiments, Lonsdale’s crystallography, Turing’s paper on morphogenesis and many more.
The Royal Society Journal Collection project, by aiming for the best possible quality of images, indexing and digital text, is aiming to deliver an online collection of the highest standards to support the needs of scientists and historians worldwide. In size, the project also plays in the Monster truck category: over 740,000 pages will be digitised covering 9 separate journals and over 45,000 articles dating from 1665 to 1996. The project has been a wonderful opportunity to delve into the history of scientific publishing at the Royal Society, so watch this space to discover more about our tribulations in the journal collections and contact us if you have use cases for the new content you’d like to explore, or if you would like to recommend the collection to your library.
Mounted cameras used to capture images of the bound volumes. Courtesy of Stephanie Routhier-Perry.