As we near the launch of our new journal archive, we ask linguistics researchers at Universität des Saarlandes a few questions about how they are using Royal Society journals content to carry out their work.
Please introduce us to your project team?
We are a group of researchers at Universität des Saarlandes, at the Department of Language Science and Technology who are committed to linguistics as empirical science, and to researching and exploring language with corpus-based approaches. Our group is part of the Collaborative Research Centre Information Density and Linguistic Encoding (short: IDeAL; funded by the German Research Foundation) where around 50 researchers work together within a multidisciplinary research programme within various subprojects.
What is your project about?
In IDeAL, we investigate the hypothesis that language use, including change in language use, may be driven by the optimal and most effective use of the communication channel. Going back to Shannon’s notion of information, we use concepts from information theory, such as (relative) entropy or surprisal, which measure the amount of information transmitted by a given item in terms of the number of bits needed for encoding. In our group, we study linguistic densification in the evolution of scientific writing in English from the 17th century to the present. We have built large corpora of scientific texts and one of the corpora we use is the Royal Society Corpus consisting of all digitised texts of the Philosophical Transactions and the Proceedings of the Royal Society published between 1665 and 1869. Using this data, we can analyse the diachronic development of a range of linguistic structures with a certain level of complexity. While some structures allow the authors of scientific texts to distribute information over a larger number of clauses, others are particularly useful for packing information very densely into clauses and for conveying information in a very compact manner, for example, long noun phrases, such as “the inclination of the Earth’s axis” or complex words with several components, such as “physico-mathematical”.
Why did you choose the Philosophical Transactions and the Proceedings for your corpus?
We are interested in the development of the scientific article from the early stages of the first scholarly journals that were published in English to contemporary scientific publications. The scientific journals of the Royal Society with their long and continuous history provide an excellent basis for the diachronic analysis of the characteristics of academic writing in research articles that we can then compare to other text types and registers.
What challenges did you face when preparing the corpus?
Our texts needed a lot of pre-processing, including the correction of OCR (optical character recognition) errors, particularly the older texts. We also had to deal with spelling variation in the early texts as the language was not standardised in the 17th century. In our corpus, it is now possible to switch between historical spelling variants and modern equivalents in order to successfully apply natural language processing and corpus linguistic methods. Another challenge was automatic part-of-speech tagging, i.e., the identification of nouns, verbs, adjectives etc., which is always more difficult for historical than for modern texts. We intertwine corpus building, corpus annotation and analysis to produce new versions of the corpus whenever we encounter problems in the quality of the data, a method we have developed specifically from working on the text data of the Royal Society.
What are your preliminary results?
A full coverage from the middle of the 17th to the end of the 19th century is now available as linguistically annotated corpus and a high-quality resource for studying diachronic variation in scientific writing. Many students at our university, for instance, use the corpus for studying aspects of language use, variation and change. The corpus is freely available and can be downloaded, queried on the command-line or by using a web-based interface.
We conducted various analyses which have shown that as scientific activity becomes more diversified and specialised, particular linguistic patterns become more predictable. We can see in the data how certain linguistic patterns became more and more conventionalised while others became less frequent. We can show easily, for instance, that the length of noun phrases increases over time.
What are your next research questions?
Our current version of the Royal Society Corpus comprises about 32.5 million tokens and covers journals from the first 200 years of the Royal Society. For comparative purposes, we also use other corpora for different genres covering the same time period and various corpora with more contemporary scientific journal articles from several disciplines. We would like to add more digitised texts from the Philosophical Transactions and the Proceedings into our Royal Society Corpus that were published at the end of the 19th century and the beginning of the 20th century, when the journals split into separate and more specialised publications. This will allow us to study various linguistic developments in more detail. Moreover, we want to investigate the role information density plays in language change when compared to other factors such as time, discipline and author, as well as linguistic constraints for particular changes.
What do you hope the newly indexed and digitised Royal Society Journal Collection will help you achieve?
The newly indexed and digitised Journal Collection provides high-quality metadata for the individual journal articles. You can see for instance what type of article it is – a biography, an editorial an astronomical observation etc. The categories are more fine-grained than what we currently have in our metadata. We only distinguish between articles, book reviews and a few other categories so far. Additionally, we hope to use the information on the author background that is available in the new Journal Collection, e.g., information on the affiliation of the contributors, or their role, e.g., author, biographee or communicator.
IDeAL (Information Density and Linguistic Encoding) research group
Professor Dr Elke Teich, Dr Hannah Kermes, Dr Stefania Degaetano-Ortlieb, Dr Katrin Menzel, Jörg Knappen, Stefan Fischer