We need to stop thinking of the internet as a two dimensional accumulation of information on a screen, but rather as a three dimensional entity that can be explored back in time, too. pt

Iterations of a Generative Adversarial Network AI learning to create abstract art

In the ephemeral world of the internet, nothing seems to stay the same for long. Websites are constantly updated while a picture or video clip can be edited with just a few clicks.

How can you tell in this shifting landscape of information if material published online contains the same facts today as it did a week ago, a month ago or even a year ago? In an era when more and more people are struggling to trust the information they find online, being able to check how it has been edited, deleted or replaced over time is an important tool for assessing its veracity.

Of course, some information evolves as we find out more. It is something that scientists are accustomed to – as more data is gathered it can lead to new ideas, theories and knowledge. But what if an organisation or individual makes a claim that is deliberately misleading? How can they be held to account if they can simply edit or erase their claims from history?

Fortunately, there is a little-known resource that can help – the UK Web Archive. This enormous digital repository collects millions of UK websites every year. It preserves old or obsolete websites so they can be accessed by future generations.

Although organisations such as the British Library began collecting and archiving websites around 15 years ago, in 2013 the UK government introduced new regulations that required digital publications to be systematically preserved as part of something known as legal deposit. Legal deposit has existed in English law since 1662, and obliges publishers to place at least one copy of everything they publish in the UK and Ireland – from books to music and maps – at a designated library.

Since it was extended to include digital media, the six designated legal deposit libraries in the UK have accumulated around 700 terabytes of archived web data, and that is growing by around 70TB every year. The libraries automatically collect – or crawl – UK websites at least once a year to gather a snapshot of what they contain, while some important websites such as news sites are collected daily. They also collect eBooks, electronic journals, videos, pdfs and social media posts – almost everything that is available in a digital format. The Web Archive’s curators also focus on specific topics or events that are likely to be of interest – such as on climate change or Brexit – and place them in specific collections.

It is a fantastic resource. Anyone wanting to learn about the history of Britain since the early 1990s will almost certainly need to access information that was placed online. For academics hoping to understand how public discourse and official policy has evolved, accessing archived web material is vital. It can help journalists hoping to hold public bodies and politicians to account against what they have said and promised in the past. Web archives have been used as evidence in court, as educational tools and allow the public to access information on websites that have been deleted.

But the Web Archive could offer so much more. Libraries have always had a role as trusted institutions that hold reliable and accurate information. In the past, if you wanted to find something out, people would often head to their local library to do so. At a time when many large online platforms make it easy to access and share bad information, it seems libraries could play an important role in helping people to identify what they can trust.

The Web Archive – along with other internet archives around the world – could be a key tool for helping people to double check claims or dig out lost information. Just two years after the UK Web Archive began in 2013, those leading the project found 40% of the URLs they had archived were gone or missing from the live Web, while 60% had either vanished or had changed so much they were unrecognisable.

But there is currently a major barrier standing in the way of people hoping to search out information from the internet in days past – and, bizarrely, it is a physical one. Historic pages for only around 19,000 or so websites can be accessed through the Web Archive’s online portal. These are sites where their creators have given explicit permission to allow open access to their content, but contacting every UK website in this way is almost impossible.

To see the rest of the archive, the legislative framework requires users to follow the same rules as are applied to print items placed in legal deposit – they must physically visit one of nine locations around the UK that are classified as legal deposit libraries. These are the British Library in London or Boston Spa; the Bodleian Libraries at Oxford University; Cambridge University Libraries; the National Library of Scotland in Glasgow or Edinburgh; the National Library of Wales in Cardiff or Aberystwyth; and the Library of Trinity College, Dublin.

Once inside one of these legal deposit libraries, users can browse as they like through an onsite computer terminal. Stranger still, only one person can access a specific piece of material at any one time. While libraries are trying their best to offer access to this material, the legislation restricts how widely they can share it.

But in a world where almost everything can be done online, these requirements make little sense and significantly restricts who can access the archive. That has only been exacerbated during the Covid-19 pandemic as libraries have had to periodically shut their doors due to lockdowns while the virtual doors to the archive have also remained shut.

To shrug off these restrictions so that the entire Web Archive is available on the open web will require changes to the legal framework that govern digital legal deposit. Website owners currently have to give their explicit permission for their content to be available on the open archive, even if it was freely published on the web.

Much of the reasons for this are tied up in copyright rules. Some publishers – such as newspapers – are also not keen on their archived material becoming freely available online as it has commercial value or they use it to attract subscribers to their websites.

Of course, similar limitations affect other web archives around the world. The US Library of Congress web archive, for example, holds around two petabytes of material and is growing at a rate of 20-25 terabytes every month. Yet just a fraction of it is accessible over the internet.

Efforts like the non-profit Internet Archive – an organisation set up in the US in 1996 – has been pushing at the boundaries of what it can and cannot make available via its Wayback Machine. It now holds more than 20 years of web history within 525 billion web pages in its archive, along with millions of books, video, audio recordings and images, that can be explored by users.

If the UK Web Archive is to play a role as a trusted source in the fight against misinformation, it is clear the rules governing it will need to keep pace with the way people use the internet. Few people are willing to travel to use a computer terminal if they want to interrogate a claim or identify a bad piece of information before they share it or believe it. It needs to be available from anywhere, on any device.

Opening up the web archive would allow it to be mined at scale for high quality information using modern text analysis methods or artificial intelligence. It would enable researchers, businesses, journalists and anyone else with an interest to uncover trends or information hidden in web pages from the past.

This is only going to be possible if legislators can introduce the regulatory changes that librarians need to open up the archive. Perhaps systems of micropayments – much like those to authors of books borrowed from libraries – could be applied to material held in commercial archives.

Librarians could also have a part to play. Already among the most trusted professionals in the UK, they are trained in information literacy so they can query who made a piece of information, where it comes from and if it can be believed. Their knowledge and skills could be usefully passed onto the public to help them navigate the complex environment of historical websites. We need to stop thinking of the internet as a two dimensional accumulation of information on a screen, but rather as a three dimensional entity that can be explored back in time, too.

This will all require additional resources, particularly as digital repositories continue to grow. The legal deposit libraries, and their librarians, have done a magnificent job preserving the UK's webspace for future generations.

Busting open the virtual doors of the Web Archive won’t solve all of the problems we have with misinformation. But in the meantime, it is a vital resource just waiting to be used.

Further reading

This blog is one of a series of perspective pieces published to support the Royal Society's Online information environment report, which provides an overview of how the internet has changed, and continues to change, the way society engages with scientific information, and how it may be affecting people’s decision-making behaviour.

 

Authors

  • Professor Melissa Terras

    Professor Melissa Terras

    Melissa's studies the use of computational techniques to enable research in the arts, humanities, and wider cultural heritage and information environment that would otherwise be impossible. She is a Fellow of the Chartered Institute of Library and Information Professionals, a Chartered IT Professional, Fellow of the British Computer Society and a Turing Institute Fellow.