Post-graduate science students break Large Language Model guardrails at Royal Society AI safety event

07 November 2023

Forty post-graduate research health and climate science students were able to generate harmful scientific misinformation (false scientific information produced accidentally), disinformation (false scientific information produced with the intent to harm) and malinformation (verifiable scientific facts placed out of context leading to misleading and inaccurate conclusions) in less than two hours using Meta’s open-source Large Language Model (LLM), Llama 2, according to the Royal Society.

The students, all from the UK, were taking part in a ‘red teaming’ event hosted by the Royal Society and non-profit AI-safety organisation, Humane Intelligence, held ahead of the UK’s Global AI Safety Summit this week. The event aimed to raise awareness of the potential vulnerabilities of LLMs in scientific misinformation content generation about climate change and COVID-19. 

As well as producing misleading information, the experiment also found the model was not fit to communicate context, complexity, or uncertainty accurately and there were several instances where it produced outdated information and simulated uncertainty in debates where there is already a pre-established scientific consensus. It also drew on sources that included pseudoscience and lobbying material and some interactions showed the model exhibiting human-like qualities such as empathy, persuasion, and authority.

Professor Alison Noble, a vice president at the Royal Society said: “AI-based technologies have huge potential to revolutionise scientific discovery by helping to solve some of society’s greatest challenges, from disease prevention to climate change, but we must also be fully aware of potential risks associated with these new technologies.

“This event has demonstrated the importance of including scientists in AI quality and safety assessments to test the capabilities of models to address cutting-edge topics, emerging technologies, and complex themes, which may not always be within the purview of the data used to train a language model.”

Jutta Williams, co-founder of Humane Intelligence, said: "Our findings really validate the contribution that red teaming events can offer AI model developers. By combining deep scientific expertise and structured feedback with the guardrail capabilities that companies developing models have established, the user who interacts with models is made safer.”

Participants at the event, held ahead of the UK’s Global AI Safety Summit this week, were allocated one of four personas as part of the exercise: Good Samaritan – unknowingly produce and share misinformation content; Profiteer – share misinformation content or are ambivalent about content veracity if they generate profit from the dissemination; Attention hacker – knowingly produces and shares divisive misinformation content; and Coordinated influence operator – knowingly produces and share disinformation to say public opinion to benefit the agenda of their organisation, industry or government.

Each student had to complete a series of challenges, using a series of prompts, while embodying these roles, attempting to break the guardrails through a series of use cases that involved the generation and dissemination of false scientific content. For example, Good Samaritans explored whether it was possible to produce scientific misinformation while seeking advice, Profiteers sought ideas for new products, Attention hackers generated mistrust campaigns, and coordinated influence operators generated agenda-driven conspiracy theories. 

Students were more successful in producing disinformation when assuming the role of proactive and/or malicious disinformation actors. Participants were unsuccessful in breaking guardrails that prevent common disinformation trends about COVID-19 and climate denialism, and discrimination against historically excluded communities, suggesting these guardrails could prevent the general public from using language models to consume and generate misinformation at scale.

The Royal Society will draw on these findings in a policy report, Science in the Age of AI, due to be published next year. The report will explore how AI is changing the nature and methods of scientific research. The report will cover novel developments in large language models, provide case studies of AI in science, and detail environmental, ethical, and research integrity challenges. 

The Royal Society has been investigating AI-generated misinformation since 2022, including publishing a report on The online information environment and a joint workshop with the BBC on Generative AI, content provenance, and a public service internet. One of the takeaways of this work was the importance of ‘building resilience within platforms and the people who use them’. Recommendation 7 of the report called for ‘collaboration to develop examples of best practice for countering misinformation’ and Recommendation 9 called for ‘lifelong, information literacy initiatives.

Humane Intelligence, under the leadership of Dr Rumman Chowdhury and Jutta Williams, have designed numerous events designed to test the ‘guardrails’ of LLMs, including the largest ever at DEFCON 29.