The Science as an Open Enterprise report includes this illustration of how the diverse attributes of data production in a range of scientific studies. The type of open data is tailored to the nature of the data, the curation and storage effort, and requirements for data access.
A full description of each example is given below the diagram. To illustrate the attributes of each project, the relative percentage below roughly equate to work done on that attribute.
Astronomy and the Virtual Observatory
In the field of astronomy, scientists have for some time already recognised the importance of greater openness in science. Astronomers from around the globe have initiated the Virtual Observatory (VO) project to allow scientists to discover, access, analyse and combine astronomical data archives and make use of novel software tools. The International Virtual Observatory Alliance (IVOA) coordinates various national VO organisations and establishes technical and astronomical standards. The establishment of such standards is vital so that datasets and analysis tools from around the world are interoperable. Metadata are also standardised using the Flexible Image Transport System (FITS) standard and the more recent XML-based IVOA format, IVOTable. It is also an IVOA standard to register datasets in a registry, a sort of web-based Yellow Pages for astronomy databases. These are important to document the existence and location of datasets so that they can be easily found and accessed. IVOA itself collates a registry of registries.
In Europe, the main VO organisations have come together to form Euro-VO. Euro-VO is responsible for maintaining an operational VO in Europe by supporting the utilisation of its tools and services by the scientific community, ensuring the technology take up and compliance with international standards and assisting the build of the technical infrastructure. Deposition of data in data centres is common practice in astronomy, especially since it is a condition of access to large facilities. Access to data may be embargoed for up to a year to allow the scientists who carried out the research to have a first chance to analyse their data; data are however made publically available at the end of this period.
Laser Interferometer Gravitational-wave Observatory project
The Laser Interferometer Gravitational-wave Observatory (LIGO) project is an 800-person international open collaboration, involving approximately 50 institutions. It aims to detect gravitational waves, tiny ripples in the structure of spacetime caused by astrophysical events like supernovae, neutron stars or black holes. They were first predicted by Albert Einstein in 1916 as part of his theory of general relativity but remain to be directly observed. The UK is involved in this collaboration via the UK-German GEO600 project, a 600m laser interferometer infrastructure built near Hannover.
The collaboration has generated in the order of 1 petabyte of data so far, a volume which is expected to increase to a rate of around 1 petabyte per year by 2015. These data are stored at the US LIGO sites, some or all of which is also maintained at various European sites. Despite the core dataset being relatively straightforward, it also includes important but complex auxiliary channels, such as seismic activity and environmental factors, and several layers of highly-reduced data products, mostly specific to custom software suites. Such data require careful curation. The management of the data and the processing software has so far been designed to support an ongoing research project. A long term data preservation plan has also recently been agreed, including an algorithm for data release. Data collected remain proprietary to the collaboration until its release is triggered by a significant event such as an announced detection of a gravitational wave, or a certain volume of spacetime being explored by the detector.
Scientific Visualisation Service for the International Space Innovation Centre
The Science Visualisation Service for Earth Observation (SVSeo), developed by CEDA as part of the development of the International Space Innovation Centre (ISIC), is a web-based application that allows users to visualise and reuse Earth Observation data and climate model simulations.
Users can visually explore large and complex environmental datasets from observations and models, view, step through and zoom in to gridded datasets on a map view, overlay different parameters, export images as figures and create animations for viewing and manipulation on the ISIC videowall, on Google Earth or other similar software. Datasets from the National Centre for Earth Observation (NCEO) in the CEDA archives have been included in the visualisation service and provide satellite derived products relating to clouds, plankton, air-sea gas exchange and fire, and model output.
The visualisation service will be updated as additional datasets are produced and provided to CEDA for long term archival. The service is also capable of including any remote data which are exposed via a Web Map Service (WMS) interface. CEDA data are made available for visualisation through the CEDA Open Geospatial Consortium (OGC) Web Services framework (COWS).
(Interactive visualisation software developed by partners in STFC e-Science and the University of Reading can also be used at the ISIC facility to create animations on a virtual globe or multiple, synchronised virtual globes displayed on a large videowall.)
The UK Land Cover Map at the Centre for Ecology & Hydrology
The UK Land Cover Map (LCM2007) has classified approximately 10 million land parcels into the UK Biodiversity Action plan Broad Habitats by combining satellite imagery and national cartography. It is the first land cover map to provide continuous vector coverage of 23 of the UK Broad Habitats derived from satellite data.
To process and classify the 2 terabytes of data involved, researchers have developed novel techniques and automated production tools. The data are curated by the Natural Environment Research Council (NERC) Centre for Ecology and Hydrology (CEH) so it can be reused for further research. Metadata, technical descriptions, visualisation services and download of summary datasets are available through the CEH Information Gateway. The national product is available in a range of formats from 1 km summary to 25 m resolution for the UK for all 23 habitat types.
Global Ocean Models at the UK National Oceanography Centre
Researchers at the National Oceanography Centre in Southampton run high resolution global ocean models to study the physics of ocean circulation and the bio-geochemical consequences of changes in this circulation over timescales spanning multiple decades.
Data on the ocean properties, sea-ice cover, ocean currents and biological tracers are recorded and a typical 50 year run produces between 10 and 50 terabytes of data. To analyse the data, researchers’ cycle through the time series of output using software specifically developed in-house. Standard packages can be used to visualise the data although in-house packages are also developed for specific needs. The data are stored locally and at data centres for up to 10 years or until superseded and are made freely available to the academic community.
The Avon Longitudinal Study of Parents and Children (ALSPAC)
This aims to investigate genetic and environmental factors that affect health and development. Researchers have been collecting large amounts of data from mothers and their children at 55 time points since 1991 in the form of biological samples, questionnaires, information from medical notes, and in some cases genome analysis.
The nearly 40,000 variables, 55 time points and 94 data collection events of this study can be explored through a prototype online gateway developed by the MRC, the MRC Research Gateway. Researchers from approved collaborations can view ‘deep metadata’ of variables of studies and export these to support data sharing requests. Researchers then liaise with a ‘data buddy’ who releases the required data according to the degree of risk of breaching participant anonymity.
If there is a risk that study participants may be identified, data are made available via a two-stage process: first potentially identifying but unmatched data are provided to collaborators, the study team later matches these with the dataset. Data with a low risk of disclosure are more readily accessible and subject to a less stringent release process. Genotype data are only made available via data transfer agreements. The MRC Research Gateway is striving to enhance data sharing within defined limits to protect participant anonymity.