Science as a Public Enterprise

Science as an open enterpriseCosts of digital repositories

The Science as an Open Enterprise report distinguishes between four tiers of digital repositories.

  • Tier 1 comprises the major international data initiatives that have well defined protocols for the selection and incorporation of new data and access to them.
  • Tier 2 includes the data centres and resources managed by national bodies such as UK Research Councils or prominent research funders such as the Wellcome Trust.
  • Tier 3 refers to curation at the level of individual universities and research institutes, or groupings of them.
  • Tier 4 is that of the individual researcher or research group that collates and stores its own data, often making it available via a website to collaborators or for public access.

This page presents costings and capabilities for a representative sample of Tier 1, Tier 2 and Tier 3 repositories, gathered by a standardised survey instrument. The data presented below were gathered in January-February 2012, and are accurate as of this time.

International and large national repositories (Tier 1 and 2)

Worldwide Protein Data Bank (wwPDB)

The Worldwide Protein Data Bank (wwPDB) archive is the single worldwide repository of information about the 3D structures of large biological molecules, including proteins and nucleic acids. It was founded in 1971, and is managed by the Worldwide PDB organisation (wwpdb.org). As of January 2012, it held 78477 structures. 8120 were added in 2011, at a rate of 677 per month. In 2011, an average of 31.6 million data files were downloaded per month. The total storage requirement for the repository was 135GB for the archive.

The total cost for the project is approximately $11-12 million per year (total costs, including overhead), spread out over the four member sites. It employs 69 FTE staff. wwPDB estimate that $6-7 million is for “data in” expenses relating to the deposition and curation of data. 

Platform provision, maintenance and development?

Yes

Multiple formats or versions (eg PDF, html, postscript, latex; multiple revisions of datasets)?

Yes

‘Front end’ - web-access to pages?

Yes

Registration and access control?

No

Input quality control: consistency with format standards, adequate technical standards?

Yes

Input quantity control: ensure community coverage?

Yes

Add metadata and references to describe authorship, provenance and experimental or simulation context?

Yes

Provide accession number to log deposition?

Yes

Alert registrants to new additions?

Yes

Provide means by which the data can be cited and credited to originators?

Yes

Host or link to relevant analysis tools (eg visualisation, statistics)?

Yes

Measure and document impact: downloads, data citations?

Yes

 

UK Data Archive

The UK Data Archive, founded 1967, is curator of the largest collection of digital data in the social sciences in the United Kingdom. It contains several thousand datasets relating to society, both historical and contemporary. The UK Data Archive provides services to the ESRC and JISC: including the Economic and Social Data Service, the Secure Data Service, the Census Registration Service, the Census Portal. The Archive maintains the History Data Service (unfunded) and undertakes a variety of research and development projects in all areas of the digital life cycle. UKDA is funded mainly by Economic and Social Research Council, University of Essex and JISC, and is hosted at University of Essex.

The main storage ‘repository’ holds multiple versions of approx 1.26 million files (ie digital objects), other ‘repositories’ hold a little under than 1 files (in a primary version.) UKDA tends to work on the basis of core data collections, of which there are currently 6,400. Of the 6,400 data collections, there were 53,432 direct downloads in 2011 (approx 4,500 per month). This does not include downloads of freely-available material which are estimated to be over 1 million. This also does not include online tabulations through Nesstar, nor images browsed through websites hosted at the UK Data Archive (eg, www.histpop.org).

On average around 2,600 (new or revised) files are uploaded to the repository monthly. (This includes file packages, so the absolute number of files is higher.) The baseline size of the main storage repository is <1Tb, though with multiple versions and files outside this system, a total capacity of c.10Tb is required.

The UKDA currently (26/1/2012) employs 64.5 people. The physical storage systems and related security infrastructure is staffed by 2.5 FTE. The total expenditure of the UK Data Archive (2010-11) was approx £3.43 million. This includes additional infrastructural costs eg lighting, heat, estates etc. Total staff costs (2010-11) across the whole organisation: £2.43 million. Total non-staff costs (2010-11) across the whole organisation: £360,000, but these can fluctuate by more than 100% across given years. Non-staff costs in 2009-10 were approx £580,000, but will be much higher in 2011-12, ie almost £3 million due to additional investment. 

Platform provision, maintenance and development?

Yes

Multiple formats or versions (eg PDF, html, postscript, latex; multiple revisions of datasets)?

Yes

‘Front end’ - web-access to pages?

Yes

Registration and access control?

Yes

Input quality control: consistency with format standards, adequate technical standards?

Yes

Input quantity control: ensure community coverage?

Yes

Add metadata and references to describe authorship, provenance and experimental or simulation context?

Yes, including metadata creation

Provide accession number to log deposition?

Yes

Alert registrants to new additions?

Optional

Provide means by which the data can be cited and credited to originators?

Yes

Host or link to relevant analysis tools (eg visualisation, statistics)?

Yes – not all data

Measure and document impact: downloads, data citations?

Yes – not all data

Other notes: UKDA also provides a range of other services including Content creation, Content hosting, Content hosting (secure), Content licensing, Content selection, Data curation, Data preservation, Data curation (secure), Licence negotiation, Documentation creation, Resource discovery, Content Development, Ingest (QA/Validation), Access Control (liaison with data owners), Consultancy, Creating & maintaining expertise, Developing advice & guidance (eg on data management), Requirements expertise, Solutions expertise, Training, Thesaurus/controlled vocabulary development, Horizon scanning, Trend analysis, General helpdesk support, Online help (FAQ, help manuals), Specialist helpdesk support, Event organisation & management, Funding engagement, Funding application, Market research, Promotion and PR, Impact promotion, Vendor engagement, Project & programme management.

arXiv.org

arXiv.org is internationally acknowledged as a pioneering and successful digital archive and open-access distribution service for research articles. The e-print repository has transformed the scholarly communication infrastructure of multiple fields of physics and plays an increasingly prominent role in a unified set of global resources for physics, mathematics, computer science, and related disciplines. It is very firmly embedded in the research workflows of these subject domains and has changed the way in which material is shared, making science more democratic and allowing for the rapid dissemination of scientific findings. It has been running since 1991, and is hosted by Cornel University Library, and is funded by Cornell University Library and contributing institutions.

As of January 2012, it held over 750,000 articles. Around 7,300 are added per month. The size of the repository is currently 263GB. arXiv.org employs just over six people. Its projected running costs for 2012 (including indirect costs) are in the region of $810,000 per year, of which roughly $670,000 are staff costs. Storage and computing infrastructure accounts for around $45,000 per year. 

Platform provision, maintenance and development?

Yes

Multiple formats or versions (eg PDF, html, postscript, latex; multiple revisions of datasets)?

Yes

‘Front end’ - web-access to pages?

Yes

Registration and access control?

Allows user registration, but all papers are open access.

Input quality control: consistency with format standards, adequate technical standards?

Yes, please see the policies at http://arxiv.org/help

Input quantity control: ensure community coverage?

See
http://arxiv.org/help/moderation

Add metadata and references to describe authorship, provenance and experimental or simulation context?

We rely on metadata provided during submission and are in the process of considering ORCID or other similar initiatives for author name disambiguation

Provide accession number to log deposition?

Yes

Alert registrants to new additions?

Yes

Provide means by which the data can be cited and credited to originators?

Yes (for arXiv documents – see http://arxiv.org/help/faq/references

Host or link to relevant analysis tools (eg visualisation, statistics)?

We have some R&D natured tools such as http://arxiv.culturomics.org/

Measure and document impact: downloads, data citations?

none

Other notes: Provides support for ancillary files: http://arxiv.org/help/ancillary_files. Support for datasets as a R&D project, not a streamlined operation: http://arxiv.org/help/datasets.

Dryad

Dryad is a repository of data underlying peer reviewed articles in the basic and applied biosciences. Dryad closely coordinates with journals to integrate article and data submission. The repository is community driven, governed and sustained by a consortium of scientific societies, publishers, and other stakeholder organisations. Dryad currently hosts data from over 100 journals, from many different publishers, institutions, and countries of origin. It was founded in 2008.

As of 24 January 2012, Dryad contained 1280 data packages and 3095 data files, associated with articles in 108 journals. It received 7518 downloads per month in December 2011, and 79 new data packages in December, 2011, with approximately 2.3 files per data package. Its current size is 0.05 TB.

Dryad has 4-6 FTE, with 50% devoted to operational core and 50% to R&D. Its total budget is around $350,000 per year, with staff costs of approximately $300,000, and $5,000-$10,000, of infrastructure costs including subscription services (eg DataCite, LOCKSS, etc.). It has received R&D funding from NSF and IMLS in the US, and JISC in the UK. Dryad's sustainability plan and business model ensure that long term, revenues from payments for the submission of new data deposits cover the repository’s operating costs (including curation, storage, and software maintenance). The primary production server is maintained by the North Carolina State University Digital Library Program. The Dryad is currently applying to the State of North Carolina and the US IRS to be recognised as an independent not-for-profit organisation. 

 

Platform provision, maintenance and development?

Usage is primarily through the centrally managed web platform at NCSU and its mirrors. The Dryad is responsible for provision, maintenance and development of this service. Since Dryad is built using open-source software, in large part DSpace, it can also be locally deployed for institutional use.

Multiple formats or versions (eg PDF, html, postscript, latex; multiple revisions of datasets)?

Both multiple data machine and content formats and multiple versions are supported. Dryad does not generally host the articles themselves, but rather the datafiles associated with them.

‘Front end’ - web-access to pages?

Yes, see http://datadryad.org

Registration and access control?

Yes, but not required for viewing/download, only for submission. Data and metadata can be embargoed from public access until article acceptance, publication, or beyond, depending on journal policy.

Input quality control: consistency with format standards, adequate technical standards?

Dryad’s curators are responsible for the following:

  • Quality control of bibliographic and subject metadata, including author name control
  • Validation of the integrity of uploaded files, including screening for copyrighted and sensitive content
  • Migrating files to new or more preservation-robust formats
  • Providing user help.

The formatting of file contents varies with discipline and is controlled by journal policy, not by Dryad. In fields with mature data standards, journals frequently specify that users use a specialised repository. Dryad is designed to provide a home for the “long tail” of data, where such formats and repositories do not (yet) exist. At the same time, Dryad is developing means to coordinate the submission process with specialised repositories in order to ensure each data file is appropriately managed.

Input quantity control: ensure community coverage?

Dryad is interdisciplinary and spans multiple scientific communities; annotation functions are under discussion.

Add metadata and references to describe authorship, provenance and experimental or simulation context?

The repository controls the quality and completeness of bibliographic metadata (title, authors, DOI, etc), including subject keywords to enable search. Provenance and other context provided is always provided at least partially by the associated article. Authors may supplement this upon deposit (eg with a ReadMe file) or include such information within a metadata-rich data file (eg XML)

Provide accession number to log deposition?

Yes, DataCite DOIs.

Alert registrants to new additions?

Yes, eg through an RSS feed

Provide means by which the data can be cited and credited to originators?

Yes, Dryad is frequently noted as an exemplar of data citation policy best practice. http://datadryad.org/using#howCite

Host or link to relevant analysis tools (eg visualisation, statistics)?

No.

Measure and document impact: downloads, data citations?

Views and downloads are reported on a per file and per data package basis (eg see http://datadryad.org/depositing#viewStats). Tracking data citations is a long-range objective, but not currently feasible technically.

Other notes:Dryad is governed by a diverse set of stakeholder organisations. The Dryad is itself a service to its membership in providing a forum for the discussion of data policies and the promotion of best practice in data archiving. 

Dryad has an open access business model in which curation and preservation costs are paid upfront to ensure that the data can be provided at no cost to those who wish to use it. Nearly all content in the repository is made available for reuse through a Creative Commons Zero waiver, and so can be built upon both by academic researchers and third party value-added services (eg more specialised data repositories that provide additional curation). Dryad also enables partner journals to integrate manuscript and data submission through automated exchange of metadata emails. This ensures that data records are prepopulated with bibliographic information in order to reduce the submission burden on authors, and partner journals are notified of all completed submissions, including DOIs. Partner journals may allow or disallow authors to set a one year embargo on access to a datafile, and editors may specify custom embargo lengths. Partner journals may offer editors and peer-reviewers anonymous and secure access to data from manuscripts prior to their acceptance.

Institutional repositories (Tier 3)

Most university repositories in the UK have small amounts of staff time. The Repositories Support Project survey in 2011 received responses from 75 UK universities. It found that the average university repository employed a total 1.36 FTE – combined into Managerial, Administrative and Technical roles. 40% of these repositories accept research data. In the vast majority of cases (86%), the library has lead responsibility for the repository.

ePrints Soton

ePrints Soton, founded in 2003, is the institutional repository for the University of Southampton. It holds publications including journal articles, books and chapters, reports and working papers, higher theses, and some art and design items. It is looking to expand its holdings of datasets.

It currently has metadata on 65653 items. The majority of these lead to an access request facility or point to open access material held elsewhere. It holds 8830 open access items. There are 46,758 downloads per month, and an average of 826 new uploads every month. The total size of the repository is 0.25TB. It has a staff of 3.2 FTE (1FTE technical, 0.9 senior editor, 1.2 editors, 0.1 senior manager). Total costs of the repository are of £116, 318, comprised of staff costs of £111,318, and infrastructure costs of £5,000. (These figures do not include a separate repository for electronics and computer science, which will be merged into the main repository later in 2012.) It is funded and hosted by the University of Southampton, and uses the ePrints server, which was developed by the University of Southampton School of Electronics and Computer Science. 

Platform provision, maintenance and development?

Yes

Multiple formats or versions (eg PDF, html, postscript, latex; multiple revisions of datasets)?

Yes

‘Front end’ - web-access to pages?

Yes

Registration and access control?

Yes

Input quality control: consistency with format standards, adequate technical standards?

Yes, although stronger on metadata. Starting to do more on recommended storage formats for objects for preservation purposes but more to do in this complex area

Input quantity control: ensure community coverage?

Yes

Add metadata and references to describe authorship, provenance and experimental or simulation context?

Yes, a new data project means the repository will be working more on provenance and contextual information for data. Up to now mostly publications rather than data.

Provide accession number to log deposition?

Yes

Alert registrants to new additions?

Yes- users can set up alerts

Provide means by which the data can be cited and credited to originators?

Yes

Host or link to relevant analysis tools (eg visualisation, statistics)?

Stats visualisation

Measure and document impact: downloads, data citations?

Yes, downloads, harvest ISI citation counts

Other notes: Integration with other systems – eg user/project profile pages, reporting for internal and external stakeholders, import/export in various formats including open data RDF format.

DSpace at MIT

DSpace@MIT is MIT's institutional repository built to save, share, and search MIT's digital research materials including an increasing number of conference papers, images, peer reviewed scholarly articles, preprints, technical reports, theses, working papers, and more. It was founded in 2002.

As of December 2011 DSpace@MIT held 53,365 total items, comprising 661,530 bitstreams. The scope of its holdings of research data is unknown, as whilst submitters have the ability to designate new items as being of a ‘research dataset’ content type, this information is not required.   It receives aroundonemillion browser-based file download per month, and an additional 1.3 million crawler-based file downloads per month. It receives around 700 uploads of new items per month. The total size of the repository is currently 1.1TB. Growth is anticipated at ~250GB/yr with current service scope.

The repository has 1.25 FTE dedicated to overall program administration technical support.   Additional capacity of 1.5 FTE supports the identification, acquisition, ingest, and curation of MIT’s database of Open Access Faculty Articles. While there are additional staff costs associated with identifying and managing the collections which are curated by the MIT Libraries and disseminated via the DSpace platform, e.g., theses, technical report series, working papers, etc., these costs are independent of DSpace@MIT and are borne in other Libraries’ units independent of the service platform. The total cost of the repository itself is approximately $260,000 per year, of which around $76,500 are infrastructure costs, and around $183,500 direct or indirect staff costs.

 

Platform provision, maintenance and development?

DSpace@MIT is run as a single repository instance for all contributing communities at MIT.  Provision, maintenance and development are done in house for this library-run service. 

Multiple formats or versions (eg PDF, html, postscript, latex; multiple revisions of datasets)?

Multiple formats are supported but are not automatically generated by the system upon ingest.  Versioning is supported through creation of multiple items with cross-reference links and descriptive text.

‘Front end’ - web-access to pages?

Yes

Registration and access control?

Yes

Input quality control: consistency with format standards, adequate technical standards?

Support from within the Libraries varies here depending upon the source community and target collection.  The MIT Open Access Articles collection is heavily curated, as are other collections mediated by the Libraries.  However, the DSpace@MIT service is open to the faculty and research community at large and aside from specific collections is largely unmediated – i.e., there is no specific review of pre-ingested content to determine the quality and completeness of entry. 

Add metadata and references to describe authorship, provenance and experimental or simulation context?

The ability to input this metadata is supported within the system.  They provide a best practices guide that aids submitters with respect to describing research datasets.  The guide includes recommendations for describing the hardware, software and conditions that created the dataset, file format descriptions, and requirements for reuse of the data.

Provide accession number to log deposition?

Internally, the system creates an identifier for the submitted items that are directly referenceable to back-end database queries.  Additionally, each item receives a handle URI (similar to a DOI) that is a permanent, persistent and citable URI.  It does not yet support DataCite or other file-level identifiers (DSpace items can contain multiple files).

Alert registrants to new additions?

Yes.  Users can set up e-mail notification of new content or via RSS.

Provide means by which the data can be cited and credited to originators?

Yes.  Permanent, persistent handle URI for citation at the item level.

Host or link to relevant analysis tools (eg visualization, statistics)?

Supported if these links to relevant tools are added by the submitter.  It does not have embedded ‘dissemination’ services that would produce such visualizations, analytics or derivatives on the fly from submitted bitstreams.

Measure and document impact: downloads, data citations?

The repository captures internal usage statistics but these are not publically displayed or redistributed to authors/creators/submitters.  They do not attempt to track subsequent citation of their content.

Other notes: Most of the comments above have not directly referenced research data specifically.  As an institutional repository, DSpace@MIT serves as the single repository for the breadth of research and teaching output of the Institute.  As such, DSpace was designed to support submission of all formats, but without description, dissemination and search facilities that were specialized for various format types.  Moreover, DSpace@MIT has historically been modelled as an unmediated service open to the faculty and research community at MIT. 

The data model and metadata schema allows for the notation of related items, either held within the repository or externally.  This allows for linking a locally-held dataset to an externally published article or to denote relationships among items. Also, DSpace@MIT supports the application of a Creative Commons license for submitted research data.

Oxford University Research Archive and DataBank

The Oxford University Research Archive (ORA) and DataBank services are being developed as part of the Bodleian Libraries’ digital collections provision for Oxford. ORA is a publications repository, which holds a mixture of ‘text-like’ items. The data repository, DataBank, is well developed and is being developed further as part of the JISC-funded Damaro project. It will form one service within a broader research data infrastructure for Oxford. The Bodleian Libraries are also developing a research data catalogue (DataFinder) to record metadata about Oxford research data for discovery of datasets. ORA was founded in 2007, DataBank in 2008: both are still in development.

ORA currently hold 14,500 items, and DataBank 12 datasets. There are 1100 downloads from ORA per month; figures for DataBank are not available. ORA has around 100 uploads per month, excluding bulk ingests. DataBank currently has no deposit interface (one is in development), and requires assisted deposit. The service is not yet broadly advertised.

ORA has a staff of 2.5 FTE (0.5 manager; 1.0 developer; 1.0 assistant). Staffing that will be required for DataBank is not yet clear, but these staff will overlap with ORA. Total running costs not available. The service is hosted by the Bodleian Libraries. Funding for DataBank is under discussion within the University. ORA use Fedora, whilst DataBank uses Oxford DAMS (Digital Asset Management System). 

 

Platform provision, maintenance and development?

Yes.

Multiple formats or versions (eg PDF, html, postscript, latex; multiple revisions of datasets)?

Format agnostic. Advise open formats if possible. All datasets in DataBank should be ‘publishable’ and can therefore be assigned a DOI. Updated versions can be accepted (DOI can indicate version number).

‘Front end’ - web-access to pages?

Yes

Registration and access control?

Open access if permitted. Embargo facility if not.

Input quality control: consistency with format standards, adequate technical standards?

On the repository side, yes.

Add metadata and references to describe authorship, provenance and experimental or simulation context?

Working towards mandating DataCite kernal for data but may mandate additional fields (eg rights) [To be discussed].

Provide accession number to log deposition?

Every item assigned a UUID as well as internal system PID

Alert registrants to new additions?

ORA: [Feature on home page]; RSS feed; Twitter

Provide means by which the data can be cited and credited to originators?

DOI for datasets; UUID for every item in both repositories; PURL resolver currently being deployed; DataFinder will provide a record including location for Oxford data even if not stored at Oxford.

Host or link to relevant analysis tools (eg visualisation, statistics)?

ORA: Statistics analytics (Piwik)

Measure and document impact: downloads, data citations?

ORA: Record accesses and downloads

Other services (please add additional rows as appropriate)

DOI assignment for datasets (DataCite)

Other notes: DataBank is not yet fully functioning (deposit and search features under development and also user interface design). The handful of datasets in the repository can be freely accessed by using the accurate URL. The Damaro project will see development of DataFinder. Policies and sustainability and training will be also be developed as part of Damaro. A colleague is working on the Oxford DMPOnline project (data management planning tool) which runs parallel to Damaro. We are expecting the basic service to be launched during 2013. ORA is small as yet and still in development. We see increasing numbers of doctoral theses (institutional mandate). We are currently starting promotion of easy deposit into ORA using Symplectic. We are aiming to run more bulk uploads where possible.

Science as an open enterprise

Final report, case studies of data use and data repositories and the launch event
published June 2012

Public meeting and seminar
held in November 2011

Call for evidence
closed August 2011

Public town hall meeting
held in June 2011

Project details, Terms of Reference and Working Group
announced in May 2011

Share this page