1. Solving the data problem for research and beyond
Matthew Dovey, Head of e-infrastructure strategy, Jisc
John Kaye, Senior co-design manager - research data, Jisc
28/04/2017
1
3. Research is changing
»The 4th Paradigm of data-intensive research and
data-driven innovation
»Open by default
»Dependency on digital infrastructures and digital transformation
»Globally competitive environment – digital transformation is open
to everyone
28/04/2017 Solving the data problem 3
4. The vision
»Jisc’s vision is to make the UK the most digitally advanced
research nation in the world by fully exploiting the possibilities of
modern digital empowerment, content and connectivity
»Jisc will provide the underlying infrastructure which can scale and
flex to enable researchers to deliver the outcomes that funders,
government, industry and society want from the sector
»Our vision is of a seamless, interoperable digital infrastructure
which enables researchers and research organisations the freedom
to apply their strategic resources to maximise their research
impact and minimise the cost and burden of the supporting
operations
28/04/2017 Solving the data problem 4
5. The vision
28/04/2017 Solving the data problem 5
Underpinning
infrastructure
Information
model
Dynamic
research
platform
»Cyber-Security Support
»Data Assurance
»Network Performance
Optimisation
»Procurement Frameworks
»Research Analytics
»Research Outputs - Publication,
Curation, Archiving and Preservation
»Content Licensing, Discovery and
Management
»Standards and Identifiers
»Vocabularies
»Data Model
»Janet Backbone
»Federated Access and
Identity Management
»Data Centres
Research enabling services
»Advanced Networking
Technologies
»Data Warehouse
»Flexible Storage
»Metadata Profiles
»Application Profiles
»Data Brokerage
6. Top three priorities
»The comprehensive connectivity across the infrastructure at a
diversity of scales (local, regional, national, international)
»A coherent suite of research services which reduces the burden on
institutions, increases the efficiency, delivers solutions to common
problems and improves UK’s research performance
»Representation of the UK’s digital needs in our engagements and
advocacy in the national and international arena
Jisc will provide three elements of the vision
28/04/2017 Solving the data problem 6
7. Research strategy outcomes
1. The UK’s research environment is underpinned by flexible, scalable infrastructure where
standards based approaches ensure that data can be generated, moved, stored, found
and used with the minimum of cost or burden to the institution and the researcher
2. The transition from Open Access to Open Science where research objects are findable,
accessible, interoperable and reusable by academia, industry and society for wider
economic and social benefit
3. UK interests are represented in both international policy and operational environments
enabling UK researchers to collaborate, compete and comply with the global research
community
4. The UK maintains its position as a digital thought leader and shaper of both research
infrastructures and the wider scholarly communications environment
5. The investment in the mission-critical UK E-Infrastructure required by the research base
is safeguarded for the long-term enabling UK Research to continue to punch above its
weight in the global research environment
28/04/2017 Solving the data problem 7
9. Motivation and engagement
»Initial interest for explored with SDC-North tenants
»Informal vendor discussions to determine technical feasibility
»Requirements workshop – November 2016
»Active working group to develop full business case for phased
implementation in 2017
»Progress and input from wider community via
https://community.jisc.ac.uk/groups/tiered-storage
28/04/2017 Solving the data problem 9
10. Opportunities
» Provide a national storage provision filling a current gap
› Universities looking at ever-increasing storage requirements and needs
› Confused by different approaches (in house, cloud, hybrid), technologies, solutions,
pricing structures
› Different requirements and policies (internal, and externally imposed)
» Remove headache of procurement and management across multiple providers and
technologies
» Maximise Janet network value
» De-risk University in area of exponential growth
› Low riskPAYG infrastructure avoids over investment
28/04/2017 Solving the data problem 10
11. Benefits
» Savings on costs of power, cooling and carbon arising from a modern consolidated
infrastructure in a high-specification datacentre with modern cooling
» Procurement cost savings not just from quantity of procurements, but also from
timeliness of procurements: you will get cheaper overall storage costs by procuring 100TB
a year in each of five years than procuring 500TB once (simply because you get more
storage for your money as time goes on)
» Operational savings on time for installing and managing storage hardware
» Clear compliance with research council expectations for appropriate data management
across the research lifecycle
» Benefits across the University sector of providing a standard for research data
management and a standard costing
28/04/2017 Solving the data problem 11
12. Multi-vendor tiered storage proposal
28/04/2017 Solving the data problem 12
HSM Appliance
AWS
Cloud storage pool Archival storage pool
Customer infrastructure
(eg VMWare Vsphere)
Amazon
Glacier
Arkivum
Customer applications RDM share services
Cloud9
iSCSI
SMB
CIF
NFS
S3
https
Swift
ceph
…
Applications
Jisc tiered storage service
HSM Data Policy
• Pool Prioritisation
• Replication
• Snapshots
• SLAs (e.g.
retention,
availability,
security)
Distributed
storage pool
Google
HSM data policy
» Pool prioritisation
» Replication
» Snapshots
» SLAs (eg retention,
availability, security)
HSM Appliance
13. Tiered storage proposal - pools
28/04/2017 Solving the data problem 13
Pool Overview Class Copies RecoveryTime
Objective
Recovery Point
Objective
Distributed
storage pool
Data stored near sites (possibly based
on SDC1, SDC2 and other locations eg
national research e-infrastructure
centres, other NRENs) to give
onsitenearsite recovery times
Use of erasure-encoding to give
equivalence of 2 copies with ~1.6 times
storage capacity
Lever Janet
backbone to
deliverOnsite
equivalence
Equivalent to 2
Copies including
offsite
Onsitenear site
equivalent
<1 Hour
Cloud storage
pool
Managing data copies across multiple
cloud providers
Archive Equivalent to 2
Copies including
offsite
< 1 Hour 1-24 Hour
Archival storage
pool
Managing data copies across multiple
cloud “vault” providers (ie 99% or
100% guaranteed data recovery)
Vault Guaranteed
recovery
N/A N/A
14. Requirements and demand working group
»University of Oxford
»University of Leeds
»University of Manchester
»University College London
»London School of Economics
»Natural History Museum
»Additions welcome
Current members
»Phased technical specification
»Use scenarios
› (eg data movement)
»Business and financial case
› (includingTCO analysis)
»Market review and supplier
engagement
Key outputs
28/04/2017 Solving the data problem 14
15. Tiered storage positioning
28/04/2017 Solving the data problem 15
Storage
Providers
Jisc Tiered Storage
Other Jisc
Services
Storage
Policy
Storage
Policy
Storage
Policy
Storage
Policy
Jisc RDSS
Local Research
Data Systems
Other local systems
(financial, T&L, etc)
16. Jisc research data shared service
28/04/2017 Solving the data problem 16
17. The futures portfolio consists of three big areas
28/04/2017 Solving the data problem 17
Store
services
Playlists Diagnostic
tool builder
Curation
and remix
Learner
Analytics Services
Digital
capability
Learning
analytics
Digital
launchpad
Apprentice
workforce
development
Digital
leadership
Summer of
student
innovation
Analytics
academy
Analytics
labs
Qualification
verification
App
and
content
store
Research data
discovery
Research
data
usage
metrics
Equipment
data
Repository and
preservation platform
Research
data
shared
service
?
22. …..but a challenging problem
28/04/2017 Solving the data problem 22
Implementing
Archivematica for
research data
preservation at
York and Hull
Jenny Mitcham
(DigitalArchivist) -
University ofYork
27. Pilot MVP components
* Under review as additional reporting options may be available, also differing offers from
full dashboard/analytics to API only. Further discovery work is underway.
28/04/2017 Solving the data problem 27
RDSS Component Offer Number of Pilots Requiring (total =17)
RDSS Repository 14
RDSS Preservation 17
RDSS Reporting 14 (TBC)*
RDSS Storage 16
28. Pilot Alpha MVP integrations
*RDSS Framework Supplier
28/04/2017 Solving the data problem 28
RDSS Component Offer Number of Pilots Requiring (total =17)
Eprints (Repository) 12
Dspace (Repository) 4
Hydra (Repository) 2
Symplectic (CRIS)* 4
Pure (CRIS) 3
Converis (CRIS) 1
Authentication 17
29. Middlesex Figshare implementation
»Accelerated deployment in 10 weeks
(Installation by 10th November)
»Stakeholder engagement
»Development of institutional requirements
»Sign up to Datacite membership
»Implementation team (informal)
»Integration with Jisc Storage
»Implementation of pilot data repository
28/04/2017 Solving the data problem 29
30. The University of Jisc Sandbox
» Scratch environment for testing of
configuration and integration of service
platform components
» A mock HEI to integrate with
» Infrastructure as code, learning from
building, and managing the mixture of
SaaS and custom applications.This will
allow easy push button install of
products
» Working with test data and metadata
taken from real HEI repositories
» Consistent and standardised UX
» Bespoke development environment
28/04/2017 Solving the data problem 30
Apps CRIS
Test data
Zenodo
RDSS pilot HEI repositories
Publisher data
AWS
storage + tools
Data
repositories
Figshare, Hydra
Islandora, Haplo
Publication
repositories
Eprints
D-space
Preservation
systems
Preservica
Archivematica
Additional
software
and services
32. Preservation of research data
“I currently spend about £1,200 pa on data
storage from my own salary. I have the highest
data needs in my School, and there is no plan in
place for storing my data.”
28/04/2017 Solving the data problem 32
33. Sensitive research data
“It would be helpful to clarify the rules for storing
anonymised data on cloud services. My
departmental rules say this is never OK, however
this seems to contradict University rules.”
28/04/2017 Solving the data problem 33
34. University services to support RDM
“Support is woeful in the university currently, in
particular long-term data archiving is critically
required. Most of my non-current data is rotting
on CD's and hard-drives.”
28/04/2017 Solving the data problem 34
35. University services to support RDM
“Please, individualise the support.Workshop are
useless, emails with information are useless,
brochures are useless, posters are useless.”
28/04/2017 Solving the data problem 35
38. What we’d like to know…..
» What are your current priorities and pain points with managing data?
» Do you have or are you expecting a data deluge?
» What would you like Jisc to provide for managing data?
» What would you like the Jisc offer to look like?
» Have we missed anything in our pilots?Are there gaps?
» Are there any aspects of data management you’d like to keep ‘in-house’?
» Do you have issues around research systems user experience for researchers and staff
» Do you have issues around systems interoperability
» Do you have preservation needs beyond research data (eg records management, Archives)
» Can you share any hooks or incentives to engage researchers in data management services
» Any tips for success and lessons learned that we can utilise in implementing systems?
» Anything else…..
28/04/2017 Solving the data problem 38
39. 28/04/2017 Solving the data problem 39
Matthew Dovey
Head of e-infrastructure strategy
matthew.dovey@jisc.ac.uk
John Kaye
Senior co-design manager – Research Data
john.kaye@jisc.ac.uk
jisc.ac.uk/rd/projects/research-data-shared-service
https://community.jisc.ac.uk/groups/tiered-storage
Editor's Notes
What we have now – fragmentation, lack of interoperability, some good practice within subject areas but not the efficiencies possible when we deliver at scale.
Vision
Researchers shouldn’t need to think (too much!) about Research Data Management
"Visible data, invisible infrastructure"
Provide researchers intuitive, easy functionality to publish, archive and preserve their research outputs.
Provide interoperable systems to allow researchers and institutions to fulfil and go beyond policy requirements and adhere to best practice throughout the RDM lifecycle.
Goals
RDM Policy compliance
Increased sector efficiencies: procurement, data re-use, interoperability opportunities
Improving the integrity of research
Addressing Market Gaps: Integrated RDM system, Preservation Gap, Usability
Accelerating Research Data Management in institutions
Supporting institutions meet Open Access/REF
PreservationThis is the big GAP – many institutions are only now starting to address this need, in particular the question of what to keep (and what not to keep) and how log to keep things for.
While there are solutions like Arkivum there is a gap in terms of curating for preservation – tools that allow file format identification, metadata and the creation of archival information packages – data integrity and even emulation.
There is also a lack of true integration from data creation through to long term preservation.
The long tail
The long tail of unidentifiable files that we will have to deal with
Mention Jenny Mitcham's stats - around 60% of unidentifiable items in the RDM collection using existing workflows
PDF's - easy to deal with, as problem solved by global initiatives e.g. JHOVE, VeraPDF
Interoperability
In many ways the integration with other existing systems is the key USP for many potential stakeholders.
No one institutional set up is the same as another and the shared service has to integrate each case so the integration piece across all of the lots shown here and plugging those into reporting services, aggregators and funder systems is a major challenge.
We do it because it is hard.
Worktribe -
Note it is data as a top line BUT our solution WILL meet text requirements hence the OA / REF one here.
Some of the important issues and requirements that will be addressed in beta is the service approach to managing large datasets and storage and access management for sensitive datasets.
The beta phase also covers significant development and improvements to the user experience and integration with additional institutional systems such as HR, finance and ethics.
Powerfolder