SlideShare a Scribd company logo
1 of 54
Moving Gigantic Files In
and Out of the Repository
Jeff Potts
Metaversant Group, Inc.
Learn. Connect. Collaborate.
What’s the Deal with Large Files?
• Alfresco can manage files of any size, but getting large files into and out of
the repo is often problematic
• They take way too long to transfer
– Sessions timeout
– Machines go to sleep
– Incomplete files get transferred
– Users think, “Is this thing hung?” and then cancel
• End-users must actively monitor transfers in most cases
This talk is a technical case study
about an approach to significantly
improving large file transfers
Learn. Connect. Collaborate.
About Noble Research Institute
• Research organization focused on
improving agriculture for all mankind
– Research
– Producer Relations
– Applied agricultural systems and
stewardship
– Education
• About 400 employees from all over
the world
• Headquartered in Ardmore,
Oklahoma
• https://www.noble.org
Learn. Connect. Collaborate.
• Consulting firm focused on solving business problems with open source
Content Management, Workflow, & Search technology
• Founded in 2010
• Clients all over the world in a variety of industries, including:
– Airlines
– Manufacturing
– Construction
– Financial Services
– Higher Education
– Life Sciences
– Professional Services
https://www.metaversant.com
Learn. Connect. Collaborate.
The Problem
• Researchers work with very large files
• Typical size ranges from a few GB to hundreds of GB
• Source of the files is mixed
– Generate internally (e.g., gene sequencing machines)
– Acquire data sets from other research institutions
• Data governance team wants everything in Alfresco
• Large size makes moving files in and out of Alfresco difficult
Learn. Connect. Collaborate.
What We Tried
• Desktop Sync
• CMIS update content stream
– Versions are created, somewhat painful to disable auto-versioning
• Increasing timeouts
– Losing battle when files are multiple gigabytes
• Using Alfresco FTP
– Usually requires thick client installed
– Not preferred by end-users
• Resumable upload Share customization
– Actually worked pretty well
– Only handles uploads, not downloads
Learn. Connect. Collaborate.
Sidebar: Resumable Upload Details
• Share customization (closed source)
• Leverages resumable.js, see http://www.resumablejs.com/
• Utilizes the HTML5 File API
• If an upload stalls or ends prematurely, the end-user can restart where it
left off
Learn. Connect. Collaborate.
Inescapable math related to moving large files
• How long does it take to move 25 GB of data?
– Ethernet = 10 Mbit/s = 333.33 minutes
– Fast Ethernet = 100 Mbit/s = 33.33 minutes
– Gigabit Ethernet = 1 Gbit/s = 3.33 minutes
– 10 Gigabit Ethernet = 10 Gbit/s = 0.33 minutes
– 100 Gigabit Ethernet = 100 Gbit/s = 0.03 minutes
• Assumes full bandwidth is available
• Network only, does not account for disk or other non-network latencies on
either end
It’s not the actual import/export
that’s killing us, it’s the movement of
so many bytes over the network
Learn. Connect. Collaborate.
Technologies That Move Large Files
• BitTorrent
– Looked at BitTorrent Sync which became Resilio Sync
– Performance increases when multiple people have the same file
– Primarily peer-to-peer with an emphasis on desktop-to-desktop or between
devices
• GridFTP
– Extends FTP to add parallelism
– Multiple implementations, including at least one that is commercially supported
– Works between servers, desktop-to-server, and between devices
Learn. Connect. Collaborate.
GridFTP was created to move large files to clusters
• Extension of FTP
• Defined by the Open Grid Foundation (http://www.ogf.org)
• Designed specifically to facilitate transfers of large files and large sets of
files
• Uses multiple parallel streams to move data over TCP
• One of several ways that a product called Globus uses to move data
between end points
• More information at http://toolkit.globus.org/toolkit/docs/6.0/gridftp/
Learn. Connect. Collaborate.
Globus provides data migration tools to researchers
• Non-profit business within the University of Chicago
• Focused on providing low-cost tools to researchers doing data-intensive
research
• Globus is SaaS that acts as a middleman to coordinate transfers of data
between endpoints
• Publishes a list of public endpoints
• Provides API and services such as authentication
• Sync between two endpoints typically uses GridFTP protocol
• It is possible to use GridFTP without leveraging Globus
– See http://toolkit.globus.org/toolkit/docs/latest-stable/admin/install/
Globus/GridFTP helps move bytes
over the network. Alfresco BFSIT
does fast imports once the files are
on the server
Learn. Connect. Collaborate.
High-Level Approach: Two Step Import
First Step: Globus Personal Connect to Globus Endpoint
Shared Mount
Learn. Connect. Collaborate.
High-Level Approach: Two Step Import
Second Step: Alfresco Bulk File System Import
Shared Mount
Learn. Connect. Collaborate.
High-Level Approach: Two Step Export
First Step: Write file(s) to File System
Shared Mount
Learn. Connect. Collaborate.
High-Level Approach: Two Step Export
Second Step: Globus Endpoint to Globus Personal Connect
Shared Mount
With the high-level approach
determined, it was time to work on
the details
Learn. Connect. Collaborate.
Where to Put the UI?
• Considered Share
– But researchers were already looking for a more streamlined interface
• Considered ADF
– But it was too new at the time
– Wasn’t the right fit for this particular application
• Decided on custom Spring Boot application
– Needed an app anyway
– Could bring ADF later in if desired
Learn. Connect. Collaborate.
Custom Globus Alfresco Transfers application
Simple Scope
• Start transfer jobs
• See the status of transfer jobs
• Publishes and subscribes to queues used to
coordinate multi-step transfers
• Authentication
– Authenticates against Alfresco
– Accounts linked to Globus via Oauth
Built With
• Spring Boot
• Angular 4
• Bootstrap 3
• Apache ActiveMQ
• Apache Maven
Learn. Connect. Collaborate.
• Alfresco Enterprise
Edition, Clustered
• Globus Server
Endpoint
• Both point to the
same shared mount
Solution
Components
Shared Mount
Learn. Connect. Collaborate.
Solution
Components
• Globus SaaS
communicates with
– Globus Server
Endpoint
– Each individual’s
Globus Personal
Connect
• Globus SaaS provides
a REST API
Shared Mount
Learn. Connect. Collaborate.
• Spring Boot application
used to create transfer
jobs
• Coordinates the
transfers
• Persists transfer job
and user objects to
PostgreSQL
Solution
Components
Shared Mount
Learn. Connect. Collaborate.
• Everything is
asynchronous
• Apache ActiveMQ acts
as the message broker,
persists queues
Solution
Components
Shared Mount
Learn. Connect. Collaborate.
Queues and Listeners
Alfresco
Import
Listener
Alfresco
Export
Listener
Globus
Inbound
Transfer
Listener
Globus
Outbound
Transfer
Listener
Transfer
Status
Listener
Given a file
path, imports
it into a
specified
node ref using
BFSIT
Given a node
ref, exports it
to a specified
file path
Given an
endpoint ID
and a path,
transfer it to
the Noble
endpoint
Given a path on
the Noble
endpoint,
transfer to a
specified path
on an endpoint
Persist status
changes; Kick
off next step
AMP Globus Alfresco Transfers Spring Boot App
Importing into Alfresco
Learn. Connect. Collaborate.
1. Save Transfer Job
2. Put message on a
queue
Transfer to
Alfresco (1)
1.
2. “Do Globus transfer”
Shared Mount
Learn. Connect. Collaborate.
1. See message
2. Start transfer
3. Perform the transfer
4. Put message on the
queue
Transfer to
Alfresco (2)
1. ”Do Globus
transfer”
2.
3.
4. “Globus transfer done”
Shared Mount
Learn. Connect. Collaborate.
2.
3. “Do Alfresco transfer”
1. See message
2. Update status
3. Queue message
Transfer to
Alfresco (3)
1. “Globus
transfer done”
Shared Mount
Learn. Connect. Collaborate.
5.
4. “Alfresco import done”
1. “Do Alfresco
import”
2. BFSIT
3. “Alfresco import done”
Transfer to
Alfresco (4)
1. See message
2. BFSIT import
3. Queue message
4. See message
5. Update status
Shared Mount
Downloading from Alfresco
Learn. Connect. Collaborate.
1. Save Transfer Job
2. Put message on a
queue
Transfer from
Alfresco (1)
1.
2. "Do Alfresco export”
Shared Mount
Learn. Connect. Collaborate.
1. See message
2. Custom export
3. Queue message
Transfer from
Alfresco (2)
1. “Do Alfresco
export”
2.
3. “Alfresco export done”
Shared Mount
Learn. Connect. Collaborate.
1. See message
2. Update status
3. Queue message
Transfer from
Alfresco (3)
1. “Alfresco
export done”
3. “Do Globus transfer”
2.
Shared Mount
Learn. Connect. Collaborate.
1. See message
2. Initiate transfer
3. Do transfer
4. Queue message
5. See message
6. Set status
Transfer from
Alfresco (4)
6.
1. “Do Globus
transfer”
3.4. “Globus transfer done”
2.
5.
Shared Mount
How did we do?
Learn. Connect. Collaborate.
Metrics: Multi-file* Upload/Download
Upload to Alfresco Download from Alfresco
Method Time Rate Time Rate
Out-of-the-box 5 minutes 612 MB/min 6.4 minutes 476.6 MB/min
Globus
Alfresco
Transfers
2 minutes 1530 MB/min 3.6 minutes 1020 MB/min
Improvement 60% faster 150% more
throughput
53% faster 114% more
throughput
*Four files totaling 3,060 MB
Learn. Connect. Collaborate.
Metrics: Single-file* Upload/Download
Upload to Alfresco Download from Alfresco
Method Time Rate Time Rate
Out-of-the-box 7.2 minutes 616.2 MB/min DNF** DNF**
Globus
Alfresco
Transfers
3.6 minutes 1220.4 MB/min 5.1 minutes 862.9 MB/min
Improvement 50% faster 98% more
throughput
Infinitely
faster
Infinitely greater
throughput
*Single file of size 4,418 MB **Alfresco throws an
exception at around 1 GB
Learn. Connect. Collaborate.
Results
• Transfers can now be done as “fire-and-forget” jobs
• Any number of files, any size
• Streamlined, purpose-built UI keeps researchers focused
• Integrates with existing sync technology researchers like
• Reduced transfer time by 50 - 60%
• Increased transfer rate by 100 – 150%
Learn. Connect. Collaborate.
Futures
• Improve download by doing a move from content store rather than a write
• Send files to/from any Globus endpoint, including external
– Currently transfer source/target is Globus Personal Connect on Noble
workstations
• Security hardening
• Set metadata on multiple files during import
• Auditing/usage reports
• Possible new requirements
– Scheduled/recurring transfers
– Share integration
– ADF integration
Thank You!
https://www.metaversant.com
https://ecmarchitect.com
@jeffpotts01

More Related Content

What's hot

Alfresco search services: Now and Then
Alfresco search services: Now and ThenAlfresco search services: Now and Then
Alfresco search services: Now and Then
Angel Borroy López
 
Alfresco Share - Recycle Bin Ideas
Alfresco Share - Recycle Bin IdeasAlfresco Share - Recycle Bin Ideas
Alfresco Share - Recycle Bin Ideas
AlfrescoUE
 

What's hot (20)

Alfresco 5.2 REST API
Alfresco 5.2 REST APIAlfresco 5.2 REST API
Alfresco 5.2 REST API
 
Alfresco Backup and Disaster Recovery White Paper
Alfresco Backup and Disaster Recovery White PaperAlfresco Backup and Disaster Recovery White Paper
Alfresco Backup and Disaster Recovery White Paper
 
Alfresco Workshop: Introduction to Records Management Using Alfresco Governan...
Alfresco Workshop: Introduction to Records Management Using Alfresco Governan...Alfresco Workshop: Introduction to Records Management Using Alfresco Governan...
Alfresco Workshop: Introduction to Records Management Using Alfresco Governan...
 
Alfresco search services: Now and Then
Alfresco search services: Now and ThenAlfresco search services: Now and Then
Alfresco search services: Now and Then
 
Scale your Alfresco Solutions
Scale your Alfresco Solutions Scale your Alfresco Solutions
Scale your Alfresco Solutions
 
Upgrading to Alfresco 6
Upgrading to Alfresco 6Upgrading to Alfresco 6
Upgrading to Alfresco 6
 
Collaborative Editing Tools for Alfresco
Collaborative Editing Tools for AlfrescoCollaborative Editing Tools for Alfresco
Collaborative Editing Tools for Alfresco
 
Alfresco tuning part2
Alfresco tuning part2Alfresco tuning part2
Alfresco tuning part2
 
Moving From Actions & Behaviors to Microservices
Moving From Actions & Behaviors to MicroservicesMoving From Actions & Behaviors to Microservices
Moving From Actions & Behaviors to Microservices
 
Alfresco CMIS
Alfresco CMISAlfresco CMIS
Alfresco CMIS
 
Alfresco Security Best Practices Guide
Alfresco Security Best Practices GuideAlfresco Security Best Practices Guide
Alfresco Security Best Practices Guide
 
Alfresco DevCon 2019 - Alfresco Identity Services in Action
Alfresco DevCon 2019 - Alfresco Identity Services in ActionAlfresco DevCon 2019 - Alfresco Identity Services in Action
Alfresco DevCon 2019 - Alfresco Identity Services in Action
 
Jose portillo dev con presentation 1138
Jose portillo   dev con presentation 1138Jose portillo   dev con presentation 1138
Jose portillo dev con presentation 1138
 
No Docker? No Problem: Automating installation and config with Ansible
No Docker? No Problem: Automating installation and config with AnsibleNo Docker? No Problem: Automating installation and config with Ansible
No Docker? No Problem: Automating installation and config with Ansible
 
Alfresco tuning part1
Alfresco tuning part1Alfresco tuning part1
Alfresco tuning part1
 
How to migrate from Alfresco Search Services to Alfresco SearchEnterprise
How to migrate from Alfresco Search Services to Alfresco SearchEnterpriseHow to migrate from Alfresco Search Services to Alfresco SearchEnterprise
How to migrate from Alfresco Search Services to Alfresco SearchEnterprise
 
Alfresco Share - Recycle Bin Ideas
Alfresco Share - Recycle Bin IdeasAlfresco Share - Recycle Bin Ideas
Alfresco Share - Recycle Bin Ideas
 
Replacing Your Shared Drive with Alfresco - Open Source ECM
Replacing Your Shared Drive with Alfresco - Open Source ECMReplacing Your Shared Drive with Alfresco - Open Source ECM
Replacing Your Shared Drive with Alfresco - Open Source ECM
 
Guide to alfresco monitoring
Guide to alfresco monitoringGuide to alfresco monitoring
Guide to alfresco monitoring
 
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 20190-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
 

Similar to Moving Gigantic Files Into and Out of the Alfresco Repository

Nov 2014 webinar Making The Transition From Ftp
Nov 2014 webinar Making The Transition From FtpNov 2014 webinar Making The Transition From Ftp
Nov 2014 webinar Making The Transition From Ftp
FileCatalyst
 
Open stack summit-2015-dp
Open stack summit-2015-dpOpen stack summit-2015-dp
Open stack summit-2015-dp
Dirk Petersen
 

Similar to Moving Gigantic Files Into and Out of the Alfresco Repository (20)

Kubernetes - Hosted OSG Services
Kubernetes - Hosted OSG ServicesKubernetes - Hosted OSG Services
Kubernetes - Hosted OSG Services
 
Partner spotlight: Telestream
Partner spotlight: TelestreamPartner spotlight: Telestream
Partner spotlight: Telestream
 
Serverless design with Fn project
Serverless design with Fn projectServerless design with Fn project
Serverless design with Fn project
 
Nov 2014 webinar Making The Transition From Ftp
Nov 2014 webinar Making The Transition From FtpNov 2014 webinar Making The Transition From Ftp
Nov 2014 webinar Making The Transition From Ftp
 
Sochi games wrap-up
Sochi games wrap-upSochi games wrap-up
Sochi games wrap-up
 
OSGi for real in the enterprise: Apache Karaf - NLJUG J-FALL 2010
OSGi for real in the enterprise: Apache Karaf - NLJUG J-FALL 2010OSGi for real in the enterprise: Apache Karaf - NLJUG J-FALL 2010
OSGi for real in the enterprise: Apache Karaf - NLJUG J-FALL 2010
 
Partner webinar featuring CatDV
Partner webinar featuring CatDVPartner webinar featuring CatDV
Partner webinar featuring CatDV
 
Spotlight on the petroleum and energy vertical
Spotlight on the petroleum and energy vertical Spotlight on the petroleum and energy vertical
Spotlight on the petroleum and energy vertical
 
Building Cloud Native Software
Building Cloud Native SoftwareBuilding Cloud Native Software
Building Cloud Native Software
 
Swift Buildpack for Cloud Foundry
Swift Buildpack for Cloud FoundrySwift Buildpack for Cloud Foundry
Swift Buildpack for Cloud Foundry
 
Galera webinar migration to galera cluster from my sql async replication
Galera webinar migration to galera cluster from my sql async replicationGalera webinar migration to galera cluster from my sql async replication
Galera webinar migration to galera cluster from my sql async replication
 
Three years of OFELIA - taking stock
Three years of OFELIA - taking stockThree years of OFELIA - taking stock
Three years of OFELIA - taking stock
 
Alfresco Coding mit dem Alfresco SDK (auf Englisch) - Julien Bruinaud, Techni...
Alfresco Coding mit dem Alfresco SDK (auf Englisch) - Julien Bruinaud, Techni...Alfresco Coding mit dem Alfresco SDK (auf Englisch) - Julien Bruinaud, Techni...
Alfresco Coding mit dem Alfresco SDK (auf Englisch) - Julien Bruinaud, Techni...
 
Partner spotlight: Empress
Partner spotlight: EmpressPartner spotlight: Empress
Partner spotlight: Empress
 
Tackling Terraform at Ticketmaster
Tackling Terraform at TicketmasterTackling Terraform at Ticketmaster
Tackling Terraform at Ticketmaster
 
Introduction to Globus: Research Data Management Software at the ALCF
Introduction to Globus: Research Data Management Software at the ALCFIntroduction to Globus: Research Data Management Software at the ALCF
Introduction to Globus: Research Data Management Software at the ALCF
 
Questions and answers
Questions and answersQuestions and answers
Questions and answers
 
An Introduction to FileCatalyst
An Introduction to FileCatalystAn Introduction to FileCatalyst
An Introduction to FileCatalyst
 
Open stack summit-2015-dp
Open stack summit-2015-dpOpen stack summit-2015-dp
Open stack summit-2015-dp
 
Jive, dropbox and other integrations
Jive, dropbox and other integrationsJive, dropbox and other integrations
Jive, dropbox and other integrations
 

More from Jeff Potts

Alfresco Community Survey 2012 Results
Alfresco Community Survey 2012 ResultsAlfresco Community Survey 2012 Results
Alfresco Community Survey 2012 Results
Jeff Potts
 
Alfresco SAUG: CMIS & Integrations
Alfresco SAUG: CMIS & IntegrationsAlfresco SAUG: CMIS & Integrations
Alfresco SAUG: CMIS & Integrations
Jeff Potts
 

More from Jeff Potts (20)

Flexible Permissions Management with ACL Templates
Flexible Permissions Management with ACL TemplatesFlexible Permissions Management with ACL Templates
Flexible Permissions Management with ACL Templates
 
Could Alfresco Survive a Zombie Attack?
Could Alfresco Survive a Zombie Attack?Could Alfresco Survive a Zombie Attack?
Could Alfresco Survive a Zombie Attack?
 
Connecting Content Management Apps with CMIS
Connecting Content Management Apps with CMISConnecting Content Management Apps with CMIS
Connecting Content Management Apps with CMIS
 
The Challenges of Keeping Bees
The Challenges of Keeping BeesThe Challenges of Keeping Bees
The Challenges of Keeping Bees
 
Getting Started With CMIS
Getting Started With CMISGetting Started With CMIS
Getting Started With CMIS
 
Alfresco: What every developer should know
Alfresco: What every developer should knowAlfresco: What every developer should know
Alfresco: What every developer should know
 
CMIS: An Open API for Managing Content
CMIS: An Open API for Managing ContentCMIS: An Open API for Managing Content
CMIS: An Open API for Managing Content
 
Apache Chemistry in Action: Using CMIS and your favorite language to unlock c...
Apache Chemistry in Action: Using CMIS and your favorite language to unlock c...Apache Chemistry in Action: Using CMIS and your favorite language to unlock c...
Apache Chemistry in Action: Using CMIS and your favorite language to unlock c...
 
Alfresco: The Story of How Open Source Disrupted the ECM Market
Alfresco: The Story of How Open Source Disrupted the ECM MarketAlfresco: The Story of How Open Source Disrupted the ECM Market
Alfresco: The Story of How Open Source Disrupted the ECM Market
 
Join the Alfresco community
Join the Alfresco communityJoin the Alfresco community
Join the Alfresco community
 
Intro to the Alfresco Public API
Intro to the Alfresco Public APIIntro to the Alfresco Public API
Intro to the Alfresco Public API
 
Apache Chemistry in Action
Apache Chemistry in ActionApache Chemistry in Action
Apache Chemistry in Action
 
Building Content-Rich Java Apps in the Cloud with the Alfresco API
Building Content-Rich Java Apps in the Cloud with the Alfresco APIBuilding Content-Rich Java Apps in the Cloud with the Alfresco API
Building Content-Rich Java Apps in the Cloud with the Alfresco API
 
Alfresco Community Survey 2012 Results
Alfresco Community Survey 2012 ResultsAlfresco Community Survey 2012 Results
Alfresco Community Survey 2012 Results
 
Getting Started with CMIS
Getting Started with CMISGetting Started with CMIS
Getting Started with CMIS
 
Relational Won't Cut It: Architecting Content Centric Apps
Relational Won't Cut It: Architecting Content Centric AppsRelational Won't Cut It: Architecting Content Centric Apps
Relational Won't Cut It: Architecting Content Centric Apps
 
Alfresco SAUG: State of ECM
Alfresco SAUG: State of ECMAlfresco SAUG: State of ECM
Alfresco SAUG: State of ECM
 
Alfresco SAUG: CMIS & Integrations
Alfresco SAUG: CMIS & IntegrationsAlfresco SAUG: CMIS & Integrations
Alfresco SAUG: CMIS & Integrations
 
Should You Attend Alfresco Devcon 2011
Should You Attend Alfresco Devcon 2011Should You Attend Alfresco Devcon 2011
Should You Attend Alfresco Devcon 2011
 
2011 Alfresco Community Survey Results
2011 Alfresco Community Survey Results2011 Alfresco Community Survey Results
2011 Alfresco Community Survey Results
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

Moving Gigantic Files Into and Out of the Alfresco Repository

  • 1. Moving Gigantic Files In and Out of the Repository Jeff Potts Metaversant Group, Inc.
  • 2. Learn. Connect. Collaborate. What’s the Deal with Large Files? • Alfresco can manage files of any size, but getting large files into and out of the repo is often problematic • They take way too long to transfer – Sessions timeout – Machines go to sleep – Incomplete files get transferred – Users think, “Is this thing hung?” and then cancel • End-users must actively monitor transfers in most cases
  • 3. This talk is a technical case study about an approach to significantly improving large file transfers
  • 4. Learn. Connect. Collaborate. About Noble Research Institute • Research organization focused on improving agriculture for all mankind – Research – Producer Relations – Applied agricultural systems and stewardship – Education • About 400 employees from all over the world • Headquartered in Ardmore, Oklahoma • https://www.noble.org
  • 5. Learn. Connect. Collaborate. • Consulting firm focused on solving business problems with open source Content Management, Workflow, & Search technology • Founded in 2010 • Clients all over the world in a variety of industries, including: – Airlines – Manufacturing – Construction – Financial Services – Higher Education – Life Sciences – Professional Services https://www.metaversant.com
  • 6. Learn. Connect. Collaborate. The Problem • Researchers work with very large files • Typical size ranges from a few GB to hundreds of GB • Source of the files is mixed – Generate internally (e.g., gene sequencing machines) – Acquire data sets from other research institutions • Data governance team wants everything in Alfresco • Large size makes moving files in and out of Alfresco difficult
  • 7. Learn. Connect. Collaborate. What We Tried • Desktop Sync • CMIS update content stream – Versions are created, somewhat painful to disable auto-versioning • Increasing timeouts – Losing battle when files are multiple gigabytes • Using Alfresco FTP – Usually requires thick client installed – Not preferred by end-users • Resumable upload Share customization – Actually worked pretty well – Only handles uploads, not downloads
  • 8. Learn. Connect. Collaborate. Sidebar: Resumable Upload Details • Share customization (closed source) • Leverages resumable.js, see http://www.resumablejs.com/ • Utilizes the HTML5 File API • If an upload stalls or ends prematurely, the end-user can restart where it left off
  • 9. Learn. Connect. Collaborate. Inescapable math related to moving large files • How long does it take to move 25 GB of data? – Ethernet = 10 Mbit/s = 333.33 minutes – Fast Ethernet = 100 Mbit/s = 33.33 minutes – Gigabit Ethernet = 1 Gbit/s = 3.33 minutes – 10 Gigabit Ethernet = 10 Gbit/s = 0.33 minutes – 100 Gigabit Ethernet = 100 Gbit/s = 0.03 minutes • Assumes full bandwidth is available • Network only, does not account for disk or other non-network latencies on either end
  • 10. It’s not the actual import/export that’s killing us, it’s the movement of so many bytes over the network
  • 11. Learn. Connect. Collaborate. Technologies That Move Large Files • BitTorrent – Looked at BitTorrent Sync which became Resilio Sync – Performance increases when multiple people have the same file – Primarily peer-to-peer with an emphasis on desktop-to-desktop or between devices • GridFTP – Extends FTP to add parallelism – Multiple implementations, including at least one that is commercially supported – Works between servers, desktop-to-server, and between devices
  • 12. Learn. Connect. Collaborate. GridFTP was created to move large files to clusters • Extension of FTP • Defined by the Open Grid Foundation (http://www.ogf.org) • Designed specifically to facilitate transfers of large files and large sets of files • Uses multiple parallel streams to move data over TCP • One of several ways that a product called Globus uses to move data between end points • More information at http://toolkit.globus.org/toolkit/docs/6.0/gridftp/
  • 13. Learn. Connect. Collaborate. Globus provides data migration tools to researchers • Non-profit business within the University of Chicago • Focused on providing low-cost tools to researchers doing data-intensive research • Globus is SaaS that acts as a middleman to coordinate transfers of data between endpoints • Publishes a list of public endpoints • Provides API and services such as authentication • Sync between two endpoints typically uses GridFTP protocol • It is possible to use GridFTP without leveraging Globus – See http://toolkit.globus.org/toolkit/docs/latest-stable/admin/install/
  • 14. Globus/GridFTP helps move bytes over the network. Alfresco BFSIT does fast imports once the files are on the server
  • 15. Learn. Connect. Collaborate. High-Level Approach: Two Step Import First Step: Globus Personal Connect to Globus Endpoint Shared Mount
  • 16. Learn. Connect. Collaborate. High-Level Approach: Two Step Import Second Step: Alfresco Bulk File System Import Shared Mount
  • 17. Learn. Connect. Collaborate. High-Level Approach: Two Step Export First Step: Write file(s) to File System Shared Mount
  • 18. Learn. Connect. Collaborate. High-Level Approach: Two Step Export Second Step: Globus Endpoint to Globus Personal Connect Shared Mount
  • 19. With the high-level approach determined, it was time to work on the details
  • 20. Learn. Connect. Collaborate. Where to Put the UI? • Considered Share – But researchers were already looking for a more streamlined interface • Considered ADF – But it was too new at the time – Wasn’t the right fit for this particular application • Decided on custom Spring Boot application – Needed an app anyway – Could bring ADF later in if desired
  • 21. Learn. Connect. Collaborate. Custom Globus Alfresco Transfers application Simple Scope • Start transfer jobs • See the status of transfer jobs • Publishes and subscribes to queues used to coordinate multi-step transfers • Authentication – Authenticates against Alfresco – Accounts linked to Globus via Oauth Built With • Spring Boot • Angular 4 • Bootstrap 3 • Apache ActiveMQ • Apache Maven
  • 22. Learn. Connect. Collaborate. • Alfresco Enterprise Edition, Clustered • Globus Server Endpoint • Both point to the same shared mount Solution Components Shared Mount
  • 23. Learn. Connect. Collaborate. Solution Components • Globus SaaS communicates with – Globus Server Endpoint – Each individual’s Globus Personal Connect • Globus SaaS provides a REST API Shared Mount
  • 24. Learn. Connect. Collaborate. • Spring Boot application used to create transfer jobs • Coordinates the transfers • Persists transfer job and user objects to PostgreSQL Solution Components Shared Mount
  • 25. Learn. Connect. Collaborate. • Everything is asynchronous • Apache ActiveMQ acts as the message broker, persists queues Solution Components Shared Mount
  • 26. Learn. Connect. Collaborate. Queues and Listeners Alfresco Import Listener Alfresco Export Listener Globus Inbound Transfer Listener Globus Outbound Transfer Listener Transfer Status Listener Given a file path, imports it into a specified node ref using BFSIT Given a node ref, exports it to a specified file path Given an endpoint ID and a path, transfer it to the Noble endpoint Given a path on the Noble endpoint, transfer to a specified path on an endpoint Persist status changes; Kick off next step AMP Globus Alfresco Transfers Spring Boot App
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34. Learn. Connect. Collaborate. 1. Save Transfer Job 2. Put message on a queue Transfer to Alfresco (1) 1. 2. “Do Globus transfer” Shared Mount
  • 35. Learn. Connect. Collaborate. 1. See message 2. Start transfer 3. Perform the transfer 4. Put message on the queue Transfer to Alfresco (2) 1. ”Do Globus transfer” 2. 3. 4. “Globus transfer done” Shared Mount
  • 36. Learn. Connect. Collaborate. 2. 3. “Do Alfresco transfer” 1. See message 2. Update status 3. Queue message Transfer to Alfresco (3) 1. “Globus transfer done” Shared Mount
  • 37. Learn. Connect. Collaborate. 5. 4. “Alfresco import done” 1. “Do Alfresco import” 2. BFSIT 3. “Alfresco import done” Transfer to Alfresco (4) 1. See message 2. BFSIT import 3. Queue message 4. See message 5. Update status Shared Mount
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45. Learn. Connect. Collaborate. 1. Save Transfer Job 2. Put message on a queue Transfer from Alfresco (1) 1. 2. "Do Alfresco export” Shared Mount
  • 46. Learn. Connect. Collaborate. 1. See message 2. Custom export 3. Queue message Transfer from Alfresco (2) 1. “Do Alfresco export” 2. 3. “Alfresco export done” Shared Mount
  • 47. Learn. Connect. Collaborate. 1. See message 2. Update status 3. Queue message Transfer from Alfresco (3) 1. “Alfresco export done” 3. “Do Globus transfer” 2. Shared Mount
  • 48. Learn. Connect. Collaborate. 1. See message 2. Initiate transfer 3. Do transfer 4. Queue message 5. See message 6. Set status Transfer from Alfresco (4) 6. 1. “Do Globus transfer” 3.4. “Globus transfer done” 2. 5. Shared Mount
  • 49. How did we do?
  • 50. Learn. Connect. Collaborate. Metrics: Multi-file* Upload/Download Upload to Alfresco Download from Alfresco Method Time Rate Time Rate Out-of-the-box 5 minutes 612 MB/min 6.4 minutes 476.6 MB/min Globus Alfresco Transfers 2 minutes 1530 MB/min 3.6 minutes 1020 MB/min Improvement 60% faster 150% more throughput 53% faster 114% more throughput *Four files totaling 3,060 MB
  • 51. Learn. Connect. Collaborate. Metrics: Single-file* Upload/Download Upload to Alfresco Download from Alfresco Method Time Rate Time Rate Out-of-the-box 7.2 minutes 616.2 MB/min DNF** DNF** Globus Alfresco Transfers 3.6 minutes 1220.4 MB/min 5.1 minutes 862.9 MB/min Improvement 50% faster 98% more throughput Infinitely faster Infinitely greater throughput *Single file of size 4,418 MB **Alfresco throws an exception at around 1 GB
  • 52. Learn. Connect. Collaborate. Results • Transfers can now be done as “fire-and-forget” jobs • Any number of files, any size • Streamlined, purpose-built UI keeps researchers focused • Integrates with existing sync technology researchers like • Reduced transfer time by 50 - 60% • Increased transfer rate by 100 – 150%
  • 53. Learn. Connect. Collaborate. Futures • Improve download by doing a move from content store rather than a write • Send files to/from any Globus endpoint, including external – Currently transfer source/target is Globus Personal Connect on Noble workstations • Security hardening • Set metadata on multiple files during import • Auditing/usage reports • Possible new requirements – Scheduled/recurring transfers – Share integration – ADF integration

Editor's Notes

  1. Learn more about Noble Research Institute at https://www.noble.org
  2. Learn more at https://www.metaversant.com
  3. App saves transfer job Places a message on the queue
  4. App saves transfer job Places a message on the queue
  5. Update status Put message on Alfresco Import queue
  6. Alfresco sees message Initiates a Bulk File System Import Places a message on the queue to update status App sees message Updates status to “Complete”
  7. App saves transfer job Places a message on the queue
  8. Multi-file upload test (4 files, totaling 3,060 MB): GAT uploaded the files in 2 minutes versus 5 minutes out-of-the-box (60% improvement) Multi-file download test (4 files, totaling 3,060 MB): GAT downloaded the files in 3 minutes versus 6.42 minutes out-of-the-box (53% improvement)
  9. Single-file upload test (1 file, 4,418 MB): GAT uploaded the file in 3.62 minutes versus 7.17 minutes out-of-the-box (50% improvement) Single-file download test (1 file, 4,418 MB): GAT downloaded the file in 5.12 minutes versus multiple unsuccessful attempts out-of-the-box