SlideShare a Scribd company logo
1 of 30
1 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
What does rename() do?
Steve Loughran
stevel@hortonworks.com
@steveloughran
June 2017
2 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
How do we safely persist
& recover state?
3 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Why?
⬢ Save state for when application is restarted
⬢ Publish data for other applications
⬢ Process data published by other applications
⬢ Work with more data than fits into RAM
⬢ Share data with other instances of same application
⬢ Save things people care about & want to get back
4 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Define "Storage"?
FAT8
dBASE II & Lotus 1-2-3
int 21h
Linux: ext3, reiserfs, ext4
sqlite, mysql, leveldb
open(path, O_CREAT|O_EXCL)
rename(src, dest)
Windows NT, XP
NTFS
Access, Excel
CreateFile(path, CREATE_NEW,...)
MoveFileEx(src, dest, MOVEFILE_WRITE_THROUGH)
Facebook Prineville Datacentre
1+ Exabyte on HDFS + cold store
Hive, Spark, ...
FileSystem.rename()
8 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Model and APIs
9 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Structured data: algebra
10 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
11 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
12 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
13 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
File-System
Directories and files
Posix with stream metaphor
14 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
org.apache.hadoop.fs.FileSystem
hdfs s3awasb adlswift gcs
Hadoop offers Posix API to remote cluster filesystems & storage
val work = new Path("s3a://stevel-frankfurt/work")
val fs = work.getFileSystem(new Configuration())
val task00 = new Path(work, "task00")
fs.mkdirs(task00)
val out = fs.create(new Path(task00, "part-00"), false)
out.writeChars("hello")
out.close();
fs.listStatus(task00).foreach(stat =>
fs.rename(stat.getPath, work)
)
val statuses = fs.listStatus(work).filter(_.isFile)
require("part-00" == statuses(0).getPath.getName)
16 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
rename() gives us O(1) atomic task commits
/
work
_temp
part-00 part-01
00
00
00
01
01
01
part-01
rename("/work/_temp/task00/*", "/work")
task-00 task-01
HDFS Namenode Datanode-01
Datanode-03
Datanode-02
Datanode-04
17 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Amazon S3 doesn't have a rename()
/
work
_temp
part-00 part-01
00
00
00
01
01
part-01
LIST /work/_temp/task-01/*
task-00 task-01
01
01
01
COPY /work/_temp/task-01/part-01 /work/part-01
DELETE /work/_temp/task-01/part-01
01
S3 Shards
18 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
part-01
01
01
01
Fix: fundamentally rethink how we commit
/
work
00
00
00
POST /work/part-01?uploads => UploadID
POST /work/part01?uploadId=UploadID&partNumber=01
POST /work/part01?uploadId=UploadID&partNumber=02
POST /work/part01?uploadId=UploadID&partNumber=03
S3 Shards
19 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
job manager selectively completes tasks' multipart uploads
/
work
part-00
00
00
00
part-01
(somehow list pending uploads of task 01)
01
01
01
POST /work/part-01?uploadId=UploadID
<CompleteMultipartUpload>
<Part>
<PartNumber>01</PartNumber><ETag>44a3</ETag>
<PartNumber>02</PartNumber><ETag>29cb</ETag>
<PartNumber>03</PartNumber><ETag>1aac</ETag>
</Part>
</CompleteMultipartUpload>
part-01
01
01
01
S3 Shards
20 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
S3A O(1) zero-rename commit demo!
21 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
What else to rethink?
⬢ Hierarchical directories to tree-walk
==> list & work with all files under a prefix;
⬢ seek() read() sequences
==> HTTP-2 friendly scatter/gather IO
read((buffer1, 10 KB, 200 KB), (buffer2, 16 MB, 4 MB))
⬢ How to work with Eventually Consistent data?
⬢ or: is everything just a K-V store with some search mechanisms?
22 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Model #3: Storage as Memory
typedef struct record_struct {
int field1, field2;
long next;
} record;
int fd = open("/shared/dbase", O_CREAT | O_EXCL);
record* data = (record*) mmap(NULL, 8192,
PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
(*data).field1 += 5;
data->field2 = data->field1;
msync(record, sizeof(record), MS_SYNC | MS_INVALIDATE);
SSD via SATA
SSD via NVMe/M.2
Future NVM technologies
25 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Non Volatile Memory
⬢ SSD-backed RAM
⬢ near-RAM-speed SSD
⬢ Future memory stores
⬢ RDMA access to NVM on other servers
What would a datacentre of NVM & RDMA access do?
typedef struct record_struct {
int field1, field2;
record_struct* next;
} record;
int fd = open("/shared/dbase");
record* data = (record*) pmem_map(fd);
// lock ?
(*data).field1 += 5;
data->field2 = data->field1;
// commit ?
27 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
NVM moves the commit problem into memory I/O
⬢ How to split internal state into persistent and transient?
⬢ When is data saved to NVM ($L1-$L3 cache flushed, sync in memory
buffers, ...)
⬢ How to co-ordinate shared R/W access over RDMA?
⬢ How do we write apps for a world where rebooting doesn't reset our state?
Catch up: read "The Morning Paper" summaries of research
28 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Storage is moving up in scale and/or closer to RAM
⬢ Storage is moving up in scale and/or closer to RAM
⬢ Blobstore APIs address some scale issues, but don't match app expectations
for file/dir behaviour; inefficient read/write model
⬢ Non volatile memory is the other radical change
⬢ Posix metaphor/API isn't suited to either —what next?
⬢ SQL makes all this someone else's problem
(leaving only O/R mapping, transaction isolation...)
29 © Hortonworks Inc. 2011 – 2017 All Rights Reserved29 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Questions?
30 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Backup Slides

More Related Content

What's hot

What's hot (20)

Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
 
ORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, SmallerORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, Smaller
 
ORC: 2015 Faster, Better, Smaller
ORC: 2015 Faster, Better, SmallerORC: 2015 Faster, Better, Smaller
ORC: 2015 Faster, Better, Smaller
 
Apache Hive ACID Project
Apache Hive ACID ProjectApache Hive ACID Project
Apache Hive ACID Project
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSHDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFS
 
Accelerate Spark Workloads on S3
Accelerate Spark Workloads on S3Accelerate Spark Workloads on S3
Accelerate Spark Workloads on S3
 
ORC 2015
ORC 2015ORC 2015
ORC 2015
 
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Getting Started with Apache Spark and Alluxio for Blazingly Fast AnalyticsGetting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
 
Major advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL complianceMajor advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL compliance
 
Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5
 
Burst Presto & Spark workloads to AWS EMR with no data copies
Burst Presto & Spark workloads to AWS EMR with no data copiesBurst Presto & Spark workloads to AWS EMR with no data copies
Burst Presto & Spark workloads to AWS EMR with no data copies
 
HDP-1 introduction for HUG France
HDP-1 introduction for HUG FranceHDP-1 introduction for HUG France
HDP-1 introduction for HUG France
 
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
Speeding Up Spark Performance using Alluxio at China Unicom
Speeding Up Spark Performance using Alluxio at China UnicomSpeeding Up Spark Performance using Alluxio at China Unicom
Speeding Up Spark Performance using Alluxio at China Unicom
 

Similar to What does rename() do?

7. emc isilon hdfs enterprise storage for hadoop
7. emc isilon hdfs   enterprise storage for hadoop7. emc isilon hdfs   enterprise storage for hadoop
7. emc isilon hdfs enterprise storage for hadoop
Taldor Group
 
Datastage parallell jobs vs datastage server jobs
Datastage parallell jobs vs datastage server jobsDatastage parallell jobs vs datastage server jobs
Datastage parallell jobs vs datastage server jobs
shanker_uma
 

Similar to What does rename() do? (20)

What does Rename Do: (detailed version)
What does Rename Do: (detailed version)What does Rename Do: (detailed version)
What does Rename Do: (detailed version)
 
The age of rename() is over
The age of rename() is overThe age of rename() is over
The age of rename() is over
 
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
 
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC IsilonImproving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
 
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDisaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
 
IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...
IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...
IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...
 
7. emc isilon hdfs enterprise storage for hadoop
7. emc isilon hdfs   enterprise storage for hadoop7. emc isilon hdfs   enterprise storage for hadoop
7. emc isilon hdfs enterprise storage for hadoop
 
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDisaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
 
Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Introduction to hadoop ecosystem
Introduction to hadoop ecosystem
 
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
 
FOD Paris Meetup - Global Data Management with DataPlane Services (DPS)
FOD Paris Meetup -  Global Data Management with DataPlane Services (DPS)FOD Paris Meetup -  Global Data Management with DataPlane Services (DPS)
FOD Paris Meetup - Global Data Management with DataPlane Services (DPS)
 
Taming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementTaming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop Management
 
Nagios Conference 2013 - James Clark - Nagios On-Call Rotation
Nagios Conference 2013 - James Clark - Nagios On-Call RotationNagios Conference 2013 - James Clark - Nagios On-Call Rotation
Nagios Conference 2013 - James Clark - Nagios On-Call Rotation
 
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
 
Sf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBase
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
 
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
 
Datastage parallell jobs vs datastage server jobs
Datastage parallell jobs vs datastage server jobsDatastage parallell jobs vs datastage server jobs
Datastage parallell jobs vs datastage server jobs
 
Migration from 8.1 to 11.3
Migration from 8.1 to 11.3Migration from 8.1 to 11.3
Migration from 8.1 to 11.3
 
Fosdem17 honeypot your database server
Fosdem17 honeypot your database serverFosdem17 honeypot your database server
Fosdem17 honeypot your database server
 

More from Steve Loughran

2013 11-19-hoya-status
2013 11-19-hoya-status2013 11-19-hoya-status
2013 11-19-hoya-status
Steve Loughran
 

More from Steve Loughran (20)

Hadoop Vectored IO
Hadoop Vectored IOHadoop Vectored IO
Hadoop Vectored IO
 
Put is the new rename: San Jose Summit Edition
Put is the new rename: San Jose Summit EditionPut is the new rename: San Jose Summit Edition
Put is the new rename: San Jose Summit Edition
 
@Dissidentbot: dissent will be automated!
@Dissidentbot: dissent will be automated!@Dissidentbot: dissent will be automated!
@Dissidentbot: dissent will be automated!
 
Extreme Programming Deployed
Extreme Programming DeployedExtreme Programming Deployed
Extreme Programming Deployed
 
Testing
TestingTesting
Testing
 
I hate mocking
I hate mockingI hate mocking
I hate mocking
 
Hadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object StoresHadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object Stores
 
Household INFOSEC in a Post-Sony Era
Household INFOSEC in a Post-Sony EraHousehold INFOSEC in a Post-Sony Era
Household INFOSEC in a Post-Sony Era
 
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 editionHadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
 
Hadoop and Kerberos: the Madness Beyond the Gate
Hadoop and Kerberos: the Madness Beyond the GateHadoop and Kerberos: the Madness Beyond the Gate
Hadoop and Kerberos: the Madness Beyond the Gate
 
Slider: Applications on YARN
Slider: Applications on YARNSlider: Applications on YARN
Slider: Applications on YARN
 
YARN Services
YARN ServicesYARN Services
YARN Services
 
Datacentre stack
Datacentre stackDatacentre stack
Datacentre stack
 
Overview of slider project
Overview of slider projectOverview of slider project
Overview of slider project
 
Help! My Hadoop doesn't work!
Help! My Hadoop doesn't work!Help! My Hadoop doesn't work!
Help! My Hadoop doesn't work!
 
2014 01-02-patching-workflow
2014 01-02-patching-workflow2014 01-02-patching-workflow
2014 01-02-patching-workflow
 
2013 11-19-hoya-status
2013 11-19-hoya-status2013 11-19-hoya-status
2013 11-19-hoya-status
 
Hoya for Code Review
Hoya for Code ReviewHoya for Code Review
Hoya for Code Review
 
Hadoop: Beyond MapReduce
Hadoop: Beyond MapReduceHadoop: Beyond MapReduce
Hadoop: Beyond MapReduce
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed Filesystem
 

Recently uploaded

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Recently uploaded (20)

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptxBUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
ManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide DeckManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide Deck
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 

What does rename() do?

  • 1. 1 © Hortonworks Inc. 2011 – 2017 All Rights Reserved What does rename() do? Steve Loughran stevel@hortonworks.com @steveloughran June 2017
  • 2. 2 © Hortonworks Inc. 2011 – 2017 All Rights Reserved How do we safely persist & recover state?
  • 3. 3 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Why? ⬢ Save state for when application is restarted ⬢ Publish data for other applications ⬢ Process data published by other applications ⬢ Work with more data than fits into RAM ⬢ Share data with other instances of same application ⬢ Save things people care about & want to get back
  • 4. 4 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Define "Storage"?
  • 5. FAT8 dBASE II & Lotus 1-2-3 int 21h
  • 6. Linux: ext3, reiserfs, ext4 sqlite, mysql, leveldb open(path, O_CREAT|O_EXCL) rename(src, dest) Windows NT, XP NTFS Access, Excel CreateFile(path, CREATE_NEW,...) MoveFileEx(src, dest, MOVEFILE_WRITE_THROUGH)
  • 7. Facebook Prineville Datacentre 1+ Exabyte on HDFS + cold store Hive, Spark, ... FileSystem.rename()
  • 8. 8 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Model and APIs
  • 9. 9 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Structured data: algebra
  • 10. 10 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
  • 11. 11 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
  • 12. 12 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
  • 13. 13 © Hortonworks Inc. 2011 – 2017 All Rights Reserved File-System Directories and files Posix with stream metaphor
  • 14. 14 © Hortonworks Inc. 2011 – 2017 All Rights Reserved org.apache.hadoop.fs.FileSystem hdfs s3awasb adlswift gcs Hadoop offers Posix API to remote cluster filesystems & storage
  • 15. val work = new Path("s3a://stevel-frankfurt/work") val fs = work.getFileSystem(new Configuration()) val task00 = new Path(work, "task00") fs.mkdirs(task00) val out = fs.create(new Path(task00, "part-00"), false) out.writeChars("hello") out.close(); fs.listStatus(task00).foreach(stat => fs.rename(stat.getPath, work) ) val statuses = fs.listStatus(work).filter(_.isFile) require("part-00" == statuses(0).getPath.getName)
  • 16. 16 © Hortonworks Inc. 2011 – 2017 All Rights Reserved rename() gives us O(1) atomic task commits / work _temp part-00 part-01 00 00 00 01 01 01 part-01 rename("/work/_temp/task00/*", "/work") task-00 task-01 HDFS Namenode Datanode-01 Datanode-03 Datanode-02 Datanode-04
  • 17. 17 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Amazon S3 doesn't have a rename() / work _temp part-00 part-01 00 00 00 01 01 part-01 LIST /work/_temp/task-01/* task-00 task-01 01 01 01 COPY /work/_temp/task-01/part-01 /work/part-01 DELETE /work/_temp/task-01/part-01 01 S3 Shards
  • 18. 18 © Hortonworks Inc. 2011 – 2017 All Rights Reserved part-01 01 01 01 Fix: fundamentally rethink how we commit / work 00 00 00 POST /work/part-01?uploads => UploadID POST /work/part01?uploadId=UploadID&partNumber=01 POST /work/part01?uploadId=UploadID&partNumber=02 POST /work/part01?uploadId=UploadID&partNumber=03 S3 Shards
  • 19. 19 © Hortonworks Inc. 2011 – 2017 All Rights Reserved job manager selectively completes tasks' multipart uploads / work part-00 00 00 00 part-01 (somehow list pending uploads of task 01) 01 01 01 POST /work/part-01?uploadId=UploadID <CompleteMultipartUpload> <Part> <PartNumber>01</PartNumber><ETag>44a3</ETag> <PartNumber>02</PartNumber><ETag>29cb</ETag> <PartNumber>03</PartNumber><ETag>1aac</ETag> </Part> </CompleteMultipartUpload> part-01 01 01 01 S3 Shards
  • 20. 20 © Hortonworks Inc. 2011 – 2017 All Rights Reserved S3A O(1) zero-rename commit demo!
  • 21. 21 © Hortonworks Inc. 2011 – 2017 All Rights Reserved What else to rethink? ⬢ Hierarchical directories to tree-walk ==> list & work with all files under a prefix; ⬢ seek() read() sequences ==> HTTP-2 friendly scatter/gather IO read((buffer1, 10 KB, 200 KB), (buffer2, 16 MB, 4 MB)) ⬢ How to work with Eventually Consistent data? ⬢ or: is everything just a K-V store with some search mechanisms?
  • 22. 22 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Model #3: Storage as Memory
  • 23. typedef struct record_struct { int field1, field2; long next; } record; int fd = open("/shared/dbase", O_CREAT | O_EXCL); record* data = (record*) mmap(NULL, 8192, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); (*data).field1 += 5; data->field2 = data->field1; msync(record, sizeof(record), MS_SYNC | MS_INVALIDATE);
  • 24. SSD via SATA SSD via NVMe/M.2 Future NVM technologies
  • 25. 25 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Non Volatile Memory ⬢ SSD-backed RAM ⬢ near-RAM-speed SSD ⬢ Future memory stores ⬢ RDMA access to NVM on other servers What would a datacentre of NVM & RDMA access do?
  • 26. typedef struct record_struct { int field1, field2; record_struct* next; } record; int fd = open("/shared/dbase"); record* data = (record*) pmem_map(fd); // lock ? (*data).field1 += 5; data->field2 = data->field1; // commit ?
  • 27. 27 © Hortonworks Inc. 2011 – 2017 All Rights Reserved NVM moves the commit problem into memory I/O ⬢ How to split internal state into persistent and transient? ⬢ When is data saved to NVM ($L1-$L3 cache flushed, sync in memory buffers, ...) ⬢ How to co-ordinate shared R/W access over RDMA? ⬢ How do we write apps for a world where rebooting doesn't reset our state? Catch up: read "The Morning Paper" summaries of research
  • 28. 28 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Storage is moving up in scale and/or closer to RAM ⬢ Storage is moving up in scale and/or closer to RAM ⬢ Blobstore APIs address some scale issues, but don't match app expectations for file/dir behaviour; inefficient read/write model ⬢ Non volatile memory is the other radical change ⬢ Posix metaphor/API isn't suited to either —what next? ⬢ SQL makes all this someone else's problem (leaving only O/R mapping, transaction isolation...)
  • 29. 29 © Hortonworks Inc. 2011 – 2017 All Rights Reserved29 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Questions?
  • 30. 30 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Backup Slides

Editor's Notes

  1. Trivia: dir rename only came with DOS 3.0
  2. Facebook prineville photo store is probably at the far end of the spectrum
  3.  Zermelo–Fraenkel set theory (with axiom of choice) and a relational algebra. Or, as it is known: SQL
  4. Relational Set theory
  5. Everything usies the Hadoop APIs to talk to both HDFS, Hadoop Compatible Filesystems and object stores; the Hadoop FS API. There's actually two: the one with a clean split between client side and "driver side", and the older one which is a direct connect. Most use the latter and actually, in terms of opportunities for object store integration tweaking, this is actually the one where can innovate with the most easily. That is: there's nothing in the way. Under the FS API go filesystems and object stores. HDFS is "real" filesystem; WASB/Azure close enough. What is "real?". Best test: can support HBase.
  6. This is how we commit work in Hadoop FileOutputFormat, and so, transitively, how Spark does it too (Hive does some other things, which I'm ignoring, but are more manifest file driven)
  7. This is my rough guess at a C-level operation against mmaped data today. Ignoringthe open/sync stuff, then the writes in the middle are the operations we need to worry about, as they update the datastructures in-situ. Nonatomically.
  8. This is my rough guess at a C-level operation against mmaped data today. Ignoringthe open/sync stuff, then the writes in the middle are the operations we need to worry about, as they update the datastructures in-situ. Nonatomically.
  9. Work like RAMCloud has led the way here, go look at the papers. But that was at RAM, not NVM, where we have the persistence problem Some form of LSF model tends to be used, which ties in well with raw SSD (don't know about other techs, do know that non-raw SSD doesn't really suit LFS) If you look at code, we generally mix persisted (and read back) data with transient; load/save isn't so much seriaiizing out data structures as marshalling the persistent parts to a form we think they will be robust over time, unmarshalling them later (exceptions: Java Serialization, which is notoriously brittle and insecure) What apps work best here? Could you extend Spark to make RDDs persistent & shared, so you can persist them simply by copying to part of cluster memory?, e.g a "NVMOutputFormat"