SlideShare a Scribd company logo
1 of 36
Pre-University College
Masterclass Big Data
Prof.dr.ir. Arjen P. de Vries
arjen@acm.org
Nijmegen, February 20th
, 2017
Overview
 Big Data
- Defining properties?
- The data center as the computer!
 Very brief: map-reduce
 Streaming data!
 Whatever pops up meanwhile
“Big Data”
If your organization stores multiple petabytes of
data, if the information most critical to your
business resides in forms other than rows and
columns of numbers, or if answering your biggest
question would involve a “mashup” of several
analytical efforts, you’ve got a big data
opportunity
http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century
Process
 Challenges in Big Data Analytics include
- capturing data,
- aligning data from different sources (e.g., resolving when two
objects are the same),
- transforming the data into a form suitable for analysis,
- modeling it, whether mathematically, or through some form of
simulation,
- understanding the output — visualizing and sharing the results
Attributed to IBM Research’s Laura Haas in
http://www.odbms.org/download/Zicari.pdf
The “Data Scientist”
 Suggested reading:
- Harvard Business Review:
http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/
- A 2001 (!) Bell Labs technical report
Data Science: An Action Plan for Expanding the
Technical Areas of the Field of Statistics
http://www.stat.purdue.edu/~wsc/papers/datascience.pdf
- Quora
http://www.quora.com/What-is-it-like-to-be-a-data-scientist
Big Data?
 Big Data refers to datasets whose size is beyond the
ability of typical database software tools to capture, store,
manage and analyze
- McKinsey Global Institute, “Big data: The next frontier for
innovation, competition and productivity.” May 2011.
Big Data?
 Big data is the data that you aren’t able to process and
use quickly enough with the technology you have now
- Buck Woody
http://www.simple-talk.com/sql/database-administration/big-data-is-just-a-fad/
We need to think about data comprehensively – all types
of data.
Big Data
 The 3 Vs (sometimes others are added):
- Volume
We measure more and more; the resulting data is very large
already, and it grows faster and faster
- Velocity
The analysis may take too long for appropriate reaction to
measurement
- Variety
The data comes in many variants, structured and unstructured
Why Big Data?
 We can analyse (and differentiate) to the level of the
individual
 We are less likely to miss rare events, e.g., those that
occur one out of ten million times
 We can better account for the real-time nature of the data
No data like more data!
(Banko and Brill, ACL 2001)
(Brants et al., EMNLP 2007)
s/knowledge/data/g;
How do we get here if we’re not Google?
Exercise
 What examples of big data to analyze can we imagine?
 How much data could that be?
Big?
 20 Terabyte?
- Clueweb 2009
 80 – 120 – 150 Terabyte?
- Recent “web” crawls (IA,
CommonCrawl 2009-2016,
Clueweb 2012)
 10 Petabyte?
- Complete Internet Archive
How much data?
9 PB of user data +
>50 TB/day (11/2011)
processes 20 PB a day (2008)
36 PB of user data +
80-90 TB/day (6/2010)
Wayback Machine: 3 PB +
100 TB/month (3/2009)
LHC: ~15 PB a year
(at full capacity)
LSST: 6-10 PB a year (~2015)
150 PB on 50k+ servers
running 15k apps
S3: 449B objects, peak 290k request/second
(7/2011)
How big is big?
 Facebook (Aug 2012):
- 2.5 billion content items shared per day (status updates + wall
posts + photos + videos + comments)
- 2.7 billion Likes per day
- 300 million photos uploaded per day
Big is very big!
 100+ petabytes of disk space in one of
FB’s largest Hadoop (HDFS) clusters
 105 terabytes of data scanned via Hive, Facebook’s
Hadoop query language, every 30 minutes
 70,000 queries executed on these databases per day
 500+ terabytes of new data ingested into the databases
every day
http://gigaom.com/data/facebook-is-collecting-your-data-500-terabytes-a-day/
Back of the Envelope
 Note:
“105 terabytes of data scanned every 30 minutes”
 A very very fast disk can do 300 MB/s – so, on one disk,
this would take
(105 TB = 110100480 MB) / 300 (MB/s) =
367Ks =~ 6000m
 So at least 200 disks are used in parallel!
 PS: the June 2010 estimate was that facebook ran on 60K servers
Shared-nothing
 A collection of independent, possibly virtual, machines,
each with local disk and local main memory, connected
together on a high-speed network
- Possible trade-off: large number of low-end servers instead of
small number of high-end ones
@UT~1990
@CWI – 2011
Source: Google
Data Center (is the Computer)
Source: NY Times (6/14/2006), http://www.nytimes.com/2006/06/14/technology/14search.html
FB’s Data Centers
 Suggested further reading:
- http://www.datacenterknowledge.com/the-facebook-data-center-faq/
- http://opencompute.org/
- “Open hardware”: server, storage, and data center
- Claim 38% more efficient and 24% less expensive to build and
run than other state-of-the-art data centers
Building Blocks
Source: Barroso, Clidaras and Hölzle (2013): DOI 10.2200/S00516ED2V01Y201306CAC024
Storage Hierarchy
Source: Barroso, Clidaras and Hölzle (2013): DOI 10.2200/S00516ED2V01Y201306CAC024
Numbers Everyone Should Know
L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lock/unlock 100 ns
Main memory reference 100 ns
Compress 1K bytes with Zippy 10,000 ns
Send 2K bytes over 1 Gbps network 20,000 ns
Read 1 MB sequentially from memory 250,000 ns
Round trip within same datacenter 500,000 ns
Disk seek 10,000,000 ns
Read 1 MB sequentially from network 10,000,000 ns
Read 1 MB sequentially from disk 30,000,000 ns
Send packet CA->Netherlands->CA 150,000,000 ns
According to Jeff Dean
Storage Hierarchy
Source: Barroso, Clidaras and Hölzle (2013): DOI 10.2200/S00516ED2V01Y201306CAC024
Storage Hierarchy
Source: Barroso, Clidaras and Hölzle (2013): DOI 10.2200/S00516ED2V01Y201306CAC024
Quiz Time!!
 Consider a 1 TB database with 100 byte records
- We want to update 1 percent of the records
Plan A:
Seek to the records and make the updates
Plan B:
Write out a new database that includes the updates
Source: Ted Dunning, on Hadoop mailing list
Seeks vs. Scans
 Consider a 1 TB database with 100 byte records
- We want to update 1 percent of the records
 Scenario 1: random access
- Each update takes ~30 ms (seek, read, write)
- 108
updates = ~35 days
 Scenario 2: rewrite all records
- Assume 100 MB/s throughput
- Time = 5.6 hours(!)
 Lesson: avoid random seeks!
In words of Prof. Peter Boncz (CWI & VU):
“Latency is the enemy”
Source: Ted Dunning, on Hadoop mailing list
Parallel Programming is Difficult
 Concurrency is difficult to reason about
- At the scale of datacenter(s)
- In the presence of failures
- In terms of multiple interacting services
 In the dark ages of data center computing…
- Lots of one-off solutions, custom code
- Programmers using their own dedicated libraries
- Burden on the programmer to explicitly manage everything
Observation
 Remember:
0.5ns (L1) vs.
500,000ns (round trip in datacenter)
Δ is 6 orders in magnitude!
 With huge amounts of data (and resources necessary to
process it), we simply cannot expect to ship the data to
the application – the application logic needs to ship to the
data!
Gray’s Laws
How to approach data engineering challenges for large-scale
scientific datasets:
1. Scientific computing is becoming increasingly data intensive
2. The solution is in a “scale-out” architecture
3. Bring computations to the data, rather than data to the
computations
4. Start the design with the “20 queries”
5. Go from “working to working”
See:
http://research.microsoft.com/en-us/collaboration/fourthparadigm/4th_paradigm_book_part1_szalay.pdf
Emerging (Emerged ) Big Data Systems
 Distributed
 Shared-nothing
- None of the resources are logically shared between processes
 Data parallel
- Exactly the same task is performed on different pieces of the
data
A Prototype “Big Data Analysis” Task
 Iterate over a large number of records
 Extract something of interest from each
 Aggregate intermediate results
- Usually, aggregation requires to shuffle and sort the
intermediate results
 Generate final output
Key idea: provide a functional abstraction for these two operations
Map
Reduce
(Dean and Ghemawat, OSDI 2004)
Streaming Big Data
 What if you cannot store all the data coming in?
- E.g., small devices in Internet of Things
 Can you carry out the analysis without making a copy?
 Hands-on session!
- Tutorial: https://rubigdata.github.io/course/puc/
- Code: https://github.com/rubigdata/puc/
 But first things first: Big Food 

More Related Content

What's hot

Microsoft on Big Data
Microsoft on Big DataMicrosoft on Big Data
Microsoft on Big DataYvette Teiken
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridEvert Lammerts
 
Data Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud AutomationData Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud AutomationIan Foster
 
Scaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and JupyterScaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and JupyterIan Foster
 
Büyük Veriyle Büyük Resmi Görmek
Büyük Veriyle Büyük Resmi GörmekBüyük Veriyle Büyük Resmi Görmek
Büyük Veriyle Büyük Resmi Görmekideaport
 
Big Tools for Big Data
Big Tools for Big DataBig Tools for Big Data
Big Tools for Big DataLewis Crawford
 
Introduction of Big data and Hadoop
Introduction of Big data and Hadoop Introduction of Big data and Hadoop
Introduction of Big data and Hadoop Arohi Khandelwal
 
Python for Big Data Analytics
Python for Big Data AnalyticsPython for Big Data Analytics
Python for Big Data AnalyticsEdureka!
 
Introduction NL-HUG (April)
Introduction NL-HUG (April)Introduction NL-HUG (April)
Introduction NL-HUG (April)Evert Lammerts
 
Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introductionFrans van Noort
 
Cloud Experiences
Cloud ExperiencesCloud Experiences
Cloud ExperiencesGuy Coates
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopChristopher Pezza
 
Big data-denis-rothman
Big data-denis-rothmanBig data-denis-rothman
Big data-denis-rothmanDenis Rothman
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP vinoth kumar
 

What's hot (20)

Big data PPT
Big data PPT Big data PPT
Big data PPT
 
Microsoft on Big Data
Microsoft on Big DataMicrosoft on Big Data
Microsoft on Big Data
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
 
Data Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud AutomationData Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud Automation
 
1. what is hadoop part 1
1. what is hadoop   part 11. what is hadoop   part 1
1. what is hadoop part 1
 
Scaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and JupyterScaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and Jupyter
 
Büyük Veriyle Büyük Resmi Görmek
Büyük Veriyle Büyük Resmi GörmekBüyük Veriyle Büyük Resmi Görmek
Büyük Veriyle Büyük Resmi Görmek
 
Big Tools for Big Data
Big Tools for Big DataBig Tools for Big Data
Big Tools for Big Data
 
Introduction of Big data and Hadoop
Introduction of Big data and Hadoop Introduction of Big data and Hadoop
Introduction of Big data and Hadoop
 
Python for Big Data Analytics
Python for Big Data AnalyticsPython for Big Data Analytics
Python for Big Data Analytics
 
Introduction NL-HUG (April)
Introduction NL-HUG (April)Introduction NL-HUG (April)
Introduction NL-HUG (April)
 
Big Data Hadoop Tutorial by Easylearning Guru
Big Data Hadoop Tutorial by Easylearning GuruBig Data Hadoop Tutorial by Easylearning Guru
Big Data Hadoop Tutorial by Easylearning Guru
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introduction
 
Cloud Experiences
Cloud ExperiencesCloud Experiences
Cloud Experiences
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Big data-denis-rothman
Big data-denis-rothmanBig data-denis-rothman
Big data-denis-rothman
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP
 
Big Data, Baby Steps
Big Data, Baby StepsBig Data, Baby Steps
Big Data, Baby Steps
 

Viewers also liked

Viewers also liked (13)

Bigdata processing with Spark
Bigdata processing with SparkBigdata processing with Spark
Bigdata processing with Spark
 
TREC 2016: Looking Forward Panel
TREC 2016: Looking Forward PanelTREC 2016: Looking Forward Panel
TREC 2016: Looking Forward Panel
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part II
 
The personal search engine
The personal search engineThe personal search engine
The personal search engine
 
tarig certificates (2)
tarig certificates (2)tarig certificates (2)
tarig certificates (2)
 
Aquí
AquíAquí
Aquí
 
La nube para medicos
La nube para medicosLa nube para medicos
La nube para medicos
 
HRPA Hamilton: Managing Off-duty Conduct
HRPA Hamilton: Managing Off-duty ConductHRPA Hamilton: Managing Off-duty Conduct
HRPA Hamilton: Managing Off-duty Conduct
 
Company case of abou shakra.
Company case of abou shakra.Company case of abou shakra.
Company case of abou shakra.
 
EL TRAFICO DE SERES HUMANOS
EL TRAFICO DE SERES HUMANOSEL TRAFICO DE SERES HUMANOS
EL TRAFICO DE SERES HUMANOS
 
Cancer de mama
Cancer de mamaCancer de mama
Cancer de mama
 
Закріплення таблиць додавання і віднімання у межах 10.
Закріплення таблиць додавання і віднімання у межах 10.Закріплення таблиць додавання і віднімання у межах 10.
Закріплення таблиць додавання і віднімання у межах 10.
 
Легенди про первоцвіти
Легенди про первоцвітиЛегенди про первоцвіти
Легенди про первоцвіти
 

Similar to PUC Masterclass Big Data

Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big DataOmnia Safaan
 
Rpi talk foster september 2011
Rpi talk foster september 2011Rpi talk foster september 2011
Rpi talk foster september 2011Ian Foster
 
INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_publicAttila Barta
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencingGuy Coates
 
Research issues in the big data and its Challenges
Research issues in the big data and its ChallengesResearch issues in the big data and its Challenges
Research issues in the big data and its ChallengesKathirvel Ayyaswamy
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer OverlordsIan Foster
 
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animations
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animationsRoots tech 2013 Big Data at Ancestry (3-22-2013) - no animations
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animationsWilliam Yetman
 
Big Data By Vijay Bhaskar Semwal
Big Data By Vijay Bhaskar SemwalBig Data By Vijay Bhaskar Semwal
Big Data By Vijay Bhaskar SemwalIIIT Allahabad
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop IntroductionJayant Mukherjee
 
Seminar presentation
Seminar presentationSeminar presentation
Seminar presentationKlawal13
 
Big Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with AzureBig Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with AzureChristos Charmatzis
 
Overview of Big Data by Sunny
Overview of Big Data by SunnyOverview of Big Data by Sunny
Overview of Big Data by SunnyDignitasDigital1
 

Similar to PUC Masterclass Big Data (20)

Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 
Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
 
Big Data
Big DataBig Data
Big Data
 
Rpi talk foster september 2011
Rpi talk foster september 2011Rpi talk foster september 2011
Rpi talk foster september 2011
 
Big data
Big dataBig data
Big data
 
INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_public
 
Data analytics & its Trends
Data analytics & its TrendsData analytics & its Trends
Data analytics & its Trends
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencing
 
Big Data
Big Data Big Data
Big Data
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
Research issues in the big data and its Challenges
Research issues in the big data and its ChallengesResearch issues in the big data and its Challenges
Research issues in the big data and its Challenges
 
2014 pycon-talk
2014 pycon-talk2014 pycon-talk
2014 pycon-talk
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer Overlords
 
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animations
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animationsRoots tech 2013 Big Data at Ancestry (3-22-2013) - no animations
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animations
 
Big Data By Vijay Bhaskar Semwal
Big Data By Vijay Bhaskar SemwalBig Data By Vijay Bhaskar Semwal
Big Data By Vijay Bhaskar Semwal
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop Introduction
 
Seminar presentation
Seminar presentationSeminar presentation
Seminar presentation
 
Big Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with AzureBig Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with Azure
 
Introduction Big data
Introduction Big data  Introduction Big data
Introduction Big data
 
Overview of Big Data by Sunny
Overview of Big Data by SunnyOverview of Big Data by Sunny
Overview of Big Data by Sunny
 

More from Arjen de Vries

Masterclass Big Data (leerlingen)
Masterclass Big Data (leerlingen) Masterclass Big Data (leerlingen)
Masterclass Big Data (leerlingen) Arjen de Vries
 
Beverwedstrijd Big Data (klas 3/4/5/6)
Beverwedstrijd Big Data (klas 3/4/5/6) Beverwedstrijd Big Data (klas 3/4/5/6)
Beverwedstrijd Big Data (klas 3/4/5/6) Arjen de Vries
 
Beverwedstrijd Big Data (groep 5/6 en klas 1/2)
Beverwedstrijd Big Data (groep 5/6 en klas 1/2)Beverwedstrijd Big Data (groep 5/6 en klas 1/2)
Beverwedstrijd Big Data (groep 5/6 en klas 1/2)Arjen de Vries
 
Web Archives and the dream of the Personal Search Engine
Web Archives and the dream of the Personal Search EngineWeb Archives and the dream of the Personal Search Engine
Web Archives and the dream of the Personal Search EngineArjen de Vries
 
Information Retrieval and Social Media
Information Retrieval and Social MediaInformation Retrieval and Social Media
Information Retrieval and Social MediaArjen de Vries
 
Information Retrieval intro TMM
Information Retrieval intro TMMInformation Retrieval intro TMM
Information Retrieval intro TMMArjen de Vries
 
ACM SIGIR 2017 - Opening - PC Chairs
ACM SIGIR 2017 - Opening - PC ChairsACM SIGIR 2017 - Opening - PC Chairs
ACM SIGIR 2017 - Opening - PC ChairsArjen de Vries
 
Data Science Master Specialisation
Data Science Master SpecialisationData Science Master Specialisation
Data Science Master SpecialisationArjen de Vries
 
Models for Information Retrieval and Recommendation
Models for Information Retrieval and RecommendationModels for Information Retrieval and Recommendation
Models for Information Retrieval and RecommendationArjen de Vries
 
Better Contextual Suggestions by Applying Domain Knowledge
Better Contextual Suggestions by Applying Domain KnowledgeBetter Contextual Suggestions by Applying Domain Knowledge
Better Contextual Suggestions by Applying Domain KnowledgeArjen de Vries
 
Similarity & Recommendation - CWI Scientific Meeting - Sep 27th, 2013
Similarity & Recommendation - CWI Scientific Meeting - Sep 27th, 2013Similarity & Recommendation - CWI Scientific Meeting - Sep 27th, 2013
Similarity & Recommendation - CWI Scientific Meeting - Sep 27th, 2013Arjen de Vries
 
ESSIR 2013 - IR and Social Media
ESSIR 2013 - IR and Social MediaESSIR 2013 - IR and Social Media
ESSIR 2013 - IR and Social MediaArjen de Vries
 
Looking beyond plain text for document representation in the enterprise
Looking beyond plain text for document representation in the enterpriseLooking beyond plain text for document representation in the enterprise
Looking beyond plain text for document representation in the enterpriseArjen de Vries
 
Recommendation and Information Retrieval: Two Sides of the Same Coin?
Recommendation and Information Retrieval: Two Sides of the Same Coin?Recommendation and Information Retrieval: Two Sides of the Same Coin?
Recommendation and Information Retrieval: Two Sides of the Same Coin?Arjen de Vries
 
Searching Political Data by Strategy
Searching Political Data by StrategySearching Political Data by Strategy
Searching Political Data by StrategyArjen de Vries
 
How to Search Annotated Text by Strategy?
How to Search Annotated Text by Strategy?How to Search Annotated Text by Strategy?
How to Search Annotated Text by Strategy?Arjen de Vries
 
How to build the next 1000 search engines?!
How to build the next 1000 search engines?! How to build the next 1000 search engines?!
How to build the next 1000 search engines?! Arjen de Vries
 
What to do when one size does not fit all?!
What to do when one size does not fit all?!What to do when one size does not fit all?!
What to do when one size does not fit all?!Arjen de Vries
 

More from Arjen de Vries (20)

Doing a PhD @ DOSSIER
Doing a PhD @ DOSSIERDoing a PhD @ DOSSIER
Doing a PhD @ DOSSIER
 
Masterclass Big Data (leerlingen)
Masterclass Big Data (leerlingen) Masterclass Big Data (leerlingen)
Masterclass Big Data (leerlingen)
 
Beverwedstrijd Big Data (klas 3/4/5/6)
Beverwedstrijd Big Data (klas 3/4/5/6) Beverwedstrijd Big Data (klas 3/4/5/6)
Beverwedstrijd Big Data (klas 3/4/5/6)
 
Beverwedstrijd Big Data (groep 5/6 en klas 1/2)
Beverwedstrijd Big Data (groep 5/6 en klas 1/2)Beverwedstrijd Big Data (groep 5/6 en klas 1/2)
Beverwedstrijd Big Data (groep 5/6 en klas 1/2)
 
Web Archives and the dream of the Personal Search Engine
Web Archives and the dream of the Personal Search EngineWeb Archives and the dream of the Personal Search Engine
Web Archives and the dream of the Personal Search Engine
 
Information Retrieval and Social Media
Information Retrieval and Social MediaInformation Retrieval and Social Media
Information Retrieval and Social Media
 
Information Retrieval intro TMM
Information Retrieval intro TMMInformation Retrieval intro TMM
Information Retrieval intro TMM
 
ACM SIGIR 2017 - Opening - PC Chairs
ACM SIGIR 2017 - Opening - PC ChairsACM SIGIR 2017 - Opening - PC Chairs
ACM SIGIR 2017 - Opening - PC Chairs
 
Data Science Master Specialisation
Data Science Master SpecialisationData Science Master Specialisation
Data Science Master Specialisation
 
Models for Information Retrieval and Recommendation
Models for Information Retrieval and RecommendationModels for Information Retrieval and Recommendation
Models for Information Retrieval and Recommendation
 
Better Contextual Suggestions by Applying Domain Knowledge
Better Contextual Suggestions by Applying Domain KnowledgeBetter Contextual Suggestions by Applying Domain Knowledge
Better Contextual Suggestions by Applying Domain Knowledge
 
Similarity & Recommendation - CWI Scientific Meeting - Sep 27th, 2013
Similarity & Recommendation - CWI Scientific Meeting - Sep 27th, 2013Similarity & Recommendation - CWI Scientific Meeting - Sep 27th, 2013
Similarity & Recommendation - CWI Scientific Meeting - Sep 27th, 2013
 
ESSIR 2013 - IR and Social Media
ESSIR 2013 - IR and Social MediaESSIR 2013 - IR and Social Media
ESSIR 2013 - IR and Social Media
 
Looking beyond plain text for document representation in the enterprise
Looking beyond plain text for document representation in the enterpriseLooking beyond plain text for document representation in the enterprise
Looking beyond plain text for document representation in the enterprise
 
Recommendation and Information Retrieval: Two Sides of the Same Coin?
Recommendation and Information Retrieval: Two Sides of the Same Coin?Recommendation and Information Retrieval: Two Sides of the Same Coin?
Recommendation and Information Retrieval: Two Sides of the Same Coin?
 
Searching Political Data by Strategy
Searching Political Data by StrategySearching Political Data by Strategy
Searching Political Data by Strategy
 
How to Search Annotated Text by Strategy?
How to Search Annotated Text by Strategy?How to Search Annotated Text by Strategy?
How to Search Annotated Text by Strategy?
 
How to build the next 1000 search engines?!
How to build the next 1000 search engines?! How to build the next 1000 search engines?!
How to build the next 1000 search engines?!
 
Big data hadoop rdbms
Big data hadoop rdbmsBig data hadoop rdbms
Big data hadoop rdbms
 
What to do when one size does not fit all?!
What to do when one size does not fit all?!What to do when one size does not fit all?!
What to do when one size does not fit all?!
 

Recently uploaded

Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learninglevieagacer
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Silpa
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedDelhi Call girls
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformationAreesha Ahmad
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...chandars293
 
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)AkefAfaneh2
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Introduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptxIntroduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptxBhagirath Gogikar
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Silpa
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxRizalinePalanog2
 
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...Mohammad Khajehpour
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Servicenishacall1
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptxAlMamun560346
 

Recently uploaded (20)

Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Introduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptxIntroduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptx
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 

PUC Masterclass Big Data

  • 1. Pre-University College Masterclass Big Data Prof.dr.ir. Arjen P. de Vries arjen@acm.org Nijmegen, February 20th , 2017
  • 2. Overview  Big Data - Defining properties? - The data center as the computer!  Very brief: map-reduce  Streaming data!  Whatever pops up meanwhile
  • 3. “Big Data” If your organization stores multiple petabytes of data, if the information most critical to your business resides in forms other than rows and columns of numbers, or if answering your biggest question would involve a “mashup” of several analytical efforts, you’ve got a big data opportunity http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century
  • 4. Process  Challenges in Big Data Analytics include - capturing data, - aligning data from different sources (e.g., resolving when two objects are the same), - transforming the data into a form suitable for analysis, - modeling it, whether mathematically, or through some form of simulation, - understanding the output — visualizing and sharing the results Attributed to IBM Research’s Laura Haas in http://www.odbms.org/download/Zicari.pdf
  • 5. The “Data Scientist”  Suggested reading: - Harvard Business Review: http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/ - A 2001 (!) Bell Labs technical report Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics http://www.stat.purdue.edu/~wsc/papers/datascience.pdf - Quora http://www.quora.com/What-is-it-like-to-be-a-data-scientist
  • 6.
  • 7. Big Data?  Big Data refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage and analyze - McKinsey Global Institute, “Big data: The next frontier for innovation, competition and productivity.” May 2011.
  • 8. Big Data?  Big data is the data that you aren’t able to process and use quickly enough with the technology you have now - Buck Woody http://www.simple-talk.com/sql/database-administration/big-data-is-just-a-fad/ We need to think about data comprehensively – all types of data.
  • 9. Big Data  The 3 Vs (sometimes others are added): - Volume We measure more and more; the resulting data is very large already, and it grows faster and faster - Velocity The analysis may take too long for appropriate reaction to measurement - Variety The data comes in many variants, structured and unstructured
  • 10. Why Big Data?  We can analyse (and differentiate) to the level of the individual  We are less likely to miss rare events, e.g., those that occur one out of ten million times  We can better account for the real-time nature of the data
  • 11. No data like more data! (Banko and Brill, ACL 2001) (Brants et al., EMNLP 2007) s/knowledge/data/g; How do we get here if we’re not Google?
  • 12. Exercise  What examples of big data to analyze can we imagine?  How much data could that be?
  • 13. Big?  20 Terabyte? - Clueweb 2009  80 – 120 – 150 Terabyte? - Recent “web” crawls (IA, CommonCrawl 2009-2016, Clueweb 2012)  10 Petabyte? - Complete Internet Archive
  • 14. How much data? 9 PB of user data + >50 TB/day (11/2011) processes 20 PB a day (2008) 36 PB of user data + 80-90 TB/day (6/2010) Wayback Machine: 3 PB + 100 TB/month (3/2009) LHC: ~15 PB a year (at full capacity) LSST: 6-10 PB a year (~2015) 150 PB on 50k+ servers running 15k apps S3: 449B objects, peak 290k request/second (7/2011)
  • 15. How big is big?  Facebook (Aug 2012): - 2.5 billion content items shared per day (status updates + wall posts + photos + videos + comments) - 2.7 billion Likes per day - 300 million photos uploaded per day
  • 16. Big is very big!  100+ petabytes of disk space in one of FB’s largest Hadoop (HDFS) clusters  105 terabytes of data scanned via Hive, Facebook’s Hadoop query language, every 30 minutes  70,000 queries executed on these databases per day  500+ terabytes of new data ingested into the databases every day http://gigaom.com/data/facebook-is-collecting-your-data-500-terabytes-a-day/
  • 17. Back of the Envelope  Note: “105 terabytes of data scanned every 30 minutes”  A very very fast disk can do 300 MB/s – so, on one disk, this would take (105 TB = 110100480 MB) / 300 (MB/s) = 367Ks =~ 6000m  So at least 200 disks are used in parallel!  PS: the June 2010 estimate was that facebook ran on 60K servers
  • 18. Shared-nothing  A collection of independent, possibly virtual, machines, each with local disk and local main memory, connected together on a high-speed network - Possible trade-off: large number of low-end servers instead of small number of high-end ones
  • 21. Source: Google Data Center (is the Computer)
  • 22. Source: NY Times (6/14/2006), http://www.nytimes.com/2006/06/14/technology/14search.html
  • 23. FB’s Data Centers  Suggested further reading: - http://www.datacenterknowledge.com/the-facebook-data-center-faq/ - http://opencompute.org/ - “Open hardware”: server, storage, and data center - Claim 38% more efficient and 24% less expensive to build and run than other state-of-the-art data centers
  • 24. Building Blocks Source: Barroso, Clidaras and Hölzle (2013): DOI 10.2200/S00516ED2V01Y201306CAC024
  • 25. Storage Hierarchy Source: Barroso, Clidaras and Hölzle (2013): DOI 10.2200/S00516ED2V01Y201306CAC024
  • 26. Numbers Everyone Should Know L1 cache reference 0.5 ns Branch mispredict 5 ns L2 cache reference 7 ns Mutex lock/unlock 100 ns Main memory reference 100 ns Compress 1K bytes with Zippy 10,000 ns Send 2K bytes over 1 Gbps network 20,000 ns Read 1 MB sequentially from memory 250,000 ns Round trip within same datacenter 500,000 ns Disk seek 10,000,000 ns Read 1 MB sequentially from network 10,000,000 ns Read 1 MB sequentially from disk 30,000,000 ns Send packet CA->Netherlands->CA 150,000,000 ns According to Jeff Dean
  • 27. Storage Hierarchy Source: Barroso, Clidaras and Hölzle (2013): DOI 10.2200/S00516ED2V01Y201306CAC024
  • 28. Storage Hierarchy Source: Barroso, Clidaras and Hölzle (2013): DOI 10.2200/S00516ED2V01Y201306CAC024
  • 29. Quiz Time!!  Consider a 1 TB database with 100 byte records - We want to update 1 percent of the records Plan A: Seek to the records and make the updates Plan B: Write out a new database that includes the updates Source: Ted Dunning, on Hadoop mailing list
  • 30. Seeks vs. Scans  Consider a 1 TB database with 100 byte records - We want to update 1 percent of the records  Scenario 1: random access - Each update takes ~30 ms (seek, read, write) - 108 updates = ~35 days  Scenario 2: rewrite all records - Assume 100 MB/s throughput - Time = 5.6 hours(!)  Lesson: avoid random seeks! In words of Prof. Peter Boncz (CWI & VU): “Latency is the enemy” Source: Ted Dunning, on Hadoop mailing list
  • 31. Parallel Programming is Difficult  Concurrency is difficult to reason about - At the scale of datacenter(s) - In the presence of failures - In terms of multiple interacting services  In the dark ages of data center computing… - Lots of one-off solutions, custom code - Programmers using their own dedicated libraries - Burden on the programmer to explicitly manage everything
  • 32. Observation  Remember: 0.5ns (L1) vs. 500,000ns (round trip in datacenter) Δ is 6 orders in magnitude!  With huge amounts of data (and resources necessary to process it), we simply cannot expect to ship the data to the application – the application logic needs to ship to the data!
  • 33. Gray’s Laws How to approach data engineering challenges for large-scale scientific datasets: 1. Scientific computing is becoming increasingly data intensive 2. The solution is in a “scale-out” architecture 3. Bring computations to the data, rather than data to the computations 4. Start the design with the “20 queries” 5. Go from “working to working” See: http://research.microsoft.com/en-us/collaboration/fourthparadigm/4th_paradigm_book_part1_szalay.pdf
  • 34. Emerging (Emerged ) Big Data Systems  Distributed  Shared-nothing - None of the resources are logically shared between processes  Data parallel - Exactly the same task is performed on different pieces of the data
  • 35. A Prototype “Big Data Analysis” Task  Iterate over a large number of records  Extract something of interest from each  Aggregate intermediate results - Usually, aggregation requires to shuffle and sort the intermediate results  Generate final output Key idea: provide a functional abstraction for these two operations Map Reduce (Dean and Ghemawat, OSDI 2004)
  • 36. Streaming Big Data  What if you cannot store all the data coming in? - E.g., small devices in Internet of Things  Can you carry out the analysis without making a copy?  Hands-on session! - Tutorial: https://rubigdata.github.io/course/puc/ - Code: https://github.com/rubigdata/puc/  But first things first: Big Food 