Practical Tips On Handling Big Data

•Download as PPTX, PDF•

0 likes•7,233 views

Big Data is often shrouded in mystery and jargon. This talk will attempt to demystify the topic through a series of short vignettes on how to pragmatically deal with Big Data. Including: how to avoid Big Data problems in the first place, hardware optimizations, and scaling code through functional programming Bio: Dr. Brian Spiering is a professor of Data Science at Galvanize University, which is industry-driven, outcomes-focused education institution offering a Masters in Data Science. He teaches Natural Language Processing (NLP), Data Engineering, and Deep Learning.

Education

Dr. Brian J. Spiering
Practical Tips On
Handling Big Data

hi, brian.
Data Science Faculty @GalvanizeU
@BrianSpiering

Roadmap
Defining “Big Data” (aka, you probably don’t have Big Data)
How to avoid Big Data (and associated problems)
Okay, I really have Big Data. What should I do?
1
2
3

Defining “Big Data”
(aka, you probably don’t have Big Data)
1

What is Big Data?
“Data sets with sizes beyond the ability of
commonly used software tools to capture,
curate, manage, and process data within a
tolerable amounts of time.”

How to avoid Big Data
(and associated problems)
2

Scaling Out
1. Single Local Machine < 10s GB*
2. Single Cloud Machine < 2 TB*
3. Cloud Cluster of Machines > 2 TB*
* Summer 2016

Matrix Multiplication:
Imperative Implementation

Matrix Multiplication:
Functional Implementation

Head, Torso, Tail:
Separate models (and hardware)

Okay, I really have Big Data.
What should I do?
3

“But my data is more than 5TB!
(and I need it in memory)”

“But my data is more than 5TB!
(and I need it in memory)”
Your life sucks now…
You are stuck with
distributed computing

What to do:
1. Learn some math tricks (linear algebra)
2. Learn how to optimize your code
3. Learn how to use cloud compute
4. Learn a Big Data Framework

Where have we been?
Defining “Big Data” (aka, you probably don’t have Big Data)
How to avoid Big Data (and associated problems)
Okay, I really have Big Data. What should I do?
1
2
3

What's hot

Making Sense of DataMichael Driscoll

Big dataArtiSolanki5

Introduction to Big DataIMC Institute

Overview of bigdataAbinaya B

Are you ready for BIG DATA?Putchong Uthayopas

One Billion Rows per Second: Analytics for the Digital Media MarketsMichael Driscoll

6 levels of big data analytics applicationspanoratio

Big Data AnalyticsIMC Institute

What do we do with all this big data by susan etlingerSahil Kumar

Operationalizing Data Science St. Louis Big Data IDEAAdam Doyle

Easylearning Guru online Hadoop class KCC Software Ltd. & Easylearning.guru

R & Data mining in actionKatarzyna Mrowca

CeDAWI - Center for Data Analytics and Web InsightsAsgar Mammadli

Community-Assisted Software Engineering Decision Makinggregoryg

Sztuka czytania między wierszami - R i Data miningKatarzyna Mrowca

Unit 3 part 2MohammadAsharAshraf

Data science and_analytics_for_ordinary_people_ebookJeffrey Strickland, Ph.D., CMSP

The Walking DataJESS3

Big Data vs Data Science vs Data Analytics | Demystifying The Difference | Ed...Edureka!

Hadoop 101: North East Wisconsin Code CampJim Argeropoulos

What's hot (20)

Making Sense of Data

Big data

Introduction to Big Data

Overview of bigdata

Are you ready for BIG DATA?

One Billion Rows per Second: Analytics for the Digital Media Markets

6 levels of big data analytics applications

Big Data Analytics

What do we do with all this big data by susan etlinger

Operationalizing Data Science St. Louis Big Data IDEA

Easylearning Guru online Hadoop class

R & Data mining in action

CeDAWI - Center for Data Analytics and Web Insights

Community-Assisted Software Engineering Decision Making

Sztuka czytania między wierszami - R i Data mining

Unit 3 part 2

Data science and_analytics_for_ordinary_people_ebook

The Walking Data

Big Data vs Data Science vs Data Analytics | Demystifying The Difference | Ed...

Hadoop 101: North East Wisconsin Code Camp

Similar to Practical Tips On Handling Big Data

Introduction to Big DataKaran Desai

Introduction Big data مروان الوجيه

Big Data, Big OpportunitiesArimo, Inc.

BIG DATA-Seminar Reportjosnapv

Big Data AnalysisIRJET Journal

Data mining with big dataSandip Tipayle Patil

GADLJRIET850691neha trivedi

Big Data, Big Deal: For Future Big Data ScientistsWay-Yen Lin

Wake up and smell the datamark madsen

Big data and enterprise search trends 120827nnCathy McKnight

"Big Data Dreams"Michael DeAloia

A Big Data ConceptDharmesh Tank

Data mining with big dataSandip Tipayle Patil

Big Data By Vijay Bhaskar SemwalIIIT Allahabad

Introduction to big data for the EA course at Solvay MBAWim Van Leuven

sybca-bigdata-ppt.pptxcalf_ville86

BIG DATA AND HADOOP.pdfsonukumar379092

Presentation About Big Data (DBMS)SiamAhmed16

Scott Edmunds slides from #IDCC13 Data Science sessionGigaScience, BGI Hong Kong

Introduction to big dataHari Priya

Similar to Practical Tips On Handling Big Data (20)

Introduction to Big Data

Introduction Big data

Big Data, Big Opportunities

BIG DATA-Seminar Report

Big Data Analysis

Data mining with big data

GADLJRIET850691

Big Data, Big Deal: For Future Big Data Scientists

Wake up and smell the data

Big data and enterprise search trends 120827nn

"Big Data Dreams"

A Big Data Concept

Data mining with big data

Big Data By Vijay Bhaskar Semwal

Introduction to big data for the EA course at Solvay MBA

sybca-bigdata-ppt.pptx

BIG DATA AND HADOOP.pdf

Presentation About Big Data (DBMS)

Scott Edmunds slides from #IDCC13 Data Science session

Introduction to big data

Recently uploaded

4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239

How to do quick user assign in kanban in Odoo 17 ERPCeline George

Earth Day Presentation wow hello nice greatYousafMalik24

ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1

call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR

Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri

ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood

GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2

Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc

Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99

YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxConquiztadors- the Quiz Society of Sri Venkateswara College

Keynote by Prof. Wurzer at Nordex about IP-designMIPLM

ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood

AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfphamnguyenenglishnb

FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxConquiztadors- the Quiz Society of Sri Venkateswara College

MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma

Science 7 Quarter 4 Module 2: Natural Resources.pptxMaryGraceBautista27

HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection

Q4 English4 Week3 PPT Melcnmg-based.pptxnelietumpap1

Raw materials used in Herbal Cosmetics.pptxAshokrao Mane college of Pharmacy Peth-Vadgaon

Recently uploaded (20)

4.18.24 Movement Legacies, Reflection, and Review.pptx

How to do quick user assign in kanban in Odoo 17 ERP

Earth Day Presentation wow hello nice great

ANG SEKTOR NG agrikultura.pptx QUARTER 4

call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️

Judging the Relevance and worth of ideas part 2.pptx

ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT

GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS

Procuring digital preservation CAN be quick and painless with our new dynamic...

Choosing the Right CBSE School A Comprehensive Guide for Parents

YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx

Keynote by Prof. Wurzer at Nordex about IP-design

ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx

AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf

FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx

MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx

Science 7 Quarter 4 Module 2: Natural Resources.pptx

HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...

Q4 English4 Week3 PPT Melcnmg-based.pptx

Raw materials used in Herbal Cosmetics.pptx

Practical Tips On Handling Big Data

1. Dr. Brian J. Spiering Practical Tips On Handling Big Data

2. hi, brian. Data Science Faculty @GalvanizeU @BrianSpiering

3. Roadmap Defining “Big Data” (aka, you probably don’t have Big Data) How to avoid Big Data (and associated problems) Okay, I really have Big Data. What should I do? 1 2 3

4. Defining “Big Data” (aka, you probably don’t have Big Data) 1

5. What is Big Data? “Data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable amounts of time.”

6. What is Big Data? “Data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable amounts of time.” Data that doesn’t find on a single machine.

7. What is not Big Data?

8. How to avoid Big Data (and associated problems) 2

9. Handling Medium Data

10.

11.

12.

13.

14. Cache RAM Disk Data Center

15. Big Data Gotcha!

16. Scaling Out 1. Single Local Machine < 10s GB* 2. Single Cloud Machine < 2 TB* 3. Cloud Cluster of Machines > 2 TB* * Summer 2016

17.

18. Matrix Multiplication

19. Matrix Multiplication: Imperative Implementation

20. Matrix Multiplication: Functional Implementation

21. Matrix Multiplication

22.

23. Head, Torso, Tail: Separate models (and hardware)

24. Okay, I really have Big Data. What should I do? 3

25. “But my data is more than 5TB! (and I need it in memory)”

26. “But my data is more than 5TB! (and I need it in memory)” Your life sucks now… You are stuck with distributed computing

27.

28.

29. map reduce Big Data is functional

30.

31. What to do: 1. Learn some math tricks (linear algebra) 2. Learn how to optimize your code 3. Learn how to use cloud compute 4. Learn a Big Data Framework

32. Where have we been? Defining “Big Data” (aka, you probably don’t have Big Data) How to avoid Big Data (and associated problems) Okay, I really have Big Data. What should I do? 1 2 3

33. Thank You! Questions?

Editor's Notes

Good Evening! Tonight, I’m going to sharing a couple Practical Tips on Handling Big Data I’m …
I have been working in Big Data for the last couple of years. About 1 year ago joined Galvanize Galvanize is education company - build learning communities GalvanizeU is MSDS Teach NLP, Big Data, and Deep Learning
Many people think they have big data but
Here is popular quote. Does this sound reasonable
I’m more precisely define what I mean by a single machine Compute (RAM) Storage (Disk)
You can load hundreds of megabytes into memory in an efficient vectorized format. Tell story- I working at SaaS company my intern fitting a random forest for churn, 1mm rows / 1K attributes About R (8 hours), python (1 hour) Spark (10 minutes)
I was working for a company doing competitive intelligence… In a data frame 5 GBs on my laptop. realtime <100ms Wes McKinney projects to scale out Pandas - ibis / arrow Single “computer”
redefine “machine”
2TB of RAM 2,000 GB In memory DB limited roll out / use but it’s the future
bigger, cheaper, faster, easier
[Walk through slowly] http://www.theregister.co.uk/2016/04/04/memory_and_storage_boundary_changes/
Remember doing competitive intelligence project. It took 5 minutes to load into RAM. “The difference between RAM and cache is its performance, cost, and proximity to the CPU. Cache is faster, more costly, and closest to the CPU. Due to the cost there is much less cache than RAM. The most basic computer is a CPU and storage for data. The structure we have these days is to give us the most bang for the buck. Generally faster is more expensive. For best performance the faster more expensive storage is closer to the CPU. The relation is like this: CPU-L1 cache-L2 cache-RAM-Hard drive-backup media(tape). The CPU itself has its registers for storing data. The cost per bit of storage goes down from the CPU out.”
Stay local or stay in the cloud I was storing the data Moore’s Law: the number of transistors in a dense integrated circuit doubles approximately every two years 60% annual growth rate- printer will smaller font, more information on each sheet "Kryder's Law” A 2005 Scientific American article, titled observed that magnetic disk areal storage density was then increasing very quickly.[2] The pace was then much faster than the two-year doubling time of semiconductor chip density posited by Moore's law. Nielsen's Law of Internet bandwidth states that: a high-end user's connection speed grows by 50% per year
These numbers are going to change - Both in value - Relative tipping point What is your preference?
Alex Smola - Carggie Melon now at leading AWS machine learning offerings
http://www.math.cornell.edu/~mec/Winter2009/RalucaRemus/Lecture1/lecture1.html regression (GLM) PCA / eigenvalue
Vanilla Python very clear & logical very slow
functional programming is an API call: what, not how less code functional * hides optimizations we can swap out the underlying code
optimization distribute/parrelize by row send each row to a worker (core or cluster member )
Power Law - The internet 101 Chris Anderson Movies - a few blockbuster, many in the middle of the pack, youtube/vimeo has enable MANY amateur cinampthoers
Power Law - The internet 101 Head, Torso, Tail for recommenders Keep: Head in Cache Torse in RAM Tail on Disk
- Learn spark 1st then go back to Hadoop Spark, just works better and easy to understand Beyond the scope of the talk, DataBricks Cloud
Get out the data center as quickly as possible Simple ETL into aggregate Competitive intelligence project. I would ETL on the cloud and fit arggeaget data locally
inputs and output Hadoop / MapReduce / Spark extends but is still functional practice on simple problems then extend to data
Keep The Goal, The Goal. I love delight people, especially customers What are trying to do with your data? Properly spec’d then not big data Data Density

Practical Tips On Handling Big Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Practical Tips On Handling Big Data

Similar to Practical Tips On Handling Big Data (20)

Recently uploaded

Recently uploaded (20)

Practical Tips On Handling Big Data

Editor's Notes