Big Data is often shrouded in mystery and jargon. This talk will attempt to demystify the topic through a series of short vignettes on how to pragmatically deal with Big Data. Including: how to avoid Big Data problems in the first place, hardware optimizations, and scaling code through functional programming
Bio: Dr. Brian Spiering is a professor of Data Science at Galvanize University, which is industry-driven, outcomes-focused education institution offering a Masters in Data Science. He teaches Natural Language Processing (NLP), Data Engineering, and Deep Learning.
3. Roadmap
Defining “Big Data” (aka, you probably don’t have Big Data)
How to avoid Big Data (and associated problems)
Okay, I really have Big Data. What should I do?
1
2
3
5. What is Big Data?
“Data sets with sizes beyond the ability of
commonly used software tools to capture,
curate, manage, and process data within a
tolerable amounts of time.”
6. What is Big Data?
“Data sets with sizes beyond the ability of
commonly used software tools to capture,
curate, manage, and process data within a
tolerable amounts of time.”
Data that doesn’t find on a single machine.
31. What to do:
1. Learn some math tricks (linear algebra)
2. Learn how to optimize your code
3. Learn how to use cloud compute
4. Learn a Big Data Framework
32. Where have we been?
Defining “Big Data” (aka, you probably don’t have Big Data)
How to avoid Big Data (and associated problems)
Okay, I really have Big Data. What should I do?
1
2
3
Good Evening!
Tonight, I’m going to sharing a couple Practical Tips on Handling Big Data
I’m …
I have been working in Big Data for the last couple of years.
About 1 year ago joined Galvanize
Galvanize is education company - build learning communities
GalvanizeU is MSDS
Teach NLP, Big Data, and Deep Learning
Many people think they have big data but
Here is popular quote.
Does this sound reasonable
I’m more precisely define what I mean by a single machine
Compute (RAM)
Storage (Disk)
You can load hundreds of megabytes into memory in an efficient vectorized format.
Tell story- I working at SaaS company my intern fitting a random forest for churn, 1mm rows / 1K attributes
About R (8 hours), python (1 hour) Spark (10 minutes)
I was working for a company doing competitive intelligence…
In a data frame 5 GBs on my laptop. realtime <100ms
Wes McKinney projects to scale out Pandas - ibis / arrow
Single “computer”
redefine “machine”
2TB of RAM
2,000 GB
In memory DB
limited roll out / use but it’s the future
bigger, cheaper, faster, easier
[Walk through slowly]
http://www.theregister.co.uk/2016/04/04/memory_and_storage_boundary_changes/
Remember doing competitive intelligence project. It took 5 minutes to load into RAM.
“The difference between RAM and cache is its performance, cost, and proximity to the CPU. Cache is faster, more costly, and closest to the CPU. Due to the cost there is much less cache than RAM. The most basic computer is a CPU and storage for data. The structure we have these days is to give us the most bang for the buck. Generally faster is more expensive. For best performance the faster more expensive storage is closer to the CPU. The relation is like this: CPU-L1 cache-L2 cache-RAM-Hard drive-backup media(tape). The CPU itself has its registers for storing data. The cost per bit of storage goes down from the CPU out.”
Stay local or stay in the cloud
I was storing the data
Moore’s Law: the number of transistors in a dense integrated circuit doubles approximately every two years 60% annual growth rate- printer will smaller font, more information on each sheet
"Kryder's Law” A 2005 Scientific American article, titled observed that magnetic disk areal storage density was then increasing very quickly.[2] The pace was then much faster than the two-year doubling time of semiconductor chip density posited by Moore's law.
Nielsen's Law of Internet bandwidth states that: a high-end user's connection speed grows by 50% per year
These numbers are going to change
- Both in value
- Relative tipping point
What is your preference?
Alex Smola - Carggie Melon now at leading AWS machine learning offerings
functional programming is an API call: what, not how
less code
functional * hides optimizations
we can swap out the underlying code
optimization distribute/parrelize by row
send each row to a worker (core or cluster member )
Power Law - The internet 101
Chris Anderson
Movies - a few blockbuster, many in the middle of the pack, youtube/vimeo has enable MANY amateur cinampthoers
Power Law - The internet 101
Head, Torso, Tail for recommenders
Keep:
Head in Cache
Torse in RAM
Tail on Disk
- Learn spark 1st then go back to Hadoop
Spark, just works better and easy to understand
Beyond the scope of the talk, DataBricks Cloud
Get out the data center as quickly as possible
Simple ETL into aggregate
Competitive intelligence project. I would ETL on the cloud and fit arggeaget data locally
inputs and output
Hadoop / MapReduce / Spark extends but is still functional
practice on simple problems then extend to data
Keep The Goal, The Goal.
I love delight people, especially customers
What are trying to do with your data?
Properly spec’d then not big data
Data Density