This is a talk about Big Data, focusing on its impact on all of us. It also encourages institution to take a close look on providing courses in this area.
2. Report Highlights and Units
The McKinsey Global Institute Report
Capturing Big Data Value
What is Big Data?
Various Dimensions of Big Data
Tools and Technologies
Additional Tools and Technologies
How can we benefit from big data?
Transforming the organization
Talent Specifications
Educational Courses / Training
References
Cloudera Distribution including Hadoop
Conclusion
3. The Units
Multiples of bytes
SI decimal prefixes
Name
(Symbol)
Value
kilobyte(kB) 103
megabyte(MB) 106
gigabyte(GB) 109
terabyte(TB) 1012
petabyte(PB) 1015
exabyte
(EB)
1018
zettabyte(ZB) 1021
yottabyte(YB) 1024
To_Main
Talk Highlights - - - - - - - - - -
CERN generates 40 TB/sec of data
In 2009, nearly all sectors in the US economy
have 200 TB of data on the average.
One Exabyte approximately equals 4000 X
Information stored in the US Library of
Congress
235 Terabytes data were collected by the US
Library of Congress in April 2011
15 out of 17 firms in the US have more data
stored per company than the US Library of
Congress
By 2018, US will have shortage of 140,000 to
190,000 deep data analysts and 1.5 million
more data savvy managers.
Training on big data is still in its infancy.
4. The McKinsey Global Institute (MGI) Report
MGI, May 2011, issued a report about organizations
being deluged with data.
This large amount of data is generally referred to as
“Big Data.”
The use/analysis of Big Data will be the basis of
innovation, competition and productivity.
Corporations will be using “Data Science” to properly
manage and utilize “Big Data.”
Data Scientists are the elite and specialized class of
highly-compensated data cleaning, analysis and
visualization experts.
To_MainNext_Capturing_its_Value
5. Capturing its Value
$300 Billion/year - Potential annual value to US Health
Care
€250 Billion/year – Potential annual value to the European
government administration
$600 Billion – Potential annual consumer surplus from
using personal location data globally
60% potential increase in retailer’s operating margins
possible with big data
However, USA alone needs 140,000 to 190,000 more
deep analytical talent positions, and
1.5 million more data-savvy managers needed to take
full advantage of big data.
To_MainNext_What_is_Big_Data
6. What is Big Data?
Large data sets which are impossible to manage with
conventional database tools.
Size is relative. What is big today will be small
tomorrow.
In 2011, our global output of data was estimated at 1.8
zettabytes. Big Data consists of
Structured, machine-friendly information
Unstructured, human-friendly information (email,
social media, video, audio, click-streams and images.)
To_MainNext_Various_Dimension
7. Various Dimensions
Volume – terabytes … petabytes of information
Variety – extends well beyond structured data: text,
audio, video, click streams, log files, etc.
Velocity – frequently time-sensitive, big data must be
used with its stream into the enterprise in order to
maximize its value. (Example: static average:
7:00AM: 1,3 (4/2 ->Avg = 2) 10:00AM: 1,3,5 (9/3->3)
Dynamic: 1,3(_, 4,2,2) (5, 9,3,3)
To_MainNext_Tools_Technologies
8. Tools and Technologies
Hadoop – is a free, Java-based programming
framework that supports the processing large data sets
in a distributed computing environment.
Facebook, LinkedIn, Twitter, eBay use Hadoop.
Hadoop is at the center of this decade’s Big Data
revolution.
In 2011, five major companies embraced Hadoop:
EMC, IBM, Informatica, Microsoft and Oracle.
To_MainNext_Additional_TechnologiesJump_CDH
9. Additional Technologies
– a scalable multi-master database with no single
points of failure
Chukwa – data collection system for managing large distributed
systems
Hbase – a scalable, distributed database that supports structured
data storage for large tables
Hive – a data warehouse infrastructure that provides data
summarization and ad hoc querying
Mahout – a scalable machine learning and data mining library
Pig – a high-level data-flow language and execution framework
for parallel computation
Zookeeper – a high-performance coordination service for
distributed applications
To_MainNext_Commercial_Technology
10. Commercial Technology (CDH)
CDH (Cloudera Distribution including Hadoop)
File System Mount
(Fuse-DPS)
UI Framework/SDK
(Hue)
Data Mining
(Apache Mahout)
Workflow
(Apache Oozie)
Scheduling
(Apache Oozie)
Metadata
(Apache Hive)
Data Integration
(Apache FLUME,
Apache SQOOP)
Languages/Compilers
(Apache Pig, Apache Hive) Fast Read/ Write
Access (Apache
Hbase)
Hadoop
Coordination (Apache Zookeeper)
SCM Express (Installation Wizard)
To_MainNext_How_To_Benefit_from_Big_DataGoTo_Hadoop
11. How to benefit from Big Data?
Choose the right data
Data should be in line with corporate objectives.
Build models that predict and optimize outcomes
Hypothesis-based model building is better.
Transform your company’s capabilities
Data Science is not a replacement for human judgment.
To_MainTransforming_Your_Company
12. Transforming your company
Leadership – companies succeed because they have leadership
teams that set clear goals, define what success looks like and ask
the right questions.
Talent Management – companies need to manage a unique breed
of individuals who are scientists but who are comfortable with
the language of business.
Technology – The tools to handle the volume, velocity and
variety of big data are always a necessary component of big data
strategy.
Decision Making – An effective organization puts information
and the relevant decision rights in the same location.
Company Culture – Companies should NOT ask “What do we
think?” but should ask “What do we know?”
To_MainNext_Talent_Specifications
13. Talent Specifications
Hybrid of data hacker, communicator and trusted
adviser
Universal skill: ability to write code
Can communicate in a language that his stakeholders
understand
Can tell story with data, whether verbally, visually or
both
Many of the brightest data scientists are PhD in
esoteric fields like ecology and systems biology
To_MainNext_Talent_Specifications_2
14. Talent Specifications - 2
Roumeliotis, PhD in Astrophysics, Head of Data Science
Team at Intuit in Silicon Valley begins his search for
candidates by:
asking the candidate if they can develop prototypes in any
mainstream programming language, like Java.
seeking a skill set consisting of: Mathematics, Statistics,
Probability and Computer Science and a certain habits of
the mind (curiosity, inventiveness, discipline, endurance?).
looking for people with a feel for business issues and
empathy for customers.
immersing the candidate with on-the-job training with
occasional course in a particular technology
To_MainNext_Talent_Specifications_3
15. Talent Specifications - 3
Many of the data scientists working in business today
were formally trained in computer science,
mathematics, statistics or economics.
They can emerge from any field that has a strong data
and computational focus.
Hal Varian, the chief economist at Google, is known to
have said, “The next sexy job in the next 10 years
will be statisticians.”
To_MainNext_Courses
16. Courses
There are only few formal courses being offered right
now.
Data Science is at the center of:
Computer Science, Operations Research, Statistics and
Business
To_Main
Statistics
Data
Science
Business
Operations
Research
Computer
Science
Next_Schools_Offering_Data_Science
17. Schools offering Data Science
Master of Science in Analytics (MSA)
Institute for Advanced Analytics
North Carolina State University
= = =
The class of 2012 has the following job statistics:
-15 interviews per student
- Average base salary offer with professional experience
$99,600
$65,000 to $160,000 for candidates with experience
$60,000 to $100,000 for candidates with no experience
To_MainNext_Schools_offering_Data_Science_2
18. Schools offering Data Science -2
Insight Data Science Fellows Program
- a postdoctoral fellowship designed by Jake Klamka ( a
High-Energy Physicist by training) takes scientists
from academia and in six weeks prepares them to
succeed as data scientists
Syracuse University’s School of Information Studies
(iSchool)
Rensselaer Polytechnic’s Data Science Research Center
To_MainNext_Conclusion
19. Conclusion
Big Data is now a reality with a huge profit potential.
Tools and Technologies are available through Open-Source.
Each one of us can benefit from working with Big Data
(dynamic) in its pure form or in its traditional form
(static).
Data Science is the path towards the full utilization of Big
Data.
Schools are in the process of offering Data Science
programs.
Students could pursue a career on Data Science programs.
Doing statistical interpretation is the everyday work
routine of Data Science. (Many commercial
implementations exist.)
To_MainNext_to_Reference
20. Commercial Implementations
SAP Hana – Metscale
Microsoft Parallel Data warehouse
Exadata Database Machine (Oracle)
Exalytics In-Memory Machine (Oracle)
Greenplum Data Computing Appliance (EMC)
Netezza Data Warehouse Appliance (IBM)
Vertica Analytics Platform (HP)
SolidDB (IBM)
Teracotta BigMemory (Software AG) …
To_Conclusion To_Main
21. References
1. McKinsey Global Institute Report 2011
2. A Simple Introduction to Data Science
- Noreen Burlingame and Lars Nielsen – 2012
3. Big Data Now
- Allen Noren, 2011 (O’Reilly Radar Team)
4. What is Data Science
- Mike Loukides, 2011 (O’Reilly Media)
5. Big Data: The Management Revolution
- Andrew McAfee and Erik Brynjolfsson (Harvard Bus Rev – Oct 2012)
6. Data Scientist: The Sexiest Job of the 21st Century
- Thomas H. Davenport and D.J. Patil (Harvard Bus Rev –Oct 2012)
7. Making Advance Analytics Work for You
- Dominic Barton and David Court (Harvard Bus Rev –Oct 2012)
8. Various YouTube Materials / Hadoop - Stanford University
To_MainNext_To_ThankYou