On Big Data

EXABYTE
PETABYTE
TERABYTE
GIGABYTE
KILOBYTE
Management Issues
Career Direction
Technical Resources
Speaker:
Dr. Kang Mun Arturo Tan
Assistant Professor
Management Information Systems
Management Sciences Department
Date: Dec 23, 2012
Time: 12:20 – 13:10 pm
Place: Room 75
Yanbu University College

Report Highlights and Units
The McKinsey Global Institute Report
Capturing Big Data Value
What is Big Data?
Various Dimensions of Big Data
Tools and Technologies
Additional Tools and Technologies
How can we benefit from big data?
Transforming the organization
Talent Specifications
Educational Courses / Training
References
Cloudera Distribution including Hadoop
Conclusion

The Units
Multiples of bytes
SI decimal prefixes
Name
(Symbol)
Value
kilobyte(kB) 103
megabyte(MB) 106
gigabyte(GB) 109
terabyte(TB) 1012
petabyte(PB) 1015
exabyte
(EB)
1018
zettabyte(ZB) 1021
yottabyte(YB) 1024
To_Main
Talk Highlights - - - - - - - - - -
CERN generates 40 TB/sec of data
In 2009, nearly all sectors in the US economy
have 200 TB of data on the average.
One Exabyte approximately equals 4000 X
Information stored in the US Library of
Congress
235 Terabytes data were collected by the US
Library of Congress in April 2011
15 out of 17 firms in the US have more data
stored per company than the US Library of
Congress
By 2018, US will have shortage of 140,000 to
190,000 deep data analysts and 1.5 million
more data savvy managers.
Training on big data is still in its infancy.

The McKinsey Global Institute (MGI) Report
 MGI, May 2011, issued a report about organizations
being deluged with data.
 This large amount of data is generally referred to as
“Big Data.”
 The use/analysis of Big Data will be the basis of
innovation, competition and productivity.
 Corporations will be using “Data Science” to properly
manage and utilize “Big Data.”
 Data Scientists are the elite and specialized class of
highly-compensated data cleaning, analysis and
visualization experts.
To_MainNext_Capturing_its_Value

Capturing its Value
 $300 Billion/year - Potential annual value to US Health
Care
 €250 Billion/year – Potential annual value to the European
government administration
 $600 Billion – Potential annual consumer surplus from
using personal location data globally
 60% potential increase in retailer’s operating margins
possible with big data
 However, USA alone needs 140,000 to 190,000 more
deep analytical talent positions, and
 1.5 million more data-savvy managers needed to take
full advantage of big data.
To_MainNext_What_is_Big_Data

What is Big Data?
 Large data sets which are impossible to manage with
conventional database tools.
 Size is relative. What is big today will be small
tomorrow.
 In 2011, our global output of data was estimated at 1.8
zettabytes. Big Data consists of
 Structured, machine-friendly information
 Unstructured, human-friendly information (email,
social media, video, audio, click-streams and images.)
To_MainNext_Various_Dimension

Various Dimensions
 Volume – terabytes … petabytes of information
 Variety – extends well beyond structured data: text,
audio, video, click streams, log files, etc.
 Velocity – frequently time-sensitive, big data must be
used with its stream into the enterprise in order to
maximize its value. (Example: static average:
 7:00AM: 1,3 (4/2 ->Avg = 2) 10:00AM: 1,3,5 (9/3->3)
 Dynamic: 1,3(_, 4,2,2) (5, 9,3,3)
To_MainNext_Tools_Technologies

Tools and Technologies
 Hadoop – is a free, Java-based programming
framework that supports the processing large data sets
in a distributed computing environment.
 Facebook, LinkedIn, Twitter, eBay use Hadoop.
 Hadoop is at the center of this decade’s Big Data
revolution.
 In 2011, five major companies embraced Hadoop:
EMC, IBM, Informatica, Microsoft and Oracle.
To_MainNext_Additional_TechnologiesJump_CDH

Additional Technologies
– a scalable multi-master database with no single
points of failure
 Chukwa – data collection system for managing large distributed
systems
 Hbase – a scalable, distributed database that supports structured
data storage for large tables
 Hive – a data warehouse infrastructure that provides data
summarization and ad hoc querying
 Mahout – a scalable machine learning and data mining library
 Pig – a high-level data-flow language and execution framework
for parallel computation
 Zookeeper – a high-performance coordination service for
distributed applications
To_MainNext_Commercial_Technology

Commercial Technology (CDH)
CDH (Cloudera Distribution including Hadoop)
File System Mount
(Fuse-DPS)
UI Framework/SDK
(Hue)
Data Mining
(Apache Mahout)
Workflow
(Apache Oozie)
Scheduling
(Apache Oozie)
Metadata
(Apache Hive)
Data Integration
(Apache FLUME,
Apache SQOOP)
Languages/Compilers
(Apache Pig, Apache Hive) Fast Read/ Write
Access (Apache
Hbase)
Hadoop
Coordination (Apache Zookeeper)
SCM Express (Installation Wizard)
To_MainNext_How_To_Benefit_from_Big_DataGoTo_Hadoop

How to benefit from Big Data?
 Choose the right data
 Data should be in line with corporate objectives.
 Build models that predict and optimize outcomes
 Hypothesis-based model building is better.
 Transform your company’s capabilities
 Data Science is not a replacement for human judgment.
To_MainTransforming_Your_Company

Transforming your company
 Leadership – companies succeed because they have leadership
teams that set clear goals, define what success looks like and ask
the right questions.
 Talent Management – companies need to manage a unique breed
of individuals who are scientists but who are comfortable with
the language of business.
 Technology – The tools to handle the volume, velocity and
variety of big data are always a necessary component of big data
strategy.
 Decision Making – An effective organization puts information
and the relevant decision rights in the same location.
 Company Culture – Companies should NOT ask “What do we
think?” but should ask “What do we know?”
To_MainNext_Talent_Specifications

Talent Specifications
 Hybrid of data hacker, communicator and trusted
adviser
 Universal skill: ability to write code
 Can communicate in a language that his stakeholders
understand
 Can tell story with data, whether verbally, visually or
both
 Many of the brightest data scientists are PhD in
esoteric fields like ecology and systems biology
To_MainNext_Talent_Specifications_2

Talent Specifications - 2
 Roumeliotis, PhD in Astrophysics, Head of Data Science
Team at Intuit in Silicon Valley begins his search for
candidates by:
 asking the candidate if they can develop prototypes in any
mainstream programming language, like Java.
 seeking a skill set consisting of: Mathematics, Statistics,
Probability and Computer Science and a certain habits of
the mind (curiosity, inventiveness, discipline, endurance?).
 looking for people with a feel for business issues and
empathy for customers.
 immersing the candidate with on-the-job training with
occasional course in a particular technology
To_MainNext_Talent_Specifications_3

Talent Specifications - 3
 Many of the data scientists working in business today
were formally trained in computer science,
mathematics, statistics or economics.
 They can emerge from any field that has a strong data
and computational focus.
 Hal Varian, the chief economist at Google, is known to
have said, “The next sexy job in the next 10 years
will be statisticians.”
To_MainNext_Courses

Courses
 There are only few formal courses being offered right
now.
 Data Science is at the center of:
 Computer Science, Operations Research, Statistics and
Business
To_Main
Statistics
Data
Science
Business
Operations
Research
Computer
Science
Next_Schools_Offering_Data_Science

Schools offering Data Science
 Master of Science in Analytics (MSA)
 Institute for Advanced Analytics
 North Carolina State University
 = = =
 The class of 2012 has the following job statistics:
 -15 interviews per student
 - Average base salary offer with professional experience
$99,600
 $65,000 to $160,000 for candidates with experience
 $60,000 to $100,000 for candidates with no experience
To_MainNext_Schools_offering_Data_Science_2

Schools offering Data Science -2
 Insight Data Science Fellows Program
 - a postdoctoral fellowship designed by Jake Klamka ( a
High-Energy Physicist by training) takes scientists
from academia and in six weeks prepares them to
succeed as data scientists
 Syracuse University’s School of Information Studies
(iSchool)
 Rensselaer Polytechnic’s Data Science Research Center
To_MainNext_Conclusion

Conclusion
 Big Data is now a reality with a huge profit potential.
 Tools and Technologies are available through Open-Source.
 Each one of us can benefit from working with Big Data
(dynamic) in its pure form or in its traditional form
(static).
 Data Science is the path towards the full utilization of Big
Data.
 Schools are in the process of offering Data Science
programs.
 Students could pursue a career on Data Science programs.
 Doing statistical interpretation is the everyday work
routine of Data Science. (Many commercial
implementations exist.)
To_MainNext_to_Reference

Commercial Implementations
 SAP Hana – Metscale
 Microsoft Parallel Data warehouse
 Exadata Database Machine (Oracle)
 Exalytics In-Memory Machine (Oracle)
 Greenplum Data Computing Appliance (EMC)
 Netezza Data Warehouse Appliance (IBM)
 Vertica Analytics Platform (HP)
 SolidDB (IBM)
 Teracotta BigMemory (Software AG) …
To_Conclusion To_Main

References
 1. McKinsey Global Institute Report 2011
 2. A Simple Introduction to Data Science
 - Noreen Burlingame and Lars Nielsen – 2012
 3. Big Data Now
 - Allen Noren, 2011 (O’Reilly Radar Team)
 4. What is Data Science
 - Mike Loukides, 2011 (O’Reilly Media)
 5. Big Data: The Management Revolution
 - Andrew McAfee and Erik Brynjolfsson (Harvard Bus Rev – Oct 2012)
 6. Data Scientist: The Sexiest Job of the 21st Century
 - Thomas H. Davenport and D.J. Patil (Harvard Bus Rev –Oct 2012)
 7. Making Advance Analytics Work for You
 - Dominic Barton and David Court (Harvard Bus Rev –Oct 2012)
 8. Various YouTube Materials / Hadoop - Stanford University
To_MainNext_To_ThankYou

On Big Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to On Big Data

Similar to On Big Data (20)

More from arttan2001

More from arttan2001 (7)

Recently uploaded

Recently uploaded (20)

On Big Data