3. 3
A Decade of Hadoop History on One Slide
Ten years ago, “Hadoop” referred to a scalable, fault-tolerant
filesystem (HDFS) and programming framework (MapReduce)
for distributed computing.
Today, it refers to both a kernel containing the aforementioned
pieces, as well as a constantly evolving ecosystem of 25+ data
stores, execution engines, programming and data access
frameworks, and other componentry.
Recognize this guy?
4. 4
Fast Historical Facts
• The code that eventually became Hadoop was written by
Doug Cutting and Mike Cafarella, open source developers
working in the search tech community, as part of the
Nutch project.
• The word “hadoop” originated with Cutting’s young son,
who owned a plush toy elephant he gave that name.
• Yahoo! was the first user of Hadoop in large-scale
production, and Cutting did early work on Hadoop there.
• Eventually, Cutting joined Cloudera as its chief architect
and remains there to this day.
7. 7
2002
Doug Cutting and Mike
Cafarella create Nutch, an
open source web crawler
(October)
Google publishes its
“Google File System” paper
(October) Cutting & Cafarella
implement Nutch features
that will become HDFS
(June)
Google publishes its
“MapReduce” paper
(October)
2002
2003
2004
Timeline (Abridged): The Invention Years
8. 8
2002
Cafarella spearheads an
implementation of
MapReduce in Nutch
(February)
Cutting joins Yahoo!;
starts Hadoop subproject by
carving code from Nutch
(January)
2005
2006
2007
Yahoo! creates its first Hadoop
cluster for R&D
(March)
Google publishes “Bigtable”
paper, which eventually will
inspire creation of HBase
(November)
First Hadoop User Group
meeting (in Palo Alto, CA)
(October)
Community contributions begin
to rise steeply
First Apache release of Hadoop
(April)
Timeline (Abridged): The Incubation Years
9. 9
2002
Hadoop becomes a Top Level
ASF project
(January) Initial publication of Hadoop:
The Definitive Guide, by Tom
White
(June)
2008
2009
Cutting joins Cloudera as its
chief architect
(August)
Inaugural Hadoop World
conference
convenes in New York
(October)
Yahoo! launches world’s
largest Hadoop application
(February)
Hive, Hadoop’s first SQL
framework, becomes a
Hadoop sub-project
(June)
Cloudera, first company to
commercialize Hadoop, is
founded (August)
Initial Apache release of Pig
(November)
Timeline (Abridged): The Coming-Out Years
10. 10
2002
The extended Hadoop
community busily builds out
a plethora of new
components (Crunch, Sqoop,
Flume, Oozie, etc) that
extend Hadoop use cases and
usability
HDFS NameNode HA, a
significant new feature for
enterprise adoption, merges
into Hadoop trunk
(March)
2010-11
2012
YARN, another important
advance for adoption,
becomes a Hadoop subproject
(August)
Impala, the first native MPP
query engine for Hadoop data,
joins the ecosystem
(October)
Spark, the emerging default
execution engine for Hadoop,
becomes a Top Level ASF
project
(February)
2013-14
Kudu, the first native storage
option for Hadoop since
HBase, joins the ASF Incubator
(as does Impala)
(December)
2015
Timeline (Abridged): The Rapid Adoption Years
13. 13
Why Did Hadoop Succeed?
1. Open source community and license
A large and diverse community of developers has historically made, and continues to
make, the Hadoop ecosystem among the most active and engaged in history, while
the Apache License lowers the barrier to entry for users.
2. Extensibility/adaptability
With the possible exception of Linux, no other complex platform has evolved on so
many levels, and so quickly, to meet user requirements over time.
3. A strong focus on systems
The roots of Hadoop are in making distributed computing infrastructure more
accessible by application developers. That continuing focus continues to bear fruit in
areas like resource management and security.
14. 14
Hadoop’s Next 10 Years
Interest in public-cloud
deployments are driving
native support for them
into the platform.
Rapid hardware
advances are forcing the
community to re-think
Hadoop’s foundations.
Data sources are more
numerous, distributed,
and diverse (IoT), and
Hadoop will adapt.
15. 15
The Use Case Only Gets Stronger
Much of the progress we will make in this century
will come from increased understanding of the data
we generate.
- Doug Cutting
“
”
Kick off the presentation by playing the Doug Cutting “Hadoop 10” video.
https://www.youtube.com/watch?v=XHz_R33QnsI
In the beginning, the word “Hadoop” referred to just two components. Fast forward a decade, and that word now refers to that “kernel” (aka Core Hadoop) as well as to a growing ecosystem of related projects. In that sense, Hadoop now has much in common with Linux, which is also both a kernel and an ecosystem.
These are the very high-level historical facts about Hadoop. The timeline to follow contains much more detail.
As a very basic explanation, Hadoop was originally an open source implementation of internal systems built by Google in the early ‘00s to deal with the extraordinarily resource-intensive problem of indexing the Internet every night. Those systems were first described in these papers, and Cutting and Cafarella, who faced similar problems with Nutch, took notice of them quickly. (Later, Google also published its “Bigtable” paper, which led other developers to create HBase.) As Cutting puts it, periodically, “Google sends us messages from the future.”
Cutting & Cafarella’s initial implementation of these systems consisted of just 2 components: MapReduce and HDFS.
This timeline is abridged for brevity, but it contains some major milestones.
The rapid expansion of the Hadoop ecosystem is further evidence of its meteoric adoption.
With the expansion of that ecosystem, “Hadoop” has grown much, much bigger than its original “core.”
Even with this history, one has to ask: Why did Hadoop succeed?
What does the future hold for Hadoop? There are many possible permutations, but these are just a couple of the obvious influences going forward.
Regardless of what Hadoop looks like in 10 or 20 years, it’s indisputable that the use cases for it will grow stronger as data volume, variety, and velocity expand. There will be no clearer driver of progress than the ability to translate raw data into actionable insight.