今日はCloudera World Tokyo 2015にお越しいただきありがとうございます。
このセッションは「Spark 徹底入門」というタイトルで、「Sparkとはどういうものか」について紹介します。
申し遅れましたが、Clouderaの川崎と申します。
この中には以前にお会いした方もいらっしゃいますが、技術トレーニングやHadoopなどの啓蒙活動などをしています。
日本で技術トレーニングを始めて4年ちょっとですが、おそらく、私が日本で一番数多く、HadoopやHBaseのトレーニングを実施しています。
Sparkのトレーニングもやってます。
These are typical examples of the type of things you’d like to be able to do
ただ、この図は若干微妙。SparkはReduce処理の際に
These are typical examples of the type of things you’d like to be able to do
Historically, a Hadoop node has been 64 GB of RAM for 24 TB of disk. Now, a hadoop node tends to have 256 GB RAM for the same 24TB disk. We expect this to quadruple in the next 3 years to 1 TB RAM for 24 TB disks. Which is a HUGE increase in ratio - 20x improvement.
Thus, engineering systems to work with lots of RAM is important. Spark is one such example, but so is Impala. In fact, we are focusing on making sure all our systems are capable of leveraging RAM effectively.
Caching Data is a Big Win
Avoid re-reading data
Most of Spark’s speed advantage
MapRecuce: 毎回110秒
Spark: 1回目:80秒
以降はキャッシュを使うことで1秒
Caching Data is a Big Win
Avoid re-reading data
Most of Spark’s speed advantage
MapRecuce: 毎回110秒
Spark: 1回目:80秒
以降はキャッシュを使うことで1秒
Caching Data is a Big Win
Avoid re-reading data
Most of Spark’s speed advantage
MapRecuce: 毎回110秒
Spark: 1回目:80秒
以降はキャッシュを使うことで1秒
30回だと3400秒 vs 110秒
Caching Data is a Big Win
Avoid re-reading data
Most of Spark’s speed advantage
Interactive exploration of data for data scientists – no need to develop “applications”
Developers can prototype application on live system as they build application
END BUILD – the magic.
5. “count” is an action (note the red triangle!), meaning that it actually processes the data in the RDD and returns a value to the calling program (the driver). At the point that the operation is performed on the last RDD, Spark will recursively process each RDD and its parent.
countはアクション。アクションでRDDにあるデータを処理して呼び出したプログラム(ドライバ)に返す
この時点で操作は最後のRDDで行われ、Sparkは各RDDを処理して親に再帰的に処理する
We’ll talk more later about this concept of “lineage”. The important point here is that nothing actually *happens* until that final call.リネージのコンセプトは後述。ここでの重要なポイントは最後の呼び出しまで何も起こらないってこと。
Also worth mentioning: this data is not cached by default! if you call count again, it will re-execute all three transformations. Caching is covered later, but it might be worth pointing out that the picture is misleading in that it implies that the RDDs stay populated after the call, which isn’t the case.
伝えておく価値のあること。データはデフォルトではキャッシュされない!もう一度カウントを呼び出すと、3つの変換が再実行される。キャッシュは後述するが重要。絵はわかりにくいが実行後には残ってないので注意
What is it?
Run continuous processing of data using Spark’s core API
Extends Spark RDDs to DStreams (Discretized Streams)
Fault-tolerant, transformable streams
Re-use existing code for batch/offline processing
Adds “rolling window” operations
Example: Compute rolling averages or counts for data over last five minutes
Common Use Cases:
“On-the-fly” ETL as data is ingested into Hadoop/HDFS
Detect anomalous behavior and trigger alerts
Continuous reporting of summary metrics for incoming data
Item 1: Standard question related to big data problem
Item 2: Bigger question enabled by Enterprise Data Hub
Item 3: Direct benefit of Cloudera’s platform
Link to account record in SFDC (valid for Cloudera employees only): https://na6.salesforce.com/0018000000tdY3k?srPos=0&srKp=001
Cerner saves lives with Big Data analytics that predict patient conditions.
Background: Cerner, a longtime leader in the healthcare IT space, is in the midst of an evolution. Today, its solutions and services are utilized in over 14,000 medical facilities around the world, such as hospitals, integrated delivery networks, ambulatory offices, and physicians’ offices. But Cerner’s goal is to deliver more than software and solutions. The company is expanding its historical focus on electronic medical records (EMR) to help improve health and care across the board. Cerner aims to bring together the world's healthcare data in order to reduce costs and increase efficiency of delivering healthcare, while improving patient outcomes.
Challenge: When the Cerner team set out to build this unified healthcare data platform, they established a few key objectives. It would need to be:
Capable of bringing together all of the world’s health data
Secure, traceable, and audited
Catalogued and explorable
Usable for any need at any time
Most tools available at the time weren't really a great fit for the magnitude or complexity of the global healthcare data challenge Cerner was trying to address. They started out just building search indexes for medical records, but even that required huge amounts of computational power.Cerner needed an achievable way to throw significant CPU power at a very large dataset without compromising agility. They needed to be able to iterate quickly on search processing algorithms.
Solution: Cerner has built an enterprise data hub on Cloudera that contains more than 2 PB of data in a multi-tenant environment, supporting several hundred clients. Cloudera Manager provides a holistic view of multiple clusters. The platform ingests different EMRs, HL7 feeds, Health Information Exchange information, claims data, and custom extracts from a variety of proprietary or client-owned data sources, uses Apache Kafka to ingest real-time data streams, and then pushes data to the appropriate HBase or HDFS cluster using Apache Storm.
Data moves from Cloudera to Cerner’s HP Vertica data marts via bulk loads, giving data scientists, SAP Business Objects users, and SAS users the ability to interact with Hadoop data for broad reporting and analysis using tools they’re familiar with. This helps them understand the most significant risks and opportunities for improvement across a population of people. Cerner is starting to use SAS on Hadoop for deep data science.In the healthcare space especially, a technical solution must provide a mechanism for threat mitigation in order to be considered a viable data management technology. Cloudera advised Cerner’s approach to encrypting data at rest and on its Kerberos integration, and Cerner is actively evaluating tools like Apache Sentry to complement what the team has already built.
Results: What’s unique about Cerner’s EDH is that it brings together data from an almost unlimited number of sources, and that data can be used to build a far more complete picture of any patient, condition, or trend. For example, using SAS on Hadoop is giving Cerner the ability to predict the probability of a discharged patient being re-admitted for the same or a similar condition.Using the same strategy, Cerner can accurately determine the probability that a person has a bloodstream infection. Its clients are reporting that the new system has actually saved hundreds of lives by being able to predict if a patient is septic more effectively than they could before.
Cloudera Spark Committers: 4
Intel (our close partner): 1
Hortonworks: 1
IBM: 0
MapR: 0
8 full-time engineers working on Spark
Contributed over 370 patches and 43,000 lines of code to Spark
Compare to Hortonworks (4), IBM (12), and MapR (1)