SlideShare a Scribd company logo
1 of 77
Mahout資料分析基礎入門 
Raymond@hadoopCon 2014 
1
Raymond http://systw.net 
興趣 
• 異常偵測 
• 資料分析 
• 網路安全 
• 程式開發 
現職 
• Acer eDC資安系統部資深專案工程師 
• 康寧專校資訊安全兼任講師 
• 台科大資工所博士生 
2
Acer eDC 
• SoC(資安監控中心) 
• 幫助客戶偵測資安事件 
• 也幫助企業建置專屬SoC平台 
3
我們的任務 
從Log被產生,到轉變為資訊的所有過程 
• 收集各類型Log 
• 整合到SoC 
• 分析並找出重要資訊 
4
我們的目標 
找出危險並告訴客戶 
5
分析是其中一個工作 
for example 
• Statistic 
• Data Mining 
• Machine Learning 
6
就是今天的主題 
資料分析基礎入門 
7
Outline 
• Mahout介紹 
• 快速建立環境 
• 動動手玩分群 
• 動動手玩推薦系統 
• 其他方法 
8
分析工具 
• Python scikit-learn 
• R 
• Weka, Matlab 
• Mahout 
9
https://mahout.apache.org/ 
10
Introduction to Apache Mahout in Youtube, BTI360 
11
Introduction to Apache Mahout in Youtube ,BTI360 
12
Mahout 
User 
(直接使用Mahout內建命令) 
Developer 
(用Mahout提供的library改寫程式) 
13
什麼時候能用Mahout ? 
• 資料非常巨大ex: Acer eDC 
• 純練習 
14
要用多少節點跑Mahout ? 
• 當你Mahout跑太久的時候ex:2天 
…omit 
14/01/27 10:01:39 INFO driver.MahoutDriver: Program took 151260185 
ms (Minutes: 2521.886416666666667) 
15
Hadoop and Mahout 
快速建立環境 
16
Hadoop 安裝 
• Etu Virtual Appliance 
• Hortonworks 
• 全手工安裝 
17
Mahout 安裝 
In Hortonworks 
# yum install mahout 
…omit… 
# mahout 
18
Mahout 安裝 
Download mahout 
• mahout-distribution-0.9.tar.gz (user) 
• mahout-distribution-0.9-src.tar.gz (developer) 
# tar –zxvf mahout-distribution-xx.tar.gz 
# cd mahout-distribution-.xx 
# bin/mahout 
19
執行Mahout 
# mahout 
Error: JAVA_HOME is not set. 
20
執行Mahout 
# export JAVA_HOME=/usr/jdk64/jdk1.6.0_31 
# mahout 
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. 
Running on hadoop, using /usr/lib/hadoop/bin/hadoop and 
HADOOP_CONF_DIR=/etc/hadoop/conf 
MAHOUT-JOB: /usr/lib/mahout/mahout-examples-0.8.0.2.0.6.1-101-job.jar 
An example program must be given as the first argument. 
Valid program names are: 
...omit... 
21
快速的完成環境建立 
22
Clustering 
動動手玩分群 
23
DataSet 
Feature 1 Feature 2 
ID 1 0 1 
ID 2 1 0 
ID 3 1 1 
ID 4 2 1 
ID 5 1 2 
ID 6 2 2 
ID 7 5 6 
ID 8 6 5 
ID 9 6 6 
ID 10 9 9 
24
DataSet視覺化 
25
分群DataSet視覺化 
26
分群DataSet 
Feature 1 Feature 2 
ID 1 0 1 
ID 2 1 0 
ID 3 1 1 
ID 4 2 1 
ID 5 1 2 
ID 6 2 2 
ID 7 5 6 
ID 8 6 5 
ID 9 6 6 
ID 10 9 9 
27
分群太簡單了? 
28
Try it 
29
透過分群演算法 
30
分群 
常見應用 
31
資料描述 
告訴你這個資料長這樣,根據資料特性可分為8大群組 
32
異常偵測 
33
這麼好用 
34
一個指令搞定 
mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job 
35
Photo Demo 
36
資料集介紹(就是剛剛的範例) 
Upload(GB) Download(GB) 
172.16.10.1 0 1 
172.16.10.2 1 0 
172.16.10.3 1 1 
172.16.10.4 2 1 
172.16.10.5 1 2 
172.16.10.6 2 2 
172.16.10.7 5 6 
172.16.10.8 6 5 
172.16.10.9 6 6 
172.16.10.10 9 9 
37
ETL 
#vi clustering.data 
0 1 
1 0 
1 1 
2 1 
1 2 
2 2 
5 6 
6 5 
6 6 
9 9 
38
將資料丟到Hadoop 
# hadoop fs -mkdir testdata 
# hadoop fs -put clustering.data testdata 
# hadoop fs -ls -R testdata 
-rw-r--r-- 3 root hdfs 288374 2014-02-05 21:53 testdata/clustering.data 
39
開始clustering 
# mahout org.apache.mahout.clustering.syntheticcontrol.canopy.Job -t1 
3 -t2 2 -i testdata -o output 
...omit... 
14/09/08 01:31:07 INFO clustering.ClusterDumper: Wrote 3 clusters 
14/09/08 01:31:07 INFO driver.MahoutDriver: Program took 104405 ms (Minutes: 
1.7400833333333334) 
40
…canopy.Job -t1 3 -t2 2 -i testdata -o… 
41 
-t2 2 
找出3群 
-t1 3
…canopy.Job -t1 6 -t2 5 -i testdata -o… 
42 
-t2 5 
-t1 6 
找出2群
尋找final檔 
#hadoop fs –ls -R output 
...omit... 
drwxr-xr-x - root hdfs 0 2014-02-07 14:48 output/clusteredPoints 
drwxr-xr-x - root hdfs 0 2014-02-07 14:48 output/clusters-0 
drwxr-xr-x - root hdfs 0 2014-02-07 14:48 output/clusters-1-final 
drwxr-xr-x - root hdfs 0 2014-02-07 14:48 output/data 
drwxr-xr-x - root hdfs 0 2014-02-07 14:48 output/random-seeds 
...omit... 
43
輸出結果 
#mahout clusterdump --input output/clusters-1-final --pointsDir output/clusteredPoints 
C-0{n=1 c=[9.000, 9.000] r=[]} 
Weight : [props - optional]: Point: 
1.0: [9.000, 9.000] 
C-1{n=2 c=[5.833, 5.583] r=[0.167, 0.083]} 
Weight : [props - optional]: Point: 
1.0: [5.000, 6.000] 
1.0: [6.000, 5.000] 
1.0: [6.000, 6.000] 
C-2{n=4 c=[1.313, 1.333] r=[0.345, 0.527]} 
Weight : [props - optional]: Point: 
1.0: [1:1.000] 
1.0: [0:1.000] 
1.0: [1.000, 1.000] 
1.0: [2.000, 1.000] 
1.0: [1.000, 2.000] 
1.0: [2.000, 2.000] 
44
視覺化結果 
org.apache.mahout.clustering.display 
45
網路IP分群視覺化 
http://flowdm.openfoundry.org 46
網路IP分群視覺化 
http://flowdm.openfoundry.org 47
恭喜大家 
得到技能”clustering by mahout” 
目前clustering技能點數為1 
48
下一步 
挑戰真實的資料 
Upload(GB) Download(GB) Flow Dtdport 
172.16.10.1 53 2.2 2321k 5 
172.16.10.2 10 2.1 251k 6 
172.16.10.3 3.2 3.5 981k 4 
172.16.10.4 1.3 5.3 12k 2 
172.16.10.5 1.6 2.1 142k 4 
172.16.10.6 2.8 2.6 652k 3 
172.16.10.7 5.2 6.2 9k 2 
172.16.10.8 6.1 1.5 65k 9 
172.16.10.9 5.2 6.9 86k 3 
172.16.10.10 1.8 1.9 1241k 7 
49
下一步 
試試不同的分群 
• kmeans clustering 
• fuzzykmeans clustering 
• dirichlet clustering 
• meanshift clustering 
50
Recommendation 
動動手玩推薦系統 
51
推薦系統就在你身邊 
52
常見應用 
53
線上購物 
54
YouTube 
55
推薦系統類型 
• content-based 
• collaborative filtering(協同過濾) 
• hybrid 
56
collaborative filtering原理 
有兩位user還沒對book-c評價, 
你猜這兩位user評價如何? 
user對book的評價表 
book-a book-b book-c 
User 1 5 4 5 
User 2 4 5 4 
User 3 5 4 
User 4 1 2 
User 5 2 1 1 
57
collaborative filtering原理 
user對book的評價表 
book-a book-b book-c 
User 1 5 4 5 
User 2 4 5 4 
User 3 5 4 
User 4 1 2 
User 5 2 1 1 
58
collaborative filtering原理 
user對book的評價表 
book-a book-b book-c 
User 1 5 4 5 
User 2 4 5 4 
User 3 5 4 4~5 
User 4 1 2 1~2 
User 5 2 1 1 
59
推薦系統太簡單了 
60
如果是這樣子呢? 
user對book的評價表 
book1 book2 book3 book4 book5 book6 book7 book8 Book9 
User1 3 2 1 5 5 1 3 1 
User2 2 3 1 3 5 4 3 
User3 1 2 3 3 2 1 
User4 2 1 2 1 1 2 
User5 3 3 1 3 2 2 3 3 2 
User6 1 3 2 2 1 
user7 4 4 1 5 1 3 3 4 
61
一個指令搞定 
mahout recommenditembased 
62
Photo Demo 
63
資料集介紹(就是剛剛的範例) 
book-a book-b book-c 
User 1 5 4 5 
User 2 4 5 4 
User 3 5 4 
User 4 1 2 
User 5 2 1 1 
user對book的評價表 
64
ETL 
#vi recom.data 
1,1,5 
1,2,4 
1,3,5 
2,1,4 
2,2,5 
2,3,4 
3,1,5 
3,2,4 
4,1,1 
4,2,2 
5,1,2 
5,2,1 
5,3,1 
65
將資料丟到Hadoop 
# hadoop fs -mkdir testdata 
# hadoop fs -put recom.data testdata 
# hadoop fs -ls -R testdata 
-rw-r--r-- 3 root hdfs 288374 2014-02-05 21:53 testdata/recom.data 
66
開始Recommendation 
# mahout recommenditembased -s 
SIMILARITY_EUCLIDEAN_DISTANCE -i testdata -o output 
...omit… 
File Input Format Counters 
Bytes Read=287 
File Output Format Counters 
Bytes Written=32 
14/09/04 05:46:56 INFO driver.MahoutDriver: Program took 434965 ms (Minutes: 
7.249416666666667) 
67
顯示推薦結果 
# hadoop fs -cat output/part-r-00000 
3 [3:4.4787264] 
4 [3:1.5212735] 
68
工人智慧=人工智慧 
user對book的評價表 
book-a book-b Book-c 
User 1 5 4 5 
User 2 4 5 4 
User 3 5 4 4~5 
User 4 1 2 1~2 
User 5 2 1 1 
# hadoop fs -ca 
3 [3:4.478726 
4 [3:1.521273 
69
要推薦book-c給User4嗎? 
user對book的評價表 
book-a book-b Book-c 
User 1 5 4 5 
User 2 4 5 4 
User 3 5 4 4~5 
User 4 1 2 1~2 
User 5 2 1 1 
我們預測User4不太喜歡book-c 
所以我不會推薦book-c給User4 
70
要推薦book-c給User3嗎? 
user對book的評價表 
book-a book-b Book-c 
User 1 5 4 5 
User 2 4 5 4 
User 3 5 4 4~5 
User 4 1 2 1~2 
User 5 2 1 1 
我們預測User3喜歡book-c 
所以我會推薦book-c給User3 
71
在次恭喜大家 
得到技能”recommendation by mahout” 
目前recommendation技能點數為1 
72
下一步 
Try it !!! 
user對book的評價表 
book1 book2 book3 book4 book5 book6 book7 book8 Book9 
User1 3 2 1 5 5 1 3 1 
User2 2 3 1 3 5 4 3 
User3 1 2 3 3 2 1 
User4 2 1 2 1 1 2 
User5 3 3 1 3 2 2 3 3 2 
User6 1 3 2 2 1 
user7 4 4 1 5 1 3 3 4 
73
Other 
• Frequent pattern analysis 
https://systw.net/note/af/sblog/more.php?id=265 
• Mahout fpgrowth 
https://systw.net/note/af/sblog/more.php?id=292 
74
Other 
• Classification and Predication 
http://systw.net/note/af/sblog/more.php?id=262 
• Mahout logistic 
http://systw.net/note/af/sblog/more.php?id=293 
75
Summary 
• 快速的完成環境建立 
• 一個指令完成clustering 
• 一個指令完成recommendation 
• 其他方法 
76
Thank 
77

More Related Content

What's hot

Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 
第一次Elasticsearch就上手
第一次Elasticsearch就上手第一次Elasticsearch就上手
第一次Elasticsearch就上手Aaron King
 
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQLTop 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQLJim Mlodgenski
 
Unleash the Power of Redis with Amazon ElastiCache
Unleash the Power of Redis with Amazon ElastiCacheUnleash the Power of Redis with Amazon ElastiCache
Unleash the Power of Redis with Amazon ElastiCacheAmazon Web Services
 
Database Consolidation using Oracle Multitenant
Database Consolidation using Oracle MultitenantDatabase Consolidation using Oracle Multitenant
Database Consolidation using Oracle MultitenantPini Dibask
 
Migrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at FacebookMigrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at FacebookMariaDB plc
 
ClickHouse Introduction by Alexander Zaitsev, Altinity CTO
ClickHouse Introduction by Alexander Zaitsev, Altinity CTOClickHouse Introduction by Alexander Zaitsev, Altinity CTO
ClickHouse Introduction by Alexander Zaitsev, Altinity CTOAltinity Ltd
 
Oracle Architecture
Oracle ArchitectureOracle Architecture
Oracle ArchitectureNeeraj Singh
 
Fluentd - Flexible, Stable, Scalable
Fluentd - Flexible, Stable, ScalableFluentd - Flexible, Stable, Scalable
Fluentd - Flexible, Stable, ScalableShu Ting Tseng
 
개발자를 위한 Amazon Lightsail Deep-Dive - 정창훈(당근마켓)
개발자를 위한 Amazon Lightsail Deep-Dive - 정창훈(당근마켓)개발자를 위한 Amazon Lightsail Deep-Dive - 정창훈(당근마켓)
개발자를 위한 Amazon Lightsail Deep-Dive - 정창훈(당근마켓)AWSKRUG - AWS한국사용자모임
 
Sizing MongoDB Clusters
Sizing MongoDB Clusters Sizing MongoDB Clusters
Sizing MongoDB Clusters MongoDB
 
Oracle ACFS High Availability NFS Services (HANFS)
Oracle ACFS High Availability NFS Services (HANFS)Oracle ACFS High Availability NFS Services (HANFS)
Oracle ACFS High Availability NFS Services (HANFS)Anju Garg
 
SRV401 Deep Dive on Amazon Elastic File System (Amazon EFS)
SRV401 Deep Dive on Amazon Elastic File System (Amazon EFS)SRV401 Deep Dive on Amazon Elastic File System (Amazon EFS)
SRV401 Deep Dive on Amazon Elastic File System (Amazon EFS)Amazon Web Services
 
Linux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performanceLinux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performancePostgreSQL-Consulting
 
Data Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its EcosystemData Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its EcosystemDatabricks
 
SQL Query Optimization: Why Is It So Hard to Get Right?
SQL Query Optimization: Why Is It So Hard to Get Right?SQL Query Optimization: Why Is It So Hard to Get Right?
SQL Query Optimization: Why Is It So Hard to Get Right?Brent Ozar
 

What's hot (20)

Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
第一次Elasticsearch就上手
第一次Elasticsearch就上手第一次Elasticsearch就上手
第一次Elasticsearch就上手
 
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQLTop 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
 
Unleash the Power of Redis with Amazon ElastiCache
Unleash the Power of Redis with Amazon ElastiCacheUnleash the Power of Redis with Amazon ElastiCache
Unleash the Power of Redis with Amazon ElastiCache
 
Database Consolidation using Oracle Multitenant
Database Consolidation using Oracle MultitenantDatabase Consolidation using Oracle Multitenant
Database Consolidation using Oracle Multitenant
 
Migrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at FacebookMigrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at Facebook
 
MS-SQL SERVER ARCHITECTURE
MS-SQL SERVER ARCHITECTUREMS-SQL SERVER ARCHITECTURE
MS-SQL SERVER ARCHITECTURE
 
ClickHouse Introduction by Alexander Zaitsev, Altinity CTO
ClickHouse Introduction by Alexander Zaitsev, Altinity CTOClickHouse Introduction by Alexander Zaitsev, Altinity CTO
ClickHouse Introduction by Alexander Zaitsev, Altinity CTO
 
PostgreSQL
PostgreSQL PostgreSQL
PostgreSQL
 
Oracle Architecture
Oracle ArchitectureOracle Architecture
Oracle Architecture
 
Fluentd - Flexible, Stable, Scalable
Fluentd - Flexible, Stable, ScalableFluentd - Flexible, Stable, Scalable
Fluentd - Flexible, Stable, Scalable
 
개발자를 위한 Amazon Lightsail Deep-Dive - 정창훈(당근마켓)
개발자를 위한 Amazon Lightsail Deep-Dive - 정창훈(당근마켓)개발자를 위한 Amazon Lightsail Deep-Dive - 정창훈(당근마켓)
개발자를 위한 Amazon Lightsail Deep-Dive - 정창훈(당근마켓)
 
Storage and Alfresco
Storage and AlfrescoStorage and Alfresco
Storage and Alfresco
 
Sizing MongoDB Clusters
Sizing MongoDB Clusters Sizing MongoDB Clusters
Sizing MongoDB Clusters
 
Oracle ACFS High Availability NFS Services (HANFS)
Oracle ACFS High Availability NFS Services (HANFS)Oracle ACFS High Availability NFS Services (HANFS)
Oracle ACFS High Availability NFS Services (HANFS)
 
SRV401 Deep Dive on Amazon Elastic File System (Amazon EFS)
SRV401 Deep Dive on Amazon Elastic File System (Amazon EFS)SRV401 Deep Dive on Amazon Elastic File System (Amazon EFS)
SRV401 Deep Dive on Amazon Elastic File System (Amazon EFS)
 
Linux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performanceLinux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performance
 
Data Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its EcosystemData Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its Ecosystem
 
SQL Query Optimization: Why Is It So Hard to Get Right?
SQL Query Optimization: Why Is It So Hard to Get Right?SQL Query Optimization: Why Is It So Hard to Get Right?
SQL Query Optimization: Why Is It So Hard to Get Right?
 
Apache CouchDB
Apache CouchDBApache CouchDB
Apache CouchDB
 

Viewers also liked

Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用James Chen
 
「沙中撈金術」﹣談開放原始碼的推薦系統
「沙中撈金術」﹣談開放原始碼的推薦系統 「沙中撈金術」﹣談開放原始碼的推薦系統
「沙中撈金術」﹣談開放原始碼的推薦系統 建興 王
 
A Concept of Network Analysis Tool by Data Mining
A Concept of Network Analysis Tool by Data MiningA Concept of Network Analysis Tool by Data Mining
A Concept of Network Analysis Tool by Data MiningJhang Raymond
 
paper review about botnet
paper review about botnetpaper review about botnet
paper review about botnetJhang Raymond
 
從AlphaGo的設計淺談資安領域的異常分析流程
從AlphaGo的設計淺談資安領域的異常分析流程從AlphaGo的設計淺談資安領域的異常分析流程
從AlphaGo的設計淺談資安領域的異常分析流程Jhang Raymond
 
全文搜尋引擎的進階實作與應用
全文搜尋引擎的進階實作與應用全文搜尋引擎的進階實作與應用
全文搜尋引擎的進階實作與應用建興 王
 
Windows Mobile 多媒體應用程式開發
Windows Mobile 多媒體應用程式開發Windows Mobile 多媒體應用程式開發
Windows Mobile 多媒體應用程式開發建興 王
 
Java 的開放原碼全文搜尋技術 - Lucene
Java 的開放原碼全文搜尋技術 - LuceneJava 的開放原碼全文搜尋技術 - Lucene
Java 的開放原碼全文搜尋技術 - Lucene建興 王
 
Introduction to Apache Mahout
Introduction to Apache MahoutIntroduction to Apache Mahout
Introduction to Apache MahoutEdureka!
 
排隊應用開發
排隊應用開發排隊應用開發
排隊應用開發建興 王
 
在雲端上啜飲爪哇
在雲端上啜飲爪哇在雲端上啜飲爪哇
在雲端上啜飲爪哇建興 王
 
5分鐘建立第一個Bluemix網站
5分鐘建立第一個Bluemix網站5分鐘建立第一個Bluemix網站
5分鐘建立第一個Bluemix網站Pei-Ru Shih
 
Automated Deployment with Maven - going the whole nine yards
Automated Deployment with Maven - going the whole nine yardsAutomated Deployment with Maven - going the whole nine yards
Automated Deployment with Maven - going the whole nine yardsJohn Ferguson Smart Limited
 
Hpca2012 facebook keynote
Hpca2012 facebook keynoteHpca2012 facebook keynote
Hpca2012 facebook keynoteparallellabs
 
Docker based Hadoop provisioning - Hadoop Summit 2014
Docker based Hadoop provisioning - Hadoop Summit 2014 Docker based Hadoop provisioning - Hadoop Summit 2014
Docker based Hadoop provisioning - Hadoop Summit 2014 Janos Matyas
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeperSaurav Haloi
 

Viewers also liked (20)

Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用
 
「沙中撈金術」﹣談開放原始碼的推薦系統
「沙中撈金術」﹣談開放原始碼的推薦系統 「沙中撈金術」﹣談開放原始碼的推薦系統
「沙中撈金術」﹣談開放原始碼的推薦系統
 
A Concept of Network Analysis Tool by Data Mining
A Concept of Network Analysis Tool by Data MiningA Concept of Network Analysis Tool by Data Mining
A Concept of Network Analysis Tool by Data Mining
 
paper review about botnet
paper review about botnetpaper review about botnet
paper review about botnet
 
從AlphaGo的設計淺談資安領域的異常分析流程
從AlphaGo的設計淺談資安領域的異常分析流程從AlphaGo的設計淺談資安領域的異常分析流程
從AlphaGo的設計淺談資安領域的異常分析流程
 
全文搜尋引擎的進階實作與應用
全文搜尋引擎的進階實作與應用全文搜尋引擎的進階實作與應用
全文搜尋引擎的進階實作與應用
 
Windows Mobile 多媒體應用程式開發
Windows Mobile 多媒體應用程式開發Windows Mobile 多媒體應用程式開發
Windows Mobile 多媒體應用程式開發
 
Java 的開放原碼全文搜尋技術 - Lucene
Java 的開放原碼全文搜尋技術 - LuceneJava 的開放原碼全文搜尋技術 - Lucene
Java 的開放原碼全文搜尋技術 - Lucene
 
Introduction to Apache Mahout
Introduction to Apache MahoutIntroduction to Apache Mahout
Introduction to Apache Mahout
 
排隊應用開發
排隊應用開發排隊應用開發
排隊應用開發
 
在雲端上啜飲爪哇
在雲端上啜飲爪哇在雲端上啜飲爪哇
在雲端上啜飲爪哇
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Qcon
QconQcon
Qcon
 
Intro to Apache Mahout
Intro to Apache MahoutIntro to Apache Mahout
Intro to Apache Mahout
 
5分鐘建立第一個Bluemix網站
5分鐘建立第一個Bluemix網站5分鐘建立第一個Bluemix網站
5分鐘建立第一個Bluemix網站
 
Automated Deployment with Maven - going the whole nine yards
Automated Deployment with Maven - going the whole nine yardsAutomated Deployment with Maven - going the whole nine yards
Automated Deployment with Maven - going the whole nine yards
 
Hpca2012 facebook keynote
Hpca2012 facebook keynoteHpca2012 facebook keynote
Hpca2012 facebook keynote
 
Machine Learning with Apache Spark
Machine Learning with Apache SparkMachine Learning with Apache Spark
Machine Learning with Apache Spark
 
Docker based Hadoop provisioning - Hadoop Summit 2014
Docker based Hadoop provisioning - Hadoop Summit 2014 Docker based Hadoop provisioning - Hadoop Summit 2014
Docker based Hadoop provisioning - Hadoop Summit 2014
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
 

Similar to Mahout資料分析基礎入門

Bigdata 大資料分析實務 (進階上機課程)
Bigdata 大資料分析實務 (進階上機課程)Bigdata 大資料分析實務 (進階上機課程)
Bigdata 大資料分析實務 (進階上機課程)家雋 莊
 
Bigdata 大資料分析實務 (進階上機課程)
Bigdata 大資料分析實務 (進階上機課程)Bigdata 大資料分析實務 (進階上機課程)
Bigdata 大資料分析實務 (進階上機課程)家雋 莊
 
Continuous Delivery Workshop with Ansible x GitLab CI (5th)
 Continuous Delivery Workshop with Ansible x GitLab CI (5th) Continuous Delivery Workshop with Ansible x GitLab CI (5th)
Continuous Delivery Workshop with Ansible x GitLab CI (5th)Chu-Siang Lai
 
Continuous Delivery Workshop with Ansible x GitLab CI (3rd)
Continuous Delivery Workshop with Ansible x GitLab CI (3rd)Continuous Delivery Workshop with Ansible x GitLab CI (3rd)
Continuous Delivery Workshop with Ansible x GitLab CI (3rd)Chu-Siang Lai
 
用Raspberry PI學Linux驅動程式
用Raspberry PI學Linux驅動程式用Raspberry PI學Linux驅動程式
用Raspberry PI學Linux驅動程式Stanley Ho
 
給軟體工程師的不廢話 R 語言精要班
給軟體工程師的不廢話 R 語言精要班給軟體工程師的不廢話 R 語言精要班
給軟體工程師的不廢話 R 語言精要班台灣資料科學年會
 
Track A-3 Enterprise Data Lake in Action - 搭建「活」的企業 Big Data 生態架構
Track A-3 Enterprise Data Lake in Action - 搭建「活」的企業 Big Data 生態架構Track A-3 Enterprise Data Lake in Action - 搭建「活」的企業 Big Data 生態架構
Track A-3 Enterprise Data Lake in Action - 搭建「活」的企業 Big Data 生態架構Etu Solution
 
Python 于 webgame 的应用
Python 于 webgame 的应用Python 于 webgame 的应用
Python 于 webgame 的应用勇浩 赖
 
Oracle10g Rac Configuration For Linux X86
Oracle10g Rac Configuration For Linux X86Oracle10g Rac Configuration For Linux X86
Oracle10g Rac Configuration For Linux X86dbabc
 
C/C++调试、跟踪及性能分析工具综述
C/C++调试、跟踪及性能分析工具综述C/C++调试、跟踪及性能分析工具综述
C/C++调试、跟踪及性能分析工具综述Xiaozhe Wang
 
管理员必备的20个 Linux系统监控工具
管理员必备的20个 Linux系统监控工具管理员必备的20个 Linux系统监控工具
管理员必备的20个 Linux系统监控工具wensheng wei
 
1_MySQL_20220307_0328.pptx
1_MySQL_20220307_0328.pptx1_MySQL_20220307_0328.pptx
1_MySQL_20220307_0328.pptxFEG
 
[students AI workshop] Pytorch
[students AI workshop]  Pytorch[students AI workshop]  Pytorch
[students AI workshop] PytorchTzu-Wei Huang
 
COSCUP 2014 : open source compiler 戰國時代的軍備競賽
COSCUP 2014 : open source compiler 戰國時代的軍備競賽COSCUP 2014 : open source compiler 戰國時代的軍備競賽
COSCUP 2014 : open source compiler 戰國時代的軍備競賽Kito Cheng
 
Linux性能监控cpu内存io网络
Linux性能监控cpu内存io网络Linux性能监控cpu内存io网络
Linux性能监控cpu内存io网络lovingprince58
 
手机腾讯网Js资源版本增量更新方案w3ctech
手机腾讯网Js资源版本增量更新方案w3ctech 手机腾讯网Js资源版本增量更新方案w3ctech
手机腾讯网Js资源版本增量更新方案w3ctech luyongfugx
 
程式人雜誌 -- 2013年5月號
程式人雜誌 -- 2013年5月號程式人雜誌 -- 2013年5月號
程式人雜誌 -- 2013年5月號鍾誠 陳鍾誠
 
淺談物聯網巨量資料挑戰 - Jazz 王耀聰 (2016/3/17 於鴻海內湖) 免費講座
淺談物聯網巨量資料挑戰 - Jazz 王耀聰 (2016/3/17 於鴻海內湖) 免費講座淺談物聯網巨量資料挑戰 - Jazz 王耀聰 (2016/3/17 於鴻海內湖) 免費講座
淺談物聯網巨量資料挑戰 - Jazz 王耀聰 (2016/3/17 於鴻海內湖) 免費講座NTC.im(Notch Training Center)
 
廣宣學堂Python金融爬蟲原理班 20170416
廣宣學堂Python金融爬蟲原理班 20170416廣宣學堂Python金融爬蟲原理班 20170416
廣宣學堂Python金融爬蟲原理班 20170416Paul Chao
 

Similar to Mahout資料分析基礎入門 (20)

Bigdata 大資料分析實務 (進階上機課程)
Bigdata 大資料分析實務 (進階上機課程)Bigdata 大資料分析實務 (進階上機課程)
Bigdata 大資料分析實務 (進階上機課程)
 
Bigdata 大資料分析實務 (進階上機課程)
Bigdata 大資料分析實務 (進階上機課程)Bigdata 大資料分析實務 (進階上機課程)
Bigdata 大資料分析實務 (進階上機課程)
 
Continuous Delivery Workshop with Ansible x GitLab CI (5th)
 Continuous Delivery Workshop with Ansible x GitLab CI (5th) Continuous Delivery Workshop with Ansible x GitLab CI (5th)
Continuous Delivery Workshop with Ansible x GitLab CI (5th)
 
Continuous Delivery Workshop with Ansible x GitLab CI (3rd)
Continuous Delivery Workshop with Ansible x GitLab CI (3rd)Continuous Delivery Workshop with Ansible x GitLab CI (3rd)
Continuous Delivery Workshop with Ansible x GitLab CI (3rd)
 
用Raspberry PI學Linux驅動程式
用Raspberry PI學Linux驅動程式用Raspberry PI學Linux驅動程式
用Raspberry PI學Linux驅動程式
 
給軟體工程師的不廢話 R 語言精要班
給軟體工程師的不廢話 R 語言精要班給軟體工程師的不廢話 R 語言精要班
給軟體工程師的不廢話 R 語言精要班
 
Track A-3 Enterprise Data Lake in Action - 搭建「活」的企業 Big Data 生態架構
Track A-3 Enterprise Data Lake in Action - 搭建「活」的企業 Big Data 生態架構Track A-3 Enterprise Data Lake in Action - 搭建「活」的企業 Big Data 生態架構
Track A-3 Enterprise Data Lake in Action - 搭建「活」的企業 Big Data 生態架構
 
Python 于 webgame 的应用
Python 于 webgame 的应用Python 于 webgame 的应用
Python 于 webgame 的应用
 
Oracle10g Rac Configuration For Linux X86
Oracle10g Rac Configuration For Linux X86Oracle10g Rac Configuration For Linux X86
Oracle10g Rac Configuration For Linux X86
 
C/C++调试、跟踪及性能分析工具综述
C/C++调试、跟踪及性能分析工具综述C/C++调试、跟踪及性能分析工具综述
C/C++调试、跟踪及性能分析工具综述
 
Enterprise Data Lake in Action
Enterprise Data Lake in ActionEnterprise Data Lake in Action
Enterprise Data Lake in Action
 
管理员必备的20个 Linux系统监控工具
管理员必备的20个 Linux系统监控工具管理员必备的20个 Linux系统监控工具
管理员必备的20个 Linux系统监控工具
 
1_MySQL_20220307_0328.pptx
1_MySQL_20220307_0328.pptx1_MySQL_20220307_0328.pptx
1_MySQL_20220307_0328.pptx
 
[students AI workshop] Pytorch
[students AI workshop]  Pytorch[students AI workshop]  Pytorch
[students AI workshop] Pytorch
 
COSCUP 2014 : open source compiler 戰國時代的軍備競賽
COSCUP 2014 : open source compiler 戰國時代的軍備競賽COSCUP 2014 : open source compiler 戰國時代的軍備競賽
COSCUP 2014 : open source compiler 戰國時代的軍備競賽
 
Linux性能监控cpu内存io网络
Linux性能监控cpu内存io网络Linux性能监控cpu内存io网络
Linux性能监控cpu内存io网络
 
手机腾讯网Js资源版本增量更新方案w3ctech
手机腾讯网Js资源版本增量更新方案w3ctech 手机腾讯网Js资源版本增量更新方案w3ctech
手机腾讯网Js资源版本增量更新方案w3ctech
 
程式人雜誌 -- 2013年5月號
程式人雜誌 -- 2013年5月號程式人雜誌 -- 2013年5月號
程式人雜誌 -- 2013年5月號
 
淺談物聯網巨量資料挑戰 - Jazz 王耀聰 (2016/3/17 於鴻海內湖) 免費講座
淺談物聯網巨量資料挑戰 - Jazz 王耀聰 (2016/3/17 於鴻海內湖) 免費講座淺談物聯網巨量資料挑戰 - Jazz 王耀聰 (2016/3/17 於鴻海內湖) 免費講座
淺談物聯網巨量資料挑戰 - Jazz 王耀聰 (2016/3/17 於鴻海內湖) 免費講座
 
廣宣學堂Python金融爬蟲原理班 20170416
廣宣學堂Python金融爬蟲原理班 20170416廣宣學堂Python金融爬蟲原理班 20170416
廣宣學堂Python金融爬蟲原理班 20170416
 

Mahout資料分析基礎入門

Editor's Notes

  1. 看到入門兩個字 就表示這個主題簡單到爆 我的目標是,聽完這場分享後, 不會mahout的人可以馬上能應用 那原本就會的人,…會變更強嗎,…. 其實不會… 因為今天是分享入門, 入門聽在多次,還是入門 所以現在要換場還來得急
  2. we have many data from many years
  3. # mahout Error: JAVA_HOME is not set. # export JAVA_HOME=/usr/jdk64/jdk1.6.0_31 # mahout MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. Running on hadoop, using /usr/lib/hadoop/bin/hadoop and HADOOP_CONF_DIR=/etc/hadoop/conf MAHOUT-JOB: /usr/lib/mahout/mahout-examples-0.8.0.2.0.6.1-101-job.jar An example program must be given as the first argument. Valid program names are: ...omit...
  4. 現在假設有個資料集像這樣,10個id,各配上2個特徵值 大家可以自由想像, 這是10個人的身高與體重, 或是IP的流量與連線數, 或是任何事情
  5. 如果我們把剛剛的表, 畫在一張圖上, 那大概就會長這樣
  6. 如果我們用分群演算法,依照各id的屬性,就能得到這樣的結果,
  7. 這張很容易用肉眼看出來, 而且分的很明顯 , 自己就能分了, 幹嘛還要用分群演算法 一定會有人想說, 這麼簡單, 那我自己人工分一分就好了啊, 幹嘛還要讓演算法做 這張圖要小朋友來分也一定分的出來, 就這樣這樣
  8. 那如果是這張樣呢 要怎麼分?
  9. 不用擔心, 這時候演算法就很好用了, 一下就能幫你分好
  10. 0 1 1 0 1 1 2 1 1 2 2 2 5 6 6 5 6 6 9 9
  11. 1,1,5 1,2,4 1,3,5 2,1,4 2,2,5 2,3,4 3,1,5 3,2,4 4,1,1 4,2,2 5,1,2 5,2,1 5,3,1