SlideShare a Scribd company logo
1 of 35
Running Word2Vec with Chinese Wikipedia
dump
Similarity
1. if two words have high similarity, it means they have strong
relationship
2. use wikipedia to let machine has general sense about our
world
"魯夫" is main charactrer in "海賊王"
"東京" is capital city in "日本"
Related Application
1. voice-driven assistants
(Siri, Google Now, Microsoft Cortana)
2. e-commerce recommandation
(Alibaba, Rakuten)
3. question answering(IBM Waston)
4. others(Flipboard, SmartNews)
Related Application
Build you own smart AI
My current progress
Download Wikipedia
1. https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-
pages-articles.xml.bz2
2. it contains traditional chinese and simplified chinese
articles
3. 1G file size, 230,000 articles, 150,000,000 words
Preprocessing
1. use OpenCC to translate from simplified chinese to
traditional chinese
2. support C、C++、Python、PHP、Java、Ruby、Node.js
3. compatible with Linux, Windows and Mac
4. “智能手机” -> “智慧手機”, “信息” -> “資訊”
5. you can play it on the website http://opencc.byvoid.com/
opencc -i zhwiki.txt -o twwiki.txt -c /usr/share/opencc/s2twp.json
Preprocessing
1. use gensim to extract article from Wikipedia dump
2. 2G memory is required
Preprocessing
from gensim.corpora import WikiCorpus
if __name__ == '__main__':
inp, outp = sys.argv[1:3]
output = open(outp,'w')
wiki = WikiCorpus(inp, lemmatize=False, dictionary={})
for text in wiki.get_texts():
output.write(space.join(text) + "n")
output.close()
gensim provides iterator to extract sentences from
compressed wiki dump
Segmentation
1. english uses some notation(whitespace, dot, etc) to
separate words,
but not all language follow this practice
2. "下雨天/留客天/留我/不留", "下雨/天留客/天留/我不留"
3. new word keep to be generated(such as "小確幸", "物聯網")
Segmentation
Jieba supports full and search mode
#encoding=UTF-8
import jieba
if __name__ == '__main__':
input_str = u'今天讓我們來測試中文斷詞'
seg_list = jieba.cut(input_str, cut_all=True) # full mode
print(', '.join(seg_list))
seg_list = jieba.cut(input_str, cut_all=False) # search mode
print(', '.join(seg_list))
今天, 讓, 我, 們, 來, 測, 試, 中文, 斷, 詞
今天, 讓, 我們, 來, 測試, 中文, 斷詞
Segmentation
sometimes the result is a little bit funny
#encoding=UTF-8
import jieba
if __name__ == '__main__':
input_str = u'張無忌來大都找我吧!哈哈哈哈哈哈'
seg_list = jieba.cut(input_str, cut_all=False)
print(', '.join(seg_list))
張無忌, 來, 大都, 找, 我, 吧, !, 哈哈哈, 哈哈哈
Segmentation
good dictionary, good result
#encoding=UTF-8
import jieba
if __name__ == '__main__':
input_str = u'舒潔衛生紙買一送一'
seg_list = jieba.cut(input_str, cut_all=False)
print(', '.join(seg_list))
jieba.set_dictionary('./data/dict.txt.big')
seg_list = jieba.cut(input_str, cut_all=False)
print(', '.join(seg_list))
舒潔衛, 生紙, 買, 一送, 一
舒潔, 衛生紙, 買一送一
Segmentation
verb? nouns? adjective? adverb?
#encoding=UTF-8
import pseg
if __name__ == '__main__':
input_str = u'今天讓我們來測試中文斷詞'
seg_list = pseg.cut(input_str)
for seg, flag in seg_list:
print u'{}:{}'.format(seg, flag)
今天:t 讓:v 我們:r 來:v 測試:vn 中文:nz 斷詞:n
Segmentation
keyword extraction
#encoding=UTF-8
import jieba
import jieba.analyse
if __name__ == '__main__':
input_str = u'我的故鄉在台灣, I am Taiwanese'
jieba.set_dictionary('./data/dict.txt.big')
seg_list = jieba.analyse.extract_tags(input_str, topK=3)
print(', '.join(seg_list))
jieba.analyse.set_stop_words('./data/stop_words.txt')
seg_list = jieba.analyse.extract_tags(input_str, topK=3)
print(', '.join(seg_list))
台灣, am, 故鄉
台灣, 故鄉, Taiwanese
Finding Similarity
1. How to do that ? Word2Vec is super star !
Word2Vec
transform from word to vector, distance between vector
implies degree of similarity
vector("首爾") - vector("日本") > vector("東京") - vector("日本")
vector("東京") - vector("日本") + vector("首爾") = vector("南韓")
Word2Vec
word2vec targets the word is asked to predict the
surrounding context
在日本,[ 青森 的 "蘋果" 又 甜 ]又好吃
今年,新版的[ Macbook 是 "蘋果" 發表 的 ]重點之一
"青森" and "Macbook" have high simlaritiy with “蘋果"
training from previous window, "青森" and "日本" also have
high simlaritiy
Word2Vec
word2vec uses skip-gram neural network to predict neighbor
context
Training Word2Vec model by gensim
words already preprocessed and separated by whitespace.
#encoding=UTF-8
import multiprocessing
from gensim.corpora import WikiCorpus
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
if __name__ == '__main__':
inp = sys.argv[1]
model = Word2Vec(LineSentence(inp),
size=100,
window=10,
min_count=10,
workers=multiprocessing.cpu_count())
it doesn't work for me, gensim's word2vec run out of memory
Move to Spark MLlib
1. Spark offer over 80 operators that make it easy to build
parallel application
2. Databrick company uses Spark to break world record in
2014 1TB sort benchmark completition
3. MLlib is Spark's machine learning library.
Spark cluster overview
1. Spark is master-slave architecture, which likes YARN
2. cluster management is master, it handle resource
managemnet and slave health management.
3. when you launch application,
master will assign a slave to be driver.
driver request resource from master,
execute main function and assign task to slave
Spark cluster deployment
1. use Linode API to create and boot new instance rapidly
2. use standalone Spark cluster
it also can deploy on Mesos or YARN cluster
3. install Java,Scala and put pre-built Spark, finally launch
slave executor!
4. use ansible to deploy spark executor and use LZ4 to speed
up decompress pre-built Spark package
Training Word2Vec model by Spark cluster
RDD is the basic abstraction in Spark.
Represents an immutable, partitioned collection of elements
that can be operated on in parallel
val input:RDD[String] = sc.textFile(inp, 5).cache()
val token:RDD[Seq[String]] = input.map(article => tokenize(article))
val word2vec = new Word2Vec()
word2vec.setNumPartitions(5)
val model = word2vec.fit(token)
sc.parallelize(Seq(model), 1).saveAsObjectFile("hdfs://....")
Querying Word2Vec model by Spark cluster
val model = sc.objectFile[Word2VecModel]("hdfs://....").first()
val synonyms = model.findSynonyms("熱火",10)
for((synonyms, cosineSim) <- synonyms){
println(synonyms+":"+cosineSim)
}
load model from HDFS
compare with model training, resource requirement is cheap
on finding similarity
Query Word2Vec by Spark cluster
Example of "man"
Example of "luffy"(one piece comic's man
charactrer)
Example of "cell phone"
Thank you

More Related Content

Viewers also liked

Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher ManningDeep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
BigDataCloud
 
Modeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural NetworksModeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural Networks
Josh Patterson
 
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
Edge AI and Vision Alliance
 
Statistical Semantic入門 ~分布仮説からword2vecまで~
Statistical Semantic入門 ~分布仮説からword2vecまで~Statistical Semantic入門 ~分布仮説からword2vecまで~
Statistical Semantic入門 ~分布仮説からword2vecまで~
Yuya Unno
 

Viewers also liked (20)

Lda2vec text by the bay 2016 with notes
Lda2vec text by the bay 2016 with notesLda2vec text by the bay 2016 with notes
Lda2vec text by the bay 2016 with notes
 
Word2vec (中文)
Word2vec (中文)Word2vec (中文)
Word2vec (中文)
 
Machine Learning : comparing neural network methods
Machine Learning : comparing neural network methodsMachine Learning : comparing neural network methods
Machine Learning : comparing neural network methods
 
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)
 
Extending Word2Vec for Performance and Semi-Supervised Learning-(Michael Mala...
Extending Word2Vec for Performance and Semi-Supervised Learning-(Michael Mala...Extending Word2Vec for Performance and Semi-Supervised Learning-(Michael Mala...
Extending Word2Vec for Performance and Semi-Supervised Learning-(Michael Mala...
 
Information Retrieval with Deep Learning
Information Retrieval with Deep LearningInformation Retrieval with Deep Learning
Information Retrieval with Deep Learning
 
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher ManningDeep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
 
Image Recognition with TensorFlow
Image Recognition with TensorFlowImage Recognition with TensorFlow
Image Recognition with TensorFlow
 
word2vec - From theory to practice
word2vec - From theory to practiceword2vec - From theory to practice
word2vec - From theory to practice
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word Embeddings
 
淺談HTTP發展趨勢與SPDY
淺談HTTP發展趨勢與SPDY淺談HTTP發展趨勢與SPDY
淺談HTTP發展趨勢與SPDY
 
Modeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural NetworksModeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural Networks
 
Recent Progress in RNN and NLP
Recent Progress in RNN and NLPRecent Progress in RNN and NLP
Recent Progress in RNN and NLP
 
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
 
Word2vec algorithm
Word2vec algorithmWord2vec algorithm
Word2vec algorithm
 
Lecture 06 marco aurelio ranzato - deep learning
Lecture 06   marco aurelio ranzato - deep learningLecture 06   marco aurelio ranzato - deep learning
Lecture 06 marco aurelio ranzato - deep learning
 
藏頭詩產生器
藏頭詩產生器藏頭詩產生器
藏頭詩產生器
 
Statistical Semantic入門 ~分布仮説からword2vecまで~
Statistical Semantic入門 ~分布仮説からword2vecまで~Statistical Semantic入門 ~分布仮説からword2vecまで~
Statistical Semantic入門 ~分布仮説からword2vecまで~
 
Learning Financial Market Data with Recurrent Autoencoders and TensorFlow
Learning Financial Market Data with Recurrent Autoencoders and TensorFlowLearning Financial Market Data with Recurrent Autoencoders and TensorFlow
Learning Financial Market Data with Recurrent Autoencoders and TensorFlow
 
Deep Learning for Information Retrieval
Deep Learning for Information RetrievalDeep Learning for Information Retrieval
Deep Learning for Information Retrieval
 

Similar to Running Word2Vec with Chinese Wikipedia dump

Work Queues
Work QueuesWork Queues
Work Queues
ciconf
 
Apache spark-the-definitive-guide-excerpts-r1
Apache spark-the-definitive-guide-excerpts-r1Apache spark-the-definitive-guide-excerpts-r1
Apache spark-the-definitive-guide-excerpts-r1
AjayRawat971036
 

Similar to Running Word2Vec with Chinese Wikipedia dump (20)

College Project - Java Disassembler - Description
College Project - Java Disassembler - DescriptionCollege Project - Java Disassembler - Description
College Project - Java Disassembler - Description
 
Sparklife - Life In The Trenches With Spark
Sparklife - Life In The Trenches With SparkSparklife - Life In The Trenches With Spark
Sparklife - Life In The Trenches With Spark
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Don't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax TreesDon't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax Trees
 
DSLs in JavaScript
DSLs in JavaScriptDSLs in JavaScript
DSLs in JavaScript
 
Functional (web) development with Clojure
Functional (web) development with ClojureFunctional (web) development with Clojure
Functional (web) development with Clojure
 
[HITB Malaysia 2011] Exploit Automation
[HITB Malaysia 2011] Exploit Automation[HITB Malaysia 2011] Exploit Automation
[HITB Malaysia 2011] Exploit Automation
 
[Kiwicon 2011] Post Memory Corruption Memory Analysis
[Kiwicon 2011] Post Memory Corruption Memory Analysis[Kiwicon 2011] Post Memory Corruption Memory Analysis
[Kiwicon 2011] Post Memory Corruption Memory Analysis
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
05 - Bypassing DEP, or why ASLR matters
05 - Bypassing DEP, or why ASLR matters05 - Bypassing DEP, or why ASLR matters
05 - Bypassing DEP, or why ASLR matters
 
Work Queues
Work QueuesWork Queues
Work Queues
 
[Ruxcon 2011] Post Memory Corruption Memory Analysis
[Ruxcon 2011] Post Memory Corruption Memory Analysis[Ruxcon 2011] Post Memory Corruption Memory Analysis
[Ruxcon 2011] Post Memory Corruption Memory Analysis
 
Puppet for Sys Admins
Puppet for Sys AdminsPuppet for Sys Admins
Puppet for Sys Admins
 
Apache spark-the-definitive-guide-excerpts-r1
Apache spark-the-definitive-guide-excerpts-r1Apache spark-the-definitive-guide-excerpts-r1
Apache spark-the-definitive-guide-excerpts-r1
 
Solving the Riddle of Search: Using Sphinx with Rails
Solving the Riddle of Search: Using Sphinx with RailsSolving the Riddle of Search: Using Sphinx with Rails
Solving the Riddle of Search: Using Sphinx with Rails
 
London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015
 
Gearman and CodeIgniter
Gearman and CodeIgniterGearman and CodeIgniter
Gearman and CodeIgniter
 
[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory Analysis[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory Analysis
 
Inside Bokete: Web Application with Mojolicious and others
Inside Bokete:  Web Application with Mojolicious and othersInside Bokete:  Web Application with Mojolicious and others
Inside Bokete: Web Application with Mojolicious and others
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Recently uploaded (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

Running Word2Vec with Chinese Wikipedia dump

  • 1. Running Word2Vec with Chinese Wikipedia dump
  • 2. Similarity 1. if two words have high similarity, it means they have strong relationship 2. use wikipedia to let machine has general sense about our world "魯夫" is main charactrer in "海賊王" "東京" is capital city in "日本"
  • 3. Related Application 1. voice-driven assistants (Siri, Google Now, Microsoft Cortana) 2. e-commerce recommandation (Alibaba, Rakuten) 3. question answering(IBM Waston) 4. others(Flipboard, SmartNews)
  • 5. Build you own smart AI
  • 6.
  • 8. Download Wikipedia 1. https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest- pages-articles.xml.bz2 2. it contains traditional chinese and simplified chinese articles 3. 1G file size, 230,000 articles, 150,000,000 words
  • 9. Preprocessing 1. use OpenCC to translate from simplified chinese to traditional chinese 2. support C、C++、Python、PHP、Java、Ruby、Node.js 3. compatible with Linux, Windows and Mac 4. “智能手机” -> “智慧手機”, “信息” -> “資訊” 5. you can play it on the website http://opencc.byvoid.com/ opencc -i zhwiki.txt -o twwiki.txt -c /usr/share/opencc/s2twp.json
  • 10. Preprocessing 1. use gensim to extract article from Wikipedia dump 2. 2G memory is required
  • 11. Preprocessing from gensim.corpora import WikiCorpus if __name__ == '__main__': inp, outp = sys.argv[1:3] output = open(outp,'w') wiki = WikiCorpus(inp, lemmatize=False, dictionary={}) for text in wiki.get_texts(): output.write(space.join(text) + "n") output.close() gensim provides iterator to extract sentences from compressed wiki dump
  • 12. Segmentation 1. english uses some notation(whitespace, dot, etc) to separate words, but not all language follow this practice 2. "下雨天/留客天/留我/不留", "下雨/天留客/天留/我不留" 3. new word keep to be generated(such as "小確幸", "物聯網")
  • 13. Segmentation Jieba supports full and search mode #encoding=UTF-8 import jieba if __name__ == '__main__': input_str = u'今天讓我們來測試中文斷詞' seg_list = jieba.cut(input_str, cut_all=True) # full mode print(', '.join(seg_list)) seg_list = jieba.cut(input_str, cut_all=False) # search mode print(', '.join(seg_list)) 今天, 讓, 我, 們, 來, 測, 試, 中文, 斷, 詞 今天, 讓, 我們, 來, 測試, 中文, 斷詞
  • 14. Segmentation sometimes the result is a little bit funny #encoding=UTF-8 import jieba if __name__ == '__main__': input_str = u'張無忌來大都找我吧!哈哈哈哈哈哈' seg_list = jieba.cut(input_str, cut_all=False) print(', '.join(seg_list)) 張無忌, 來, 大都, 找, 我, 吧, !, 哈哈哈, 哈哈哈
  • 15. Segmentation good dictionary, good result #encoding=UTF-8 import jieba if __name__ == '__main__': input_str = u'舒潔衛生紙買一送一' seg_list = jieba.cut(input_str, cut_all=False) print(', '.join(seg_list)) jieba.set_dictionary('./data/dict.txt.big') seg_list = jieba.cut(input_str, cut_all=False) print(', '.join(seg_list)) 舒潔衛, 生紙, 買, 一送, 一 舒潔, 衛生紙, 買一送一
  • 16. Segmentation verb? nouns? adjective? adverb? #encoding=UTF-8 import pseg if __name__ == '__main__': input_str = u'今天讓我們來測試中文斷詞' seg_list = pseg.cut(input_str) for seg, flag in seg_list: print u'{}:{}'.format(seg, flag) 今天:t 讓:v 我們:r 來:v 測試:vn 中文:nz 斷詞:n
  • 17. Segmentation keyword extraction #encoding=UTF-8 import jieba import jieba.analyse if __name__ == '__main__': input_str = u'我的故鄉在台灣, I am Taiwanese' jieba.set_dictionary('./data/dict.txt.big') seg_list = jieba.analyse.extract_tags(input_str, topK=3) print(', '.join(seg_list)) jieba.analyse.set_stop_words('./data/stop_words.txt') seg_list = jieba.analyse.extract_tags(input_str, topK=3) print(', '.join(seg_list)) 台灣, am, 故鄉 台灣, 故鄉, Taiwanese
  • 18. Finding Similarity 1. How to do that ? Word2Vec is super star !
  • 19. Word2Vec transform from word to vector, distance between vector implies degree of similarity vector("首爾") - vector("日本") > vector("東京") - vector("日本") vector("東京") - vector("日本") + vector("首爾") = vector("南韓")
  • 20. Word2Vec word2vec targets the word is asked to predict the surrounding context 在日本,[ 青森 的 "蘋果" 又 甜 ]又好吃 今年,新版的[ Macbook 是 "蘋果" 發表 的 ]重點之一 "青森" and "Macbook" have high simlaritiy with “蘋果" training from previous window, "青森" and "日本" also have high simlaritiy
  • 21. Word2Vec word2vec uses skip-gram neural network to predict neighbor context
  • 22. Training Word2Vec model by gensim words already preprocessed and separated by whitespace. #encoding=UTF-8 import multiprocessing from gensim.corpora import WikiCorpus from gensim.models import Word2Vec from gensim.models.word2vec import LineSentence if __name__ == '__main__': inp = sys.argv[1] model = Word2Vec(LineSentence(inp), size=100, window=10, min_count=10, workers=multiprocessing.cpu_count()) it doesn't work for me, gensim's word2vec run out of memory
  • 23. Move to Spark MLlib 1. Spark offer over 80 operators that make it easy to build parallel application 2. Databrick company uses Spark to break world record in 2014 1TB sort benchmark completition 3. MLlib is Spark's machine learning library.
  • 24. Spark cluster overview 1. Spark is master-slave architecture, which likes YARN 2. cluster management is master, it handle resource managemnet and slave health management. 3. when you launch application, master will assign a slave to be driver. driver request resource from master, execute main function and assign task to slave
  • 25. Spark cluster deployment 1. use Linode API to create and boot new instance rapidly 2. use standalone Spark cluster it also can deploy on Mesos or YARN cluster 3. install Java,Scala and put pre-built Spark, finally launch slave executor! 4. use ansible to deploy spark executor and use LZ4 to speed up decompress pre-built Spark package
  • 26. Training Word2Vec model by Spark cluster RDD is the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel val input:RDD[String] = sc.textFile(inp, 5).cache() val token:RDD[Seq[String]] = input.map(article => tokenize(article)) val word2vec = new Word2Vec() word2vec.setNumPartitions(5) val model = word2vec.fit(token) sc.parallelize(Seq(model), 1).saveAsObjectFile("hdfs://....")
  • 27. Querying Word2Vec model by Spark cluster val model = sc.objectFile[Word2VecModel]("hdfs://....").first() val synonyms = model.findSynonyms("熱火",10) for((synonyms, cosineSim) <- synonyms){ println(synonyms+":"+cosineSim) } load model from HDFS compare with model training, resource requirement is cheap on finding similarity
  • 28. Query Word2Vec by Spark cluster
  • 29.
  • 31. Example of "luffy"(one piece comic's man charactrer)
  • 32.
  • 34.