3. Gensim
Gensim
トピックモデル (pLSA, LDA) や deep learning(word2vec) を簡単に使えるラ
イブラリ [2][3]
公式サイトの tutorial は若干分かりにくいです
使い方は [4] や [1] に詳しい
Figure: Mentioned by the author:)
3 / 11
4. Gensim
Gensim
公式サイトの tutorial は若干分かりにくいです
使い方は [4] や [1] に詳しい
..........
..........
..........
..........
..........
..........
System
and human
system
documents
..........
..........
..........
..........
..........
..........
['system',
'and'
'human']
texts
形態素解析
{'and': 19,
'minors': 37, ...}
dic = corpora.Dictonary()
..........
..........
..........
..........
..........
..........
[(10, 2),
(19, 1),
(3, 1), ...]
corpus
dic.doc2bow()
辞書とtf値を対応付け
dic.save()
dict.dic
MmCorpus
.serialize()
corpus.mm
tf・idf
LSALSA
LDA
HDP
RP
log
entropy
word
2vec
models
model
.save()
lda.model
dic.load() MmCorpus()
model
.load()
similarities
文書の類似性判定
lda.model
topic
extraction
model
.show_topics()
文書のトピック抽出
Figure: Gensim を使った処理の一例
4 / 11
5. Gensim
Step0. documents
元の文書をリスト型で準備
1 # 元の文書
2 documents = [
3 ”Human machine interface for lab abc computer applications”,
4 ”A survey of user opinion of computer system response time”,
5 ”The EPS user interface management system”,
6 ”System and human system engineering testing of EPS”,
7 ”Relation of user perceived response time to error measurement”,
8 ”The generation of random binary unordered trees”,
9 ”The intersection graph of paths in trees”,
10 ”Graph minors IV Widths of trees and well quasi ordering”,
11 ”Graph minors A survey”]
5 / 11
6. Gensim
Step1. 形態素解析
1 def parse(doc):
2 # 日本語なら形態素解析
3 # stopwordを除去する
4 stoplist = set(’for a of the and to in’.split())
5 text = [word for word in doc.lower().split() if word not in stoplist]
6 return text
7
8 texts = [[w for w in parse(doc)] for doc in documents]
9 print texts
10 ’’’ [
11 [’human’, ’machine’, ’interface’, ...],
12 [’a’, ’survey’, ’of’, ’user’, ...],
13 ...] ’’’
6 / 11