SlideShare a Scribd company logo
1 of 54
Download to read offline
Globally Scalable Web Document Classification
Using Word2Vec
Kohei Nakaji (SmartNews)
keyword: machine learning for discovery
SmartNews Demo
About SmartNews
Japan
Launched 2013
4M+ Monthly Active Users
50% DAU/MAU
100+ Publishers
2013 App of The Year
US
Launched Oct 2014
1M+ Monthly Active Users
Same engagement
80+ Publishers
Top News Category App
International
Launched Feb 2015
10M Downloads WW
Same engagement
English beta
Featured App
Funding: $50M
Outline of our algorithm
Structure Analysis
Semantics Analysis
URLs Found
Importance Estimation
10 million/day
1000+/day
Diversification
Signals on the Internet
Outline of our algorithm
Structure Analysis
Semantics Analysis
URLs Found
Importance Estimation
10 million/day
1000+ /day
Diversification
Signals on the Internet
Web Document
Classification
⊂
Web Document Classification
ENTERTAINMENT
SPORTS
TECHNOLOGY
LIFESTYLE
SCIENCE
…
Task definition:
When an arbitrary web document arrives, choose one
category exclusively from a pre-determined category set.
WORLD
Web Document Classification
ENTERTAINMENT
① Main Content Extraction
② Text Classification
① ②
There are roughly two steps:
There are roughly two steps:
Web Document Classification
ENTERTAINMENT
① Main Content Extraction
② Text Classification
① ②
Main Content Extraction
Two approaches:
html
html
easier, but takes time
difficult, but fast
・Extract after rendering whole page
・Extract from HTML
Main Content Extraction
・Extract after rendering whole page
・Extract from HTML
html
html
easier, but takes time
difficult, but fast
Two approaches:
Our Approach
Main Content Extraction from HTML
<html>
<body>

<div>click <a>here</a> for </div>

<div>

<a>tweet</a><a>share</a>
<p>
Robert Bates was a volunteer deputy who'd
never led an arrest for the Tulsa County Sheriff's
Office.

</p>

<a>you also like this</a>
<p>
So how did the 73-year-old insurance company
CEO end up joining a sting operation this month
that ended when he pulled out his handgun and
killed suspect Eric Harris instead of stunning
him with a Taser?</p>
</div>
</body>
</html>
Example:
main content
not main content
Main Content Extraction from HTML
Rule1:
div which has

text length > 200
num of ‘a’ tag < 3
is Main Content
Rule-based extraction algorithm is possible.
English:
Rule2:
div which has

text length < 100
num of ‘p’ tag > 4
is Main Content
RuleN:
…
Main Content Extraction from HTML
Rule1:
div which has

text length > 200
num of ‘a’ tag < 3
is Main Content
Rule-based extraction algorithm is possible.
English:
Rule2:
div which has

text length < 100
num of ‘p’ tag > 4
is Main Content
RuleN:
…
But not scalable.
Japanese:
…
…
…
…
Main Content Extraction from HTML
② live data
(features)block1:
block2:
block3:
(features)
(features)
…
① training
(features, main)
(features, not main)
(features, main)
block1:
block2:
block3:
…
decision tree
block separation &
feature extraction
We are using a machine learning approach;
See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)
Main Content Extraction from HTML
② live data
(features)block1:
block2:
block3:
(features)
(features)
…
① training
(features, main)
(features, not main)
(features, main)
block1:
block2:
block3:
…
decision tree
block separation &
feature extraction
We are using a machine learning approach;
See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)
Feature Extraction from HTML
<html>
<body>

<div>click <a>here</a> for </div>

<div>

<a>tweet</a><a>share</a>
<p>
Robert Bates was a volunteer deputy
who'd never led an arrest for the Tulsa
County Sheriff's Office.

</p>

<a>you also like this</a>
<p>
So how did the 73-year-old insurance
company CEO end up joining a sting
operation this month that ended when he
pulled out his handgun and killed suspect
Eric Harris instead of stunning him with a
Taser?</p></div>
</body>
</html>
Separate HTML into ‘text block’s
Step1:
Feature Extraction from HTML
<html>
<body>

<div>click <a>here</a> for </div>

<div>

<a>tweet</a><a>share</a>
<p>
Robert Bates was a volunteer deputy
who'd never led an arrest for the Tulsa
County Sheriff's Office.

</p>

<a>you also like this</a>
<p>
So how did the 73-year-old insurance
company CEO end up joining a sting
operation this month that ended when he
pulled out his handgun and killed suspect
Eric Harris instead of stunning him with a
Taser?</p></div>
</body>
</html>
Step1:
Separate HTML into ‘text block’s
Step2:
Extract local features for every text block
ex: word count = 36, num of <a> = 0
Feature Extraction from HTML
<html>
<body>

<div>click <a>here</a> for </div>

<div>

<a>tweet</a><a>share</a>
<p>
Robert Bates was a volunteer deputy
who'd never led an arrest for the Tulsa
County Sheriff's Office.

</p>

<a>you also like this</a>
<p>
So how did the 73-year-old insurance
company CEO end up joining a sting
operation this month that ended when he
pulled out his handgun and killed suspect
Eric Harris instead of stunning him with a
Taser?</p></div>
</body>
</html>
Step1:
Separate HTML into ‘text block’s
Step2:
Extract local features for every text block
ex: word count = 36, num of <a> = 0
Step3:
Define feature of each text block as
combination of local features
word count(current block) : 36,
num of <a>(current block) : 0,
word count (previous block) : 4,
num of <a> (previous block) : 1
ex:
Main Content Extraction from HTML
② live data
(features)block1:
block2:
block3:
(features)
(features)
…
① training
(features, main)
(features, not main)
(features, main)
block1:
block2:
block3:
…
decision tree
block separation &
feature extraction
We are using a machine learning approach:
See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)
Main Content Extraction from HTML
② live data
(features)block1:
block2:
block3:
(features)
(features)
…
① training
(features, main)
(features, not main)
(features, main)
block1:
block2:
block3:
…
decision tree
block separation &
feature extraction
We are using a machine learning approach;
See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)
Making Main Content Using Decision Tree
(features)block1:
not main
(features)block2:
not main
(features)block3:
main
(features)block5:
main
(features)block4:
not main
Main Content Extraction From HTML
② live data
(features)block1:
block2:
block3:
(features)
(features)
…
① training
(features, main)
(features, not main)
(features, main)
block1:
block2:
block3:
…
decision tree
block separation &
feature extraction
We are using machine learning approach;
See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)
There are roughly two steps:
Web Document Classification
ENTERTAINMENT
① Main Content Extraction
② Text Classification
① ②
Text Classification
Ordinary text classification architecture:
② live data
(features)
① training
(features, entertainment)
(features, sports)
(features, entertainment)
features
? ?
…
entertainment
sports
(features, politics)
…
sports
training
algorithm
classifier
feature
extraction
Text Classification
Ordinary text classification architecture:
② live data
(features)
① training
(features, entertainment)
(features, sports)
(features, entertainment)
features
? ?
…
entertainment
sports
(features, politics)
…
sports
training
algorithm
classifier
feature
extraction
Feature Extraction in Text Classification
Will LeBron James
deliver an NBA
championship to
Cleveland?
‘Bag-of-words’ is commonly used as a feature vector.
Will
deliver
an NBA
championship
to
Cleveland
James
LeBron
Feature Extraction in Text Classification
Will LeBron James
deliver an NBA
championship to
Cleveland?
‘Bag-of-words’ is commonly used as a feature vector
Will
deliver
an NBA
championship
to
Cleveland
James
LeBron
stop words
sports players dictionary
with some feature engineering.
NBA_PLAYER
tf-idf
Feature Extraction in Text Classification
Similarly used in Japanese.
私は中路です。
よろしくお願いします。
stop words
person dictionary
私
は
中路
よろしく
お願い
し
ます
です
PERSON
tf-idf
Another Option: Paragraph Vector
Example:
私は中路です。
よろしくお願いします。
[0.2, 0.3, ……0.2]
Will LeBron James deliver
an NBA championship to
Cleveland?
[0.1, 0.4, ……0.1]
Paragraph Vector
(dimension ∼ several 100)
Outline of Distributed Representation
・word2vec
・paragraph vector
every word is mapped to unique word vector.
every document is mapped to unique vector.
(Quoc V. Le, Tomas Mikolov http://arxiv.org/abs/1405.4053)
(https://code.google.com/p/word2vec/)
Outline of Distributed Representation
・word2vec
・paragraph vector
every word is mapped to unique word vector.
every document is mapped to unique vector.
(https://code.google.com/p/word2vec/)
(Quoc V. Le, Tomas Mikolov http://arxiv.org/abs/1405.4053)
Word Vector in word2vec Model
Every word is mapped to unique word vector
with good properties.
[0.1, 0.2, ……0.2]=
[0.1, 0.1, ……-0.1]=
[0.3, 0.4, ……0]=
[0.3, 0.3, ……0.3]=
Germany Berlin
Paris
France
…
“Germany - Berlin = France - Paris”
vFrance
vParis
vGermany
vBerlin
Procedure to Create Word Vectors
Mikolov et al. (http://arxiv.org/pdf/1301.3781.pdf)
cat
sat
the
street
on
A cat sat on the street.
…
I love cat very much.
w220
w221
He comes from Japan.
…
…
TX
t=1
logP(wt|wt c, · · · wt+c)
P(wt|wt c, · · · wt+c) =
exp(uwt · v)
P
W exp(uW · v)
v =
X
t0
6=t, ct0
c
vw
0
t
for anduw vw
vw is word vector for w.
Word vectors are trained so that it becomes a good
feature for predicting surrounding words.
Objective Function (cbow-case)
Model (sum-case)
=
Procedure
① Maximize
②
L
L
Outline of Distributed Representation
・word2vec
・paragraph vector
every word is mapped to unique word vector.
every document is mapped to unique vector.
(Quoc V. Le, Tomas Mikolov http://arxiv.org/abs/1405.4053)
Example:
私は中路です。
よろしくお願いします。
[0.2, 0.3, ……0.2]
Will LeBron James deliver
an NBA championship to
Cleveland?
[0.1, 0.4, ……0.1]
Paragraph Vectors
(dimension ∼ 100s)
Procedure to Create Paragraph Vectors
for uw vw
A cat sat on the street.
…
doc_1 : doc_2 :
…
I love cat very much.
w220
He comes from Japan.
…
w221
Mikolov et al. (http://arxiv.org/pdf/1301.3781.pdf)
cat
sat
the
street
on
doc_1
TX
t=1
logP(wt|wt c, · · · wt+c, doc i)
P(wt|wt c, · · · wt+c, doc i) =
exp(uwt · v)
P
W exp(uW · v)
v =
X
t0
6=t, ct0
c
vw
0
t
+ di
, and di
wt is included
vw② Preserve uw , as ˜uw , ˜vw
document where
Add a vector to the model for each document.
Objective Function (dbow-case)
=
Model (sum-case)
Procedure
① Maximize
L
L
Procedure to Create Paragraph Vector
for uw vw, and di
vw② Preserve uw , as ˜uw , ˜vw
After training, we can get a good paragraph vector as
a feature for a new document.
Objective Function (dbow-case)
Model (sum-case)
Procedure
① Maximize
TX
t=1
logP(wt|wt c, · · · wt+c, doc)
P(wt|wt c, · · · wt+c, doc) =
exp(˜uwt · ˜v)
P
W exp(˜uW · ˜v)
˜v =
X
t0
6=t, ct0
c
˜vwt
0 + d
We love SmartNews.
…
doc :
I love SmartNews
very much.
d
Ldoc =
③ Maximize for
L
Ldoc d
④ Use as a paragraph vectord
training
live data
Procedure to Create Paragraph Vector
Feature Extractor
[0.2, 0.3, ……0.2]
d
˜uw ˜vw
Paragraph Vector :
Lmaximize
Ldocmaximize
Text Classification
Ordinary text classification architecture:
② live data
([0.1, -0.1, …])
① training
([0.1, 0.3, …], entertainment)
([0.2, -0.3, …], sports)
([0.1, 0.1, …], entertainment)
features
? ?
…
entertainment
sports
([0.1, -0.2, …], politics)
…
sports
training
algorithm
classifier
feature
extraction
Good
Benefits of Using Paragraph Vector
・High Scalability
・High Precision in Text Classification
Several percent better than using Bag-of-Words
with feature engineering in our Japanese/English data set.
We don’t need to work hard for feature engineering in
each language.
Bad
・Difficulty in analyzing error
It is hard to understand the meaning of each
component of paragraph vector.
labeled: ∼several 10000
unlabeled: ∼100000
Benefits of Using Paragraph Vector
It is important that Paragraph Vector has a
different nature than Bag-of-Words
Reason: We can get a better classifier by combining
two different types of classifiers.
Our Use Case
Validation
Use one to validate the other.
Combination
Use the more reliable result of two classifiers:
Bag-of-Words-based classifier vs.
Paragraph Vector-based classifier
In multilingual localization
Use only Paragraph Vector-based classifier without
any feature engineering.
Our Use Case (future)
Web Document Classification
ENTERTAINMENT
① Main Content Extraction
② Text Classification
① ②
There are roughly two steps:
The Challenge
The Challenge
News is uncertainty seeking for long-term values.
Exploitation Exploration
What SmartNews does:
uncertainty seeking
discovery
What Big Data Firms
typically do:
preference estimation
and risk quantification
What if parents don't feed vegetables to children who only like meat?
What if you keep hearing only opinions that match yours?
The Challenge
Searching not optimal, but acceptable form of exploration.
Why? Humans are not rational enough to simply accept the optimum.
Without acceptance, users will never read SmartNews.
・topic extraction
We are developing:
・image extraction
・multi-arm bandit based scoring model
① For better Feature Vector of users and articles
② For Human-Acceptable Exploration
user
interests
①
②
…
feature vector for 10 million users
real-time feature vector for articles
x
We are building our engineering team in SF -
please join us!
採用してます
・ML/NLP Engineer
・Data Science Engineer
…
kohei.nakaji@smartnews.com
References
Main Content Extraction
・Christian Kohlschütter, Peter Fankhauser, Wolfgang Nejdl
Text Classification
Boilerplate Detection using Shallow Text Features
・BoilerPipe (GoogleCode)
・Quoc V. Le, Tomas Mikolov
Distributed Representations of Sentences and Documents
・Word2Vec (GoogleCode)
References
About SmartNews
・Japan’s SmartNews Raises Another $10M At A $320M Valuation
To Expand In The U.S.
・SmartNews, The Minimalist News App That's A Hit In Japan,
Sets Its Sights On The U.S.
・Japanese news app SmartNews nabs $10M bridge round,
at pre-money valuation of $320M
・About our Company SmartNews
Articles about SmartNews

More Related Content

What's hot

AutoGluonではじめるAutoML
AutoGluonではじめるAutoMLAutoGluonではじめるAutoML
AutoGluonではじめるAutoML西岡 賢一郎
 
音声コーパス設計と次世代音声研究に向けた提言
音声コーパス設計と次世代音声研究に向けた提言音声コーパス設計と次世代音声研究に向けた提言
音声コーパス設計と次世代音声研究に向けた提言Shinnosuke Takamichi
 
DeNAゲーム事業におけるデータエンジニアの貢献 [DeNA TechCon 2019]
DeNAゲーム事業におけるデータエンジニアの貢献 [DeNA TechCon 2019]DeNAゲーム事業におけるデータエンジニアの貢献 [DeNA TechCon 2019]
DeNAゲーム事業におけるデータエンジニアの貢献 [DeNA TechCon 2019]DeNA
 
Multi-agent Inverse reinforcement learning: 相互作用する行動主体の報酬推定
Multi-agent Inverse reinforcement learning: 相互作用する行動主体の報酬推定Multi-agent Inverse reinforcement learning: 相互作用する行動主体の報酬推定
Multi-agent Inverse reinforcement learning: 相互作用する行動主体の報酬推定Keiichi Namikoshi
 
正規表現入門 星の高さを求めて
正規表現入門 星の高さを求めて正規表現入門 星の高さを求めて
正規表現入門 星の高さを求めてRyoma Sin'ya
 
【DL輪読会】Monocular real time volumetric performance capture
【DL輪読会】Monocular real time volumetric performance capture 【DL輪読会】Monocular real time volumetric performance capture
【DL輪読会】Monocular real time volumetric performance capture Deep Learning JP
 
【DL輪読会】Trajectory Prediction with Latent Belief Energy-Based Model
【DL輪読会】Trajectory Prediction with Latent Belief Energy-Based Model【DL輪読会】Trajectory Prediction with Latent Belief Energy-Based Model
【DL輪読会】Trajectory Prediction with Latent Belief Energy-Based ModelDeep Learning JP
 
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat ModelsDeep Learning JP
 
画像キャプションと動作認識の最前線 〜データセットに注目して〜(第17回ステアラボ人工知能セミナー)
画像キャプションと動作認識の最前線 〜データセットに注目して〜(第17回ステアラボ人工知能セミナー)画像キャプションと動作認識の最前線 〜データセットに注目して〜(第17回ステアラボ人工知能セミナー)
画像キャプションと動作認識の最前線 〜データセットに注目して〜(第17回ステアラボ人工知能セミナー)STAIR Lab, Chiba Institute of Technology
 
SSII2022 [OS3-02] Federated Learningの基礎と応用
SSII2022 [OS3-02] Federated Learningの基礎と応用SSII2022 [OS3-02] Federated Learningの基礎と応用
SSII2022 [OS3-02] Federated Learningの基礎と応用SSII
 
全文検索サーバ Fess 〜 全文検索システム構築時の悩みどころ
全文検索サーバ Fess 〜 全文検索システム構築時の悩みどころ全文検索サーバ Fess 〜 全文検索システム構築時の悩みどころ
全文検索サーバ Fess 〜 全文検索システム構築時の悩みどころShinsuke Sugaya
 
道路網における経路探索のための前処理データ構造
道路網における経路探索のための前処理データ構造道路網における経路探索のための前処理データ構造
道路網における経路探索のための前処理データ構造Atsushi Koike
 
パターン認識 05 ロジスティック回帰
パターン認識 05 ロジスティック回帰パターン認識 05 ロジスティック回帰
パターン認識 05 ロジスティック回帰sleipnir002
 
データ分析グループの組織編制とその課題 マーケティングにおけるKPI設計の失敗例 ABテストの活用と、機械学習の導入 #CWT2016
データ分析グループの組織編制とその課題 マーケティングにおけるKPI設計の失敗例 ABテストの活用と、機械学習の導入 #CWT2016データ分析グループの組織編制とその課題 マーケティングにおけるKPI設計の失敗例 ABテストの活用と、機械学習の導入 #CWT2016
データ分析グループの組織編制とその課題 マーケティングにおけるKPI設計の失敗例 ABテストの活用と、機械学習の導入 #CWT2016Tokoroten Nakayama
 
Optimizer入門&最新動向
Optimizer入門&最新動向Optimizer入門&最新動向
Optimizer入門&最新動向Motokawa Tetsuya
 
異次元のグラフデータベースNeo4j
異次元のグラフデータベースNeo4j異次元のグラフデータベースNeo4j
異次元のグラフデータベースNeo4j昌桓 李
 
【DL輪読会】大量API・ツールの扱いに特化したLLM
【DL輪読会】大量API・ツールの扱いに特化したLLM【DL輪読会】大量API・ツールの扱いに特化したLLM
【DL輪読会】大量API・ツールの扱いに特化したLLMDeep Learning JP
 

What's hot (20)

AutoGluonではじめるAutoML
AutoGluonではじめるAutoMLAutoGluonではじめるAutoML
AutoGluonではじめるAutoML
 
NLP2017 NMT Tutorial
NLP2017 NMT TutorialNLP2017 NMT Tutorial
NLP2017 NMT Tutorial
 
Mongo sharding
Mongo shardingMongo sharding
Mongo sharding
 
音声コーパス設計と次世代音声研究に向けた提言
音声コーパス設計と次世代音声研究に向けた提言音声コーパス設計と次世代音声研究に向けた提言
音声コーパス設計と次世代音声研究に向けた提言
 
DeNAゲーム事業におけるデータエンジニアの貢献 [DeNA TechCon 2019]
DeNAゲーム事業におけるデータエンジニアの貢献 [DeNA TechCon 2019]DeNAゲーム事業におけるデータエンジニアの貢献 [DeNA TechCon 2019]
DeNAゲーム事業におけるデータエンジニアの貢献 [DeNA TechCon 2019]
 
Multi-agent Inverse reinforcement learning: 相互作用する行動主体の報酬推定
Multi-agent Inverse reinforcement learning: 相互作用する行動主体の報酬推定Multi-agent Inverse reinforcement learning: 相互作用する行動主体の報酬推定
Multi-agent Inverse reinforcement learning: 相互作用する行動主体の報酬推定
 
Bloom filter
Bloom filterBloom filter
Bloom filter
 
正規表現入門 星の高さを求めて
正規表現入門 星の高さを求めて正規表現入門 星の高さを求めて
正規表現入門 星の高さを求めて
 
【DL輪読会】Monocular real time volumetric performance capture
【DL輪読会】Monocular real time volumetric performance capture 【DL輪読会】Monocular real time volumetric performance capture
【DL輪読会】Monocular real time volumetric performance capture
 
【DL輪読会】Trajectory Prediction with Latent Belief Energy-Based Model
【DL輪読会】Trajectory Prediction with Latent Belief Energy-Based Model【DL輪読会】Trajectory Prediction with Latent Belief Energy-Based Model
【DL輪読会】Trajectory Prediction with Latent Belief Energy-Based Model
 
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
 
画像キャプションと動作認識の最前線 〜データセットに注目して〜(第17回ステアラボ人工知能セミナー)
画像キャプションと動作認識の最前線 〜データセットに注目して〜(第17回ステアラボ人工知能セミナー)画像キャプションと動作認識の最前線 〜データセットに注目して〜(第17回ステアラボ人工知能セミナー)
画像キャプションと動作認識の最前線 〜データセットに注目して〜(第17回ステアラボ人工知能セミナー)
 
SSII2022 [OS3-02] Federated Learningの基礎と応用
SSII2022 [OS3-02] Federated Learningの基礎と応用SSII2022 [OS3-02] Federated Learningの基礎と応用
SSII2022 [OS3-02] Federated Learningの基礎と応用
 
全文検索サーバ Fess 〜 全文検索システム構築時の悩みどころ
全文検索サーバ Fess 〜 全文検索システム構築時の悩みどころ全文検索サーバ Fess 〜 全文検索システム構築時の悩みどころ
全文検索サーバ Fess 〜 全文検索システム構築時の悩みどころ
 
道路網における経路探索のための前処理データ構造
道路網における経路探索のための前処理データ構造道路網における経路探索のための前処理データ構造
道路網における経路探索のための前処理データ構造
 
パターン認識 05 ロジスティック回帰
パターン認識 05 ロジスティック回帰パターン認識 05 ロジスティック回帰
パターン認識 05 ロジスティック回帰
 
データ分析グループの組織編制とその課題 マーケティングにおけるKPI設計の失敗例 ABテストの活用と、機械学習の導入 #CWT2016
データ分析グループの組織編制とその課題 マーケティングにおけるKPI設計の失敗例 ABテストの活用と、機械学習の導入 #CWT2016データ分析グループの組織編制とその課題 マーケティングにおけるKPI設計の失敗例 ABテストの活用と、機械学習の導入 #CWT2016
データ分析グループの組織編制とその課題 マーケティングにおけるKPI設計の失敗例 ABテストの活用と、機械学習の導入 #CWT2016
 
Optimizer入門&最新動向
Optimizer入門&最新動向Optimizer入門&最新動向
Optimizer入門&最新動向
 
異次元のグラフデータベースNeo4j
異次元のグラフデータベースNeo4j異次元のグラフデータベースNeo4j
異次元のグラフデータベースNeo4j
 
【DL輪読会】大量API・ツールの扱いに特化したLLM
【DL輪読会】大量API・ツールの扱いに特化したLLM【DL輪読会】大量API・ツールの扱いに特化したLLM
【DL輪読会】大量API・ツールの扱いに特化したLLM
 

Similar to [SmartNews] Globally Scalable Web Document Classification Using Word2Vec

Similar to [SmartNews] Globally Scalable Web Document Classification Using Word2Vec (20)

Bootcamp - Web Development Session 2
Bootcamp - Web Development Session 2Bootcamp - Web Development Session 2
Bootcamp - Web Development Session 2
 
HTML CSS JS in Nut shell
HTML  CSS JS in Nut shellHTML  CSS JS in Nut shell
HTML CSS JS in Nut shell
 
Ember
EmberEmber
Ember
 
Getting Started with jQuery
Getting Started with jQueryGetting Started with jQuery
Getting Started with jQuery
 
Caste a vote online
Caste a vote onlineCaste a vote online
Caste a vote online
 
Jquery library
Jquery libraryJquery library
Jquery library
 
Dotnetintroduce 100324201546-phpapp02
Dotnetintroduce 100324201546-phpapp02Dotnetintroduce 100324201546-phpapp02
Dotnetintroduce 100324201546-phpapp02
 
Introduction to jQuery
Introduction to jQueryIntroduction to jQuery
Introduction to jQuery
 
Overview of PHP and MYSQL
Overview of PHP and MYSQLOverview of PHP and MYSQL
Overview of PHP and MYSQL
 
Javascript libraries
Javascript librariesJavascript libraries
Javascript libraries
 
JS Libraries and jQuery Overview
JS Libraries and jQuery OverviewJS Libraries and jQuery Overview
JS Libraries and jQuery Overview
 
Medium TechTalk — iOS
Medium TechTalk — iOSMedium TechTalk — iOS
Medium TechTalk — iOS
 
DotNet Introduction
DotNet IntroductionDotNet Introduction
DotNet Introduction
 
Build a game with javascript (april 2017)
Build a game with javascript (april 2017)Build a game with javascript (april 2017)
Build a game with javascript (april 2017)
 
MLBox
MLBoxMLBox
MLBox
 
Web scraping using scrapy - zekeLabs
Web scraping using scrapy - zekeLabsWeb scraping using scrapy - zekeLabs
Web scraping using scrapy - zekeLabs
 
Continuous Integration - Live Static Analysis with Puma Scan
Continuous Integration - Live Static Analysis with Puma ScanContinuous Integration - Live Static Analysis with Puma Scan
Continuous Integration - Live Static Analysis with Puma Scan
 
R data interfaces
R data interfacesR data interfaces
R data interfaces
 
Timothy N. Tsvetkov, Rails 3.1
Timothy N. Tsvetkov, Rails 3.1Timothy N. Tsvetkov, Rails 3.1
Timothy N. Tsvetkov, Rails 3.1
 
JQuery
JQueryJQuery
JQuery
 

Recently uploaded

Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...Akihiro Suda
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecturerahul_net
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 

Recently uploaded (20)

Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 

[SmartNews] Globally Scalable Web Document Classification Using Word2Vec

  • 1. Globally Scalable Web Document Classification Using Word2Vec Kohei Nakaji (SmartNews)
  • 2.
  • 5. About SmartNews Japan Launched 2013 4M+ Monthly Active Users 50% DAU/MAU 100+ Publishers 2013 App of The Year US Launched Oct 2014 1M+ Monthly Active Users Same engagement 80+ Publishers Top News Category App International Launched Feb 2015 10M Downloads WW Same engagement English beta Featured App Funding: $50M
  • 6. Outline of our algorithm Structure Analysis Semantics Analysis URLs Found Importance Estimation 10 million/day 1000+/day Diversification Signals on the Internet
  • 7. Outline of our algorithm Structure Analysis Semantics Analysis URLs Found Importance Estimation 10 million/day 1000+ /day Diversification Signals on the Internet Web Document Classification ⊂
  • 8. Web Document Classification ENTERTAINMENT SPORTS TECHNOLOGY LIFESTYLE SCIENCE … Task definition: When an arbitrary web document arrives, choose one category exclusively from a pre-determined category set. WORLD
  • 9. Web Document Classification ENTERTAINMENT ① Main Content Extraction ② Text Classification ① ② There are roughly two steps:
  • 10. There are roughly two steps: Web Document Classification ENTERTAINMENT ① Main Content Extraction ② Text Classification ① ②
  • 11. Main Content Extraction Two approaches: html html easier, but takes time difficult, but fast ・Extract after rendering whole page ・Extract from HTML
  • 12. Main Content Extraction ・Extract after rendering whole page ・Extract from HTML html html easier, but takes time difficult, but fast Two approaches: Our Approach
  • 13. Main Content Extraction from HTML <html> <body>
 <div>click <a>here</a> for </div>
 <div>
 <a>tweet</a><a>share</a> <p> Robert Bates was a volunteer deputy who'd never led an arrest for the Tulsa County Sheriff's Office.
 </p>
 <a>you also like this</a> <p> So how did the 73-year-old insurance company CEO end up joining a sting operation this month that ended when he pulled out his handgun and killed suspect Eric Harris instead of stunning him with a Taser?</p> </div> </body> </html> Example: main content not main content
  • 14. Main Content Extraction from HTML Rule1: div which has
 text length > 200 num of ‘a’ tag < 3 is Main Content Rule-based extraction algorithm is possible. English: Rule2: div which has
 text length < 100 num of ‘p’ tag > 4 is Main Content RuleN: …
  • 15. Main Content Extraction from HTML Rule1: div which has
 text length > 200 num of ‘a’ tag < 3 is Main Content Rule-based extraction algorithm is possible. English: Rule2: div which has
 text length < 100 num of ‘p’ tag > 4 is Main Content RuleN: … But not scalable. Japanese: … … … …
  • 16. Main Content Extraction from HTML ② live data (features)block1: block2: block3: (features) (features) … ① training (features, main) (features, not main) (features, main) block1: block2: block3: … decision tree block separation & feature extraction We are using a machine learning approach; See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)
  • 17. Main Content Extraction from HTML ② live data (features)block1: block2: block3: (features) (features) … ① training (features, main) (features, not main) (features, main) block1: block2: block3: … decision tree block separation & feature extraction We are using a machine learning approach; See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)
  • 18. Feature Extraction from HTML <html> <body>
 <div>click <a>here</a> for </div>
 <div>
 <a>tweet</a><a>share</a> <p> Robert Bates was a volunteer deputy who'd never led an arrest for the Tulsa County Sheriff's Office.
 </p>
 <a>you also like this</a> <p> So how did the 73-year-old insurance company CEO end up joining a sting operation this month that ended when he pulled out his handgun and killed suspect Eric Harris instead of stunning him with a Taser?</p></div> </body> </html> Separate HTML into ‘text block’s Step1:
  • 19. Feature Extraction from HTML <html> <body>
 <div>click <a>here</a> for </div>
 <div>
 <a>tweet</a><a>share</a> <p> Robert Bates was a volunteer deputy who'd never led an arrest for the Tulsa County Sheriff's Office.
 </p>
 <a>you also like this</a> <p> So how did the 73-year-old insurance company CEO end up joining a sting operation this month that ended when he pulled out his handgun and killed suspect Eric Harris instead of stunning him with a Taser?</p></div> </body> </html> Step1: Separate HTML into ‘text block’s Step2: Extract local features for every text block ex: word count = 36, num of <a> = 0
  • 20. Feature Extraction from HTML <html> <body>
 <div>click <a>here</a> for </div>
 <div>
 <a>tweet</a><a>share</a> <p> Robert Bates was a volunteer deputy who'd never led an arrest for the Tulsa County Sheriff's Office.
 </p>
 <a>you also like this</a> <p> So how did the 73-year-old insurance company CEO end up joining a sting operation this month that ended when he pulled out his handgun and killed suspect Eric Harris instead of stunning him with a Taser?</p></div> </body> </html> Step1: Separate HTML into ‘text block’s Step2: Extract local features for every text block ex: word count = 36, num of <a> = 0 Step3: Define feature of each text block as combination of local features word count(current block) : 36, num of <a>(current block) : 0, word count (previous block) : 4, num of <a> (previous block) : 1 ex:
  • 21. Main Content Extraction from HTML ② live data (features)block1: block2: block3: (features) (features) … ① training (features, main) (features, not main) (features, main) block1: block2: block3: … decision tree block separation & feature extraction We are using a machine learning approach: See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)
  • 22. Main Content Extraction from HTML ② live data (features)block1: block2: block3: (features) (features) … ① training (features, main) (features, not main) (features, main) block1: block2: block3: … decision tree block separation & feature extraction We are using a machine learning approach; See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)
  • 23. Making Main Content Using Decision Tree (features)block1: not main (features)block2: not main (features)block3: main (features)block5: main (features)block4: not main
  • 24. Main Content Extraction From HTML ② live data (features)block1: block2: block3: (features) (features) … ① training (features, main) (features, not main) (features, main) block1: block2: block3: … decision tree block separation & feature extraction We are using machine learning approach; See Christian Kohlschütter et al. (http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf)
  • 25. There are roughly two steps: Web Document Classification ENTERTAINMENT ① Main Content Extraction ② Text Classification ① ②
  • 26. Text Classification Ordinary text classification architecture: ② live data (features) ① training (features, entertainment) (features, sports) (features, entertainment) features ? ? … entertainment sports (features, politics) … sports training algorithm classifier feature extraction
  • 27. Text Classification Ordinary text classification architecture: ② live data (features) ① training (features, entertainment) (features, sports) (features, entertainment) features ? ? … entertainment sports (features, politics) … sports training algorithm classifier feature extraction
  • 28. Feature Extraction in Text Classification Will LeBron James deliver an NBA championship to Cleveland? ‘Bag-of-words’ is commonly used as a feature vector. Will deliver an NBA championship to Cleveland James LeBron
  • 29. Feature Extraction in Text Classification Will LeBron James deliver an NBA championship to Cleveland? ‘Bag-of-words’ is commonly used as a feature vector Will deliver an NBA championship to Cleveland James LeBron stop words sports players dictionary with some feature engineering. NBA_PLAYER tf-idf
  • 30. Feature Extraction in Text Classification Similarly used in Japanese. 私は中路です。 よろしくお願いします。 stop words person dictionary 私 は 中路 よろしく お願い し ます です PERSON tf-idf
  • 32. Example: 私は中路です。 よろしくお願いします。 [0.2, 0.3, ……0.2] Will LeBron James deliver an NBA championship to Cleveland? [0.1, 0.4, ……0.1] Paragraph Vector (dimension ∼ several 100)
  • 33. Outline of Distributed Representation ・word2vec ・paragraph vector every word is mapped to unique word vector. every document is mapped to unique vector. (Quoc V. Le, Tomas Mikolov http://arxiv.org/abs/1405.4053) (https://code.google.com/p/word2vec/)
  • 34. Outline of Distributed Representation ・word2vec ・paragraph vector every word is mapped to unique word vector. every document is mapped to unique vector. (https://code.google.com/p/word2vec/) (Quoc V. Le, Tomas Mikolov http://arxiv.org/abs/1405.4053)
  • 35. Word Vector in word2vec Model Every word is mapped to unique word vector with good properties. [0.1, 0.2, ……0.2]= [0.1, 0.1, ……-0.1]= [0.3, 0.4, ……0]= [0.3, 0.3, ……0.3]= Germany Berlin Paris France … “Germany - Berlin = France - Paris” vFrance vParis vGermany vBerlin
  • 36. Procedure to Create Word Vectors Mikolov et al. (http://arxiv.org/pdf/1301.3781.pdf) cat sat the street on A cat sat on the street. … I love cat very much. w220 w221 He comes from Japan. … … TX t=1 logP(wt|wt c, · · · wt+c) P(wt|wt c, · · · wt+c) = exp(uwt · v) P W exp(uW · v) v = X t0 6=t, ct0 c vw 0 t for anduw vw vw is word vector for w. Word vectors are trained so that it becomes a good feature for predicting surrounding words. Objective Function (cbow-case) Model (sum-case) = Procedure ① Maximize ② L L
  • 37. Outline of Distributed Representation ・word2vec ・paragraph vector every word is mapped to unique word vector. every document is mapped to unique vector. (Quoc V. Le, Tomas Mikolov http://arxiv.org/abs/1405.4053)
  • 38. Example: 私は中路です。 よろしくお願いします。 [0.2, 0.3, ……0.2] Will LeBron James deliver an NBA championship to Cleveland? [0.1, 0.4, ……0.1] Paragraph Vectors (dimension ∼ 100s)
  • 39. Procedure to Create Paragraph Vectors for uw vw A cat sat on the street. … doc_1 : doc_2 : … I love cat very much. w220 He comes from Japan. … w221 Mikolov et al. (http://arxiv.org/pdf/1301.3781.pdf) cat sat the street on doc_1 TX t=1 logP(wt|wt c, · · · wt+c, doc i) P(wt|wt c, · · · wt+c, doc i) = exp(uwt · v) P W exp(uW · v) v = X t0 6=t, ct0 c vw 0 t + di , and di wt is included vw② Preserve uw , as ˜uw , ˜vw document where Add a vector to the model for each document. Objective Function (dbow-case) = Model (sum-case) Procedure ① Maximize L L
  • 40. Procedure to Create Paragraph Vector for uw vw, and di vw② Preserve uw , as ˜uw , ˜vw After training, we can get a good paragraph vector as a feature for a new document. Objective Function (dbow-case) Model (sum-case) Procedure ① Maximize TX t=1 logP(wt|wt c, · · · wt+c, doc) P(wt|wt c, · · · wt+c, doc) = exp(˜uwt · ˜v) P W exp(˜uW · ˜v) ˜v = X t0 6=t, ct0 c ˜vwt 0 + d We love SmartNews. … doc : I love SmartNews very much. d Ldoc = ③ Maximize for L Ldoc d ④ Use as a paragraph vectord training live data
  • 41. Procedure to Create Paragraph Vector Feature Extractor [0.2, 0.3, ……0.2] d ˜uw ˜vw Paragraph Vector : Lmaximize Ldocmaximize
  • 42. Text Classification Ordinary text classification architecture: ② live data ([0.1, -0.1, …]) ① training ([0.1, 0.3, …], entertainment) ([0.2, -0.3, …], sports) ([0.1, 0.1, …], entertainment) features ? ? … entertainment sports ([0.1, -0.2, …], politics) … sports training algorithm classifier feature extraction
  • 43. Good Benefits of Using Paragraph Vector ・High Scalability ・High Precision in Text Classification Several percent better than using Bag-of-Words with feature engineering in our Japanese/English data set. We don’t need to work hard for feature engineering in each language. Bad ・Difficulty in analyzing error It is hard to understand the meaning of each component of paragraph vector. labeled: ∼several 10000 unlabeled: ∼100000
  • 44. Benefits of Using Paragraph Vector It is important that Paragraph Vector has a different nature than Bag-of-Words Reason: We can get a better classifier by combining two different types of classifiers.
  • 45. Our Use Case Validation Use one to validate the other. Combination Use the more reliable result of two classifiers: Bag-of-Words-based classifier vs. Paragraph Vector-based classifier
  • 46. In multilingual localization Use only Paragraph Vector-based classifier without any feature engineering. Our Use Case (future)
  • 47. Web Document Classification ENTERTAINMENT ① Main Content Extraction ② Text Classification ① ② There are roughly two steps:
  • 49. The Challenge News is uncertainty seeking for long-term values. Exploitation Exploration What SmartNews does: uncertainty seeking discovery What Big Data Firms typically do: preference estimation and risk quantification What if parents don't feed vegetables to children who only like meat? What if you keep hearing only opinions that match yours?
  • 50. The Challenge Searching not optimal, but acceptable form of exploration. Why? Humans are not rational enough to simply accept the optimum. Without acceptance, users will never read SmartNews. ・topic extraction We are developing: ・image extraction ・multi-arm bandit based scoring model ① For better Feature Vector of users and articles ② For Human-Acceptable Exploration user interests ① ② … feature vector for 10 million users real-time feature vector for articles x
  • 51. We are building our engineering team in SF - please join us! 採用してます ・ML/NLP Engineer ・Data Science Engineer …
  • 53. References Main Content Extraction ・Christian Kohlschütter, Peter Fankhauser, Wolfgang Nejdl Text Classification Boilerplate Detection using Shallow Text Features ・BoilerPipe (GoogleCode) ・Quoc V. Le, Tomas Mikolov Distributed Representations of Sentences and Documents ・Word2Vec (GoogleCode)
  • 54. References About SmartNews ・Japan’s SmartNews Raises Another $10M At A $320M Valuation To Expand In The U.S. ・SmartNews, The Minimalist News App That's A Hit In Japan, Sets Its Sights On The U.S. ・Japanese news app SmartNews nabs $10M bridge round, at pre-money valuation of $320M ・About our Company SmartNews Articles about SmartNews

Editor's Notes

  1. Hello I am Kohei Nakaji, engineer of SmartNews Inc. I'm developing news delivery algorithm in SmartNews, using especially machine learning and natural language processing in SmartNews. My research background is not kind of ML things but particle physics theory, begining of universe, dark matter and so on. so if you guys have interest in physics thing I can also talk about it in another day. Anyway, Today I'm gonna talk about this topic: 'Grobally scalable web document classification using word2vec'. Because This talk is based on the technology in SmartNews, I will do brief introduction of our company SmartNews. We SmartNews, are developing ios/android application: SmartNews.
  2. How many guys use SmartNews here? very few people. How many guys love machine learning? Great. So you will love SmartNews. Because our apps are made by machine learning. SmartNews is news app for more than 100 countries, but we have No writier, no editor, algorithm do everything. How many guys use news app every day? yeah most of news app fail. Some apps have great downloads but they are annoying with few engagement ratio. We SmartNews have 10M downloads grobally and more than 50% is active. We have possibility to get the position of successful news app. Then what makes SmartNews different?
  3. Keyword is ‘machine learning for discovery’. Some apps rely on human editor. they are not scalable and also they can be biased. Some apps use machine learning for delivery algorithms, but they use it for personalization. We use machine learning for everyone on earth to discover and learn new things they might not otherwise have seen. This is our mission. We are trying to develop algorithm for users to discover new things. that makes our engagement ratio high. Now let me show you demo of our apps.
  4. Let me show you guys how it works. First when you open it up, you can see the top news right here. Top news are latest important news chosen by our algorithm. Over here you got tabs of different categories which is the most straightforward result of web document classification. You see the latest important news in each category chosen by our algorithm. you may understand how precise our web document classification should be. One of the cool things is that when you find that you wanna read, for example see I wanna read this article right here, you’ve got this option right here which is the smart view option. you like this option, because it looks very very clean, no banners, no ads. Over here you can see the web view which is ordinary web browser, you see a lot of things you don’t wanna read in web view, but in smart view it is more simple and clean. You may understand how difficult to create smart view from arbitrary web site. I will introduce some of the algorithm in this talk. Another cool thing of smart view is you can see smart view even in offline. You can read in metro, in the airplane, anywhere.
  5. As I told you we have 10M downloads and more than 50% is active. there are 3 types of editions, japanese edition, us edition, and international edition. In international edition, users can read English articles which is localized for more than 100 countries. But there is no editor for each country.
  6. UI is good, Smart View is cool. But as I told you what makes us different is the algorithm to find articles from which users can discover new things. This is the outline of our algorithm for ‘users discovery’. urls are found from the signals on the Internet by our crawler, html structures are automatically analyzed, for example title, mainText, image is extracted then semantics of articles are analyzed, what category it has, what subject it has, what image is in…etc… - using signals and semantics, the importance score of each article for each category in each country are calculated diversify topic of the delivery list then we deliver the articles to users. the list of the article are refreshed in real time. We crawled 10 million urls/day and deliver only top 1000 articles to users and 100/category/day. There are many things to talk about this algorithm. Especially how we do importance estimation, we do personalize or do another approach is key feature because it is related with our mission. I will talk about it later and now let’s get into the today’s main topic
  7. Web Document classification, which is part of our structure analysis, and semantics analysis. The reason why I choose Web Document Classification for today’s topic is for one thing it is important for our application as you have already seen and for another thing, classification of unstructured data is common task in many applications, from simple spam filter to category tagging in ec site.
  8. The task definision is very simple. when arbitrary
  9. There are roughly two steps. 1. main content extraction : we have to detect main content from news website. it is difficult because there are so many websites, and different websites have different structure. 2. text classification : we classify main content into one category first I briefly show one of our algorithm to detect main content from Web Document, next I will talk about text classification using word2vec extended model
  10. Let’s start from main content extraction. I want to add that in our app main content extraction is also important for making smart view we have seen.
  11. when we do main content extraction, there are two approaches actually we use the bottom one. First approach is rendering all of the page loading all css, javascript and after that extract the main content. it is relatively easier because we can use the information of position, width, and height of each component but it takes time because we have to render all items. Second approach is extract main content directry from html. it is more difficult but needs much less computing resource comparing with first approach.
  12. we use second approach in our algorithm, because we have to proceed 10 million articles per day, 100 article per second.
  13. This is the example of main content extraction from html. It is the task to detect which is main content and which is not main content.
  14. Rule based extraction algorithm is of course possible like div which has text length more than 200 is main content. Because there are so many websites, the number of rule tend to be large,
  15. If we do it in multi-language, it becomes much harder.
  16. So, as one of our algorithm to extract main content, we are using machine learning approach which is based on the paper in 2011. So today, let me introduce about this. In the training phase, first we prepare the sets of html document that main content is already labeled. In our case, we aggregate the articles by our crawler and annotator annotate main content. Next by using block separator, html is separated into each text block, and by using feature extractor, feature vector in each block is extracted.
  17. let’s get into the block separation and feature extraction part.
  18. For step one we separate html into text blocks. The definition of ‘text block’ in our case is roughly, the block which is sandwiched by block level tag.
  19. For step 2, local features for each block is extracted. We use for example number of word, number of a tag, as local feature,
  20. For Step3, we create feature vector of each block as the combination of local features of different blocks. In this example, feature vector of this text block has element of ‘word count and num of a tag in previous and current block’.
  21. in training phase, after the block separation and feature extraction, we get sets of labeled feature vector. The label is binary value: main/not main. By using the labeled feature vector, decision tree is trained. When live data comes, html is separated into text blocks with features, and by using already trained decision tree, final result is obtained.
  22. Let’s get into this part.
  23. Feature vector in each block is classified into main/not main by using already trained decision tree. Then now, we know which text block is main content and which text block is not main content. By combining the result, we get the main text.
  24. This is the end of main content extraction. easy, simple, but not bad. If you want to know more about it. please see the link, and also there is the library which includes already trained model in English, please try. I will share the reference later.
  25. so let’s get into the text classification.
  26. Probably you know everything already, but let me review the ordinary classification architecture. In the training phase, first we prepare sets of labeled texts as training data. by using feature extractor, sets of labeled feature vector is created, then using training algorithm, like SVM or logistic regression, classifier is trained. In bag-of-words feature extractor case, sets of word in the document is extracted as feature vector, and after training, roughly speaking, which word tends to show up in which category, is trained. when live data comes, feature vector is extracted and by using already trained classifier, category is determined.
  27. Training algorithm itself is ordinary logistic regression in our application and there are many materials about it. So today, let’s focus on feature extraction part.
  28. As a feature vector ‘Bag-of-words’ is commonly used. Bag-of-words is set of words in the document, it does not care about the order of words. very simple but not bad if we use it for text classification.
  29. If we want to improve the quality of feature vector, we create, for example stop words dictionary for removing unnecessary words, create specific dictionary for adding a specific feature, or use tf-idf. But still Bag-of-Words are starting point.
  30. In Japanese case, we have to use technique to separate words, but still Bag-of-Words with some feature engineering is commonly used. But Bag-of-Words definitely seems not perfect feature vector of text, for example it cannot include the information of word order. For another example we cannot use information that two words are close to each other or not. We wonder whether we can easily get better feature vector or not.
  31. As a better feature vector, we use Paragraph Vector which is word2vec extended model. It is ‘better’ in precision of text classification.
  32. by using the technique I will talk about today, every document is mapped to one dense vector with a few hundred dimensions named paragraph vector.
  33. Because paragraph vector is kind of word2vec extended model, I should start from word2vec. In word2vec case, every word is mapped to unique word vector. In paragraph vector case, every document is mapped to unique vector.
  34. So let’s get into word2vec.
  35. Every word is mapped to unique vector. In this example, France, Paris, Germany, Berlin is mapped to each unique vector. What is surprising is the nature like Germany - Berlin = France - Paris. From this nature, we assure that some semantics is embedded in the vector.
  36. This is Brief Overview of training word2vec model. First prepare sets of document. and label each word like w1, w2, then, maximize the objective function. The value of c is arbitary. 2 or 3 is commonly used. By looking at the shape of this objective function, you can see that maximizing this objective function means maximizing the probability to predict a word from surrounding words. In the example of the right figure, The model is refreshed so that the probability to predict ‘on’ from surrounding words ‘cat’, ‘sat’, ‘the’, ‘street’ becomes higher. The model of probability function is like this. For each word, 2 types of vectors: output vector u and input vector v are defined. Roughly speaking, when training converge, the more a pair of 2words shows up in a same sentence, the bigger the inner product of u and v for the 2words become. After training we use v for each word as word vector. Technically, training this model directly is really heavy because of this sum, and 2 types of approximation Negative sampling and Hierarchal softmax are used. Detail about the approximations is beyond the scope of this talk. This is how we create word vector by using word2vec model.
  37. Then let’s get into paragraph vector.
  38. As I told you, each document is mapped into one dense vector named paragraph vector.
  39. The procedure to create paragraph vector is similar to word2vec case. Prepare sets of document. and label each word like w1, w2, we also label each document like doc_1, doc_2. Then, maximize this objective function. The difference from word2vec model is that, the objective function includes document_id where the word is included. So maximizing this objective function means maximizing the probability to predict a word not only from surrounding words but also from the document where the word is included. The model of the probability function is also a little bit different. Same as word2vec case, for each word outer vector u and inner vector v are defined. In addition, for each document, vector d_i is also defined. When training converge, we get optimized u, v for each word and d_i for each document. The final result of vector d_i is paragraph vector for each document. But what we really want to do is extracting paragraph vector from new document. For doing it we need one more step.
  40. When new document comes, we label the words in the document, and maximize this objective function. In this time, T is the number of word in the document. We don’t need to maximize the objective function for u and v, we can use u and v which is already trained. All we have to do is just maximize objective function for d. After the objective function is maximized we get d as a paragraph vector for the document.
  41. It was a little bit confusing, so I show a simple figure. First, we train the feature extractor by putting the large set of documents, and when new document comes, by using the already trained feature extractor, paragraph vector is extracted. very simple right?
  42. By just using the paragraph vector as a feature vector, we can do ordinary text classification.
  43. Good thing for using paragraph vector comparing with Bag-of-Words is these two. ①high precision In our Japanese/English data set, the result of 10-fold validation test becomes several percent better than bag-of-words with feature engineering case. ②high scalability. By just preparing the sets of Document for each language, without feature engineering, we can get good result. Bad thing is the difficulty in analyzing error. It is hard to understand the meaning of each component of paragraph vector. Because there is trade off, I don’t know which you should choose in your use-case even if the precision of text classification is several percent higher by using paragraph vector.
  44. But still, I think it’s good for you to try paragraph vector. Paragraph vector has different nature from bag-of-words. So the combination of bag-of-words classifier and paragraph vector based classifier can be much better classifier.
  45. In our app, there are many types of classifiers like sports classifier, entertainment classifier other than main category classifier. Depending on the purpose of each classification, in some case, we use the more reliable result of Bag-of-words based classifier and paragraph-vector based classifier. In another case we validate the result of bag-of-words based classifier by using paragraph-vector based classifier. Also, in the near future, when we expand our business into multi let’s say 100 languages, there is much possibility that we only use paragraph vector based classifier because of the high scalability and the high precision.
  46. Also, in the near future, when we expand our business into multi let’s say 100 languages, there is much possibility that we only use paragraph vector based classifier because of the high scalability and the high precision.
  47. This is the end of todays’ topic web document classification.
  48. News is uncertainty seeking for long-term values. What other big data firms typically do is recommend what people have interest about, by using like matrix factorization. What we are doing is not simply suggest users what they like, but expand users’ interest by our algorithm.
  49. How to explorer users’ interest space and suggest something new to users, are very challenging problem. We are now brushing up, these two. For better understanding of the users’ interest space we are brushing up the topic or the subject extraction from article, brushing up users’ feature vector For doing the good exploration multi-arm bandit based scoring model, Technically, we have to create and operate the good and reasonable model which includes feature vector of 10 million users and real time feature vector of articles, it is really exciting. Actually the number of people tuckling on these problems is 5, including ML PhD., Theoretical Physics PhD, but we need much much much more people to tackle on this difficult problem.
  50. Then let’s get into paragraph vector.
  51. Then let’s get into paragraph vector.