Rakuten has various kinds of text data such as query keywords, product descriptions, and reviews from our users. The data is collected in our servers continuously and the size grows by the hour. However, we need to convert these massive unstructured data into structured data in order to take advantage of big data for text in our businesses. In this presentation, we will talk about methodologies to automatically organize unstructured data using natural language processing techniques so that we can help Rakuten's business.
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Organizing Big Data for Text in Rakuten
1. Organizing Big Data for Text in Rakuten
October 28, 2017
Keiji Shinzato
Rakuten Institute of Technology
Rakuten, Inc.
2. 2
Text
in Rakuten
Understanding
• Search Queries
• Reviews from Users
• Product Descriptions
• Etc.
Valuable
Information
• User Interest
• User Experience
• Product Features
• Etc.
3. 3
Number of products has risen 2.6 times compared to
five years ago.
• 100M products in 20121) to 258M products in 20172)
How much time do you need for reading descriptions of
258M products?
A. 4 years C. 400 years
B. 40 years D. 4,000 years
1) https://corp.rakuten.co.jp/about/history.html
2) https://www.rakuten.co.jp/ (as of October 2nd, 2017.)
4. 4
Number of products has risen 2.6 times compared to
five years ago.
• 100M products in 20121) to 258M products in 20172)
How much time do you need for reading descriptions of
258M products?
A. 4 years C. 400 years
B. 40 years D. 4,000 years
1) https://corp.rakuten.co.jp/about/history.html
2) https://www.rakuten.co.jp/ (as of October 2nd, 2017.)
Technology to organize big data for text is critical!
6. 6
Application
Crafted from sleek
spazzolato leather
(black), the Dorian
shopper is an
elegant carryall
that's perfect for
your essentials.
10"H x 13"L x 6"D.
RALPH LAUREN
Attribute Value
Brand Ralph Lauren
Color Black
Material Leather
Size 10’’H x 13’’L x 6’’D
Unstructured Data Structured Data
Faceted Navigation / Recommendation / Market Research
The bag image is designed by Freepik (http://www.freepik.com/free-vector/set-of-woman-s-bags-in-flat-style_960523.htm)
7. 7
Difficulty
• Ambiguity
• パーカー (luxury pen brand and hoodie), PUMA (sports brand
and knife brand)
• Diversity (long tail)
• 風と光 (a company of natural foods)
Dictionary-based approach
• Easily control system behavior by editing entries in the dictionary.
• Easily understand errors.
8. 8
Brand
Dictionary
Product Titles and
their Genres
Input Data with
Brands IDs
• Tokenization
• PoS Tagging
Extraction
Morphological
Analysis
• List tokens matched
with the dictionary
entries.
• Extract the candidate
to the furthest left.
Normalization
• Retrieve brand IDs
corresponding to
extracted brands.
Synonym
Dictionary
9. 9
Brand expression Relevant Genre
力王 Unknown
中部電磁器工業 Computers & Networking
キメラパーク Unknown
シュガーローズ Women's Clothing
サスクワッチファブリッ
クス Women's Clothing
藤栄 Home Decor, Housewares &
Furniture
ミキモト Unknown
エドウィンゴルフ Sports & Outdoors
AKI WORLD Sports & Outdoors
工房飛竜 Toys, Hobbies & Games
パーカー Home & Office Supplies
ハイライトキャバレー Men's Clothing
杉野 Unknown
カウネット Kitchen, Dining & Bar
Brand expressions are contained
with their relevant genres.
• 190K entries
Relevant genres are critical for
disambiguation.
• Employ brand expressions whose
relevant genre is the same with a
given product.
• Retrieve パーカー only for
products in home & office
supplies.
11. 11
New entries
3. Assign new relevant
genres to existing brands
4. Check manually
Candidates
2. Train and run
machine learning models
Annotated text
1. Create training data
Brand
dictionary
Product data
in ICHIBA
5. Update
12. 12
Brand
Dictionary
Product Titles and
their Genres
Input Data with
Brands IDs
• Tokenization
• PoS Tagging
Extraction
Morphological
Analysis
• List tokens matched
with the dictionary
entries.
• Extract the candidate
to the furthest left.
Normalization
• Retrieve brand IDs
corresponding to
extracted brands.
Synonym
Dictionary
13. 13
Genre ID Synonym
: : :
Shoes,
Bags,…
B2449 NIKE,
ナイキ
Electronics B2450 SONY,
ソニー
: : :
Genre Product Brand ID Label
Shoes ナイキ B2449 NIKE
Shoes NIKE B2449 NIKE
Bags NIKE B2449 NIKE
Interior ナイキ -- ナイキ
Synonym DictionaryExtraction Results
Both shoe images are designed by Freepik (http://www.freepik.com/free-vector/coloured-tennis-collection_972296.htm)
The bag image is designed by Freepik (http://www.freepik.com/free-vector/gym-icons_788034.htm)
The office chair image is designed by Freepik (http://www.freepik.com/free-vector/office-chair-three-colors_983088.htm)
14. 14
Genre ID Synonym
: : :
Shoes,
Bags,…
B2449 NIKE,
ナイキ
Electronics B2450 SONY,
ソニー
: : :
Genre Product Brand ID
Shoes ナイキ B2449
Shoes NIKE B2449
Bags NIKE B2449
Interior ナイキ --
Extraction Results
Both shoe images are designed by Freepik (http://www.freepik.com/free-vector/coloured-tennis-collection_972296.htm)
The bag image is designed by Freepik (http://www.freepik.com/free-vector/gym-icons_788034.htm)
The office chair image is designed by Freepik (http://www.freepik.com/free-vector/office-chair-three-colors_983088.htm)
Synonym Dictionary
15. 15
Genre ID Synonym
: : :
Shoes,
Bags,…
B2449 NIKE,
ナイキ
Electronics B2450 SONY,
ソニー
: : :
Genre Product Brand ID
Shoes ナイキ B2449
Shoes NIKE B2449
Bags NIKE B2449
Interior ナイキ B3510
NAIKI Co.,LTD.: http://www.naiki.co.jp/index.html
Extraction Results
Both shoe images are designed by Freepik (http://www.freepik.com/free-vector/coloured-tennis-collection_972296.htm)
The bag image is designed by Freepik (http://www.freepik.com/free-vector/gym-icons_788034.htm)
The office chair image is designed by Freepik (http://www.freepik.com/free-vector/office-chair-three-colors_983088.htm)
Synonym Dictionary
16. 16
Genre ID Synonym
: : :
Shoes,
Bags,…
B2449 NIKE,
ナイキ
Electronics B2450 SONY,
ソニー
: : :
Genre Product Brand ID
Shoes ナイキ B2449
Shoes NIKE B2449
Bags NIKE B2449
Interior ナイキ B3510
Information when we can
use it is important
NAIKI Co.,LTD.: http://www.naiki.co.jp/index.html
Extraction Results
Both shoe images are designed by Freepik (http://www.freepik.com/free-vector/coloured-tennis-collection_972296.htm)
The bag image is designed by Freepik (http://www.freepik.com/free-vector/gym-icons_788034.htm)
The office chair image is designed by Freepik (http://www.freepik.com/free-vector/office-chair-three-colors_983088.htm)
Synonym Dictionary
17. 17
Find candidates automatically, and then check them manually.
• JAN code
• Wikipedia
• Semantic similarity
206K triplets of <genre, brand id, synonyms>
18. 18
Manually assign brands to 500 randomly selected product titles
• Percent of product titles including brands: 69.6% (348/500)
Performance
• Precision: 89.2% (224/251)
• Recall: 64.4% (224/348)
We can automatically extract correct brands for
100M products in 260M products!
20. 20
I ordered this a week ago, but
no response from the store.
176,502 reviewsStock
Information
Payment
Service
Package
Shipping
Snapshot of https://review.rakuten.co.jp/shop/4/261122_261122/cpmj-i0h5i-97x3lm_1_1/?l2-id=review_PC_sl_body_05 as of October 16th, 2017.
21. 21
• What aspects should we design?
• How do we develop the system to perform it?
s1: Item was nicely packaged.
s2: A tracking # was given,
but never worked.
s3: Will shop again.
s1: Package / Pos
s2: Shipping / Neg
s3: Repeat / Pos
Input: Merchant Review Output: Aspect /
Sentiment Polarity
The robot image is designed by Freepik (http://www.freepik.com/free-vector/cute-robots-collection_713858.htm)
22. 22
# Aspect Example
1 配送
(Shipping)
迅速な配送ありがとうございました。
(Thank you for the quick shipping.)
2 対応
(Service)
今まで買い物した店舗で一番対応が遅かった。
(I’ve never seen such slow service!)
3 連絡
(Communication)
注文受付の自動送信メールが届いたきり一週間何の連絡もなし。
(No contact for a week after ordering it.)
4 店舗
(Shop)
信頼できるショップ様でした。
(They are a reliable store.)
5 商品
(Item)
安全に使用できそうで、これからが楽しみです。
(I’m looking forward to using this product.)
6 リピート
(Repeat)
また利用したいと思います。
(I’m going to purchase an item again.)
7 梱包
(Package)
梱包も破損のないよう、しっかりとされていました。
(It was tightly packaged to prevent damage.)
23. 23
# Aspect Example
8 品揃え
(Stock/variety)
商品が多いので助かります。
(They have a big inventory.)
9 情報
(Information)
マネキンの身長を記載してあったのでかなり参考になりました。
(The description about the height of a mannequin is very useful.)
10 キャンセル/返品
(Cancel/return)
しかしたまに断りなく遅れたりキャンセルされている点に不満です。
(I’m not satisfied because they suddenly canceled without any notification.)
11 価格
(Price)
商品が安く、購入でき、まんぞくです。
(I’m satisfied with purchasing the item at a low price.)
12 楽天
(Rakuten)
楽天の全サービスに信用がなくなりました。
(Because of this experience, I can’t trust any services in Rakuten.)
13 支払い
(Payment)
決済方法にEdyが使える方がよいと思います。
(It would be better if Rakuten Edy were acceptable.)
14 その他
(Other)
レビューがもう少し増えるといいですね。
(I hope the number of reviews increases.)
24. 24
• What aspects should we design?
• How do we develop the system to perform it?
s1: Item was nicely packaged.
s2: A tracking # was given,
but never worked.
s3: Will shop again.
s1: Package / Pos
s2: Shipping / Neg
s3: Repeat / Pos
Input: Merchant Review Output: Aspect /
Sentiment Polarity
The robot image is designed by Freepik (http://www.freepik.com/free-vector/cute-robots-collection_713858.htm)
25. 25
Annotated 1,510 reviews (5,277 sentences)
• 配送も迅速で良かったです。
(I was very pleased at how quickly I received it.)
Shipping/Positive
• いつになっても商品が来ず、問い合わせても返信がない。
(No shipment, no reply to inquiry.)
Shipping/Negative, Communication/Negative
103 hours / a well-trained annotator
26. 26
Train models using passive aggressive algorithm, and CRF.
Features are:
• Bag-of-words, aspect dictionary, sentiment polarity
dictionary, and syntactic information.
Performance
• Aspect classification
• Precision: 82.6%, Recall: 46.8%
• Sentiment classification
• Precision: 84.8%, Recall: 77.5%
27. 27
• Important to develop technique to automatically pull
valuable information from Big Data for Text.
• e.g., reviews users’ experience
• Rakuten develops techniques in-house to exploit Big
Data for Text in the services.
• Information extraction from product descriptions
• Sentiment analysis on reviews of merchants