SlideShare a Scribd company logo
1 of 28
Download to read offline
Organizing Big Data for Text in Rakuten
October 28, 2017
Keiji Shinzato
Rakuten Institute of Technology
Rakuten, Inc.
2
Text
in Rakuten
Understanding
• Search Queries
• Reviews from Users
• Product Descriptions
• Etc.
Valuable
Information
• User Interest
• User Experience
• Product Features
• Etc.
3
Number of products has risen 2.6 times compared to
five years ago.
• 100M products in 20121) to 258M products in 20172)
How much time do you need for reading descriptions of
258M products?
A. 4 years C. 400 years
B. 40 years D. 4,000 years
1) https://corp.rakuten.co.jp/about/history.html
2) https://www.rakuten.co.jp/ (as of October 2nd, 2017.)
4
Number of products has risen 2.6 times compared to
five years ago.
• 100M products in 20121) to 258M products in 20172)
How much time do you need for reading descriptions of
258M products?
A. 4 years C. 400 years
B. 40 years D. 4,000 years
1) https://corp.rakuten.co.jp/about/history.html
2) https://www.rakuten.co.jp/ (as of October 2nd, 2017.)
Technology to organize big data for text is critical!
5
• Information Extraction from Product Data
• Sentiment Analysis on Review Data
6
Application
Crafted from sleek
spazzolato leather
(black), the Dorian
shopper is an
elegant carryall
that's perfect for
your essentials.
10"H x 13"L x 6"D.
RALPH LAUREN
Attribute Value
Brand Ralph Lauren
Color Black
Material Leather
Size 10’’H x 13’’L x 6’’D
Unstructured Data Structured Data
Faceted Navigation / Recommendation / Market Research
The bag image is designed by Freepik (http://www.freepik.com/free-vector/set-of-woman-s-bags-in-flat-style_960523.htm)
7
Difficulty
• Ambiguity
• パーカー (luxury pen brand and hoodie), PUMA (sports brand
and knife brand)
• Diversity (long tail)
• 風と光 (a company of natural foods)
Dictionary-based approach
• Easily control system behavior by editing entries in the dictionary.
• Easily understand errors.
8
Brand
Dictionary
Product Titles and
their Genres
Input Data with
Brands IDs
• Tokenization
• PoS Tagging
Extraction
Morphological
Analysis
• List tokens matched
with the dictionary
entries.
• Extract the candidate
to the furthest left.
Normalization
• Retrieve brand IDs
corresponding to
extracted brands.
Synonym
Dictionary
9
Brand expression Relevant Genre
力王 Unknown
中部電磁器工業 Computers & Networking
キメラパーク Unknown
シュガーローズ Women's Clothing
サスクワッチファブリッ
クス Women's Clothing
藤栄 Home Decor, Housewares &
Furniture
ミキモト Unknown
エドウィンゴルフ Sports & Outdoors
AKI WORLD Sports & Outdoors
工房飛竜 Toys, Hobbies & Games
パーカー Home & Office Supplies
ハイライトキャバレー Men's Clothing
杉野 Unknown
カウネット Kitchen, Dining & Bar
Brand expressions are contained
with their relevant genres.
• 190K entries
Relevant genres are critical for
disambiguation.
• Employ brand expressions whose
relevant genre is the same with a
given product.
• Retrieve パーカー only for
products in home & office
supplies.
10
Screenshot of https://item.rakuten.co.jp/brandol-ec/gu-295710-j8400-8106/ as of October 10th, 2017.
11
New entries
3. Assign new relevant
genres to existing brands
4. Check manually
Candidates
2. Train and run
machine learning models
Annotated text
1. Create training data
Brand
dictionary
Product data
in ICHIBA
5. Update
12
Brand
Dictionary
Product Titles and
their Genres
Input Data with
Brands IDs
• Tokenization
• PoS Tagging
Extraction
Morphological
Analysis
• List tokens matched
with the dictionary
entries.
• Extract the candidate
to the furthest left.
Normalization
• Retrieve brand IDs
corresponding to
extracted brands.
Synonym
Dictionary
13
Genre ID Synonym
: : :
Shoes,
Bags,…
B2449 NIKE,
ナイキ
Electronics B2450 SONY,
ソニー
: : :
Genre Product Brand ID Label
Shoes ナイキ B2449 NIKE
Shoes NIKE B2449 NIKE
Bags NIKE B2449 NIKE
Interior ナイキ -- ナイキ
Synonym DictionaryExtraction Results
Both shoe images are designed by Freepik (http://www.freepik.com/free-vector/coloured-tennis-collection_972296.htm)
The bag image is designed by Freepik (http://www.freepik.com/free-vector/gym-icons_788034.htm)
The office chair image is designed by Freepik (http://www.freepik.com/free-vector/office-chair-three-colors_983088.htm)
14
Genre ID Synonym
: : :
Shoes,
Bags,…
B2449 NIKE,
ナイキ
Electronics B2450 SONY,
ソニー
: : :
Genre Product Brand ID
Shoes ナイキ B2449
Shoes NIKE B2449
Bags NIKE B2449
Interior ナイキ --
Extraction Results
Both shoe images are designed by Freepik (http://www.freepik.com/free-vector/coloured-tennis-collection_972296.htm)
The bag image is designed by Freepik (http://www.freepik.com/free-vector/gym-icons_788034.htm)
The office chair image is designed by Freepik (http://www.freepik.com/free-vector/office-chair-three-colors_983088.htm)
Synonym Dictionary
15
Genre ID Synonym
: : :
Shoes,
Bags,…
B2449 NIKE,
ナイキ
Electronics B2450 SONY,
ソニー
: : :
Genre Product Brand ID
Shoes ナイキ B2449
Shoes NIKE B2449
Bags NIKE B2449
Interior ナイキ B3510
NAIKI Co.,LTD.: http://www.naiki.co.jp/index.html
Extraction Results
Both shoe images are designed by Freepik (http://www.freepik.com/free-vector/coloured-tennis-collection_972296.htm)
The bag image is designed by Freepik (http://www.freepik.com/free-vector/gym-icons_788034.htm)
The office chair image is designed by Freepik (http://www.freepik.com/free-vector/office-chair-three-colors_983088.htm)
Synonym Dictionary
16
Genre ID Synonym
: : :
Shoes,
Bags,…
B2449 NIKE,
ナイキ
Electronics B2450 SONY,
ソニー
: : :
Genre Product Brand ID
Shoes ナイキ B2449
Shoes NIKE B2449
Bags NIKE B2449
Interior ナイキ B3510
Information when we can
use it is important
NAIKI Co.,LTD.: http://www.naiki.co.jp/index.html
Extraction Results
Both shoe images are designed by Freepik (http://www.freepik.com/free-vector/coloured-tennis-collection_972296.htm)
The bag image is designed by Freepik (http://www.freepik.com/free-vector/gym-icons_788034.htm)
The office chair image is designed by Freepik (http://www.freepik.com/free-vector/office-chair-three-colors_983088.htm)
Synonym Dictionary
17
Find candidates automatically, and then check them manually.
• JAN code
• Wikipedia
• Semantic similarity
206K triplets of <genre, brand id, synonyms>
18
Manually assign brands to 500 randomly selected product titles
• Percent of product titles including brands: 69.6% (348/500)
Performance
• Precision: 89.2% (224/251)
• Recall: 64.4% (224/348)
We can automatically extract correct brands for
100M products in 260M products!
19
• Information Extraction from Product Data
• Sentiment Analysis on Review Data
20
I ordered this a week ago, but
no response from the store.
176,502 reviewsStock
Information
Payment
Service
Package
Shipping
Snapshot of https://review.rakuten.co.jp/shop/4/261122_261122/cpmj-i0h5i-97x3lm_1_1/?l2-id=review_PC_sl_body_05 as of October 16th, 2017.
21
• What aspects should we design?
• How do we develop the system to perform it?
s1: Item was nicely packaged.
s2: A tracking # was given,
but never worked.
s3: Will shop again.
s1: Package / Pos
s2: Shipping / Neg
s3: Repeat / Pos
Input: Merchant Review Output: Aspect /
Sentiment Polarity
The robot image is designed by Freepik (http://www.freepik.com/free-vector/cute-robots-collection_713858.htm)
22
# Aspect Example
1 配送
(Shipping)
迅速な配送ありがとうございました。
(Thank you for the quick shipping.)
2 対応
(Service)
今まで買い物した店舗で一番対応が遅かった。
(I’ve never seen such slow service!)
3 連絡
(Communication)
注文受付の自動送信メールが届いたきり一週間何の連絡もなし。
(No contact for a week after ordering it.)
4 店舗
(Shop)
信頼できるショップ様でした。
(They are a reliable store.)
5 商品
(Item)
安全に使用できそうで、これからが楽しみです。
(I’m looking forward to using this product.)
6 リピート
(Repeat)
また利用したいと思います。
(I’m going to purchase an item again.)
7 梱包
(Package)
梱包も破損のないよう、しっかりとされていました。
(It was tightly packaged to prevent damage.)
23
# Aspect Example
8 品揃え
(Stock/variety)
商品が多いので助かります。
(They have a big inventory.)
9 情報
(Information)
マネキンの身長を記載してあったのでかなり参考になりました。
(The description about the height of a mannequin is very useful.)
10 キャンセル/返品
(Cancel/return)
しかしたまに断りなく遅れたりキャンセルされている点に不満です。
(I’m not satisfied because they suddenly canceled without any notification.)
11 価格
(Price)
商品が安く、購入でき、まんぞくです。
(I’m satisfied with purchasing the item at a low price.)
12 楽天
(Rakuten)
楽天の全サービスに信用がなくなりました。
(Because of this experience, I can’t trust any services in Rakuten.)
13 支払い
(Payment)
決済方法にEdyが使える方がよいと思います。
(It would be better if Rakuten Edy were acceptable.)
14 その他
(Other)
レビューがもう少し増えるといいですね。
(I hope the number of reviews increases.)
24
• What aspects should we design?
• How do we develop the system to perform it?
s1: Item was nicely packaged.
s2: A tracking # was given,
but never worked.
s3: Will shop again.
s1: Package / Pos
s2: Shipping / Neg
s3: Repeat / Pos
Input: Merchant Review Output: Aspect /
Sentiment Polarity
The robot image is designed by Freepik (http://www.freepik.com/free-vector/cute-robots-collection_713858.htm)
25
Annotated 1,510 reviews (5,277 sentences)
• 配送も迅速で良かったです。
(I was very pleased at how quickly I received it.)
 Shipping/Positive
• いつになっても商品が来ず、問い合わせても返信がない。
(No shipment, no reply to inquiry.)
 Shipping/Negative, Communication/Negative
103 hours / a well-trained annotator
26
Train models using passive aggressive algorithm, and CRF.
Features are:
• Bag-of-words, aspect dictionary, sentiment polarity
dictionary, and syntactic information.
Performance
• Aspect classification
• Precision: 82.6%, Recall: 46.8%
• Sentiment classification
• Precision: 84.8%, Recall: 77.5%
27
• Important to develop technique to automatically pull
valuable information from Big Data for Text.
• e.g., reviews  users’ experience
• Rakuten develops techniques in-house to exploit Big
Data for Text in the services.
• Information extraction from product descriptions
• Sentiment analysis on reviews of merchants
Organizing Big Data for Text in Rakuten

More Related Content

Viewers also liked

トラブルシューティングのあれこれ Yoshihiko kamata
トラブルシューティングのあれこれ Yoshihiko kamataトラブルシューティングのあれこれ Yoshihiko kamata
トラブルシューティングのあれこれ Yoshihiko kamataRakuten Group, Inc.
 
Rakutenとsreと私 yanagimoto koichi
Rakutenとsreと私 yanagimoto koichiRakutenとsreと私 yanagimoto koichi
Rakutenとsreと私 yanagimoto koichiRakuten Group, Inc.
 
What i learned from translation of the sre ryuji tamagawa
What i learned from translation of the sre ryuji tamagawaWhat i learned from translation of the sre ryuji tamagawa
What i learned from translation of the sre ryuji tamagawaRakuten Group, Inc.
 
AI AND FUNDAMENTAL GAME TECHNOLOGIESIN FINAL FANTASY XV
AI AND FUNDAMENTAL GAME TECHNOLOGIESIN FINAL FANTASY XVAI AND FUNDAMENTAL GAME TECHNOLOGIESIN FINAL FANTASY XV
AI AND FUNDAMENTAL GAME TECHNOLOGIESIN FINAL FANTASY XVRakuten Group, Inc.
 
Value Delivery through RakutenBig Data Intelligence Ecosystem and Technology
Value Delivery through RakutenBig Data Intelligence Ecosystem  and  TechnologyValue Delivery through RakutenBig Data Intelligence Ecosystem  and  Technology
Value Delivery through RakutenBig Data Intelligence Ecosystem and TechnologyRakuten Group, Inc.
 
時間がないといって、オペレーション改善を怠るな~オペレーション改善奮闘記~ Emi muroya
時間がないといって、オペレーション改善を怠るな~オペレーション改善奮闘記~ Emi muroya時間がないといって、オペレーション改善を怠るな~オペレーション改善奮闘記~ Emi muroya
時間がないといって、オペレーション改善を怠るな~オペレーション改善奮闘記~ Emi muroyaRakuten Group, Inc.
 
Life of an enginner in rakuten osaka diarmaid lindsay
Life of an enginner in rakuten osaka diarmaid lindsayLife of an enginner in rakuten osaka diarmaid lindsay
Life of an enginner in rakuten osaka diarmaid lindsayRakuten Group, Inc.
 
Java ee7 with apache spark for the world's largest credit card core systems, ...
Java ee7 with apache spark for the world's largest credit card core systems, ...Java ee7 with apache spark for the world's largest credit card core systems, ...
Java ee7 with apache spark for the world's largest credit card core systems, ...Rakuten Group, Inc.
 
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platformcloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data PlatformRakuten Group, Inc.
 
RTC 2017 - The Power of Parallelism
RTC 2017 - The Power of ParallelismRTC 2017 - The Power of Parallelism
RTC 2017 - The Power of ParallelismRakuten Group, Inc.
 
Building your own static site Using Hugo
Building your own static site Using HugoBuilding your own static site Using Hugo
Building your own static site Using HugoRakuten Group, Inc.
 
Change the engineer life by batch system renewal
Change the engineer life by batch system renewalChange the engineer life by batch system renewal
Change the engineer life by batch system renewalRakuten Group, Inc.
 
Artificial Intelligence for Happiness of People
Artificial Intelligence for Happiness of PeopleArtificial Intelligence for Happiness of People
Artificial Intelligence for Happiness of PeopleRakuten Group, Inc.
 
Find it! Nail it! Boosting e-commerce search conversions with machine learnin...
Find it! Nail it!Boosting e-commerce search conversions with machine learnin...Find it! Nail it!Boosting e-commerce search conversions with machine learnin...
Find it! Nail it! Boosting e-commerce search conversions with machine learnin...Rakuten Group, Inc.
 
Deep learning for e-commerce: current status and future prospects
Deep learning for e-commerce: current status and future prospectsDeep learning for e-commerce: current status and future prospects
Deep learning for e-commerce: current status and future prospectsRakuten Group, Inc.
 

Viewers also liked (20)

トラブルシューティングのあれこれ Yoshihiko kamata
トラブルシューティングのあれこれ Yoshihiko kamataトラブルシューティングのあれこれ Yoshihiko kamata
トラブルシューティングのあれこれ Yoshihiko kamata
 
Rakutenとsreと私 yanagimoto koichi
Rakutenとsreと私 yanagimoto koichiRakutenとsreと私 yanagimoto koichi
Rakutenとsreと私 yanagimoto koichi
 
What i learned from translation of the sre ryuji tamagawa
What i learned from translation of the sre ryuji tamagawaWhat i learned from translation of the sre ryuji tamagawa
What i learned from translation of the sre ryuji tamagawa
 
AI AND FUNDAMENTAL GAME TECHNOLOGIESIN FINAL FANTASY XV
AI AND FUNDAMENTAL GAME TECHNOLOGIESIN FINAL FANTASY XVAI AND FUNDAMENTAL GAME TECHNOLOGIESIN FINAL FANTASY XV
AI AND FUNDAMENTAL GAME TECHNOLOGIESIN FINAL FANTASY XV
 
COBOL to Apache Spark
COBOL to Apache SparkCOBOL to Apache Spark
COBOL to Apache Spark
 
Value Delivery through RakutenBig Data Intelligence Ecosystem and Technology
Value Delivery through RakutenBig Data Intelligence Ecosystem  and  TechnologyValue Delivery through RakutenBig Data Intelligence Ecosystem  and  Technology
Value Delivery through RakutenBig Data Intelligence Ecosystem and Technology
 
One Hundred Languages
One Hundred LanguagesOne Hundred Languages
One Hundred Languages
 
時間がないといって、オペレーション改善を怠るな~オペレーション改善奮闘記~ Emi muroya
時間がないといって、オペレーション改善を怠るな~オペレーション改善奮闘記~ Emi muroya時間がないといって、オペレーション改善を怠るな~オペレーション改善奮闘記~ Emi muroya
時間がないといって、オペレーション改善を怠るな~オペレーション改善奮闘記~ Emi muroya
 
Don't manage too hard!
Don't manage too hard! Don't manage too hard!
Don't manage too hard!
 
Life of an enginner in rakuten osaka diarmaid lindsay
Life of an enginner in rakuten osaka diarmaid lindsayLife of an enginner in rakuten osaka diarmaid lindsay
Life of an enginner in rakuten osaka diarmaid lindsay
 
Java ee7 with apache spark for the world's largest credit card core systems, ...
Java ee7 with apache spark for the world's largest credit card core systems, ...Java ee7 with apache spark for the world's largest credit card core systems, ...
Java ee7 with apache spark for the world's largest credit card core systems, ...
 
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platformcloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
 
RTC 2017 - The Power of Parallelism
RTC 2017 - The Power of ParallelismRTC 2017 - The Power of Parallelism
RTC 2017 - The Power of Parallelism
 
Building your own static site Using Hugo
Building your own static site Using HugoBuilding your own static site Using Hugo
Building your own static site Using Hugo
 
Realizing AI Conversational Bot
Realizing AI Conversational BotRealizing AI Conversational Bot
Realizing AI Conversational Bot
 
Change the engineer life by batch system renewal
Change the engineer life by batch system renewalChange the engineer life by batch system renewal
Change the engineer life by batch system renewal
 
Artificial Intelligence for Happiness of People
Artificial Intelligence for Happiness of PeopleArtificial Intelligence for Happiness of People
Artificial Intelligence for Happiness of People
 
Find it! Nail it! Boosting e-commerce search conversions with machine learnin...
Find it! Nail it!Boosting e-commerce search conversions with machine learnin...Find it! Nail it!Boosting e-commerce search conversions with machine learnin...
Find it! Nail it! Boosting e-commerce search conversions with machine learnin...
 
Deep learning for e-commerce: current status and future prospects
Deep learning for e-commerce: current status and future prospectsDeep learning for e-commerce: current status and future prospects
Deep learning for e-commerce: current status and future prospects
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
 

More from Rakuten Group, Inc.

コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話Rakuten Group, Inc.
 
楽天における安全な秘匿情報管理への道のり
楽天における安全な秘匿情報管理への道のり楽天における安全な秘匿情報管理への道のり
楽天における安全な秘匿情報管理への道のりRakuten Group, Inc.
 
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...Rakuten Group, Inc.
 
DataSkillCultureを浸透させる楽天の取り組み
DataSkillCultureを浸透させる楽天の取り組みDataSkillCultureを浸透させる楽天の取り組み
DataSkillCultureを浸透させる楽天の取り組みRakuten Group, Inc.
 
大規模なリアルタイム監視の導入と展開
大規模なリアルタイム監視の導入と展開大規模なリアルタイム監視の導入と展開
大規模なリアルタイム監視の導入と展開Rakuten Group, Inc.
 
楽天における大規模データベースの運用
楽天における大規模データベースの運用楽天における大規模データベースの運用
楽天における大規模データベースの運用Rakuten Group, Inc.
 
楽天サービスを支えるネットワークインフラストラクチャー
楽天サービスを支えるネットワークインフラストラクチャー楽天サービスを支えるネットワークインフラストラクチャー
楽天サービスを支えるネットワークインフラストラクチャーRakuten Group, Inc.
 
楽天の規模とクラウドプラットフォーム統括部の役割
楽天の規模とクラウドプラットフォーム統括部の役割楽天の規模とクラウドプラットフォーム統括部の役割
楽天の規模とクラウドプラットフォーム統括部の役割Rakuten Group, Inc.
 
Rakuten Services and Infrastructure Team.pdf
Rakuten Services and Infrastructure Team.pdfRakuten Services and Infrastructure Team.pdf
Rakuten Services and Infrastructure Team.pdfRakuten Group, Inc.
 
The Data Platform Administration Handling the 100 PB.pdf
The Data Platform Administration Handling the 100 PB.pdfThe Data Platform Administration Handling the 100 PB.pdf
The Data Platform Administration Handling the 100 PB.pdfRakuten Group, Inc.
 
Supporting Internal Customers as Technical Account Managers.pdf
Supporting Internal Customers as Technical Account Managers.pdfSupporting Internal Customers as Technical Account Managers.pdf
Supporting Internal Customers as Technical Account Managers.pdfRakuten Group, Inc.
 
Making Cloud Native CI_CD Services.pdf
Making Cloud Native CI_CD Services.pdfMaking Cloud Native CI_CD Services.pdf
Making Cloud Native CI_CD Services.pdfRakuten Group, Inc.
 
How We Defined Our Own Cloud.pdf
How We Defined Our Own Cloud.pdfHow We Defined Our Own Cloud.pdf
How We Defined Our Own Cloud.pdfRakuten Group, Inc.
 
Travel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech infoTravel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech infoRakuten Group, Inc.
 
Travel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech infoTravel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech infoRakuten Group, Inc.
 
Introduction of GORA API Group technology
Introduction of GORA API Group technologyIntroduction of GORA API Group technology
Introduction of GORA API Group technologyRakuten Group, Inc.
 
100PBを越えるデータプラットフォームの実情
100PBを越えるデータプラットフォームの実情100PBを越えるデータプラットフォームの実情
100PBを越えるデータプラットフォームの実情Rakuten Group, Inc.
 
社内エンジニアを支えるテクニカルアカウントマネージャー
社内エンジニアを支えるテクニカルアカウントマネージャー社内エンジニアを支えるテクニカルアカウントマネージャー
社内エンジニアを支えるテクニカルアカウントマネージャーRakuten Group, Inc.
 

More from Rakuten Group, Inc. (20)

コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
 
楽天における安全な秘匿情報管理への道のり
楽天における安全な秘匿情報管理への道のり楽天における安全な秘匿情報管理への道のり
楽天における安全な秘匿情報管理への道のり
 
What Makes Software Green?
What Makes Software Green?What Makes Software Green?
What Makes Software Green?
 
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
 
DataSkillCultureを浸透させる楽天の取り組み
DataSkillCultureを浸透させる楽天の取り組みDataSkillCultureを浸透させる楽天の取り組み
DataSkillCultureを浸透させる楽天の取り組み
 
大規模なリアルタイム監視の導入と展開
大規模なリアルタイム監視の導入と展開大規模なリアルタイム監視の導入と展開
大規模なリアルタイム監視の導入と展開
 
楽天における大規模データベースの運用
楽天における大規模データベースの運用楽天における大規模データベースの運用
楽天における大規模データベースの運用
 
楽天サービスを支えるネットワークインフラストラクチャー
楽天サービスを支えるネットワークインフラストラクチャー楽天サービスを支えるネットワークインフラストラクチャー
楽天サービスを支えるネットワークインフラストラクチャー
 
楽天の規模とクラウドプラットフォーム統括部の役割
楽天の規模とクラウドプラットフォーム統括部の役割楽天の規模とクラウドプラットフォーム統括部の役割
楽天の規模とクラウドプラットフォーム統括部の役割
 
Rakuten Services and Infrastructure Team.pdf
Rakuten Services and Infrastructure Team.pdfRakuten Services and Infrastructure Team.pdf
Rakuten Services and Infrastructure Team.pdf
 
The Data Platform Administration Handling the 100 PB.pdf
The Data Platform Administration Handling the 100 PB.pdfThe Data Platform Administration Handling the 100 PB.pdf
The Data Platform Administration Handling the 100 PB.pdf
 
Supporting Internal Customers as Technical Account Managers.pdf
Supporting Internal Customers as Technical Account Managers.pdfSupporting Internal Customers as Technical Account Managers.pdf
Supporting Internal Customers as Technical Account Managers.pdf
 
Making Cloud Native CI_CD Services.pdf
Making Cloud Native CI_CD Services.pdfMaking Cloud Native CI_CD Services.pdf
Making Cloud Native CI_CD Services.pdf
 
How We Defined Our Own Cloud.pdf
How We Defined Our Own Cloud.pdfHow We Defined Our Own Cloud.pdf
How We Defined Our Own Cloud.pdf
 
Travel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech infoTravel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech info
 
Travel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech infoTravel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech info
 
OWASPTop10_Introduction
OWASPTop10_IntroductionOWASPTop10_Introduction
OWASPTop10_Introduction
 
Introduction of GORA API Group technology
Introduction of GORA API Group technologyIntroduction of GORA API Group technology
Introduction of GORA API Group technology
 
100PBを越えるデータプラットフォームの実情
100PBを越えるデータプラットフォームの実情100PBを越えるデータプラットフォームの実情
100PBを越えるデータプラットフォームの実情
 
社内エンジニアを支えるテクニカルアカウントマネージャー
社内エンジニアを支えるテクニカルアカウントマネージャー社内エンジニアを支えるテクニカルアカウントマネージャー
社内エンジニアを支えるテクニカルアカウントマネージャー
 

Recently uploaded

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 

Organizing Big Data for Text in Rakuten

  • 1. Organizing Big Data for Text in Rakuten October 28, 2017 Keiji Shinzato Rakuten Institute of Technology Rakuten, Inc.
  • 2. 2 Text in Rakuten Understanding • Search Queries • Reviews from Users • Product Descriptions • Etc. Valuable Information • User Interest • User Experience • Product Features • Etc.
  • 3. 3 Number of products has risen 2.6 times compared to five years ago. • 100M products in 20121) to 258M products in 20172) How much time do you need for reading descriptions of 258M products? A. 4 years C. 400 years B. 40 years D. 4,000 years 1) https://corp.rakuten.co.jp/about/history.html 2) https://www.rakuten.co.jp/ (as of October 2nd, 2017.)
  • 4. 4 Number of products has risen 2.6 times compared to five years ago. • 100M products in 20121) to 258M products in 20172) How much time do you need for reading descriptions of 258M products? A. 4 years C. 400 years B. 40 years D. 4,000 years 1) https://corp.rakuten.co.jp/about/history.html 2) https://www.rakuten.co.jp/ (as of October 2nd, 2017.) Technology to organize big data for text is critical!
  • 5. 5 • Information Extraction from Product Data • Sentiment Analysis on Review Data
  • 6. 6 Application Crafted from sleek spazzolato leather (black), the Dorian shopper is an elegant carryall that's perfect for your essentials. 10"H x 13"L x 6"D. RALPH LAUREN Attribute Value Brand Ralph Lauren Color Black Material Leather Size 10’’H x 13’’L x 6’’D Unstructured Data Structured Data Faceted Navigation / Recommendation / Market Research The bag image is designed by Freepik (http://www.freepik.com/free-vector/set-of-woman-s-bags-in-flat-style_960523.htm)
  • 7. 7 Difficulty • Ambiguity • パーカー (luxury pen brand and hoodie), PUMA (sports brand and knife brand) • Diversity (long tail) • 風と光 (a company of natural foods) Dictionary-based approach • Easily control system behavior by editing entries in the dictionary. • Easily understand errors.
  • 8. 8 Brand Dictionary Product Titles and their Genres Input Data with Brands IDs • Tokenization • PoS Tagging Extraction Morphological Analysis • List tokens matched with the dictionary entries. • Extract the candidate to the furthest left. Normalization • Retrieve brand IDs corresponding to extracted brands. Synonym Dictionary
  • 9. 9 Brand expression Relevant Genre 力王 Unknown 中部電磁器工業 Computers & Networking キメラパーク Unknown シュガーローズ Women's Clothing サスクワッチファブリッ クス Women's Clothing 藤栄 Home Decor, Housewares & Furniture ミキモト Unknown エドウィンゴルフ Sports & Outdoors AKI WORLD Sports & Outdoors 工房飛竜 Toys, Hobbies & Games パーカー Home & Office Supplies ハイライトキャバレー Men's Clothing 杉野 Unknown カウネット Kitchen, Dining & Bar Brand expressions are contained with their relevant genres. • 190K entries Relevant genres are critical for disambiguation. • Employ brand expressions whose relevant genre is the same with a given product. • Retrieve パーカー only for products in home & office supplies.
  • 11. 11 New entries 3. Assign new relevant genres to existing brands 4. Check manually Candidates 2. Train and run machine learning models Annotated text 1. Create training data Brand dictionary Product data in ICHIBA 5. Update
  • 12. 12 Brand Dictionary Product Titles and their Genres Input Data with Brands IDs • Tokenization • PoS Tagging Extraction Morphological Analysis • List tokens matched with the dictionary entries. • Extract the candidate to the furthest left. Normalization • Retrieve brand IDs corresponding to extracted brands. Synonym Dictionary
  • 13. 13 Genre ID Synonym : : : Shoes, Bags,… B2449 NIKE, ナイキ Electronics B2450 SONY, ソニー : : : Genre Product Brand ID Label Shoes ナイキ B2449 NIKE Shoes NIKE B2449 NIKE Bags NIKE B2449 NIKE Interior ナイキ -- ナイキ Synonym DictionaryExtraction Results Both shoe images are designed by Freepik (http://www.freepik.com/free-vector/coloured-tennis-collection_972296.htm) The bag image is designed by Freepik (http://www.freepik.com/free-vector/gym-icons_788034.htm) The office chair image is designed by Freepik (http://www.freepik.com/free-vector/office-chair-three-colors_983088.htm)
  • 14. 14 Genre ID Synonym : : : Shoes, Bags,… B2449 NIKE, ナイキ Electronics B2450 SONY, ソニー : : : Genre Product Brand ID Shoes ナイキ B2449 Shoes NIKE B2449 Bags NIKE B2449 Interior ナイキ -- Extraction Results Both shoe images are designed by Freepik (http://www.freepik.com/free-vector/coloured-tennis-collection_972296.htm) The bag image is designed by Freepik (http://www.freepik.com/free-vector/gym-icons_788034.htm) The office chair image is designed by Freepik (http://www.freepik.com/free-vector/office-chair-three-colors_983088.htm) Synonym Dictionary
  • 15. 15 Genre ID Synonym : : : Shoes, Bags,… B2449 NIKE, ナイキ Electronics B2450 SONY, ソニー : : : Genre Product Brand ID Shoes ナイキ B2449 Shoes NIKE B2449 Bags NIKE B2449 Interior ナイキ B3510 NAIKI Co.,LTD.: http://www.naiki.co.jp/index.html Extraction Results Both shoe images are designed by Freepik (http://www.freepik.com/free-vector/coloured-tennis-collection_972296.htm) The bag image is designed by Freepik (http://www.freepik.com/free-vector/gym-icons_788034.htm) The office chair image is designed by Freepik (http://www.freepik.com/free-vector/office-chair-three-colors_983088.htm) Synonym Dictionary
  • 16. 16 Genre ID Synonym : : : Shoes, Bags,… B2449 NIKE, ナイキ Electronics B2450 SONY, ソニー : : : Genre Product Brand ID Shoes ナイキ B2449 Shoes NIKE B2449 Bags NIKE B2449 Interior ナイキ B3510 Information when we can use it is important NAIKI Co.,LTD.: http://www.naiki.co.jp/index.html Extraction Results Both shoe images are designed by Freepik (http://www.freepik.com/free-vector/coloured-tennis-collection_972296.htm) The bag image is designed by Freepik (http://www.freepik.com/free-vector/gym-icons_788034.htm) The office chair image is designed by Freepik (http://www.freepik.com/free-vector/office-chair-three-colors_983088.htm) Synonym Dictionary
  • 17. 17 Find candidates automatically, and then check them manually. • JAN code • Wikipedia • Semantic similarity 206K triplets of <genre, brand id, synonyms>
  • 18. 18 Manually assign brands to 500 randomly selected product titles • Percent of product titles including brands: 69.6% (348/500) Performance • Precision: 89.2% (224/251) • Recall: 64.4% (224/348) We can automatically extract correct brands for 100M products in 260M products!
  • 19. 19 • Information Extraction from Product Data • Sentiment Analysis on Review Data
  • 20. 20 I ordered this a week ago, but no response from the store. 176,502 reviewsStock Information Payment Service Package Shipping Snapshot of https://review.rakuten.co.jp/shop/4/261122_261122/cpmj-i0h5i-97x3lm_1_1/?l2-id=review_PC_sl_body_05 as of October 16th, 2017.
  • 21. 21 • What aspects should we design? • How do we develop the system to perform it? s1: Item was nicely packaged. s2: A tracking # was given, but never worked. s3: Will shop again. s1: Package / Pos s2: Shipping / Neg s3: Repeat / Pos Input: Merchant Review Output: Aspect / Sentiment Polarity The robot image is designed by Freepik (http://www.freepik.com/free-vector/cute-robots-collection_713858.htm)
  • 22. 22 # Aspect Example 1 配送 (Shipping) 迅速な配送ありがとうございました。 (Thank you for the quick shipping.) 2 対応 (Service) 今まで買い物した店舗で一番対応が遅かった。 (I’ve never seen such slow service!) 3 連絡 (Communication) 注文受付の自動送信メールが届いたきり一週間何の連絡もなし。 (No contact for a week after ordering it.) 4 店舗 (Shop) 信頼できるショップ様でした。 (They are a reliable store.) 5 商品 (Item) 安全に使用できそうで、これからが楽しみです。 (I’m looking forward to using this product.) 6 リピート (Repeat) また利用したいと思います。 (I’m going to purchase an item again.) 7 梱包 (Package) 梱包も破損のないよう、しっかりとされていました。 (It was tightly packaged to prevent damage.)
  • 23. 23 # Aspect Example 8 品揃え (Stock/variety) 商品が多いので助かります。 (They have a big inventory.) 9 情報 (Information) マネキンの身長を記載してあったのでかなり参考になりました。 (The description about the height of a mannequin is very useful.) 10 キャンセル/返品 (Cancel/return) しかしたまに断りなく遅れたりキャンセルされている点に不満です。 (I’m not satisfied because they suddenly canceled without any notification.) 11 価格 (Price) 商品が安く、購入でき、まんぞくです。 (I’m satisfied with purchasing the item at a low price.) 12 楽天 (Rakuten) 楽天の全サービスに信用がなくなりました。 (Because of this experience, I can’t trust any services in Rakuten.) 13 支払い (Payment) 決済方法にEdyが使える方がよいと思います。 (It would be better if Rakuten Edy were acceptable.) 14 その他 (Other) レビューがもう少し増えるといいですね。 (I hope the number of reviews increases.)
  • 24. 24 • What aspects should we design? • How do we develop the system to perform it? s1: Item was nicely packaged. s2: A tracking # was given, but never worked. s3: Will shop again. s1: Package / Pos s2: Shipping / Neg s3: Repeat / Pos Input: Merchant Review Output: Aspect / Sentiment Polarity The robot image is designed by Freepik (http://www.freepik.com/free-vector/cute-robots-collection_713858.htm)
  • 25. 25 Annotated 1,510 reviews (5,277 sentences) • 配送も迅速で良かったです。 (I was very pleased at how quickly I received it.)  Shipping/Positive • いつになっても商品が来ず、問い合わせても返信がない。 (No shipment, no reply to inquiry.)  Shipping/Negative, Communication/Negative 103 hours / a well-trained annotator
  • 26. 26 Train models using passive aggressive algorithm, and CRF. Features are: • Bag-of-words, aspect dictionary, sentiment polarity dictionary, and syntactic information. Performance • Aspect classification • Precision: 82.6%, Recall: 46.8% • Sentiment classification • Precision: 84.8%, Recall: 77.5%
  • 27. 27 • Important to develop technique to automatically pull valuable information from Big Data for Text. • e.g., reviews  users’ experience • Rakuten develops techniques in-house to exploit Big Data for Text in the services. • Information extraction from product descriptions • Sentiment analysis on reviews of merchants