This document provides an introduction to natural language processing (NLP) and discusses various NLP techniques. It begins by introducing the author and their background in NLP. It then defines NLP and common text data used. The document outlines a typical NLP pipeline that involves pre-processing text, feature engineering, and both low-level and high-level NLP tasks. Part-of-speech tagging and sentiment analysis are discussed as examples. Deep learning techniques for NLP are also introduced, including word embeddings and recurrent neural networks.
2. About Myself – Ashwin Ittoo
Associate Professor HEC Liège, ULiège
Research Associate, JAIST (Japan)
Associate Editor, Elsevier (Computers in Industry)
3. • 3 PhD , ULiège, Belgium
• Finance
• Marketing
• Medicine
• 1 PhD , JAIST Japan (Aug. 2018)
3
Team
4. • Natural Language Processing (NLP)
• Traitement automatique de langues naturelles (TAL)
• Methods for “analysing” language
• Expressed in written form, text data
• Text data common in NLP
• Tweets
• Amazon/Yelp reviews
• Wikipedia
• Domain-specific articles (finance, medicine, …)
4
Introduction
5. • Variety of Analysis
• Document classification, e.g.
• Sentiment analysis
• Information extraction, e.g.
• Extracting facts from legal texts
• Machine translation
• Methods Evolution
• From formal logics, linguistics
• To machine learning, deep learning
5
Introduction (cont)
7. • Clean the data
• Removing stopwords (“a”, “the”,….)
• Removing non-ASCII characters
• Straightforward
• No learning (machine/deep) involved
8
Low-Level: Pre-processing
Pre-processing
Feature Engineering
8. • Text Number transformation
• Individual tokens from sentence
• Tokens: words, numbers, punctuations…
• Tokens = features
• How to best represent features?
9
Low-Level: Feature Engineering
Pre-processing
Feature Engineering
9. • As-is
• Each token = 1 feature
• Eat, ate, eaten: 3 tokens, 3 distinct features
• Huge number of features
• Curse of dimensionality
• Morphology
• Replace token with lemma (root)
• Eat, ate, eaten eat: 3 tokens, 1 feature
• Demo
10
Feature Representation
10. • Grammatical Information
• Use Part-of-Speech (POS)/POS-tagging
• Defined in Penn Tree Bank (UPenn)
• E.g. 2 nice movies CD JJ NNS
• Several tools for POS-tagging
• Stanford NLP (Java)
• Scikitlearn/NLTK (Python)
• Demo
11
Feature Representation (cont)
11. • Application of machine learning for NLP
• Large number of classes (each POS-tag)
• Temporal sequence of word occurrence
• Hidden Markov Model
• 𝑡 𝑛 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑡 𝑛 𝑃 𝑡 𝑛 𝑤 𝑛
≈ 𝑎𝑟𝑔𝑚𝑎𝑥 𝑡 𝑛 𝑃 𝑤𝑖 𝑡𝑖
𝑛
𝑖=1 𝑃 𝑡𝑖 𝑡𝑖−1
• 𝑃 𝑤𝑖 𝑡𝑖 : prob. pos-tag 𝑡𝑖 given word 𝑤𝑖
• 𝑡𝑖 𝑡𝑖−1 : prob. pos-tag 𝑡𝑖−1 given pos-tag 𝑡𝑖
12
Part-of-Speech Tagging
12. • How to select best features?
• Intuitively: some words are more important than others
• E.g. “doping” sports documents
• Tf-Idf
• Term frequency-Inverse document frequency
• Standard statistical tests
• Chi-square
• Mutual Information
• Demo
13
Low-Level: Feature Engineering
13. • High-level tasks
• Features (low-level task) as input
• Sentiment Analysis
• Determine sentiment in customer reviews
• E.g. movie reviews, Amazon product reviews
• Classification Problem
• 2 (3) classes/categories
• +, - (neutral)
• Supervised Learning
• Movie reviews, annotated with sentiment class, available
• Train classification algorithm
• Naïve-Bayes, SVM, Random Forests, Neural Networks
14
High-Level: Sentiment Analysis
Sentiment Analysis
Machine Translation
…
Low-level NLP Tasks
High-level NLP Tasks
Features
14. • Confusion matrix
• True positive, false negative
• True negative, false positive
• Precision
• Fraction of reviews correctly classified
• How precise our model is?
• Recall
• Fraction of correct reviews (from gold standard set) correctly classified
• What is the coverage of the model
• F1-score
• Balances precision, recall
15
High-Level: Evaluation Metrics
15. • Feature Engineering
• Core of machine learning, NLP but…
• Manual, time-consuming
• Bottleneck in machine learning, NLP
• Deep Learning
• Neural network with many hidden layers
• Supervised Learning Approach
• Trained on annotated data
• Movie reviews with sentiment class
• Input: word (vectors) from reviews
• Output: class label (+,-, neutral)
• Hidden layers learn feature representation
• No (minimum) feature engineering
16
Deep Learning in NLP
16. • Different Deep Learning Architectures
• E.g. CNN for image processing
• RNN (Recurrent Neural Network)
• State of the art for text
• Considers temporal nature of tokens in sentence
17
Deep Learning in NLP (cont)
17. 18
RNN for Sentiment Analysis
• Sentiment Challenge
• Each clause can express a different sentiment
• Need to keep track of word sequences
• Need to compose individual sentiments for overall sentiment
- This movie doesn't care about cleverness, wit or any other kind of
intelligent humor.
-Those who find ugly meanings in beautiful things are corrupt without
being charming.
18. 19
Language Processing/Sentiment Analysis (cont)
• Trained over sentiment treebank
• Phrases, clauses, sentences, e.g. “This isn’ a new idea”
• Annotated with respective sentiments (blue: +, red: -)
Java Demo (Stanford Libraries)
19. 20
Unsupervised Learning/Word Embeddings
• Neural language models/word embeddings
• Word2Vec (shallow neural network, not deep learning)
• Predict context given centre word (skip gram)
• E.g. given “bankrupt”, predict “the bank went bankrupt last year”
• Words/contexts from Google news
20. 21
Towards Unsupervised Learning (cont)
• Word vectors representation capture semantic properties
• Word meaning and geometry
• King – queen – man = woman