A first look at tf idf-pdx data science meetup

•Download as PPTX, PDF•

4 likes•633 views

How to Measure Document Similarity and Build Text Classifiers: A First Look at Term Frequency-Inverse Document Frequency (TF-IDF) Representations Text data is potentially valuable for many data science projects but working with text is different from working with structured data. One representation of text that has worked well for many text mining and machine learning applications is the term frequency - inverse document frequency (TF-IDF) vector. In spite of the long winded name, this method is easy to understand, performs well in many applications, and has been implemented in commonly used data science tools. This presentation will introduce TF-IDF and show examples of how to use TF-IDF for document classification and measuring the similarity between documents. This presentation does not assume any background in text mining or natural language processing. Examples will use Python.

Technology

A First Look at TF-IDF
Dan Sullivan
PP
Portland Data Science Group
March 2, 2017
Portland Data Science Meetup
March 2, 2017

Challenges
No obvious structure
Fully understanding language is hard
Large number of documents
Want to
Find documents based on similarity
Classify documents

Fortunately ...
Measuring similarity and
Classifying documents
Does not require fully understanding text

“PDX Data Science is all about data.”
about all data is pdx science
1 1 2 1 1 1

Corpus to Vectors
{
WordsCount
(Term Frequency)

Improvement 1: Remove Stop Words
{
WordsCount
Stop
Words

Improvement 2: N-grams
{
WordsCount
“computer” ,
“science”
“Computer science”

Example: Corpus of Machine Learning Papers
Some terms appear frequently
“Feature”
“Algorithm”
“Training”
Some less frequently
“Reinforcement”
“Non-linear”
“Convolution”

Intuition
Combination of words are good indicators of topic of document
Self-driving cars: “automobile”, “driver”, “radar”, “image”, “sensor”
Text mining: “corpus”, “term vector”, “syntax”
Social Network: “graph”, “communities”, “users”, “influence”

Formalizing Intuition: TF-IDF
Notation
t - a term
d - a document
D - a set of documents or corpus
N - number of documents in corpus

Formalizing Intuition: TF-IDF
Notation
t - a term
d - a document
D - a set of documents or corpus
N - number of documents in corpus
TF - term frequency
tf(t,d) is the number of times a term t occurs in document d

TF-IDF is
Large when:
There is a large count of a term in a
document (large TF) and ...
Low number of documents with term in
them
Small when
Term appears in many documents in
the corpus
TF-IDF
Frequency
Stop Words
Common Words
Rare Words

Populating a Term Vector with TF-IDF
V[index(“emmy”)] = tf-idf(“emmy”,d)
V[index(“noether”)] = tf-idf(“noether”,d)
V[index(“known”)] = tf-idf(“known”,d)
V[index(“landmark”)] = tf-idf(“landmark”,d)
V[index(“contribution”)] = tf-idf(“contribution”,d)
V[index(“abstract”)] = tf-idf(“abstract”,d)
V[index(“algebra”)] = tf-idf(“algebra”,d)
V[index(“theoretical”)] = tf-idf(“theoretical”,d)
V[index(“physics”)] = tf-idf(“physics”,d)
“Emmy Noether is known for her landmark contributions to
abstract algebra and theoretical physics”

Vector Space Model
Term 3
Term 2
Term 1
Doc 1
Doc 2
Doc 3
Term 1 Term 2 Term 3
Doc 1 0.4 0.1 0.6
Doc 2 0.3 0.5 0.0
Doc 3 0.0 0.2 0.6

Similarity Measures
Term 3
Term 2
Term 1
Doc 1
Doc 2
Doc 3 Euclidian Distance
Cosine

Classify by Vector (Point)
TF-IDF Vector

NLP Tools
Python
Gensim
NLTK
spaCy & textacy
Scikit-Learn
TextBlob
R
TM
OpenNLP (R interface)
TidyText
Other
Mallet
Google Natural Language API

What's hot

Probabilistic information retrieval models & systemsSelman Bozkır

RDataMining slides-text-mining-with-rYanchang Zhao

Author Topic ModelFReeze FRancis

Interactive Latent Dirichlet AllocationQuentin Pleplé

Text classification-php-v4Glenn De Backer

Natural Language Processing in R (rNLP)fridolin.wild

Information Retrieval Models Part IIngo Frommholz

Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevDatabricks

Text classification using Text kernelsDev Nath

Slidesbutest

Cross language information retrieval (clir)slideMohd Iqbal Al-farabi

Elements of Text Mining Part - IJaganadh Gopinadhan

Spatial LDANorth Carolina State University

Text Mining Using RKnoldus Inc.

07 04-06Gouranga123

Cross-lingual Information RetrievalShadi Saleh

LDA on social bookmarking systemsDenis Parra Santander

EDBT 12 - Top-k interesting phrase mining in ad-hoc collections using sequenc...Chuancong Gao

Question Answering with LydiaJae Hong Kil

Ir 1 lec 7alaa223

What's hot (20)

Probabilistic information retrieval models & systems

RDataMining slides-text-mining-with-r

Author Topic Model

Interactive Latent Dirichlet Allocation

Text classification-php-v4

Natural Language Processing in R (rNLP)

Information Retrieval Models Part I

Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev

Text classification using Text kernels

Slides

Cross language information retrieval (clir)slide

Elements of Text Mining Part - I

Spatial LDA

Text Mining Using R

07 04-06

Cross-lingual Information Retrieval

LDA on social bookmarking systems

EDBT 12 - Top-k interesting phrase mining in ad-hoc collections using sequenc...

Question Answering with Lydia

Ir 1 lec 7

Similar to A first look at tf idf-pdx data science meetup

Artificial Intelligencevini89

Text MiningGokulks007

Neural Text Embeddings for Information Retrieval (WSDM 2017)Bhaskar Mitra

Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Sean Golliher

Text Mining Analytics 101Manohar Swamynathan

Copy of 10text (2)Uma Se

Web and textInstitute of Technology Telkom

Chapter 10 Data Mining TechniquesHouw Liong The

Text Mining Infrastructure in RAshraf Uddin

Intro.pptWrushabhShirsat3

IRJET - Document Comparison based on TF-IDF MetricIRJET Journal

Web search enginesAbdusamadAbdukarimov2

DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...cscpconf

Language Technology Enhanced Learningtelss09

Information Retrieval ShujaatZaheer3

The Geometry of Learningfridolin.wild

Some Information Retrieval Models and Our Experiments for TREC KBAPatrice Bellot - Aix-Marseille Université / CNRS (LIS, INS2I)

Open nlp presentationssChandan Deb

vectorSpaceModelPeterBurden.pptpepe3059

Tdm information retrievalKU Leuven

Similar to A first look at tf idf-pdx data science meetup (20)

Artificial Intelligence

Text Mining

Neural Text Embeddings for Information Retrieval (WSDM 2017)

Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494

Text Mining Analytics 101

Copy of 10text (2)

Web and text

Chapter 10 Data Mining Techniques

Text Mining Infrastructure in R

Intro.ppt

IRJET - Document Comparison based on TF-IDF Metric

Web search engines

DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...

Language Technology Enhanced Learning

Information Retrieval

The Geometry of Learning

Some Information Retrieval Models and Our Experiments for TREC KBA

Open nlp presentationss

vectorSpaceModelPeterBurden.ppt

Tdm information retrieval

Recently uploaded

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

Anypoint Exchange: It’s Not Just a Repo!Manik S Magar

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

Take control of your SAP testing with UiPath Test SuiteDianaGray10

How to write a Business Continuity PlanDatabarracks

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey

DMCC Future of Trade Web3 - Special EditionDubai Multi Commodity Centre

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

Recently uploaded (20)

Connect Wave/ connectwave Pitch Deck Presentation

Anypoint Exchange: It’s Not Just a Repo!

SIP trunking in Janus @ Kamailio World 2024

Scanning the Internet for External Cloud Exposures via SSL Certs

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx

Streamlining Python Development: A Guide to a Modern Project Setup

DevEX - reference for building teams, processes, and platforms

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Vertex AI Gemini Prompt Engineering Tips

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...

Take control of your SAP testing with UiPath Test Suite

How to write a Business Continuity Plan

Unraveling Multimodality with Large Language Models.pdf

TeamStation AI System Report LATAM IT Salaries 2024

DMCC Future of Trade Web3 - Special Edition

Designing IA for AI - Information Architecture Conference 2024

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

DevoxxFR 2024 Reproducible Builds with Apache Maven

A first look at tf idf-pdx data science meetup

1. A First Look at TF-IDF Dan Sullivan PP Portland Data Science Group March 2, 2017 Portland Data Science Meetup March 2, 2017

2. What do we do with this?

3. Challenges No obvious structure Fully understanding language is hard Large number of documents Want to Find documents based on similarity Classify documents

4. Fortunately ... Measuring similarity and Classifying documents Does not require fully understanding text

5. Counting Words

6. “PDX Data Science is all about data.” about all data is pdx science 1 1 2 1 1 1

7. Corpus to Vectors { WordsCount (Term Frequency)

8. Improvement 1: Remove Stop Words { WordsCount Stop Words

9. Improvement 2: N-grams { WordsCount “computer” , “science” “Computer science”

10. Example: Corpus of Machine Learning Papers Some terms appear frequently “Feature” “Algorithm” “Training” Some less frequently “Reinforcement” “Non-linear” “Convolution”

11. Intuition Combination of words are good indicators of topic of document Self-driving cars: “automobile”, “driver”, “radar”, “image”, “sensor” Text mining: “corpus”, “term vector”, “syntax” Social Network: “graph”, “communities”, “users”, “influence”

12. Intuition Combination of words are good indicators of topic of document Self-driving cars: “automobile”, “driver”, “radar”, “image”, “sensor” Text mining: “corpus”, “term vector”, “syntax” Social Network: “graph”, “communities”, “users”, “influence” Words that appear frequently across documents in a corpus are not good indicators of topic

13. Intuition Combination of words are good indicators of topic of document Self-driving cars: “automobile”, “driver”, “radar”, “image”, “sensor” Text mining: “corpus”, “term vector”, “syntax” Social Network: “graph”, “communities”, “users”, “influence” Words that appear frequently across documents in a corpus are not good indicators of topic Words that appear frequently only within documents about a single topic are good indicators of topic

14. Formalizing Intuition: TF-IDF Notation t - a term d - a document D - a set of documents or corpus N - number of documents in corpus

15. Formalizing Intuition: TF-IDF Notation t - a term d - a document D - a set of documents or corpus N - number of documents in corpus TF - term frequency tf(t,d) is the number of times a term t occurs in document d

16. Formalizing Intuition: TF-IDF Notation t - a term d - a document D - a set of documents or corpus N - number of documents in corpus TF - term frequency tf(t,d) is the number of times a term t occurs in document d IDF - inverse document frequency idf(t,D) = log(N / | {d in D: t in d} | )

17. Formalizing Intuition: TF-IDF Notation t - a term d - a document D - a set of documents or corpus N - number of documents in corpus TF - term frequency tf(t,d) is the number of times a term t occurs in document d IDF - inverse document frequency idf(t,D) = log(N / | {d in D: t in d} | ) TF-IDF = tf(t,d) * idf(t,D)

18. TF-IDF is Large when: There is a large count of a term in a document (large TF) and ... Low number of documents with term in them Small when Term appears in many documents in the corpus TF-IDF Frequency Stop Words Common Words Rare Words

19. Improvement 3: TF-IDF { WordsTF-IDF

20. Populating a Term Vector with TF-IDF V[index(“emmy”)] = tf-idf(“emmy”,d) V[index(“noether”)] = tf-idf(“noether”,d) V[index(“known”)] = tf-idf(“known”,d) V[index(“landmark”)] = tf-idf(“landmark”,d) V[index(“contribution”)] = tf-idf(“contribution”,d) V[index(“abstract”)] = tf-idf(“abstract”,d) V[index(“algebra”)] = tf-idf(“algebra”,d) V[index(“theoretical”)] = tf-idf(“theoretical”,d) V[index(“physics”)] = tf-idf(“physics”,d) “Emmy Noether is known for her landmark contributions to abstract algebra and theoretical physics”

21. Vector Space Model Term 3 Term 2 Term 1 Doc 1 Doc 2 Doc 3 Term 1 Term 2 Term 3 Doc 1 0.4 0.1 0.6 Doc 2 0.3 0.5 0.0 Doc 3 0.0 0.2 0.6

22. Similarity Measures Term 3 Term 2 Term 1 Doc 1 Doc 2 Doc 3 Euclidian Distance Cosine

23. Classify by Vector (Point) TF-IDF Vector

24. Text Classifier with Scikit Learn

25. Document Similarity with Gensim

26. NLP Tools Python Gensim NLTK spaCy & textacy Scikit-Learn TextBlob R TM OpenNLP (R interface) TidyText Other Mallet Google Natural Language API

27. Q & A

A first look at tf idf-pdx data science meetup

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to A first look at tf idf-pdx data science meetup

Similar to A first look at tf idf-pdx data science meetup (20)

More from Dan Sullivan, Ph.D.

More from Dan Sullivan, Ph.D. (13)

Recently uploaded

Recently uploaded (20)

A first look at tf idf-pdx data science meetup