SlideShare a Scribd company logo
1 of 60
Download to read offline
Distributed
Natural Language Processing
Systems in Python
Clare Corthell,Founder of Luminant Data
@clarecorthell
thinkingmachin.es/events
Luminant Data
is a Machine Intelligence Consultancy
building more intelligent businesses
with data science strategy and
artificial intelligence technology
luminantdata.com
based in San Francisco
Clare Corthell, Founder
datasciencemasters.org
The Open Source Data Science Masters
The most popular curriculum for
learning data science
Author: Clare Corthell
development environment
python 2.7
pip, pandas, iPython, TextBlob, scikit-learn
@clarecorthell
hold on to your seats,
we’re doing NLP end-to-end
@clarecorthell
What is natural language processing 4?
@clarecorthell
how does this artificial assistant
know what you’re talking about?@clarecorthell
It’s simple, right?
@clarecorthell
wired.com
Natural Language Processing is the domain
concerned with translating human language
into something a computer can process
@clarecorthell
teaching computers to better
understand human language
today —
@clarecorthell
Natural Language Processing
Code
1.Intuition forTextAnalysis
2.Representations ofText for Computation
Case Study
3.Mining theWeb for ObscureWords
4.NLP in Production
@clarecorthell
Intuition for Text Analysis
one natural language processing
Worked Example:
http://github.com/clarecorthell/nlp_workshop
@clarecorthell
Representations of Text
for Computation
two natural language processing
Worked Example:
http://github.com/clarecorthell/nlp_workshop
@clarecorthell
Case Study
Mining the Web for Obscure Words
Finding new definitions of words for the Wordnik dictionary
codename: Serapis
three natural language processing
@clarecorthell
Important Note:
We combine Natural Language Processing
and Machine Learning in this framework
@clarecorthell
Wordnik’s Challenge
Find 1 million new words that are already defined online
@clarecorthell
The term “cheeseors” describes flighted
globules of intergalactic cheese, known to be
the scourge of the asteroid belt.
- Erin McKean, Wordnik Founder
what does this word mean?
@clarecorthell
Free-Range Definition or “FRD”
a sentence that contains and contextually defines a word
@clarecorthell
we’re given words without definitions (in Wordnik)
FRD sentences for that word (from the internet)we want
the natural language challenge:
@clarecorthell
This is a Supervised Classification Problem
Given a bunch of sentences containing a given word, we
want to know which ones are an FRD and which are not.
P(FRD | Sentence) = [0.0,1.0]
@clarecorthell
Development: Training
Production: Prediction
training labels
classificationalgorithm
training
input
new
input
Supervised Machine Learning:
Classification
test
input
learn* the difference
between labels
* Learning means finding the functions that define the difference between training labels, based on training input.
Development: Testing
once tests demonstrate the
classification algorithm works well,
push to production
predict
predict
abstract test quality of model
test
predictions
output
classifications
@clarecorthell
To get from word to FRD,
What data do we need?
@clarecorthell
real world text data is messy
user lookups from Wordnik
@clarecorthell
word —> sentence with word
search the web for
@clarecorthell
real world text data is messy
the internet
@clarecorthell
“ephemerable”
@clarecorthell
Both are messy!
We need to do some pre-processing.
Pre-processing is often custom, determined by:
- the domain of the text
- your goals
- dealing with edge cases
@clarecorthell
What does the solution look like so far?
@clarecorthell
Processing Natural
Language for Wordnik
Search Bing
qualify term
raw page
documents
word
cleaned
documents
sentences
containing
word
parse, tokenize,
sanitize text
word
english?
pre-process documents
find body text
replace term
in sentences
store for
learning or prediction
tag parts-of-speech sentences:
part-of-speech
@clarecorthell
All with the intent of creating features!
@clarecorthell
What are Features?
individual measurable properties of the
thing we’re observing.
@clarecorthell
The point of feature development and extraction
is to build derived values from the data
that represent characteristics that identify the data
or differentiate data points from one another
in such a way that the computer can observe that difference
@clarecorthell
Development: Training
Production: Prediction
training labels
classificationalgorithm
training
input
features
new
input
Supervised Machine Learning:
Classification
test
input
predicted
labels
test labels
metrics⊥
feature
extractor
learn*thedifference
betweenlabels
* Learning means finding the functions that define the difference between training labels, based on training input.
data
translations
Development: Testing
once tests demonstrate the
classification algorithm works well,
push to production
feature
extractor
data
translations
features
predict
predict
design
output
classifications
@clarecorthell
How do we construct features
that will differentiate our examples?
We have to get creative, make guesses, and
statistically test them to see what features will work
@clarecorthell
The term “cheeseors” describes flighted
globules of intergalactic cheese, known to be
the scourge of the asteroid belt.
What features do sentences with FRDs have?
@clarecorthell
Possible Feature: Length of Sentence
@clarecorthell
Possible Features: Length, Location, Position in Sentence
Chi-Squared (chi^2) Selection
Chi Squared is a statistical test that describes how
independent two given events are.
In selecting features, the two events are occurrence of the term
and occurrence of the class. If the score is high or significant, it
means that the occurrence is dependent on the class.
@clarecorthell
highest chi^2 feature scoring for tokens
@clarecorthell
highest chi^2 feature scoring for parts of speech
@clarecorthell
Final features for FRDs:
token context around the term (tf-idf)
POS tag context around the term
@clarecorthell
task.search collects URLs and the documents
behind those URLs for a given word
task.detect finds all sentences that are FRDs
within all documents collected for a given word
Final NLP Modules
@clarecorthell
Development: Training
Production: Prediction
training labels
classificationalgorithm
training
input
features
new
input
Supervised Machine Learning:
Classification
test
input
predicted
labels
test labels
metrics⊥
feature
extractor
learn*thedifference
betweenlabels
* Learning means finding the functions that define the difference between training labels, based on training input.
data
translations
Development: Testing
once tests demonstrate the
classification algorithm works well,
push to production
feature
extractor
data
translations
features
predict
predict
design
output
classifications
task.detect
@clarecorthell
Keys to Success in Supervised Learning:
• knowing what an FRD is
• knowing how to encode the FRD for computation
• using the right data sources
• choosing the right features
@clarecorthell
Ready to put it all together?
@clarecorthell
NLP in Production
four natural language processing
Lessons and Patterns from the distributed Wordnik system
codename: Serapis
@clarecorthell
Serapis is the Graeco-Egyptian god of knowledge, education —
and adding a million words into the dictionary.
@clarecorthell
Q: Can’t we just run the same thing on a server?
we want to find definitions for 1 million words
@clarecorthell
Q: Can’t we just run the same thing on a server?
we want to find definitions for 1 million words
which means we make 1 million search requests
@clarecorthell
Q: Can’t we just run the same thing on a server?
we want to find definitions for 1 million words
which means we make 1 million search requests
returning 40 results each for 40 million search results
@clarecorthell
Q: Can’t we just run the same thing on a server?
we want to find definitions for 1 million words
which means we make 1 million search requests
returning 40 results each for 40 million search results
if we follow those links to get the pages, that’s 40 million page requests
@clarecorthell
Q: Can’t we just run the same thing on a server?
we want to find definitions for 1 million words
which means we make 1 million search requests
returning 40 results each for 40 million search results
if we follow those links to get the pages, that’s 40 million page requests
each page request takes on average 2 sec, so 40,000,000 pages * 2 sec
@clarecorthell
Q: Can’t we just run the same thing on a server?
we want to find definitions for 1 million words
which means we make 1 million search requests
returning 40 results each for 40 million search results
if we follow those links to get the pages, that’s 40 million page requests
each page request takes on average 2 sec, so 40,000,000 pages * 2 sec
In series, it would take 5Years to get all the documents we want
@clarecorthell
Q: Can’t we just run the same thing on a server?
Scale and Page Requests
If we want to get search results for 1 million words, it would
take us 5 years to get all the documents if we processed
everything sequentially. We need to parallelize.
A: Nope.
@clarecorthell
Luminant Data &
Distributed Infrastructure for data collection, natural
language processing, and machine learning
resizable compute compute management
service
scalable storage index
@clarecorthell
AWS Lambda is a compute service that runs your
code in response to events and automatically manages
the underlying compute resources for you.
The promise of Lambda is that you don’t have to worry
about infrastructure, rather you set a task and a trigger,
such as a change to a document in an S3 bucket.
Lambda takes over scaling out the resources to
complete all the tasks.
@clarecorthell
1. Start an EC2 instance with the Amazon Linux AMI
2. Build shared libraries from source on EC2
3. Create a virtualenv with python dependencies
4. Write a python handler function to respond to a
given event and do its work (process text, etc)
5. Bundle the virtualenv, your code and the binary libs
into a zip file
6. Publish the zip file to AWS Lambda
Voila.
@clarecorthell
Luminant Data &
Distributed Infrastructure for data collection, natural
language processing, and machine learning
@clarecorthell
It has developed a mechanism to 'dye' very small bitcoin
transactions (called ‘bitcoin dust’) by adding extra data to them so
that they can represent bonds, shares or units of precious metals.
A Few New Words from FRDs
‘oxt weekend,’ in other words, means 'not this coming weekend
but the one after.'
This summer, a new, trendier one, emerged: NATU, for Netflix,
Airbnb, Tesla and Uber.
@clarecorthell
Clare Corthell
clare@luminantdata.com
@clarecorthell
Thank You!

More Related Content

What's hot

Text-mining and Automation
Text-mining and AutomationText-mining and Automation
Text-mining and Automationbenosteen
 
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538Krishna Sankar
 
How To Become a Data Scientist in Iran Marketplace
How To Become a Data Scientist in Iran Marketplace How To Become a Data Scientist in Iran Marketplace
How To Become a Data Scientist in Iran Marketplace Mohamadreza Mohtat
 
Machine Learning for Non-technical People
Machine Learning for Non-technical PeopleMachine Learning for Non-technical People
Machine Learning for Non-technical Peopleindico data
 
Artificial Intelligence: Natural Language Processing
Artificial Intelligence: Natural Language ProcessingArtificial Intelligence: Natural Language Processing
Artificial Intelligence: Natural Language ProcessingFrank Cunha
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkKrishna Sankar
 
The Future of Search and AI
The Future of Search and AIThe Future of Search and AI
The Future of Search and AITrey Grainger
 
The Next Generation of AI-powered Search
The Next Generation of AI-powered SearchThe Next Generation of AI-powered Search
The Next Generation of AI-powered SearchTrey Grainger
 
Real World NLP, ML, and Big Data
Real World NLP, ML, and Big DataReal World NLP, ML, and Big Data
Real World NLP, ML, and Big DataDevin Bost
 
Natural Language Processing with Graph Databases and Neo4j
Natural Language Processing with Graph Databases and Neo4jNatural Language Processing with Graph Databases and Neo4j
Natural Language Processing with Graph Databases and Neo4jWilliam Lyon
 
Teaching AI about human knowledge
Teaching AI about human knowledgeTeaching AI about human knowledge
Teaching AI about human knowledgeInes Montani
 
Implications of GPT-3
Implications of GPT-3Implications of GPT-3
Implications of GPT-3Raven Jiang
 
Natural Language Processing with Neo4j
Natural Language Processing with Neo4jNatural Language Processing with Neo4j
Natural Language Processing with Neo4jKenny Bastani
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDevashish Shanker
 
NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk Vijay Ganti
 
Pandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data SciencePandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data ScienceKrishna Sankar
 
Myths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsMyths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsDavid Pittman
 
Life of a data scientist (pub)
Life of a data scientist (pub)Life of a data scientist (pub)
Life of a data scientist (pub)Buhwan Jeong
 
Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical Dhruv Gohil
 
Machine Learning in NLP
Machine Learning in NLPMachine Learning in NLP
Machine Learning in NLPVijay Ganti
 

What's hot (20)

Text-mining and Automation
Text-mining and AutomationText-mining and Automation
Text-mining and Automation
 
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
 
How To Become a Data Scientist in Iran Marketplace
How To Become a Data Scientist in Iran Marketplace How To Become a Data Scientist in Iran Marketplace
How To Become a Data Scientist in Iran Marketplace
 
Machine Learning for Non-technical People
Machine Learning for Non-technical PeopleMachine Learning for Non-technical People
Machine Learning for Non-technical People
 
Artificial Intelligence: Natural Language Processing
Artificial Intelligence: Natural Language ProcessingArtificial Intelligence: Natural Language Processing
Artificial Intelligence: Natural Language Processing
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
 
The Future of Search and AI
The Future of Search and AIThe Future of Search and AI
The Future of Search and AI
 
The Next Generation of AI-powered Search
The Next Generation of AI-powered SearchThe Next Generation of AI-powered Search
The Next Generation of AI-powered Search
 
Real World NLP, ML, and Big Data
Real World NLP, ML, and Big DataReal World NLP, ML, and Big Data
Real World NLP, ML, and Big Data
 
Natural Language Processing with Graph Databases and Neo4j
Natural Language Processing with Graph Databases and Neo4jNatural Language Processing with Graph Databases and Neo4j
Natural Language Processing with Graph Databases and Neo4j
 
Teaching AI about human knowledge
Teaching AI about human knowledgeTeaching AI about human knowledge
Teaching AI about human knowledge
 
Implications of GPT-3
Implications of GPT-3Implications of GPT-3
Implications of GPT-3
 
Natural Language Processing with Neo4j
Natural Language Processing with Neo4jNatural Language Processing with Neo4j
Natural Language Processing with Neo4j
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
 
NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk
 
Pandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data SciencePandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data Science
 
Myths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsMyths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data Scientists
 
Life of a data scientist (pub)
Life of a data scientist (pub)Life of a data scientist (pub)
Life of a data scientist (pub)
 
Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical
 
Machine Learning in NLP
Machine Learning in NLPMachine Learning in NLP
Machine Learning in NLP
 

Viewers also liked

Scaling Harm: Designing Artificial Intelligence for Humans
Scaling Harm: Designing Artificial Intelligence for HumansScaling Harm: Designing Artificial Intelligence for Humans
Scaling Harm: Designing Artificial Intelligence for HumansClare Corthell
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemTrey Grainger
 
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...Trey Grainger
 
Hybrid Intelligence: The New Paradigm
Hybrid Intelligence: The New ParadigmHybrid Intelligence: The New Paradigm
Hybrid Intelligence: The New ParadigmClare Corthell
 
Engineering Ethics: Practicing Fairness
Engineering Ethics: Practicing FairnessEngineering Ethics: Practicing Fairness
Engineering Ethics: Practicing FairnessClare Corthell
 
The Perfect Fit: Scalable Graph for Big Data
The Perfect Fit: Scalable Graph for Big DataThe Perfect Fit: Scalable Graph for Big Data
The Perfect Fit: Scalable Graph for Big DataInside Analysis
 
Inside Wordnik's Architecture
Inside Wordnik's ArchitectureInside Wordnik's Architecture
Inside Wordnik's ArchitectureTony Tam
 
721P1_daic_revision3
721P1_daic_revision3721P1_daic_revision3
721P1_daic_revision3Dai Chen
 
Send Balls Into Orbit with Python3, AsyncIO, WebSockets and React
Send Balls Into Orbit with Python3, AsyncIO, WebSockets and ReactSend Balls Into Orbit with Python3, AsyncIO, WebSockets and React
Send Balls Into Orbit with Python3, AsyncIO, WebSockets and ReactTaras Lyapun
 
DUMP-2012 - Только хардкор! - "Архитектура и запуск облачного сервиса в Amazo...
DUMP-2012 - Только хардкор! - "Архитектура и запуск облачного сервиса в Amazo...DUMP-2012 - Только хардкор! - "Архитектура и запуск облачного сервиса в Amazo...
DUMP-2012 - Только хардкор! - "Архитектура и запуск облачного сервиса в Amazo...it-people
 
Прямая выгода BigData для бизнеса
Прямая выгода BigData для бизнесаПрямая выгода BigData для бизнеса
Прямая выгода BigData для бизнесаAlexey Lustin
 
Behind the Scenes at Coolblue - Feb 2017
Behind the Scenes at Coolblue - Feb 2017Behind the Scenes at Coolblue - Feb 2017
Behind the Scenes at Coolblue - Feb 2017Pat Hermens
 
PyCon Poland 2016: Maintaining a high load Python project: typical mistakes
PyCon Poland 2016: Maintaining a high load Python project: typical mistakesPyCon Poland 2016: Maintaining a high load Python project: typical mistakes
PyCon Poland 2016: Maintaining a high load Python project: typical mistakesViach Kakovskyi
 
Designing Invisible Software at x.ai
Designing Invisible Software at x.aiDesigning Invisible Software at x.ai
Designing Invisible Software at x.aiAlex Poon
 
Автоматизация анализа логов на базе Elasticsearch
Автоматизация анализа логов на базе ElasticsearchАвтоматизация анализа логов на базе Elasticsearch
Автоматизация анализа логов на базе ElasticsearchPositive Hack Days
 
Using Pivotal Cloud Foundry with Google’s BigQuery and Cloud Vision API
Using Pivotal Cloud Foundry with Google’s BigQuery and Cloud Vision APIUsing Pivotal Cloud Foundry with Google’s BigQuery and Cloud Vision API
Using Pivotal Cloud Foundry with Google’s BigQuery and Cloud Vision APIVMware Tanzu
 
Shadow Fight 2: архитектура системы аналитики для миллиарда событий
Shadow Fight 2: архитектура системы аналитики для миллиарда событийShadow Fight 2: архитектура системы аналитики для миллиарда событий
Shadow Fight 2: архитектура системы аналитики для миллиарда событийVyacheslav Nikulin
 
Optimizing Data Architecture for Natural Language Processing
Optimizing Data Architecture for Natural Language ProcessingOptimizing Data Architecture for Natural Language Processing
Optimizing Data Architecture for Natural Language ProcessingAlex Poon
 
2013-02-02 03 Голушко. Полнотекстовый поиск с Elasticsearch
2013-02-02 03 Голушко. Полнотекстовый поиск с Elasticsearch2013-02-02 03 Голушко. Полнотекстовый поиск с Elasticsearch
2013-02-02 03 Голушко. Полнотекстовый поиск с ElasticsearchОмские ИТ-субботники
 

Viewers also liked (20)

Scaling Harm: Designing Artificial Intelligence for Humans
Scaling Harm: Designing Artificial Intelligence for HumansScaling Harm: Designing Artificial Intelligence for Humans
Scaling Harm: Designing Artificial Intelligence for Humans
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
 
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
 
Hybrid Intelligence: The New Paradigm
Hybrid Intelligence: The New ParadigmHybrid Intelligence: The New Paradigm
Hybrid Intelligence: The New Paradigm
 
Engineering Ethics: Practicing Fairness
Engineering Ethics: Practicing FairnessEngineering Ethics: Practicing Fairness
Engineering Ethics: Practicing Fairness
 
The Perfect Fit: Scalable Graph for Big Data
The Perfect Fit: Scalable Graph for Big DataThe Perfect Fit: Scalable Graph for Big Data
The Perfect Fit: Scalable Graph for Big Data
 
Inside Wordnik's Architecture
Inside Wordnik's ArchitectureInside Wordnik's Architecture
Inside Wordnik's Architecture
 
721P1_daic_revision3
721P1_daic_revision3721P1_daic_revision3
721P1_daic_revision3
 
Send Balls Into Orbit with Python3, AsyncIO, WebSockets and React
Send Balls Into Orbit with Python3, AsyncIO, WebSockets and ReactSend Balls Into Orbit with Python3, AsyncIO, WebSockets and React
Send Balls Into Orbit with Python3, AsyncIO, WebSockets and React
 
DUMP-2012 - Только хардкор! - "Архитектура и запуск облачного сервиса в Amazo...
DUMP-2012 - Только хардкор! - "Архитектура и запуск облачного сервиса в Amazo...DUMP-2012 - Только хардкор! - "Архитектура и запуск облачного сервиса в Amazo...
DUMP-2012 - Только хардкор! - "Архитектура и запуск облачного сервиса в Amazo...
 
Прямая выгода BigData для бизнеса
Прямая выгода BigData для бизнесаПрямая выгода BigData для бизнеса
Прямая выгода BigData для бизнеса
 
Behind the Scenes at Coolblue - Feb 2017
Behind the Scenes at Coolblue - Feb 2017Behind the Scenes at Coolblue - Feb 2017
Behind the Scenes at Coolblue - Feb 2017
 
PyCon Poland 2016: Maintaining a high load Python project: typical mistakes
PyCon Poland 2016: Maintaining a high load Python project: typical mistakesPyCon Poland 2016: Maintaining a high load Python project: typical mistakes
PyCon Poland 2016: Maintaining a high load Python project: typical mistakes
 
Designing Invisible Software at x.ai
Designing Invisible Software at x.aiDesigning Invisible Software at x.ai
Designing Invisible Software at x.ai
 
Pycon UA 2016
Pycon UA 2016Pycon UA 2016
Pycon UA 2016
 
Автоматизация анализа логов на базе Elasticsearch
Автоматизация анализа логов на базе ElasticsearchАвтоматизация анализа логов на базе Elasticsearch
Автоматизация анализа логов на базе Elasticsearch
 
Using Pivotal Cloud Foundry with Google’s BigQuery and Cloud Vision API
Using Pivotal Cloud Foundry with Google’s BigQuery and Cloud Vision APIUsing Pivotal Cloud Foundry with Google’s BigQuery and Cloud Vision API
Using Pivotal Cloud Foundry with Google’s BigQuery and Cloud Vision API
 
Shadow Fight 2: архитектура системы аналитики для миллиарда событий
Shadow Fight 2: архитектура системы аналитики для миллиарда событийShadow Fight 2: архитектура системы аналитики для миллиарда событий
Shadow Fight 2: архитектура системы аналитики для миллиарда событий
 
Optimizing Data Architecture for Natural Language Processing
Optimizing Data Architecture for Natural Language ProcessingOptimizing Data Architecture for Natural Language Processing
Optimizing Data Architecture for Natural Language Processing
 
2013-02-02 03 Голушко. Полнотекстовый поиск с Elasticsearch
2013-02-02 03 Голушко. Полнотекстовый поиск с Elasticsearch2013-02-02 03 Голушко. Полнотекстовый поиск с Elasticsearch
2013-02-02 03 Голушко. Полнотекстовый поиск с Elasticsearch
 

Similar to Distributed Natural Language Processing Systems in Python

The Relevance of the Apache Solr Semantic Knowledge Graph
The Relevance of the Apache Solr Semantic Knowledge GraphThe Relevance of the Apache Solr Semantic Knowledge Graph
The Relevance of the Apache Solr Semantic Knowledge GraphTrey Grainger
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrLucidworks
 
How to Build a Semantic Search System
How to Build a Semantic Search SystemHow to Build a Semantic Search System
How to Build a Semantic Search SystemTrey Grainger
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsDerek Kane
 
Building yourself with Python - Learn the Basics!!
Building yourself with Python - Learn the Basics!!Building yourself with Python - Learn the Basics!!
Building yourself with Python - Learn the Basics!!FRANKLINODURO
 
Searching for Meaning
Searching for MeaningSearching for Meaning
Searching for MeaningTrey Grainger
 
Querying data on the Web – client or server?
Querying data on the Web – client or server?Querying data on the Web – client or server?
Querying data on the Web – client or server?Ruben Verborgh
 
Patterns of Semantic Integration
Patterns of Semantic IntegrationPatterns of Semantic Integration
Patterns of Semantic IntegrationOptum
 
CSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web TutorialCSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web TutorialLeeFeigenbaum
 
Finding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic WebFinding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic Webebiquity
 
Introduction to Multimodal LLMs with LLaVA
Introduction to Multimodal LLMs with LLaVAIntroduction to Multimodal LLMs with LLaVA
Introduction to Multimodal LLMs with LLaVARobert McDermott
 
Introduction to Multimodal LLMs with LLaVA
Introduction to Multimodal LLMs with LLaVAIntroduction to Multimodal LLMs with LLaVA
Introduction to Multimodal LLMs with LLaVARobert McDermott
 
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesMax Irwin
 
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the CloudLeveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the CloudDatabricks
 
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerHaystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerOpenSource Connections
 
The Apache Solr Semantic Knowledge Graph
The Apache Solr Semantic Knowledge GraphThe Apache Solr Semantic Knowledge Graph
The Apache Solr Semantic Knowledge GraphTrey Grainger
 
Lambda The Extreme: Test-Driving a Functional Language
Lambda The Extreme: Test-Driving a Functional LanguageLambda The Extreme: Test-Driving a Functional Language
Lambda The Extreme: Test-Driving a Functional LanguageAccenture | SolutionsIQ
 
Declarative Multilingual Information Extraction with SystemT
Declarative Multilingual Information Extraction with SystemTDeclarative Multilingual Information Extraction with SystemT
Declarative Multilingual Information Extraction with SystemTLaura Chiticariu
 
Natural Language Search with Knowledge Graphs (Chicago Meetup)
Natural Language Search with Knowledge Graphs (Chicago Meetup)Natural Language Search with Knowledge Graphs (Chicago Meetup)
Natural Language Search with Knowledge Graphs (Chicago Meetup)Trey Grainger
 
Computer Programming for Lawyers
Computer Programming for LawyersComputer Programming for Lawyers
Computer Programming for LawyersNehal Madhani
 

Similar to Distributed Natural Language Processing Systems in Python (20)

The Relevance of the Apache Solr Semantic Knowledge Graph
The Relevance of the Apache Solr Semantic Knowledge GraphThe Relevance of the Apache Solr Semantic Knowledge Graph
The Relevance of the Apache Solr Semantic Knowledge Graph
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with Solr
 
How to Build a Semantic Search System
How to Build a Semantic Search SystemHow to Build a Semantic Search System
How to Build a Semantic Search System
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
 
Building yourself with Python - Learn the Basics!!
Building yourself with Python - Learn the Basics!!Building yourself with Python - Learn the Basics!!
Building yourself with Python - Learn the Basics!!
 
Searching for Meaning
Searching for MeaningSearching for Meaning
Searching for Meaning
 
Querying data on the Web – client or server?
Querying data on the Web – client or server?Querying data on the Web – client or server?
Querying data on the Web – client or server?
 
Patterns of Semantic Integration
Patterns of Semantic IntegrationPatterns of Semantic Integration
Patterns of Semantic Integration
 
CSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web TutorialCSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web Tutorial
 
Finding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic WebFinding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic Web
 
Introduction to Multimodal LLMs with LLaVA
Introduction to Multimodal LLMs with LLaVAIntroduction to Multimodal LLMs with LLaVA
Introduction to Multimodal LLMs with LLaVA
 
Introduction to Multimodal LLMs with LLaVA
Introduction to Multimodal LLMs with LLaVAIntroduction to Multimodal LLMs with LLaVA
Introduction to Multimodal LLMs with LLaVA
 
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
 
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the CloudLeveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
 
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerHaystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
 
The Apache Solr Semantic Knowledge Graph
The Apache Solr Semantic Knowledge GraphThe Apache Solr Semantic Knowledge Graph
The Apache Solr Semantic Knowledge Graph
 
Lambda The Extreme: Test-Driving a Functional Language
Lambda The Extreme: Test-Driving a Functional LanguageLambda The Extreme: Test-Driving a Functional Language
Lambda The Extreme: Test-Driving a Functional Language
 
Declarative Multilingual Information Extraction with SystemT
Declarative Multilingual Information Extraction with SystemTDeclarative Multilingual Information Extraction with SystemT
Declarative Multilingual Information Extraction with SystemT
 
Natural Language Search with Knowledge Graphs (Chicago Meetup)
Natural Language Search with Knowledge Graphs (Chicago Meetup)Natural Language Search with Knowledge Graphs (Chicago Meetup)
Natural Language Search with Knowledge Graphs (Chicago Meetup)
 
Computer Programming for Lawyers
Computer Programming for LawyersComputer Programming for Lawyers
Computer Programming for Lawyers
 

Recently uploaded

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 

Recently uploaded (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

Distributed Natural Language Processing Systems in Python

  • 1. Distributed Natural Language Processing Systems in Python Clare Corthell,Founder of Luminant Data @clarecorthell thinkingmachin.es/events
  • 2. Luminant Data is a Machine Intelligence Consultancy building more intelligent businesses with data science strategy and artificial intelligence technology luminantdata.com based in San Francisco Clare Corthell, Founder
  • 3. datasciencemasters.org The Open Source Data Science Masters The most popular curriculum for learning data science Author: Clare Corthell
  • 4. development environment python 2.7 pip, pandas, iPython, TextBlob, scikit-learn @clarecorthell
  • 5. hold on to your seats, we’re doing NLP end-to-end @clarecorthell
  • 6. What is natural language processing 4? @clarecorthell
  • 7. how does this artificial assistant know what you’re talking about?@clarecorthell
  • 10. Natural Language Processing is the domain concerned with translating human language into something a computer can process @clarecorthell
  • 11. teaching computers to better understand human language today — @clarecorthell
  • 12. Natural Language Processing Code 1.Intuition forTextAnalysis 2.Representations ofText for Computation Case Study 3.Mining theWeb for ObscureWords 4.NLP in Production @clarecorthell
  • 13. Intuition for Text Analysis one natural language processing Worked Example: http://github.com/clarecorthell/nlp_workshop @clarecorthell
  • 14. Representations of Text for Computation two natural language processing Worked Example: http://github.com/clarecorthell/nlp_workshop @clarecorthell
  • 15. Case Study Mining the Web for Obscure Words Finding new definitions of words for the Wordnik dictionary codename: Serapis three natural language processing @clarecorthell
  • 16. Important Note: We combine Natural Language Processing and Machine Learning in this framework @clarecorthell
  • 17. Wordnik’s Challenge Find 1 million new words that are already defined online @clarecorthell
  • 18. The term “cheeseors” describes flighted globules of intergalactic cheese, known to be the scourge of the asteroid belt. - Erin McKean, Wordnik Founder what does this word mean? @clarecorthell
  • 19. Free-Range Definition or “FRD” a sentence that contains and contextually defines a word @clarecorthell
  • 20. we’re given words without definitions (in Wordnik) FRD sentences for that word (from the internet)we want the natural language challenge: @clarecorthell
  • 21. This is a Supervised Classification Problem Given a bunch of sentences containing a given word, we want to know which ones are an FRD and which are not. P(FRD | Sentence) = [0.0,1.0] @clarecorthell
  • 22. Development: Training Production: Prediction training labels classificationalgorithm training input new input Supervised Machine Learning: Classification test input learn* the difference between labels * Learning means finding the functions that define the difference between training labels, based on training input. Development: Testing once tests demonstrate the classification algorithm works well, push to production predict predict abstract test quality of model test predictions output classifications @clarecorthell
  • 23. To get from word to FRD, What data do we need? @clarecorthell
  • 24. real world text data is messy user lookups from Wordnik @clarecorthell
  • 25. word —> sentence with word search the web for @clarecorthell
  • 26. real world text data is messy the internet @clarecorthell
  • 28. Both are messy! We need to do some pre-processing. Pre-processing is often custom, determined by: - the domain of the text - your goals - dealing with edge cases @clarecorthell
  • 29. What does the solution look like so far? @clarecorthell
  • 30. Processing Natural Language for Wordnik Search Bing qualify term raw page documents word cleaned documents sentences containing word parse, tokenize, sanitize text word english? pre-process documents find body text replace term in sentences store for learning or prediction tag parts-of-speech sentences: part-of-speech @clarecorthell
  • 31. All with the intent of creating features! @clarecorthell
  • 32. What are Features? individual measurable properties of the thing we’re observing. @clarecorthell
  • 33. The point of feature development and extraction is to build derived values from the data that represent characteristics that identify the data or differentiate data points from one another in such a way that the computer can observe that difference @clarecorthell
  • 34. Development: Training Production: Prediction training labels classificationalgorithm training input features new input Supervised Machine Learning: Classification test input predicted labels test labels metrics⊥ feature extractor learn*thedifference betweenlabels * Learning means finding the functions that define the difference between training labels, based on training input. data translations Development: Testing once tests demonstrate the classification algorithm works well, push to production feature extractor data translations features predict predict design output classifications @clarecorthell
  • 35. How do we construct features that will differentiate our examples? We have to get creative, make guesses, and statistically test them to see what features will work @clarecorthell
  • 36. The term “cheeseors” describes flighted globules of intergalactic cheese, known to be the scourge of the asteroid belt. What features do sentences with FRDs have? @clarecorthell
  • 37. Possible Feature: Length of Sentence @clarecorthell
  • 38. Possible Features: Length, Location, Position in Sentence Chi-Squared (chi^2) Selection Chi Squared is a statistical test that describes how independent two given events are. In selecting features, the two events are occurrence of the term and occurrence of the class. If the score is high or significant, it means that the occurrence is dependent on the class. @clarecorthell
  • 39. highest chi^2 feature scoring for tokens @clarecorthell
  • 40. highest chi^2 feature scoring for parts of speech @clarecorthell
  • 41. Final features for FRDs: token context around the term (tf-idf) POS tag context around the term @clarecorthell
  • 42. task.search collects URLs and the documents behind those URLs for a given word task.detect finds all sentences that are FRDs within all documents collected for a given word Final NLP Modules @clarecorthell
  • 43. Development: Training Production: Prediction training labels classificationalgorithm training input features new input Supervised Machine Learning: Classification test input predicted labels test labels metrics⊥ feature extractor learn*thedifference betweenlabels * Learning means finding the functions that define the difference between training labels, based on training input. data translations Development: Testing once tests demonstrate the classification algorithm works well, push to production feature extractor data translations features predict predict design output classifications task.detect @clarecorthell
  • 44. Keys to Success in Supervised Learning: • knowing what an FRD is • knowing how to encode the FRD for computation • using the right data sources • choosing the right features @clarecorthell
  • 45. Ready to put it all together? @clarecorthell
  • 46. NLP in Production four natural language processing Lessons and Patterns from the distributed Wordnik system codename: Serapis @clarecorthell
  • 47. Serapis is the Graeco-Egyptian god of knowledge, education — and adding a million words into the dictionary. @clarecorthell
  • 48. Q: Can’t we just run the same thing on a server? we want to find definitions for 1 million words @clarecorthell
  • 49. Q: Can’t we just run the same thing on a server? we want to find definitions for 1 million words which means we make 1 million search requests @clarecorthell
  • 50. Q: Can’t we just run the same thing on a server? we want to find definitions for 1 million words which means we make 1 million search requests returning 40 results each for 40 million search results @clarecorthell
  • 51. Q: Can’t we just run the same thing on a server? we want to find definitions for 1 million words which means we make 1 million search requests returning 40 results each for 40 million search results if we follow those links to get the pages, that’s 40 million page requests @clarecorthell
  • 52. Q: Can’t we just run the same thing on a server? we want to find definitions for 1 million words which means we make 1 million search requests returning 40 results each for 40 million search results if we follow those links to get the pages, that’s 40 million page requests each page request takes on average 2 sec, so 40,000,000 pages * 2 sec @clarecorthell
  • 53. Q: Can’t we just run the same thing on a server? we want to find definitions for 1 million words which means we make 1 million search requests returning 40 results each for 40 million search results if we follow those links to get the pages, that’s 40 million page requests each page request takes on average 2 sec, so 40,000,000 pages * 2 sec In series, it would take 5Years to get all the documents we want @clarecorthell
  • 54. Q: Can’t we just run the same thing on a server? Scale and Page Requests If we want to get search results for 1 million words, it would take us 5 years to get all the documents if we processed everything sequentially. We need to parallelize. A: Nope. @clarecorthell
  • 55. Luminant Data & Distributed Infrastructure for data collection, natural language processing, and machine learning resizable compute compute management service scalable storage index @clarecorthell
  • 56. AWS Lambda is a compute service that runs your code in response to events and automatically manages the underlying compute resources for you. The promise of Lambda is that you don’t have to worry about infrastructure, rather you set a task and a trigger, such as a change to a document in an S3 bucket. Lambda takes over scaling out the resources to complete all the tasks. @clarecorthell
  • 57. 1. Start an EC2 instance with the Amazon Linux AMI 2. Build shared libraries from source on EC2 3. Create a virtualenv with python dependencies 4. Write a python handler function to respond to a given event and do its work (process text, etc) 5. Bundle the virtualenv, your code and the binary libs into a zip file 6. Publish the zip file to AWS Lambda Voila. @clarecorthell
  • 58. Luminant Data & Distributed Infrastructure for data collection, natural language processing, and machine learning @clarecorthell
  • 59. It has developed a mechanism to 'dye' very small bitcoin transactions (called ‘bitcoin dust’) by adding extra data to them so that they can represent bonds, shares or units of precious metals. A Few New Words from FRDs ‘oxt weekend,’ in other words, means 'not this coming weekend but the one after.' This summer, a new, trendier one, emerged: NATU, for Netflix, Airbnb, Tesla and Uber. @clarecorthell