SlideShare a Scribd company logo
1 of 32
Querying your database in natural language
PyData – Silicon Valley 2014
Daniel F. Moisset – dmoisset@machinalis.com
Data is everywhere
Collecting data is not the problem, but what to do with it
Any operation starts with selecting/filtering data
A classical approach
Used by:
●
Google
●
Wikipedia
●
Lucene/Solr
Performance can be improved:
●
Stemming/synonyms
●
Sorting data by relevance
Search
A classical approach
Used by:
●
Google
●
Wikipedia
●
Lucene/Solr
Performance can be improved:
●
Stemming/synonyms
●
Sorting data by relevance
Search
Limits of keyword based approaches
Query Languages
●
SQL
●
Many NOSQL approaches
●
SPARQL
●
MQL
Allow complex, accurate
queries
SELECT array_agg(players), player_teams
FROM (
SELECT DISTINCT t1.t1player AS players, t1.player_teams
FROM (
SELECT
p.playerid AS t1id,
concat(p.playerid,':', p.playername, ' ') AS t1player,
array_agg(pl.teamid ORDER BY pl.teamid) AS player_teams
FROM player p
LEFT JOIN plays pl ON p.playerid = pl.playerid
GROUP BY p.playerid, p.playername
) t1
INNER JOIN (
SELECT
p.playerid AS t2id,
array_agg(pl.teamid ORDER BY pl.teamid) AS player_teams
FROM player p
LEFT JOIN plays pl ON p.playerid = pl.playerid
GROUP BY p.playerid, p.playername
) t2 ON t1.player_teams=t2.player_teams AND t1.t1id <> t2.t2id
) innerQuery
GROUP BY player_teams
Natural Language Queries
Getting popular:
●
Wolfram Alpha
●
Apple Siri
●
Google Now
Pros and cons:
●
Very accessible, trivial
learning curve
●
Still weak in its coverage:
most applications have a
list of “sample questions”
Outline of this talk: the Quepy approach
●
Overview of our solution
●
Simple example
●
DSL
●
Parser
●
Question Templates
●
Quepy applications
●
Benefits
●
Limitations
Quepy
●
Open Source (BSD License)
https://github.com/machinalis/quepy
●
Status: usable, 2 demos available (dbpedia + freebase)
Online demo at: http://quepy.machinalis.com/
●
Complete documentation:
http://quepy.readthedocs.org/en/latest/
●
You're welcome to get involved!
Overview of the approach
●
Parsing
●
Match + Intermediate representation
●
Query generation & DSL
“What is the airspeed velocity of an unladen swallow?”
What|what|WP is|be|VBZ the|the|DT
airspeed|airspeed|NN velocity|velocity|NN
of|of|IN an|an|DT unladen|unladen|JJ
swallow|swallow|NN
SELECT DISTINCT ?x1 WHERE {
?x0 kingdom "Animal".
?x0 name "unladen swallow".
?x0 airspeed ?x1.
}
Overview of the approach
●
Parsing
●
Match + Intermediate representation
●
Query generation & DSL
“What is the airspeed velocity of an unladen swallow?”
What|what|WP is|be|VBZ the|the|DT
airspeed|airspeed|NN velocity|velocity|NN
of|of|IN an|an|DT unladen|unladen|JJ
swallow|swallow|NN
SELECT DISTINCT ?x1 WHERE {
?x0 kingdom "Animal".
?x0 name "unladen swallow".
?x0 airspeed ?x1.
}
Parsing
●
Not done at character level but at a word level
●
Word = Token + Lemma + POS
“is” → is|be|VBZ (VBZ means “verb, 3rd
person, singular, present
tense”)
“swallows” → swallows|swallow|NNS (NNS means “Noun, plural”)
●
NLTK is smart enough to know that “swallows” here means the
bird (noun) and not the action (verb)
●
Question rule = “regular expressions”
Token("what") + Lemma("be") + Question(Pos("DT")) + Plus(Pos(“NN”))
The word “what” followed by any variant of the “to be” verb, optionally followed by a
determiner (articles, “all”, “every”), followed by one or more nouns
Intermediate representation
● Graph like, with some known values and some holes (x0
,
x1
, …). Always has a “root” (house shaped in the picture)
●
Similar to knowledge databases
●
Easy to build from Python code
Code generator
●
Built-in for MQL
●
Built-in for SPARQL
●
Possible approaches for SQL, other languages
●
DSL - guided
● Outputs the query string (Quepy does not connect to a
database)
Code examples
DSL
class DefinitionOf(FixedRelation):
Relation = 
"/common/topic/description"
reverse = True
class IsMovie(FixedType):
fixedtype = "/film/film"
class IsPerformance(FixedType):
fixedtype = "/film/performance"
class PerformanceOfActor(FixedRelation):
relation = "/film/performance/actor"
class HasPerformance(FixedRelation):
relation = "/film/film/starring"
class NameOf(FixedRelation):
relation = "/type/object/name"
reverse = True
DSL
class DefinitionOf(FixedRelation):
Relation = 
"/common/topic/description"
reverse = True
class IsMovie(FixedType):
fixedtype = "/film/film"
class IsPerformance(FixedType):
fixedtype = "/film/performance"
class PerformanceOfActor(FixedRelation):
relation = "/film/performance/actor"
class HasPerformance(FixedRelation):
relation = "/film/film/starring"
class NameOf(FixedRelation):
relation = "/type/object/name"
reverse = True
DSL
Given a thing x0
, its definition:
DefinitionOf(x0)
Given an actor x2
, movies where x2
acts:
performances = IsPerformance() + PerformanceOfActor(x2)
movies = IsMovie() + HasPerformance(performances)
x3 = NameOf(movies)
Parsing: Particles and templates
class WhatIs(QuestionTemplate):
regex = Lemma("what") + Lemma("be") +
Question(Pos("DT")) + Thing() + Question(Pos("."))
def interpret(self, match):
label = DefinitionOf(match.thing)
return label
class Thing(Particle):
regex = Question(Pos("JJ")) + Plus(Pos("NN") | Pos("NNP") | Pos("NNS"))
def interpret(self, match):
return HasKeyword(match.words.tokens)
Parsing: Particles and templates
class WhatIs(QuestionTemplate):
regex = Lemma("what") + Lemma("be") +
Question(Pos("DT")) + Thing() + Question(Pos("."))
def interpret(self, match):
label = DefinitionOf(match.thing)
return label
class Thing(Particle):
regex = Question(Pos("JJ")) + Plus(Pos("NN") | Pos("NNP") | Pos("NNS"))
def interpret(self, match):
return HasKeyword(match.words.tokens)
Parsing: “movies starring <actor>”
●
More DSL:
class IsPerson(FixedType):
fixedtype = "/people/person"
fixedtyperelation = "/type/object/type"
class IsActor(FixedType):
fixedtype = "Actor"
fixedtyperelation = "/people/person/profession"
Parsing: A more complex particle
●
And then a new Particle:
class Actor(Particle):
regex = Plus(Pos("NN") | Pos("NNS") | Pos("NNP") | Pos("NNPS"))
def interpret(self, match):
name = match.words.tokens
return IsPerson() + IsActor() + HasKeyword(name)
Parsing: A more complex template
class ActedOnQuestion(QuestionTemplate):
acted_on = (Lemma("appear") | Lemma("act") | Lemma("star"))
movie = (Lemma("movie") | Lemma("movies") | Lemma("film"))
regex = (Question(Lemma("list")) + movie + Lemma("with") + Actor()) |
(Question(Pos("IN")) + (Lemma("what") | Lemma("which")) +
movie + Lemma("do") + Actor() + acted_on + Question(Pos("."))) |
(Question(Lemma("list")) + movie + Lemma("star") + Actor())
“list movies with Harrison Ford”
“list films starring Harrison Ford”
“In which film does Harrison Ford appear?”
Parsing: A more complex template
class ActedOnQuestion(QuestionTemplate):
# ...
def interpret(self, match):
performance = IsPerformance() + PerformanceOfActor(match.actor)
movie = IsMovie() + HasPerformance(performance)
movie_name = NameOf(movie)
return movie_name
Apps: gluing it all together
●
You build a Python package with quepy startapp myapp
●
There you add dsl and questions templates
●
Then configure it editing myapp/settings.py (output query
language, data encoding)
You can use that with:
app = quepy.install("myapp")
question = "What is love?"
target, query, metadata = app.get_query(question)
db.execute(query)
The good things
●
Effort to add question templates is small (minutes-hours),
and the benefit is linear wrt effort
●
Good for industry applications
●
Low specialization required to extend
●
Human work is very parallelizable
●
Easy to get many people to work on questions
●
Better for domain specific databases
The good things
●
Effort to add question templates is small (minutes-hours),
and the benefit is linear wrt effort
●
Good for industry applications
●
Low specialization required to extend
●
Human work is very parallelizable
●
Easy to get many people to work on questions
●
Better for domain specific databases
Limitations
●
Better for domain specific databases
●
It won't scale to massive amounts of question templates
(they start to overlap/contradict each other)
●
Hard to add computation (compare: Wolfram Alpha) or
deduction (can be added in the database)
●
Not very fast (this is an implementation, not design issue)
●
Requires a structured database
Limitations
●
Better for domain specific databases
●
It won't scale to massive amounts of question templates
(they start to overlap/contradict each other)
●
Hard to add computation (compare: Wolfram Alpha) or
deduction (can be added in the database)
●
Not very fast (this is an implementation, not design issue)
●
Requires a structured database
Future directions
●
Testing this under other databases
●
Improving performance
●
Collecting uncovered questions, add machine learning to
learn new patterns.
Q & A
You can also reach me at:
dmoisset@machinalis.com
Twitter: @dmoisset
http://machinalis.com/
Thanks!

More Related Content

Viewers also liked

Viewers also liked (6)

HOBBIT Link Discovery Benchmarks at OM2017 ISWC 2017
HOBBIT Link Discovery Benchmarks at OM2017 ISWC 2017HOBBIT Link Discovery Benchmarks at OM2017 ISWC 2017
HOBBIT Link Discovery Benchmarks at OM2017 ISWC 2017
 
Benchmarking Faceted Browsing Capabilities of Triple Stores
Benchmarking Faceted Browsing Capabilities of Triple StoresBenchmarking Faceted Browsing Capabilities of Triple Stores
Benchmarking Faceted Browsing Capabilities of Triple Stores
 
Leopard ISWC Semantic Web Challenge 2017 (poster)
Leopard ISWC Semantic Web Challenge 2017 (poster)Leopard ISWC Semantic Web Challenge 2017 (poster)
Leopard ISWC Semantic Web Challenge 2017 (poster)
 
Natural language search using Neo4j
Natural language search using Neo4jNatural language search using Neo4j
Natural language search using Neo4j
 
An RDF Dataset Generator for the Social Network Benchmark with Real-World Coh...
An RDF Dataset Generator for the Social Network Benchmark with Real-World Coh...An RDF Dataset Generator for the Social Network Benchmark with Real-World Coh...
An RDF Dataset Generator for the Social Network Benchmark with Real-World Coh...
 
NLIDB(Natural Language Interface to DataBases)
NLIDB(Natural Language Interface to DataBases)NLIDB(Natural Language Interface to DataBases)
NLIDB(Natural Language Interface to DataBases)
 

Similar to Querying your database in natural language by Daniel Moisset PyData SV 2014

Clojure/conj 2017
Clojure/conj 2017Clojure/conj 2017
Clojure/conj 2017Darren Kim
 
Fast REST APIs Development with MongoDB
Fast REST APIs Development with MongoDBFast REST APIs Development with MongoDB
Fast REST APIs Development with MongoDBMongoDB
 
Python utan-stodhjul-motorsag
Python utan-stodhjul-motorsagPython utan-stodhjul-motorsag
Python utan-stodhjul-motorsagniklal
 
Is your excel production code?
Is your excel production code?Is your excel production code?
Is your excel production code?ProCogia
 
Intro to Machine Learning with TF- workshop
Intro to Machine Learning with TF- workshopIntro to Machine Learning with TF- workshop
Intro to Machine Learning with TF- workshopProttay Karim
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017StampedeCon
 
The Flow of TensorFlow
The Flow of TensorFlowThe Flow of TensorFlow
The Flow of TensorFlowJeongkyu Shin
 
GreenDao Introduction
GreenDao IntroductionGreenDao Introduction
GreenDao IntroductionBooch Lin
 
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj TalkSpark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj TalkZalando Technology
 
Session 1.5 supporting virtual integration of linked data with just-in-time...
Session 1.5   supporting virtual integration of linked data with just-in-time...Session 1.5   supporting virtual integration of linked data with just-in-time...
Session 1.5 supporting virtual integration of linked data with just-in-time...semanticsconference
 
Getting started with tensor flow datasets
Getting started with tensor flow datasets Getting started with tensor flow datasets
Getting started with tensor flow datasets Godfrey Nolan
 
React.js Basics - ConvergeSE 2015
React.js Basics - ConvergeSE 2015React.js Basics - ConvergeSE 2015
React.js Basics - ConvergeSE 2015Robert Pearce
 
The Tidyverse and the Future of the Monitoring Toolchain
The Tidyverse and the Future of the Monitoring ToolchainThe Tidyverse and the Future of the Monitoring Toolchain
The Tidyverse and the Future of the Monitoring ToolchainJohn Rauser
 
DSLs for fun and profit by Jukka Välimaa
DSLs for fun and profit by Jukka VälimaaDSLs for fun and profit by Jukka Välimaa
DSLs for fun and profit by Jukka VälimaaMontel Intergalactic
 
Scala: Functioneel programmeren in een object georiënteerde wereld
Scala: Functioneel programmeren in een object georiënteerde wereldScala: Functioneel programmeren in een object georiënteerde wereld
Scala: Functioneel programmeren in een object georiënteerde wereldWerner Hofstra
 
30 分鐘學會實作 Python Feature Selection
30 分鐘學會實作 Python Feature Selection30 分鐘學會實作 Python Feature Selection
30 分鐘學會實作 Python Feature SelectionJames Huang
 
Ejercicios de estilo en la programación
Ejercicios de estilo en la programaciónEjercicios de estilo en la programación
Ejercicios de estilo en la programaciónSoftware Guru
 
30 分鐘學會實作 Python Feature Selection
30 分鐘學會實作 Python Feature Selection30 分鐘學會實作 Python Feature Selection
30 分鐘學會實作 Python Feature SelectionJames Huang
 
Penerapan text mining menggunakan python
Penerapan text mining menggunakan pythonPenerapan text mining menggunakan python
Penerapan text mining menggunakan pythonAndreas Chandra
 

Similar to Querying your database in natural language by Daniel Moisset PyData SV 2014 (20)

Clojure/conj 2017
Clojure/conj 2017Clojure/conj 2017
Clojure/conj 2017
 
Fast REST APIs Development with MongoDB
Fast REST APIs Development with MongoDBFast REST APIs Development with MongoDB
Fast REST APIs Development with MongoDB
 
Python utan-stodhjul-motorsag
Python utan-stodhjul-motorsagPython utan-stodhjul-motorsag
Python utan-stodhjul-motorsag
 
Is your excel production code?
Is your excel production code?Is your excel production code?
Is your excel production code?
 
Intro to Machine Learning with TF- workshop
Intro to Machine Learning with TF- workshopIntro to Machine Learning with TF- workshop
Intro to Machine Learning with TF- workshop
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
 
The Flow of TensorFlow
The Flow of TensorFlowThe Flow of TensorFlow
The Flow of TensorFlow
 
GreenDao Introduction
GreenDao IntroductionGreenDao Introduction
GreenDao Introduction
 
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj TalkSpark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
 
Session 1.5 supporting virtual integration of linked data with just-in-time...
Session 1.5   supporting virtual integration of linked data with just-in-time...Session 1.5   supporting virtual integration of linked data with just-in-time...
Session 1.5 supporting virtual integration of linked data with just-in-time...
 
Getting started with tensor flow datasets
Getting started with tensor flow datasets Getting started with tensor flow datasets
Getting started with tensor flow datasets
 
React.js Basics - ConvergeSE 2015
React.js Basics - ConvergeSE 2015React.js Basics - ConvergeSE 2015
React.js Basics - ConvergeSE 2015
 
The Tidyverse and the Future of the Monitoring Toolchain
The Tidyverse and the Future of the Monitoring ToolchainThe Tidyverse and the Future of the Monitoring Toolchain
The Tidyverse and the Future of the Monitoring Toolchain
 
DSLs for fun and profit by Jukka Välimaa
DSLs for fun and profit by Jukka VälimaaDSLs for fun and profit by Jukka Välimaa
DSLs for fun and profit by Jukka Välimaa
 
Scala: Functioneel programmeren in een object georiënteerde wereld
Scala: Functioneel programmeren in een object georiënteerde wereldScala: Functioneel programmeren in een object georiënteerde wereld
Scala: Functioneel programmeren in een object georiënteerde wereld
 
30 分鐘學會實作 Python Feature Selection
30 分鐘學會實作 Python Feature Selection30 分鐘學會實作 Python Feature Selection
30 分鐘學會實作 Python Feature Selection
 
Ejercicios de estilo en la programación
Ejercicios de estilo en la programaciónEjercicios de estilo en la programación
Ejercicios de estilo en la programación
 
30 分鐘學會實作 Python Feature Selection
30 分鐘學會實作 Python Feature Selection30 分鐘學會實作 Python Feature Selection
30 分鐘學會實作 Python Feature Selection
 
Penerapan text mining menggunakan python
Penerapan text mining menggunakan pythonPenerapan text mining menggunakan python
Penerapan text mining menggunakan python
 
Clojure Small Intro
Clojure Small IntroClojure Small Intro
Clojure Small Intro
 

More from PyData

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...PyData
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshPyData
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiPyData
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...PyData
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerPyData
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaPyData
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...PyData
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroPyData
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...PyData
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottPyData
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroPyData
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...PyData
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPyData
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...PyData
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydPyData
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverPyData
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldPyData
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...PyData
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardPyData
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...PyData
 

More from PyData (20)

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
 

Recently uploaded

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 

Querying your database in natural language by Daniel Moisset PyData SV 2014

  • 1. Querying your database in natural language PyData – Silicon Valley 2014 Daniel F. Moisset – dmoisset@machinalis.com
  • 2. Data is everywhere Collecting data is not the problem, but what to do with it Any operation starts with selecting/filtering data
  • 3. A classical approach Used by: ● Google ● Wikipedia ● Lucene/Solr Performance can be improved: ● Stemming/synonyms ● Sorting data by relevance Search
  • 4. A classical approach Used by: ● Google ● Wikipedia ● Lucene/Solr Performance can be improved: ● Stemming/synonyms ● Sorting data by relevance Search
  • 5. Limits of keyword based approaches
  • 6. Query Languages ● SQL ● Many NOSQL approaches ● SPARQL ● MQL Allow complex, accurate queries SELECT array_agg(players), player_teams FROM ( SELECT DISTINCT t1.t1player AS players, t1.player_teams FROM ( SELECT p.playerid AS t1id, concat(p.playerid,':', p.playername, ' ') AS t1player, array_agg(pl.teamid ORDER BY pl.teamid) AS player_teams FROM player p LEFT JOIN plays pl ON p.playerid = pl.playerid GROUP BY p.playerid, p.playername ) t1 INNER JOIN ( SELECT p.playerid AS t2id, array_agg(pl.teamid ORDER BY pl.teamid) AS player_teams FROM player p LEFT JOIN plays pl ON p.playerid = pl.playerid GROUP BY p.playerid, p.playername ) t2 ON t1.player_teams=t2.player_teams AND t1.t1id <> t2.t2id ) innerQuery GROUP BY player_teams
  • 7. Natural Language Queries Getting popular: ● Wolfram Alpha ● Apple Siri ● Google Now Pros and cons: ● Very accessible, trivial learning curve ● Still weak in its coverage: most applications have a list of “sample questions”
  • 8. Outline of this talk: the Quepy approach ● Overview of our solution ● Simple example ● DSL ● Parser ● Question Templates ● Quepy applications ● Benefits ● Limitations
  • 9. Quepy ● Open Source (BSD License) https://github.com/machinalis/quepy ● Status: usable, 2 demos available (dbpedia + freebase) Online demo at: http://quepy.machinalis.com/ ● Complete documentation: http://quepy.readthedocs.org/en/latest/ ● You're welcome to get involved!
  • 10. Overview of the approach ● Parsing ● Match + Intermediate representation ● Query generation & DSL “What is the airspeed velocity of an unladen swallow?” What|what|WP is|be|VBZ the|the|DT airspeed|airspeed|NN velocity|velocity|NN of|of|IN an|an|DT unladen|unladen|JJ swallow|swallow|NN SELECT DISTINCT ?x1 WHERE { ?x0 kingdom "Animal". ?x0 name "unladen swallow". ?x0 airspeed ?x1. }
  • 11. Overview of the approach ● Parsing ● Match + Intermediate representation ● Query generation & DSL “What is the airspeed velocity of an unladen swallow?” What|what|WP is|be|VBZ the|the|DT airspeed|airspeed|NN velocity|velocity|NN of|of|IN an|an|DT unladen|unladen|JJ swallow|swallow|NN SELECT DISTINCT ?x1 WHERE { ?x0 kingdom "Animal". ?x0 name "unladen swallow". ?x0 airspeed ?x1. }
  • 12. Parsing ● Not done at character level but at a word level ● Word = Token + Lemma + POS “is” → is|be|VBZ (VBZ means “verb, 3rd person, singular, present tense”) “swallows” → swallows|swallow|NNS (NNS means “Noun, plural”) ● NLTK is smart enough to know that “swallows” here means the bird (noun) and not the action (verb) ● Question rule = “regular expressions” Token("what") + Lemma("be") + Question(Pos("DT")) + Plus(Pos(“NN”)) The word “what” followed by any variant of the “to be” verb, optionally followed by a determiner (articles, “all”, “every”), followed by one or more nouns
  • 13. Intermediate representation ● Graph like, with some known values and some holes (x0 , x1 , …). Always has a “root” (house shaped in the picture) ● Similar to knowledge databases ● Easy to build from Python code
  • 14. Code generator ● Built-in for MQL ● Built-in for SPARQL ● Possible approaches for SQL, other languages ● DSL - guided ● Outputs the query string (Quepy does not connect to a database)
  • 16. DSL class DefinitionOf(FixedRelation): Relation = "/common/topic/description" reverse = True class IsMovie(FixedType): fixedtype = "/film/film" class IsPerformance(FixedType): fixedtype = "/film/performance" class PerformanceOfActor(FixedRelation): relation = "/film/performance/actor" class HasPerformance(FixedRelation): relation = "/film/film/starring" class NameOf(FixedRelation): relation = "/type/object/name" reverse = True
  • 17. DSL class DefinitionOf(FixedRelation): Relation = "/common/topic/description" reverse = True class IsMovie(FixedType): fixedtype = "/film/film" class IsPerformance(FixedType): fixedtype = "/film/performance" class PerformanceOfActor(FixedRelation): relation = "/film/performance/actor" class HasPerformance(FixedRelation): relation = "/film/film/starring" class NameOf(FixedRelation): relation = "/type/object/name" reverse = True
  • 18. DSL Given a thing x0 , its definition: DefinitionOf(x0) Given an actor x2 , movies where x2 acts: performances = IsPerformance() + PerformanceOfActor(x2) movies = IsMovie() + HasPerformance(performances) x3 = NameOf(movies)
  • 19. Parsing: Particles and templates class WhatIs(QuestionTemplate): regex = Lemma("what") + Lemma("be") + Question(Pos("DT")) + Thing() + Question(Pos(".")) def interpret(self, match): label = DefinitionOf(match.thing) return label class Thing(Particle): regex = Question(Pos("JJ")) + Plus(Pos("NN") | Pos("NNP") | Pos("NNS")) def interpret(self, match): return HasKeyword(match.words.tokens)
  • 20. Parsing: Particles and templates class WhatIs(QuestionTemplate): regex = Lemma("what") + Lemma("be") + Question(Pos("DT")) + Thing() + Question(Pos(".")) def interpret(self, match): label = DefinitionOf(match.thing) return label class Thing(Particle): regex = Question(Pos("JJ")) + Plus(Pos("NN") | Pos("NNP") | Pos("NNS")) def interpret(self, match): return HasKeyword(match.words.tokens)
  • 21. Parsing: “movies starring <actor>” ● More DSL: class IsPerson(FixedType): fixedtype = "/people/person" fixedtyperelation = "/type/object/type" class IsActor(FixedType): fixedtype = "Actor" fixedtyperelation = "/people/person/profession"
  • 22. Parsing: A more complex particle ● And then a new Particle: class Actor(Particle): regex = Plus(Pos("NN") | Pos("NNS") | Pos("NNP") | Pos("NNPS")) def interpret(self, match): name = match.words.tokens return IsPerson() + IsActor() + HasKeyword(name)
  • 23. Parsing: A more complex template class ActedOnQuestion(QuestionTemplate): acted_on = (Lemma("appear") | Lemma("act") | Lemma("star")) movie = (Lemma("movie") | Lemma("movies") | Lemma("film")) regex = (Question(Lemma("list")) + movie + Lemma("with") + Actor()) | (Question(Pos("IN")) + (Lemma("what") | Lemma("which")) + movie + Lemma("do") + Actor() + acted_on + Question(Pos("."))) | (Question(Lemma("list")) + movie + Lemma("star") + Actor()) “list movies with Harrison Ford” “list films starring Harrison Ford” “In which film does Harrison Ford appear?”
  • 24. Parsing: A more complex template class ActedOnQuestion(QuestionTemplate): # ... def interpret(self, match): performance = IsPerformance() + PerformanceOfActor(match.actor) movie = IsMovie() + HasPerformance(performance) movie_name = NameOf(movie) return movie_name
  • 25. Apps: gluing it all together ● You build a Python package with quepy startapp myapp ● There you add dsl and questions templates ● Then configure it editing myapp/settings.py (output query language, data encoding) You can use that with: app = quepy.install("myapp") question = "What is love?" target, query, metadata = app.get_query(question) db.execute(query)
  • 26. The good things ● Effort to add question templates is small (minutes-hours), and the benefit is linear wrt effort ● Good for industry applications ● Low specialization required to extend ● Human work is very parallelizable ● Easy to get many people to work on questions ● Better for domain specific databases
  • 27. The good things ● Effort to add question templates is small (minutes-hours), and the benefit is linear wrt effort ● Good for industry applications ● Low specialization required to extend ● Human work is very parallelizable ● Easy to get many people to work on questions ● Better for domain specific databases
  • 28. Limitations ● Better for domain specific databases ● It won't scale to massive amounts of question templates (they start to overlap/contradict each other) ● Hard to add computation (compare: Wolfram Alpha) or deduction (can be added in the database) ● Not very fast (this is an implementation, not design issue) ● Requires a structured database
  • 29. Limitations ● Better for domain specific databases ● It won't scale to massive amounts of question templates (they start to overlap/contradict each other) ● Hard to add computation (compare: Wolfram Alpha) or deduction (can be added in the database) ● Not very fast (this is an implementation, not design issue) ● Requires a structured database
  • 30. Future directions ● Testing this under other databases ● Improving performance ● Collecting uncovered questions, add machine learning to learn new patterns.
  • 31. Q & A You can also reach me at: dmoisset@machinalis.com Twitter: @dmoisset http://machinalis.com/

Editor's Notes

  1. Hello everyone, my name is Daniel Moisset. I work at Machinalis, a company based in Argentina which builds data processing solutions for other companies. I&amp;apos;m not a native English speaker, so please just wave a bit if I&amp;apos;m not speaking clearly or just not making any sense. The topic I want to introduce today is about the use of natural language to query databases and a tool that implements a possible approach to solve this Issue Let me start by trying to show you why this problem is relevant. .
  2. The problem I&amp;apos;ll discuss today is not about how to get your data. If you&amp;apos;re here, chances are you have more data that you can handle. The big problem today is to put to work all the data that comes from different sources and is piling up in some database. And of course, the first step at least of that problem is getting the data you want, that is, making &amp;quot;queries&amp;quot;. Of course you&amp;apos;ll want to do more than queries later, but selecting the information you want is typically the first step
  3. A typical approach for large bodies of text-based data is the “keyword” based approach. The basic idea is that the user provides a list of keywords, and the items that contain those keywords are retrieved. There are a lot of well known tricks to improve this, like detecting the relevance of documents with respect to user keywords, doing some preprocessing of the input and the index so I can find documents without an exact keyword match but a similar word instead, etc. This approach has proven very successful in many different contexts, with Google as a leading example of a large database that probably all of us query frequently using keyword-based queries, and many tools to build search bars into your software. It works so well that you might wonder if there&amp;apos;s any significant improvement to make by trying a different approach.
  4. Keyword-based lookups are really good when you know what you&amp;apos;re looking for, typically the name of the entity you&amp;apos;re interested in, or some entity that is uniquely related to that other entity. It&amp;apos;s very simple to get information about Albert Einstein, or figuring out who proposed the Theory of Relativity even if I don&amp;apos;t remember Albert Einstein&amp;apos;s name.
  5. However, it&amp;apos;s not easy to Google &amp;quot;What&amp;apos;s the name of that place in California with a lot of movie studios?&amp;quot; &amp;quot;The one with the big white sign in the hill?&amp;quot;. None of the keywords I used to formulate that question are very good, and other similar formulations will not help us. It&amp;apos;s not a problem of having the data, even if I have a database containing records about movie studios and their locations, but a problem of how you interact with the database. Another problem of keyword-based lookups is that it is heavily dependent on data which is mainly textual. It works fine for the web, but if I have a database with flight schedules for many airlines, a keyword based search will provide me with a very limited interface for making queries. Even with a database with a lot of text, like the schedule for the conference, it&amp;apos;s not easy to answer questions like &amp;quot;Which PyData speakers are affiliated with the sponsors&amp;quot; (without doing it manually)
  6. The solution we have for this problem, which may be summarized as &amp;quot;finding data by the stuff related to it&amp;quot; are query languages. We have many of those, depending on how we want to structure our data. All of these allow us to write very accurate and very complicated queries. And by “us” I mean the people in this room, which are developers and data scientists. Which is the weakness of this approach: it&amp;apos;s not an interface that you can provide to end-users. There&amp;apos;s a lot of data that needs to be made available to people who can&amp;apos;t or won&amp;apos;t learn a complex language to access the information. Not because they&amp;apos;re stupid, but because their field of expertise is another one.
  7. That leaves us with a need to query structured, possibly non textual, related information in a way that does not require much expertise to the person making the queries. And a straightforward way to solve that need, is allowing the data to be queried in the language that the user already knows. Which brings us to the motivation for this talk. Natural language is getting as a popular way to make queries and/or enter commands. It provides a very user friendly experience, even when most current tools are somewhat limited in the coverage they can provide. By “coverage” here I mean how many of the relevant questions are actually understood by the computer. Currently, successful applications like the ones I show here have a guide to the user describing which forms of questions are &amp;quot;valid&amp;quot;
  8. After this introduction and the motivation to the problem, let me outline where I&amp;apos;m trying to get to during this talk: Some very smart people who work with me studied different approaches to a solution and came up with a tool called Quepy which implements that approach. Of course it&amp;apos;s not the only possible approach, but it has several nice properties that are valuable to us in an industrial context. I&amp;apos;ll describe the approach in general and get to a quick overview on how to code a simple quepy app. Then I&amp;apos;ll discuss what we most like about quepy, and the limits to the scope of the problem it solves.
  9. Just in case you&amp;apos;re eager to see the code instead of listening to me, all of it is available and online, so I&amp;apos;ll leave this slide for 10 seconds so you can get a picture, and then move on.
  10. At it&amp;apos;s core, the quepy approach is not unlike a compiler. The input is a string with a question, which is sent through a parser that builds a data structure, called an &amp;quot;intermediate representation&amp;quot;. That representation is then converted to a database query, which is the output from quepy. The parsing is guided by rules provided by the application writer, which describes what kind of questions are valid.
  11. The conversion is guided by some declarative information about the structure of the database that the application writer must define. We call this definition the &amp;quot;DSL&amp;quot;, for Domain Specific Language. As you might have noted from this description, what we built is not an universal solution that you can throw over your database, but something that requires programming customization, both regarding on how to interact with the user and how to interact with your database.
  12. Let&amp;apos;s take a deeper look at the parser. The first step of the parser provided by Quepy is splitting the text into parts, a process also known as tokenization. Once this is done you have a sequence of word objects, containing information on each word: the token, which is the original word as appears in the text, the lemma, which is the root word for the token (the base verb &amp;quot;speak&amp;quot; for a wordlike &amp;quot;speaking&amp;quot;), and a part of speech tag, which indicates if the word is a noun, an adjective, a verb, etc. This list of words is then matched against a set of question templates. Each question template defines a pattern, which is something that looks like a regular expression, where patterns can describe property matches over the token, lemma, and/or part of speech.
  13. Let&amp;apos;s assume a valid match on the question template. In that case, the question template provides a little piece of code that builds the intermediate representation. The intermediate representation of a query is a small graph, where vertices are entities in the database, edges are relations between entities, and both vertices and edges can be labeled or left open. There&amp;apos;s one special vertex called the &amp;quot;head&amp;quot; which is always open, that indicates what is the value for the &amp;quot;answer&amp;quot;. This is an abstract, backend independent representation of the query, although is thought mainly to use with knowledge databases, which usually have this graph structure and allow finding matching subgraphs. Quepy provides a way to build this trees from python code in a way that&amp;apos;s quite more natural than just describing the structure top down. Trees are built by composing tree parts that have some meaningful semantics on your domain. Those components, along with the mapping of those semantics to your database schema form what we call the DSL
  14. From the internal representation tree, and the DSL information it is possibleto automatically build a query string that can be sent to your database. At this time, we have built query generators for SPARQL, which is the defacto standard for knowledge databases, and MQL, the Metaweb Query Language (used by Google&amp;apos;s Freebase). It might be possible to build custom generators for other languages, or use some kind of adapter (I know there are SPARQL endpoints that you can put in front of a SQL database for example). The DSL information needed here is somewhat schema specific but is very simple to define, in a declarative way.
  15. Let me show you some code examples, making queries on freebase with a couple of sample templates questions. We want to answer &amp;quot;What are bananas?&amp;quot; and &amp;quot;In which movies did Harrison Ford appear&amp;quot;. We will be doing this on Freebase; but don&amp;apos;t worry, there&amp;apos;s no need for you to know the Freebase schema to understand this talk. We&amp;apos;ll cover the information we need as we go. I&amp;apos;m going to show you some complete code, but this is not a tutorial so I&amp;apos;m not going to go over line by line explaining what everything does. The code I&amp;apos;m showing has the purpose of displaying what are the different parts that you&amp;apos;ll need to put together and how much (or how little) work is needed to build each.
  16. To build this example, the easiest way is to start with the DSL. We&amp;apos;ll start defining some simple concepts that look naturally related to the queries we want to make. Let&amp;apos;s take a look at the `DefinitionOf` class. What we&amp;apos;re saying here is how to get the definition of something. In freebase, entities are related to their definitions by the &amp;quot;slash common slash topic slash description&amp;quot; attribute (this is why we say that this is a `FixedRelation`; in freebase, attributes are also represented as relations). The &amp;quot;reverse equals true&amp;quot; indicates that we actually fix the left side of the relation to a known value, and want to learn about the right side. Without it, this would be the opposite query, give me an object given its definition.
  17. This is all the DSL we need to answer &amp;quot;What are bananas?&amp;quot;. The other query we wanted to make is quite more complex. Our database has movies, where each movie can have many related entities called &amp;quot;performances&amp;quot;. Each performance relates to an actor, a character, etc. So we define some basic relations to identify the type of some entities using `FixedType`. `IsMovie` describe entities having freebase type &amp;quot;slash film slash film&amp;quot;, and `IsPerformance` helps us recognizing these &amp;quot;performance&amp;quot; objects. To link both types of entities, the `PerformanceOfActor` queries which performances have a given actor and `HasPerformance` allows us to query which movie has a given performance. At last, in freebase movies are complex objects, but when we show a result to the user we want to show him a movie name so `NameOf` gets the &amp;quot;slash type slash object slash name&amp;quot; attribute of a movie, which is the movie title.
  18. The intermediate representation of queries is built on instances of these objects. For example, given an actor “a”, this expression gives the movies with “a” (slide). Note that the operations on the bottom are abstract operations between queries which build a larger query, none of this is touching the database but just building a tree.
  19. Let&amp;apos;s now see how to code the parser for the queries mentioned before. For each kind of question we can build a &amp;quot;question template&amp;quot;. The first thing that a question template specifies is how to match the questions. The matching has to be flexible enough to capture variants of the question like &amp;quot;what is X&amp;quot;, &amp;quot;what are X&amp;quot;, &amp;quot;what is an X&amp;quot;, &amp;quot;what is X?&amp;quot; which you can see we write on the regex here: We have a &amp;quot;what&amp;quot; like word, followed by some form of the verb &amp;quot;to be&amp;quot;, optionally followed by a &amp;quot;determiner&amp;quot; which is a word like &amp;quot;a&amp;quot;, &amp;quot;an&amp;quot; the&amp;quot;, followed by a thing which is what we want to look up, and followed by a question mark. Note that I said &amp;quot;a thing&amp;quot; without being too explicit on what that means. Quepy allows you to define &amp;quot;particles&amp;quot;, which mean pieces of the question that you want to capture and that follow a particular pattern.
  20. Note that at the bottom I have defined what a Thing is, the definition consisting also in one regular expression but also an intermediate representation for it. In this case, a thing is an optional adjective followed by one or more nouns. The semantics of a thing are given by the interpret method, where HasKeyword is a quepy builtin with essentially the semantics of &amp;quot;the object with this primary key&amp;quot;. It&amp;apos;s shown in the slides as a dashed line. Our question template regex refers to Thing(), so in its interpret method it will have access to the already built graph for the matched thing. So if we ask &amp;quot;What is a banana?&amp;quot;, you&amp;apos;ll end up with a valid match that builds the graph on the right, which corresponds to the appropiate query.
  21. Let&amp;apos;s work on the more complex example. The first thing we&amp;apos;ll require is some additional DSL to write the &amp;quot;Actor&amp;quot; particle. In freebase, there&amp;apos;s no actor type, but there&amp;apos;s a &amp;quot;person type&amp;quot; and then an actor profession. That allows us to define &amp;quot;IsPerson&amp;quot; (that is objects with the person type) and &amp;quot;IsActor&amp;quot; (that is objects with the actor profession)
  22. This allows us to define the Actor particle, which matches a sequence of nouns, and represent an object that is a person, works as an actor, and has as identifier the name in the match.
  23. The regex for this questions is more complex because we allow several different forms like the ones shown at the bottom. We allow several synonym verbs to be used like star vs act vs appear. We also allow synonyms like film and movie. Note that it&amp;apos;s more clear to write this by defining intermediate regular expressions, but no Particle definitions is needed if you don&amp;apos;t want to capture the word used. There are possibly more ways to ask this question, but once you figure those out it&amp;apos;s pretty easy to add those to the pattern. The pattern you see here is a simplified version of the pattern you&amp;apos;ll find on the demo we have in the github repo, but I simplified it to make it shorter to read.
  24. Once you&amp;apos;ve captured the actor, you just need to define, using the DSL, how to answer the query. Note that the definition here is very readable: we find performance objects referring to the matched actor, then we find movies with that performance, and then we find the names of those movies. Again, I described this sequentially, but you&amp;apos;re actually describing declaratively how to build a query
  25. Quepy also provide some tools to help you with their boilerplate, which are not very interesting to describe but I just wanted you to know that they are there. There&amp;apos;s the concept of a quepy app which is a python module where you fill out the DSL, question templates, settings like whether you want sparql or mql, etc. Once you have that you can import that python module with quepy dot install and get the query for a natural language question ready to send to your database.
  26. As you have seen, the approach we&amp;apos;ve used for the problem is very simple, but it has some good properties I&amp;apos;d like to highlight. The first one, that is very important for us as a company that needs to build products based on this tool, is that you can add effort incrementally and get results that benefit the application, so it&amp;apos;s very low risk. This is different from machine learning or statistical approaches where you can use a lot of project time building a model and you might end up hitting gold, or you might end up with something that adds 0 visible results to a product. So, as much as we love machine learning where I work, we refrained ourselves from using it, getting something that&amp;apos;&amp;apos;s not state-of-the-art in terms of coverage, but it is a very safe approach. Which is great value when interacting with customers
  27. Other good part about this is that extending or improving requires work that can be done by a developer who doesn&amp;apos;t need a strong linguistic specialization. So it&amp;apos;s easy to get a large team working on improving an application. And many people can work at the same time, because question templates are really modular and not an opaque construct as machine learning models. This approach works well in domain specific databases, where there&amp;apos;s a limited amount of relationships relevant within the data. For very general databases like freebase and dbpedia, if you want to answer general questions, you will find out that users will start making up questions that fal outside your question templates.
  28. And that&amp;apos;s also one of the weaknesses of this. If you have a general database, you&amp;apos;ll have an explosion in the amount of relevant queries and templates, which starts to produce problems between contradicting rules. Note that the limit here is not the amount of entities in your dataset, but the amount of relationships between them. The way this idea works also makes a bit hard if you want to integrate computation or deduction. The latter can be partly solved by using knowledge databases that have some deduction builtin, and apply that when they get a query so it&amp;apos;s something that you can work around
  29. Something that&amp;apos;s a limit of the implementation, but could be improved is the performance of the conversion. What we have is something that works for us in contexts where we don&amp;apos;t have many queries in a short time, but would need some improvements if you want to provide a service available to a wide public. The last point that can be a limitation is the need of a structured database, which is something one doesn&amp;apos;t always have access to. We actually built quepy as a component on a larger project, but we&amp;apos;re also working on the other side of this problem with a tool called iepy,
  30. So that&amp;apos;s all I have. I&amp;apos;ll take a few questions and of course you can get in touch in me later today or online for more information about this and other related work. Thanks for listening, and thenks to the people organizing this great conference.