SlideShare a Scribd company logo
1 of 29
Download to read offline
Finding relations in documents with IEPY
Daniel F. Moisset – dmoisset@machinalis.com
tw: @dmoisset
This is a talk about...
A (very brief) intro about IE
Example applications
IEPY, an open source tool to do it in python
https://github.com/machinalis/iepy
About me
I am a Computer Scientist
I am a cofounder and technical leader at Machinalis
Machinalis (http://www.machinalis.com) provides development
services for Machine Learning applications.
I participated in the development of IEPY
Information Extraction
Unstructured documents
(News articles, a wiki, a knowledge base)
Structured data
(“facts” on a database)
Some applications
Given… Obtain a table of
A collection of medical papers Genes and proteins expressed by them
Wikipedia Relevant people and their dates of birth
A set of business articles Company acquisitions and fundings
Social media posts Foodborne disease outbreaks
In general...
Extract RELATIONS
(tuples of a given arity expressing some known concept/meaning)
Between ENTITIES
(People, places, organizations, dates, amounts of money, …)
From DOCUMENTS
(text and maybe some associated metadata. Also possible over other media, not discussed in this talk)
Example: Pandas is a software library written for the Python programming language for data
manipulation and analysis. In particular, it offers data structures and operations for manipulating
numerical tables and time series. Pandas is free software released under the three-clause BSD license.
An example: iepycrunch
http://iepycrunch.machinalis.com/
Iepy pydata-amsterdam-2016
Iepy pydata-amsterdam-2016
Iepy pydata-amsterdam-2016
Iepy pydata-amsterdam-2016
IEPY
https://github.com/machinalis/iepy
Python based tool + framework for creating IE applications
● Some parts of the pipeline have defaults that can be replaced
● Supports automatic, manual, or hybrid approaches
● Includes a web UI for manual interaction
Python (+ external components), BSD open-source
IEPY pipeline: Load the documents
https://github.com/machinalis/iepy
IEPY manages its own database for storing documents
● A simple CSV importer is included
● You can write others or convert to CSV
● It is possible to do incremental updates
Some work is typically required here given the variety of possible inputs
IEPY pipeline: Preprocessing (1)
● Tokenization
● Segmentation in sentences
● Lemmatization
● Part of Speech tagging
IEPY uses off-the-shelf tools like Stanford NLP toolkit, you can
add/change steps.
IEPY pipeline: Preprocessing (2)
● Entity recognition
● NER tagger (Stanford's). This one does pronoun resolution and
coreferencing
● Gazettes resolution
● Custom rules
● Parsing (Stanford's)
● Segmentation
IEPY pipeline: Extraction
Simpler but least interesting option is human-based...
Allows integrating domain experts through a friendly UI
Manually entered data can feed the active learning core
IEPY pipeline: Extraction
Simpler but least interesting option is human-based...
Allows integrating domain experts through a friendly UI
Manually entered data can feed the active learning core
IEPY pipeline: Extraction
You can get precision (but not always high recall) with rules:
Rules are defined by regular expressions-like patterns on words
Some tools for debugging rules
IEPY pipeline: Extraction
You can enhance your results through active learning methods (and reduce
human time required which is the end-goal)
Learns from positive and negative examples found by other methods.
Some predefined features (entity distance, bag of words, bag of bigrams, …)
Many predefined classifiers (defaults to SVC from scikit-learn)
Tunable parameters for high precision or high recall
A real use-case: Archivo de la Memoria
Background history:
In 1976 Argentina, amidst heavy unrest and civil guerrilla/terrorism, the
military forces removed the president by coup and set up a new
government that lasted until 1983.
During that period, people considered “subversive” (members of the
terrorist organizations, but also any political dissenter) were taken away
without warrant, and many of them never seen again (killed without trial
and with no record). Those are called “Desaparecidos” (disappeared)
A real use-case: Archivo de la Memoria
In many of those cases, spouses and small children of desaparecidos were also
kidnapped. Also some pregnant women gave birth while imprisoned. Those
children were given under clandestine adoption to families in the military or
connected with them.
Today, most of the military top leaders of that age have been tried and
convicted for Human Rights violations. But many people in their 30's have
been lied about their real identity. Only 119 have been found as of December
of 2015. Nearly 400 haven't been located yet.
A real use-case: Archivo de la Memoria
A collaboration work between
● The NLP group at FaMAF, UNC (a research dept)
● The Province Archives of Remembrance (http://www.apm.gov.ar/ ,
government agency)
● Clearance to access classified military files
● Recovery and conservation of documents
A real use-case: Archivo de la Memoria
Input data: Army bulletins
● Personnel movements/assignments/transfers
● Training activities (“interrogation”, “intelligence”)
● Promotions, retirements.
Scanned and OCR'd from typewritten documents. Tagged with metadata
(location, date)
A real use-case: Archivo de la Memoria
Entity recognition
● Rule based: Stanford NLP is not very good in Spanish
● Rule based: To match military document conventions
A real use-case: Archivo de la Memoria
Relationship extraction
● Rule based: Taking advantage of having very regular language.
● ML based: SVM classifier to increase recall.
● Human based: some rounds of manual validation.
● ML based: retrained and relabeled after human intervention.
A real use-case: Archivo de la Memoria
Some results
● 67K people identified by rules, raised to 141K with a classifier.
● 10K transfers identified by rules, raised to 22K with a classifier
Where IEPY works well
● Single domain improves success rate
● Best on binary relations.
● Best on regular documents
● “Medium” sized datasets
Be careful about
● You are detecting occurrences, not actual entities
● Is ambiguity or “denormalization” harmful to you?
● For temporal events is a duplicate harmful?
● You get what the document says
● Are your documents consistent?
● Are your documents “truth”?
Thank you!
Daniel F. Moisset – dmoisset@machinalis.com
tw: @dmoisset

More Related Content

Viewers also liked

Bridging the gap from data science to service
Bridging the gap  from data science to serviceBridging the gap  from data science to service
Bridging the gap from data science to servicedmoisset
 
Sapeintia amoris 2 01
Sapeintia amoris 2 01Sapeintia amoris 2 01
Sapeintia amoris 2 01Enrique Ca
 
DERECHOS DEL ESTUDIANTE FITEC TRABAJO COLABORATIVO
DERECHOS DEL ESTUDIANTE FITEC TRABAJO COLABORATIVODERECHOS DEL ESTUDIANTE FITEC TRABAJO COLABORATIVO
DERECHOS DEL ESTUDIANTE FITEC TRABAJO COLABORATIVOleidy rocio jerez herrera
 
Darcie Huwe Financial Analyst Resume 10-21-14
Darcie Huwe Financial Analyst Resume 10-21-14Darcie Huwe Financial Analyst Resume 10-21-14
Darcie Huwe Financial Analyst Resume 10-21-14Darcie Huwe
 
national and international news agencies
national and international news agenciesnational and international news agencies
national and international news agenciesFatima Rifat
 
газета февраль
газета февральгазета февраль
газета февральgym1797
 
Scheduling - Sistem Operasi (Kelompok 3)
Scheduling - Sistem Operasi (Kelompok 3)Scheduling - Sistem Operasi (Kelompok 3)
Scheduling - Sistem Operasi (Kelompok 3)Ryan Aulia
 
Petunjuk bagi penulis jurnal aksata juni 2016
Petunjuk bagi penulis jurnal aksata juni 2016Petunjuk bagi penulis jurnal aksata juni 2016
Petunjuk bagi penulis jurnal aksata juni 2016jurnal aksata
 
Basics of Taking a Blood Pressure
Basics of Taking a Blood PressureBasics of Taking a Blood Pressure
Basics of Taking a Blood Pressuremeducationdotnet
 
Cum sa construiesti linkuri si sa nu fii penalizat de Google
Cum sa construiesti linkuri si sa nu fii penalizat de GoogleCum sa construiesti linkuri si sa nu fii penalizat de Google
Cum sa construiesti linkuri si sa nu fii penalizat de GoogleGomag
 
Mi autobiografía marlene
Mi autobiografía marleneMi autobiografía marlene
Mi autobiografía marlenemarle marrugo
 
NLIDB(Natural Language Interface to DataBases)
NLIDB(Natural Language Interface to DataBases)NLIDB(Natural Language Interface to DataBases)
NLIDB(Natural Language Interface to DataBases)Swetha Pallati
 
Price performance of opioids pain killers thta may kill
Price performance of opioids   pain killers thta may killPrice performance of opioids   pain killers thta may kill
Price performance of opioids pain killers thta may killAviMan
 
business ethics and social responsibility.
business ethics and social responsibility.business ethics and social responsibility.
business ethics and social responsibility.GiaHan Giang
 

Viewers also liked (18)

Quepy
QuepyQuepy
Quepy
 
Bridging the gap from data science to service
Bridging the gap  from data science to serviceBridging the gap  from data science to service
Bridging the gap from data science to service
 
Sapeintia amoris 2 01
Sapeintia amoris 2 01Sapeintia amoris 2 01
Sapeintia amoris 2 01
 
Cargus - Gian Sharp (CEO Cargus), CNEC, 27 mai 2013
Cargus - Gian Sharp (CEO Cargus), CNEC, 27 mai 2013Cargus - Gian Sharp (CEO Cargus), CNEC, 27 mai 2013
Cargus - Gian Sharp (CEO Cargus), CNEC, 27 mai 2013
 
DERECHOS DEL ESTUDIANTE FITEC TRABAJO COLABORATIVO
DERECHOS DEL ESTUDIANTE FITEC TRABAJO COLABORATIVODERECHOS DEL ESTUDIANTE FITEC TRABAJO COLABORATIVO
DERECHOS DEL ESTUDIANTE FITEC TRABAJO COLABORATIVO
 
Darcie Huwe Financial Analyst Resume 10-21-14
Darcie Huwe Financial Analyst Resume 10-21-14Darcie Huwe Financial Analyst Resume 10-21-14
Darcie Huwe Financial Analyst Resume 10-21-14
 
Portofilio v2 a
Portofilio v2 aPortofilio v2 a
Portofilio v2 a
 
national and international news agencies
national and international news agenciesnational and international news agencies
national and international news agencies
 
газета февраль
газета февральгазета февраль
газета февраль
 
Scheduling - Sistem Operasi (Kelompok 3)
Scheduling - Sistem Operasi (Kelompok 3)Scheduling - Sistem Operasi (Kelompok 3)
Scheduling - Sistem Operasi (Kelompok 3)
 
Petunjuk bagi penulis jurnal aksata juni 2016
Petunjuk bagi penulis jurnal aksata juni 2016Petunjuk bagi penulis jurnal aksata juni 2016
Petunjuk bagi penulis jurnal aksata juni 2016
 
Basics of Taking a Blood Pressure
Basics of Taking a Blood PressureBasics of Taking a Blood Pressure
Basics of Taking a Blood Pressure
 
Diccionario de matematica quechua castellano
Diccionario de matematica quechua castellanoDiccionario de matematica quechua castellano
Diccionario de matematica quechua castellano
 
Cum sa construiesti linkuri si sa nu fii penalizat de Google
Cum sa construiesti linkuri si sa nu fii penalizat de GoogleCum sa construiesti linkuri si sa nu fii penalizat de Google
Cum sa construiesti linkuri si sa nu fii penalizat de Google
 
Mi autobiografía marlene
Mi autobiografía marleneMi autobiografía marlene
Mi autobiografía marlene
 
NLIDB(Natural Language Interface to DataBases)
NLIDB(Natural Language Interface to DataBases)NLIDB(Natural Language Interface to DataBases)
NLIDB(Natural Language Interface to DataBases)
 
Price performance of opioids pain killers thta may kill
Price performance of opioids   pain killers thta may killPrice performance of opioids   pain killers thta may kill
Price performance of opioids pain killers thta may kill
 
business ethics and social responsibility.
business ethics and social responsibility.business ethics and social responsibility.
business ethics and social responsibility.
 

Similar to Iepy pydata-amsterdam-2016

2015 january technological intelligence_ surveillance_foresight
2015  january technological intelligence_ surveillance_foresight2015  january technological intelligence_ surveillance_foresight
2015 january technological intelligence_ surveillance_foresightEsther Arias Pérez-Ilzarbe
 
Data, data, data
Data, data, dataData, data, data
Data, data, dataandrewxhill
 
Introduction to the Venice Time Machine
Introduction to the Venice Time MachineIntroduction to the Venice Time Machine
Introduction to the Venice Time MachineGiovanni Colavizza
 
Cultural Heritage Insitutions and Big Data Collections
Cultural Heritage Insitutions and Big Data CollectionsCultural Heritage Insitutions and Big Data Collections
Cultural Heritage Insitutions and Big Data Collectionslljohnston
 
Setting the Scene for ViBRANT – Strategy, Philosophy and Communication
Setting the Scene for ViBRANT – Strategy, Philosophy and CommunicationSetting the Scene for ViBRANT – Strategy, Philosophy and Communication
Setting the Scene for ViBRANT – Strategy, Philosophy and Communicationvbrant
 
20110122 vibrant final
20110122 vibrant final20110122 vibrant final
20110122 vibrant finalagosti
 
I want to know more about compuerized text analysis
I want to know more about   compuerized text analysisI want to know more about   compuerized text analysis
I want to know more about compuerized text analysisLuke Czarnecki
 
Digital Tools, Trends and Methodologies in the Humanities and Social Sciences
Digital Tools, Trends and Methodologies in the Humanities and Social SciencesDigital Tools, Trends and Methodologies in the Humanities and Social Sciences
Digital Tools, Trends and Methodologies in the Humanities and Social SciencesShawn Day
 
Data Visualization in the Newsroom
Data Visualization in the NewsroomData Visualization in the Newsroom
Data Visualization in the NewsroomCarl V. Lewis
 
No, it is not "All about story."
No, it is not "All about story."No, it is not "All about story."
No, it is not "All about story."J T "Tom" Johnson
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Paul Groth
 
creating a trading zone around twitter srchives. case study: paris attacks
creating a trading zone around twitter srchives. case study: paris attackscreating a trading zone around twitter srchives. case study: paris attacks
creating a trading zone around twitter srchives. case study: paris attacksFIAT/IFTA
 
Semantic Archive Integration for Holocaust Research: the EHRI Research Infras...
Semantic Archive Integration for Holocaust Research: the EHRI Research Infras...Semantic Archive Integration for Holocaust Research: the EHRI Research Infras...
Semantic Archive Integration for Holocaust Research: the EHRI Research Infras...Vladimir Alexiev, PhD, PMP
 
Data management for researchers
Data management for researchersData management for researchers
Data management for researchersDirk Roorda
 
Measuring System Performance in Cultural Heritage Systems
Measuring System Performance in Cultural Heritage SystemsMeasuring System Performance in Cultural Heritage Systems
Measuring System Performance in Cultural Heritage SystemsToine Bogers
 
Data Analytics and Industry-Academic Partnerships: An Irish Perspective
Data Analytics and Industry-Academic Partnerships: An Irish PerspectiveData Analytics and Industry-Academic Partnerships: An Irish Perspective
Data Analytics and Industry-Academic Partnerships: An Irish PerspectiveJohn Breslin
 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with RStephen Withington
 
2015-06-02-SCIA-Presentation-Infocodex-Final
2015-06-02-SCIA-Presentation-Infocodex-Final2015-06-02-SCIA-Presentation-Infocodex-Final
2015-06-02-SCIA-Presentation-Infocodex-FinalBeat Meyer
 
Nemeth Marton - Widening the limits of cognitive reception with online digita...
Nemeth Marton - Widening the limits of cognitive reception with online digita...Nemeth Marton - Widening the limits of cognitive reception with online digita...
Nemeth Marton - Widening the limits of cognitive reception with online digita...BOBCATSSS 2017
 

Similar to Iepy pydata-amsterdam-2016 (20)

2015 january technological intelligence_ surveillance_foresight
2015  january technological intelligence_ surveillance_foresight2015  january technological intelligence_ surveillance_foresight
2015 january technological intelligence_ surveillance_foresight
 
Data, data, data
Data, data, dataData, data, data
Data, data, data
 
Introduction to the Venice Time Machine
Introduction to the Venice Time MachineIntroduction to the Venice Time Machine
Introduction to the Venice Time Machine
 
Cultural Heritage Insitutions and Big Data Collections
Cultural Heritage Insitutions and Big Data CollectionsCultural Heritage Insitutions and Big Data Collections
Cultural Heritage Insitutions and Big Data Collections
 
Setting the Scene for ViBRANT – Strategy, Philosophy and Communication
Setting the Scene for ViBRANT – Strategy, Philosophy and CommunicationSetting the Scene for ViBRANT – Strategy, Philosophy and Communication
Setting the Scene for ViBRANT – Strategy, Philosophy and Communication
 
20110122 vibrant final
20110122 vibrant final20110122 vibrant final
20110122 vibrant final
 
I want to know more about compuerized text analysis
I want to know more about   compuerized text analysisI want to know more about   compuerized text analysis
I want to know more about compuerized text analysis
 
Digital Tools, Trends and Methodologies in the Humanities and Social Sciences
Digital Tools, Trends and Methodologies in the Humanities and Social SciencesDigital Tools, Trends and Methodologies in the Humanities and Social Sciences
Digital Tools, Trends and Methodologies in the Humanities and Social Sciences
 
Data Visualization in the Newsroom
Data Visualization in the NewsroomData Visualization in the Newsroom
Data Visualization in the Newsroom
 
No, it is not "All about story."
No, it is not "All about story."No, it is not "All about story."
No, it is not "All about story."
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.
 
creating a trading zone around twitter srchives. case study: paris attacks
creating a trading zone around twitter srchives. case study: paris attackscreating a trading zone around twitter srchives. case study: paris attacks
creating a trading zone around twitter srchives. case study: paris attacks
 
Semantic Archive Integration for Holocaust Research: the EHRI Research Infras...
Semantic Archive Integration for Holocaust Research: the EHRI Research Infras...Semantic Archive Integration for Holocaust Research: the EHRI Research Infras...
Semantic Archive Integration for Holocaust Research: the EHRI Research Infras...
 
Data management for researchers
Data management for researchersData management for researchers
Data management for researchers
 
Measuring System Performance in Cultural Heritage Systems
Measuring System Performance in Cultural Heritage SystemsMeasuring System Performance in Cultural Heritage Systems
Measuring System Performance in Cultural Heritage Systems
 
Data Analytics and Industry-Academic Partnerships: An Irish Perspective
Data Analytics and Industry-Academic Partnerships: An Irish PerspectiveData Analytics and Industry-Academic Partnerships: An Irish Perspective
Data Analytics and Industry-Academic Partnerships: An Irish Perspective
 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with R
 
2015-06-02-SCIA-Presentation-Infocodex-Final
2015-06-02-SCIA-Presentation-Infocodex-Final2015-06-02-SCIA-Presentation-Infocodex-Final
2015-06-02-SCIA-Presentation-Infocodex-Final
 
Nemeth Marton - Widening the limits of cognitive reception with online digita...
Nemeth Marton - Widening the limits of cognitive reception with online digita...Nemeth Marton - Widening the limits of cognitive reception with online digita...
Nemeth Marton - Widening the limits of cognitive reception with online digita...
 
Web3uploaded
Web3uploadedWeb3uploaded
Web3uploaded
 

Recently uploaded

Bengaluru Tableau UG event- 2nd March 2024 Q1
Bengaluru Tableau UG event- 2nd March 2024 Q1Bengaluru Tableau UG event- 2nd March 2024 Q1
Bengaluru Tableau UG event- 2nd March 2024 Q1bengalurutug
 
STOCK PRICE ANALYSIS Furkan Ali TASCI --.pptx
STOCK PRICE ANALYSIS  Furkan Ali TASCI --.pptxSTOCK PRICE ANALYSIS  Furkan Ali TASCI --.pptx
STOCK PRICE ANALYSIS Furkan Ali TASCI --.pptxFurkanTasci3
 
Microeconomic Group Presentation Apple.pdf
Microeconomic Group Presentation Apple.pdfMicroeconomic Group Presentation Apple.pdf
Microeconomic Group Presentation Apple.pdfmxlos0
 
Data Analytics Fundamentals: data analytics types.potx
Data Analytics Fundamentals: data analytics types.potxData Analytics Fundamentals: data analytics types.potx
Data Analytics Fundamentals: data analytics types.potxEmmanuel Dauda
 
How to Build an Experimentation Culture for Data-Driven Product Development
How to Build an Experimentation Culture for Data-Driven Product DevelopmentHow to Build an Experimentation Culture for Data-Driven Product Development
How to Build an Experimentation Culture for Data-Driven Product DevelopmentAggregage
 
Neo4j_Jesus Barrasa_The Art of the Possible with Graph.pptx.pdf
Neo4j_Jesus Barrasa_The Art of the Possible with Graph.pptx.pdfNeo4j_Jesus Barrasa_The Art of the Possible with Graph.pptx.pdf
Neo4j_Jesus Barrasa_The Art of the Possible with Graph.pptx.pdfNeo4j
 
Prediction Of Cryptocurrency Prices Using Lstm, Svm And Polynomial Regression...
Prediction Of Cryptocurrency Prices Using Lstm, Svm And Polynomial Regression...Prediction Of Cryptocurrency Prices Using Lstm, Svm And Polynomial Regression...
Prediction Of Cryptocurrency Prices Using Lstm, Svm And Polynomial Regression...ferisulianta.com
 
Data Collection from Social Media Platforms
Data Collection from Social Media PlatformsData Collection from Social Media Platforms
Data Collection from Social Media PlatformsMahmoud Yasser
 
STOCK PRICE ANALYSIS Furkan Ali TASCI --.pptx
STOCK PRICE ANALYSIS  Furkan Ali TASCI --.pptxSTOCK PRICE ANALYSIS  Furkan Ali TASCI --.pptx
STOCK PRICE ANALYSIS Furkan Ali TASCI --.pptxFurkanTasci3
 
PPT for Presiding Officer.pptxvvdffdfgggg
PPT for Presiding Officer.pptxvvdffdfggggPPT for Presiding Officer.pptxvvdffdfgggg
PPT for Presiding Officer.pptxvvdffdfggggbhadratanusenapati1
 
Paul Martin (Gartner) - Show Me the AI Money.pdf
Paul Martin (Gartner) - Show Me the AI Money.pdfPaul Martin (Gartner) - Show Me the AI Money.pdf
Paul Martin (Gartner) - Show Me the AI Money.pdfdcphostmaster
 
Stochastic Dynamic Programming and You.pptx
Stochastic Dynamic Programming and You.pptxStochastic Dynamic Programming and You.pptx
Stochastic Dynamic Programming and You.pptxjkmrshll88
 
Enabling GenAI Breakthroughs with Knowledge Graphs
Enabling GenAI Breakthroughs with Knowledge GraphsEnabling GenAI Breakthroughs with Knowledge Graphs
Enabling GenAI Breakthroughs with Knowledge GraphsNeo4j
 
Understanding the Impact of video length on student performance
Understanding the Impact of video length on student performanceUnderstanding the Impact of video length on student performance
Understanding the Impact of video length on student performancePrithaVashisht1
 
TCFPro24 Building Real-Time Generative AI Pipelines
TCFPro24 Building Real-Time Generative AI PipelinesTCFPro24 Building Real-Time Generative AI Pipelines
TCFPro24 Building Real-Time Generative AI PipelinesTimothy Spann
 
Brain Tumor Detection with Machine Learning.pptx
Brain Tumor Detection with Machine Learning.pptxBrain Tumor Detection with Machine Learning.pptx
Brain Tumor Detection with Machine Learning.pptxShammiRai3
 
Using DAX & Time-based Analysis in Data Warehouse
Using DAX & Time-based Analysis in Data WarehouseUsing DAX & Time-based Analysis in Data Warehouse
Using DAX & Time-based Analysis in Data WarehouseThinkInnovation
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...PrithaVashisht1
 
Báo cáo Social Media Benchmark 2024 cho dân Marketing
Báo cáo Social Media Benchmark 2024 cho dân MarketingBáo cáo Social Media Benchmark 2024 cho dân Marketing
Báo cáo Social Media Benchmark 2024 cho dân MarketingMarketingTrips
 

Recently uploaded (20)

Bengaluru Tableau UG event- 2nd March 2024 Q1
Bengaluru Tableau UG event- 2nd March 2024 Q1Bengaluru Tableau UG event- 2nd March 2024 Q1
Bengaluru Tableau UG event- 2nd March 2024 Q1
 
STOCK PRICE ANALYSIS Furkan Ali TASCI --.pptx
STOCK PRICE ANALYSIS  Furkan Ali TASCI --.pptxSTOCK PRICE ANALYSIS  Furkan Ali TASCI --.pptx
STOCK PRICE ANALYSIS Furkan Ali TASCI --.pptx
 
Microeconomic Group Presentation Apple.pdf
Microeconomic Group Presentation Apple.pdfMicroeconomic Group Presentation Apple.pdf
Microeconomic Group Presentation Apple.pdf
 
Data Analytics Fundamentals: data analytics types.potx
Data Analytics Fundamentals: data analytics types.potxData Analytics Fundamentals: data analytics types.potx
Data Analytics Fundamentals: data analytics types.potx
 
How to Build an Experimentation Culture for Data-Driven Product Development
How to Build an Experimentation Culture for Data-Driven Product DevelopmentHow to Build an Experimentation Culture for Data-Driven Product Development
How to Build an Experimentation Culture for Data-Driven Product Development
 
Neo4j_Jesus Barrasa_The Art of the Possible with Graph.pptx.pdf
Neo4j_Jesus Barrasa_The Art of the Possible with Graph.pptx.pdfNeo4j_Jesus Barrasa_The Art of the Possible with Graph.pptx.pdf
Neo4j_Jesus Barrasa_The Art of the Possible with Graph.pptx.pdf
 
Prediction Of Cryptocurrency Prices Using Lstm, Svm And Polynomial Regression...
Prediction Of Cryptocurrency Prices Using Lstm, Svm And Polynomial Regression...Prediction Of Cryptocurrency Prices Using Lstm, Svm And Polynomial Regression...
Prediction Of Cryptocurrency Prices Using Lstm, Svm And Polynomial Regression...
 
Data Collection from Social Media Platforms
Data Collection from Social Media PlatformsData Collection from Social Media Platforms
Data Collection from Social Media Platforms
 
STOCK PRICE ANALYSIS Furkan Ali TASCI --.pptx
STOCK PRICE ANALYSIS  Furkan Ali TASCI --.pptxSTOCK PRICE ANALYSIS  Furkan Ali TASCI --.pptx
STOCK PRICE ANALYSIS Furkan Ali TASCI --.pptx
 
PPT for Presiding Officer.pptxvvdffdfgggg
PPT for Presiding Officer.pptxvvdffdfggggPPT for Presiding Officer.pptxvvdffdfgggg
PPT for Presiding Officer.pptxvvdffdfgggg
 
Target_Company_Data_breach_2013_110million
Target_Company_Data_breach_2013_110millionTarget_Company_Data_breach_2013_110million
Target_Company_Data_breach_2013_110million
 
Paul Martin (Gartner) - Show Me the AI Money.pdf
Paul Martin (Gartner) - Show Me the AI Money.pdfPaul Martin (Gartner) - Show Me the AI Money.pdf
Paul Martin (Gartner) - Show Me the AI Money.pdf
 
Stochastic Dynamic Programming and You.pptx
Stochastic Dynamic Programming and You.pptxStochastic Dynamic Programming and You.pptx
Stochastic Dynamic Programming and You.pptx
 
Enabling GenAI Breakthroughs with Knowledge Graphs
Enabling GenAI Breakthroughs with Knowledge GraphsEnabling GenAI Breakthroughs with Knowledge Graphs
Enabling GenAI Breakthroughs with Knowledge Graphs
 
Understanding the Impact of video length on student performance
Understanding the Impact of video length on student performanceUnderstanding the Impact of video length on student performance
Understanding the Impact of video length on student performance
 
TCFPro24 Building Real-Time Generative AI Pipelines
TCFPro24 Building Real-Time Generative AI PipelinesTCFPro24 Building Real-Time Generative AI Pipelines
TCFPro24 Building Real-Time Generative AI Pipelines
 
Brain Tumor Detection with Machine Learning.pptx
Brain Tumor Detection with Machine Learning.pptxBrain Tumor Detection with Machine Learning.pptx
Brain Tumor Detection with Machine Learning.pptx
 
Using DAX & Time-based Analysis in Data Warehouse
Using DAX & Time-based Analysis in Data WarehouseUsing DAX & Time-based Analysis in Data Warehouse
Using DAX & Time-based Analysis in Data Warehouse
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...
 
Báo cáo Social Media Benchmark 2024 cho dân Marketing
Báo cáo Social Media Benchmark 2024 cho dân MarketingBáo cáo Social Media Benchmark 2024 cho dân Marketing
Báo cáo Social Media Benchmark 2024 cho dân Marketing
 

Iepy pydata-amsterdam-2016

  • 1. Finding relations in documents with IEPY Daniel F. Moisset – dmoisset@machinalis.com tw: @dmoisset
  • 2. This is a talk about... A (very brief) intro about IE Example applications IEPY, an open source tool to do it in python https://github.com/machinalis/iepy
  • 3. About me I am a Computer Scientist I am a cofounder and technical leader at Machinalis Machinalis (http://www.machinalis.com) provides development services for Machine Learning applications. I participated in the development of IEPY
  • 4. Information Extraction Unstructured documents (News articles, a wiki, a knowledge base) Structured data (“facts” on a database)
  • 5. Some applications Given… Obtain a table of A collection of medical papers Genes and proteins expressed by them Wikipedia Relevant people and their dates of birth A set of business articles Company acquisitions and fundings Social media posts Foodborne disease outbreaks
  • 6. In general... Extract RELATIONS (tuples of a given arity expressing some known concept/meaning) Between ENTITIES (People, places, organizations, dates, amounts of money, …) From DOCUMENTS (text and maybe some associated metadata. Also possible over other media, not discussed in this talk) Example: Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. Pandas is free software released under the three-clause BSD license.
  • 12. IEPY https://github.com/machinalis/iepy Python based tool + framework for creating IE applications ● Some parts of the pipeline have defaults that can be replaced ● Supports automatic, manual, or hybrid approaches ● Includes a web UI for manual interaction Python (+ external components), BSD open-source
  • 13. IEPY pipeline: Load the documents https://github.com/machinalis/iepy IEPY manages its own database for storing documents ● A simple CSV importer is included ● You can write others or convert to CSV ● It is possible to do incremental updates Some work is typically required here given the variety of possible inputs
  • 14. IEPY pipeline: Preprocessing (1) ● Tokenization ● Segmentation in sentences ● Lemmatization ● Part of Speech tagging IEPY uses off-the-shelf tools like Stanford NLP toolkit, you can add/change steps.
  • 15. IEPY pipeline: Preprocessing (2) ● Entity recognition ● NER tagger (Stanford's). This one does pronoun resolution and coreferencing ● Gazettes resolution ● Custom rules ● Parsing (Stanford's) ● Segmentation
  • 16. IEPY pipeline: Extraction Simpler but least interesting option is human-based... Allows integrating domain experts through a friendly UI Manually entered data can feed the active learning core
  • 17. IEPY pipeline: Extraction Simpler but least interesting option is human-based... Allows integrating domain experts through a friendly UI Manually entered data can feed the active learning core
  • 18. IEPY pipeline: Extraction You can get precision (but not always high recall) with rules: Rules are defined by regular expressions-like patterns on words Some tools for debugging rules
  • 19. IEPY pipeline: Extraction You can enhance your results through active learning methods (and reduce human time required which is the end-goal) Learns from positive and negative examples found by other methods. Some predefined features (entity distance, bag of words, bag of bigrams, …) Many predefined classifiers (defaults to SVC from scikit-learn) Tunable parameters for high precision or high recall
  • 20. A real use-case: Archivo de la Memoria Background history: In 1976 Argentina, amidst heavy unrest and civil guerrilla/terrorism, the military forces removed the president by coup and set up a new government that lasted until 1983. During that period, people considered “subversive” (members of the terrorist organizations, but also any political dissenter) were taken away without warrant, and many of them never seen again (killed without trial and with no record). Those are called “Desaparecidos” (disappeared)
  • 21. A real use-case: Archivo de la Memoria In many of those cases, spouses and small children of desaparecidos were also kidnapped. Also some pregnant women gave birth while imprisoned. Those children were given under clandestine adoption to families in the military or connected with them. Today, most of the military top leaders of that age have been tried and convicted for Human Rights violations. But many people in their 30's have been lied about their real identity. Only 119 have been found as of December of 2015. Nearly 400 haven't been located yet.
  • 22. A real use-case: Archivo de la Memoria A collaboration work between ● The NLP group at FaMAF, UNC (a research dept) ● The Province Archives of Remembrance (http://www.apm.gov.ar/ , government agency) ● Clearance to access classified military files ● Recovery and conservation of documents
  • 23. A real use-case: Archivo de la Memoria Input data: Army bulletins ● Personnel movements/assignments/transfers ● Training activities (“interrogation”, “intelligence”) ● Promotions, retirements. Scanned and OCR'd from typewritten documents. Tagged with metadata (location, date)
  • 24. A real use-case: Archivo de la Memoria Entity recognition ● Rule based: Stanford NLP is not very good in Spanish ● Rule based: To match military document conventions
  • 25. A real use-case: Archivo de la Memoria Relationship extraction ● Rule based: Taking advantage of having very regular language. ● ML based: SVM classifier to increase recall. ● Human based: some rounds of manual validation. ● ML based: retrained and relabeled after human intervention.
  • 26. A real use-case: Archivo de la Memoria Some results ● 67K people identified by rules, raised to 141K with a classifier. ● 10K transfers identified by rules, raised to 22K with a classifier
  • 27. Where IEPY works well ● Single domain improves success rate ● Best on binary relations. ● Best on regular documents ● “Medium” sized datasets
  • 28. Be careful about ● You are detecting occurrences, not actual entities ● Is ambiguity or “denormalization” harmful to you? ● For temporal events is a duplicate harmful? ● You get what the document says ● Are your documents consistent? ● Are your documents “truth”?
  • 29. Thank you! Daniel F. Moisset – dmoisset@machinalis.com tw: @dmoisset