Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

A focused crawler for romanian words discovery

Roedunet 2014 Conference paper

Related Books

Free with a 30 day trial from Scribd

See all
  • Be the first to comment

  • Be the first to like this

A focused crawler for romanian words discovery

  1. 1. Authors University Politehnica of Bucharest A Focused Crawler for Romanian Words Discovery Ionuț-Gabriel Radu Traian Rebedea traian.rebedea@cs.pub.ro
  2. 2. Overview • Introduction • Objective • RWScraper • Related Work • RWScraper: Implementation • Results • Conclusions 19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 2
  3. 3. Introduction • All natural languages are subject to change over time • As the Web becomes more prevalent, it also constitutes a major source for identifying language evolution • Due to large amounts of Romanian web content, the rate of change has increased significantly 19.08.15 RoEduNet Conference 2014 – Chi inău, R. Moldovaș 3
  4. 4. Objective • To provide a mechanism to identify new words (e.g. neologisms) that entered the Romanian language • Develop a specialized (focused) web crawler for analyzing Romanian web pages and identifying new words 19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 4
  5. 5. Focused Web Crawling • Crawling the web with a specific purpose: – “Focus” the spiders to specific content (e.g. people search, scientific publications, products, etc.) – Ignore other web pages and domains 19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 5
  6. 6. Solution: RWScraper • RWScraper (Romanian Word Scraper) - is able to solve the following problems: – Identify Romanian texts; – Distinguish between proper names and common nouns; – Create a database with new words along with context information and metadata. In order to identify new – Discover the most frequent spelling errors in Romanian online texts. 19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 6
  7. 7. RWScraper – Text Processing • Each word discovered in a Romanian text is looked in the database provided by www.dexonline.ro, which contains definitions from several Romanian dictionaries (DEX, DOOM, etc.) • Text Processing Pipeline – Text Normalization – Language Validation – Sentence Segmentation – Sentence-Level Language Identification – Word Tokenization 19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 7
  8. 8. Related Work: Neologisms Identification • A study for Japanese: – Scanning existing Japanese corpora for possible ”new” words, typically by processing the texts through segmentation software and dealing with the ”out-of-lexicon” problem – Simulating the Japanese morphological processes to create new possible words and then test for the presence of them in large corpora • Identification of lexical discriminants (e.g. termed, called, known as) and punctuation discriminants (e.g. single and double quotes) for introducing new words – This method is able to identify a significantly smaller number of potential new words due to the limited number of lexical discriminant patterns. • Using data about the frequency of words usage over time 19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 8
  9. 9. Related Work: Language Identification • Common Words Methods – Store and use a list with the most frequent words for each language • Unique Letter Combinations – Database with the most frequent sequences of letters in a language, not necessarily valid words – The main disadvantage: the poor performance on short texts – The main advantage: it does not require word tokenization • Language Identification Using N-Grams – Every language has several specific frequently used character n-grams – For a particular language L, the n-gram ordered dictionary is called n- gram language profile – For a new text, we compute the distance to all computed language profiles • Markov Models for Language Identification – The word can be represented as a Markov chain where letters are states – Compute a Markov model for each language 19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 9
  10. 10. RWScraper: Implementation • RWScraper is a focused crawler for Romanian web pages • Developed using Scrapy: open-source scraping framework in Python • It uses three main concepts: – Spiders: responsible for defining rules to restrict the crawled content to our area of interest – Items: data we want to scrape from the web pages – Pipelines: text processing tasks that act on the crawled web resources 19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 10
  11. 11. RWScraper Language Validation • Divide the texts into two categories: – Diacritics free texts - DIAFREE – Genuine Romanian texts – GEN • 6.40% of the characters in the Romanian texts part of the ro_eu_parliament corpus are diacritics • One of the problems with this approach is that 4.14% of texts contained ș, â, and î. Unfortunately, there are also other languages that possess these diacritics • Romanian is the only language that uses ț and ă • Our assumption: if a text has over 600 characters and has no ț/ă are found – Then it is DIAFREE – Otherwise is GEN 19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 11
  12. 12. RWScraper Language Validation • Build language profiles, consisting of: – Character bigrams and trigrams frequency – Common words frequency – Diacritics frequency – Rare characters frequency – Double consonant frequency – Single quotes frequency 19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 12
  13. 13. Results: Language Validation • 105 texts are divided into: 20 Romanian with diacritics (RO1 - RO20), 20 Romanian without diacritics (RO21- RO40), 20 Italian, 15 English, 10 Spanish, 5 Latin, 5 French, 5 Turkish texts, 3 Catalan texts, and 2 Aromanian • The size of the texts varied from 9KB to 2:5MB, the average size being 253:4KB • Average scores for the discriminator function – Lower score means higher probability for the text to be written in Romanian – Used to set the discriminant score to 0.77 to separate between Romanian and non-Romanian texts 19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 13
  14. 14. Results • Processed 264,328 online documents – Only 12,555 documents contained new words • From this set of texts, we extracted 698,341 – Only 47,363 phrases contained new words • Discovered 53,724 new words – 21,343 are proper names • The remaining tokens are common words and they are divided into the following main categories: – Misspelled words (approximately 35%) – Technical words (approximately 15%) – Argotic words (approximately 10%) – Clitics, regionalisms, archaisms, alternative forms for existing words account for the rest (cca. 40%) 19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 14
  15. 15. Results • Most frequent new words 19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 15
  16. 16. Conclusions • RWScraper is a simple new Romanian words discovery system • The project has also managed to create a large database of Romanian words extracted from the WWW – Statistics about common proper names, frequent spelling mistakes and newly-invented words • There are several elements that could be further improved – The accuracy of the NLP components used by the system – A more pertinent analysis of the words identified by the system 19.08.15 RoEduNet Conference 2014 – Chi inău, R. Moldovaș 16
  17. 17. Thank you! Questions? Discussion 19.08.15 RoEduNet Conference 2014 – Chi inău, R. Moldovaș 17 This work has been funded by the Sectorial Operational Programme Human Resources Development 2007-2013 of the Romanian Ministry of European Funds through the Financial Agreement POSDRU/159/1.5/S/132397

×