SlideShare a Scribd company logo
1 of 40
Download to read offline
Language Iden fica on:
a Neural Network approach
Alberto Simões1 José João Almeida2 Simon D. Byers3
1CEHUM, Minho's University
ambs@ilch.uminho.pt
2CCTC, Minho's University
jj@di.uminho.pt
3AT&T Labs, Bedminster NJ
headers@gmail.com
SLATE2014, 19--20th June 2014
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
In which languages are these texts?
Malgranda Sablodezerto estas
dezerto de Okcidenta Aŭstralio
Esperanto
Po nepavykusių pirmųjų
bandymų su kukurūzais
Lithuanian
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
In which languages are these texts?
Malgranda Sablodezerto estas
dezerto de Okcidenta Aŭstralio
Esperanto
Po nepavykusių pirmųjų
bandymų su kukurūzais
Lithuanian
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
In which languages are these texts?
Malgranda Sablodezerto estas
dezerto de Okcidenta Aŭstralio
Esperanto
Po nepavykusių pirmųjų
bandymų su kukurūzais
Lithuanian
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
In which languages are these texts?
俄罗斯眼下不具备航母建造、
停泊和维护所需的基础设施和条件
Simplified Chinese
임금체계 개편은 기본적으로
노사 합의 또는
Korean
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
In which languages are these texts?
俄罗斯眼下不具备航母建造、
停泊和维护所需的基础设施和条件
Simplified Chinese
임금체계 개편은 기본적으로
노사 합의 또는
Korean
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
In which languages are these texts?
俄罗斯眼下不具备航母建造、
停泊和维护所需的基础设施和条件
Simplified Chinese
임금체계 개편은 기본적으로
노사 합의 또는
Korean
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
In which languages are these texts?
‫جلوگیری‬ .‫کردند‬ ‫گروه‬ ‫دوم‬ ‫هم‬ ‫به‬
Persian
আেবদনকারীেদর পক্েষ শুনািন কেরন িফদা
Bengali
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
In which languages are these texts?
‫جلوگیری‬ .‫کردند‬ ‫گروه‬ ‫دوم‬ ‫هم‬ ‫به‬
Persian
আেবদনকারীেদর পক্েষ শুনািন কেরন িফদা
Bengali
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
In which languages are these texts?
‫جلوگیری‬ .‫کردند‬ ‫گروه‬ ‫دوم‬ ‫هم‬ ‫به‬
Persian
আেবদনকারীেদর পক্েষ শুনািন কেরন িফদা
Bengali
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
In which languages are these texts?
ဦးသိန္းစိန္အစိုးရရဲ
ဝန္ကီးအမ်ားစုဟာ စစ္ဗုိလ္နဲ
စစ္ဗိုလ္လူထြက္ေတြ
Burmese
આ રસ મ લ િનચોડી સારી
રી િમકસ કરો અ લાસમ
Gujara
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
In which languages are these texts?
ဦးသိန္းစိန္အစိုးရရဲ
ဝန္ကီးအမ်ားစုဟာ စစ္ဗုိလ္နဲ
စစ္ဗိုလ္လူထြက္ေတြ
Burmese
આ રસ મ લ િનચોડી સારી
રી િમકસ કરો અ લાસમ
Gujara
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
In which languages are these texts?
ဦးသိန္းစိန္အစိုးရရဲ
ဝန္ကီးအမ်ားစုဟာ စစ္ဗုိလ္နဲ
စစ္ဗိုလ္လူထြက္ေတြ
Burmese
આ રસ મ લ િનચોડી સારી
રી િમકસ કરો અ લાસમ
Gujara
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
Approaches
Using a dic onary of words for each language:
Problem: amount of word forms!
Using language features:
compute unigrams, bigrams, trigrams, …;
compute short words;
compute word beginnings or termina ons;
Then use language models:
Naïve Bayes;
Hidden Markov Models (HMM);
Support Vector Machines (SVM);
Neural Networks (NN);
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
Approaches
Using a dic onary of words for each language:
Problem: amount of word forms!
Using language features:
compute unigrams, bigrams, trigrams, …;
compute short words;
compute word beginnings or termina ons;
Then use language models:
Naïve Bayes;
Hidden Markov Models (HMM);
Support Vector Machines (SVM);
Neural Networks (NN);
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
Approaches
Using a dic onary of words for each language:
Problem: amount of word forms!
Using language features:
compute unigrams, bigrams, trigrams, …;
compute short words;
compute word beginnings or termina ons;
Then use language models:
Naïve Bayes;
Hidden Markov Models (HMM);
Support Vector Machines (SVM);
Neural Networks (NN);
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
Mo va on for a new tool
lack of a decent iden fica on tool for Perl;
use of Chrome Language Detec on library is limited:
how to add new languages?
how to restrict results to specific languages?
there are tools for other programming languages:
language interoperability can be a hassle;
not clear how to add new languages;
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
Why using a Neural Network?
learn how Neural Networks work!
an approach where:
training is tedious and slow;
iden fica on is easy to implement;
iden fica on efficient when BLAS available;
therefore:
possible to use trained data in different programming languages;
easy to restrict analysis to a set of languages;
iden fica on probabili es are comparable;
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
Neural Network Architecture
x1
x2
x3
. . .
xn
input layer
(features)
a
(2)
1
a
(2)
2
a
(2)
3
. . .
a
(2)
s2
y1
y2
. . .
yK
Θ(1)
Θ(2)
output
layer
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
Preparing Training Data
texts from TED website;
more than 105 languages available!
English texts were matched against English dic onary;
OOV items are removed from the English texts and from other
language texts (trying to remove named en es wri en in their
English form from other texts).
Example
…began spoken word poet Sarah Kay, in a talk that inspired two
standing ova ons at TED2011. She tells the story of her
metamorphosis — from a wide-eyed teenager soaking in verse at
New York's Bowery Poetry Club to a teacher connec ng kids with
the power of self-expression through Project V.O.I.C.E. — and
gives two breathtaking performances of ``B'' and ``Hiroshima.''
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
Preparing Training Data
texts from TED website;
more than 105 languages available!
English texts were matched against English dic onary;
OOV items are removed from the English texts and from other
language texts (trying to remove named en es wri en in their
English form from other texts).
Example
…began spoken word poet Sarah Kay, in a talk that inspired two
standing ova ons at TED2011. She tells the story of her
metamorphosis — from a wide-eyed teenager soaking in verse at
New York's Bowery Poetry Club to a teacher connec ng kids with
the power of self-expression through Project V.O.I.C.E. — and
gives two breathtaking performances of ``B'' and ``Hiroshima.''
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
Preparing Training Data
texts from TED website;
more than 105 languages available!
English texts were matched against English dic onary;
OOV items are removed from the English texts and from other
language texts (trying to remove named en es wri en in their
English form from other texts).
Example
…began spoken word poet Sarah Kay, in a talk that inspired two
standing ova ons at TED2011. She tells the story of her
metamorphosis — from a wide-eyed teenager soaking in verse at
New York's Bowery Poetry Club to a teacher connec ng kids with
the power of self-expression through Project V.O.I.C.E. — and
gives two breathtaking performances of ``B'' and ``Hiroshima.''
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
Two kind of Features
Used Alphabet
Which are the computer characters used in the text?
Are they usually used in Asia c, Arabic or La n text?
Used Sequences of Characters
Which unigrams, bigrams or trigrams are used?
Which are most common for each language?
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
Two kind of Features
Used Alphabet
Which are the computer characters used in the text?
Are they usually used in Asia c, Arabic or La n text?
Used Sequences of Characters
Which unigrams, bigrams or trigrams are used?
Which are most common for each language?
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
Alphabet Features
Count number of Unicode characters in the following classes:
C1 La n characters, only a-z, without diacri cs;
C2 Cyrillic characters (0x0410-0x042F and 0x0430-0x044F);
C3 Hiragana and Katakana characters (0x3040-0x30FF);
C4 Hangul characters (0xAC00-0xD7AF, 0x1100-0x11FF,
0x3130-0x318F, 0xA960-0xA97F and 0xD7B0-0xD7FF);
C5 Kanji characters (0x4E00-0x9FAF);
C6 Simplified Chinese characters (2877 hand defined characters);
C7 Tradi onal Chinese characters (2663 hand defined characters);
C8 Arabic characters (0x0600-0x06FF);
C9 Thai characters (0x0E00-0x0E7F);
C10 Greek characters (0x0370-0x03FF and 0x1F00-0x1FFF).
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
Binariza on of Alphabet Features
In order of reducing entropy in the NN:
Alphabet features are binarized using a set of rules:
set C1 ⇐ C1  0.20
set C2 ⇐ C2  0.20
set C3 ⇐ C3  0.20
set C4 ⇐ C4  0.20
set C6 ⇐ C5  0.30 ∧ C6  C7
set C7 ⇐ C5  0.30 ∧ C6  C7
set C8 ⇐ C8  0.20
set C9 ⇐ C9  0.20
set C10 ⇐ C10  0.20
where
set Ci ⇔ Ci ← 1 ∧ ∀j̸=i Cj ← 0
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
Trigram Features
Why Trigrams?
bigrams would be too small when comparing very close
languages like Portuguese and Spanish;
tetragrams would be too big for some languages (like Asia c's),
where some glyphs represent words or morphemes;
as punctua on and numbers were removed, and spaces
normalized, trigrams would be able to capture, as well, the end
or beginning of words as well as to capture single character
words that appear surrounded by spaces.
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
Trigram Features: example
Für mich war das eine neue Erkenntnis. Und ich denke, mit der
Zeit, in den kommenden Jahren, Wir haben Künstler, aber leider
haben wir sie noch nicht entdeckt. Der visuelle Ausdruck ist nur
eine Form kultureller Integra on. Wir haben erkannt, dass seit
kurzem immer mehr Leutea
Top occurring trigrams
en␣ 0.02299 er␣ 0.02682 ␣de 0.01533 abe 0.01533 der 0.01149
hab 0.01149 ich 0.01149 ir␣ 0.01149 it␣ 0.01149 r␣h 0.01149
␣wi 0.01149 ben 0.01149 ch␣ 0.01149 den 0.01149 wir 0.01149
␣ha 0.01149 ine 0.00766 ler 0.00766 lle 0.00766 n␣k 0.00766
mme 0.00766 ne␣ 0.00766 nnt 0.00766 r␣l 0.00766 r␣m 0.00766
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
Trigram Features: example
Für mich war das eine neue Erkenntnis. Und ich denke, mit der
Zeit, in den kommenden Jahren, Wir haben Künstler, aber leider
haben wir sie noch nicht entdeckt. Der visuelle Ausdruck ist nur
eine Form kultureller Integra on. Wir haben erkannt, dass seit
kurzem immer mehr Leutea
Top occurring trigrams
en␣ 0.02299 er␣ 0.02682 ␣de 0.01533 abe 0.01533 der 0.01149
hab 0.01149 ich 0.01149 ir␣ 0.01149 it␣ 0.01149 r␣h 0.01149
␣wi 0.01149 ben 0.01149 ch␣ 0.01149 den 0.01149 wir 0.01149
␣ha 0.01149 ine 0.00766 ler 0.00766 lle 0.00766 n␣k 0.00766
mme 0.00766 ne␣ 0.00766 nnt 0.00766 r␣l 0.00766 r␣m 0.00766
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
Trigram Features: Merging
features ← {};
for L ∈ L do
trigrams ← ∅;
for file ∈ FilesL do
T ← computeTrigrams(file) ; // Str → IN
T ← mostOccurring(T) ; // Top 30 trigrams
for t ∈ keys(T) do
trigrams[t] ← trigrams[t] + 1;
T ← mostOccurring(T) ;
features ← features ∪ keys(trigrams);
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
Training Data Matrix (excerpt)
Alphabet Features Trigram Features
La n Greek Cyril. ␣pa ới␣ par nia ест ати. ата
PT 1 0 0 0.0041 0 0.0038 0.0001 0 0 0
PT 1 0 0 0.0039 0 0.0036 0 0 0 0
RU 0 0 1 0 0 0 0 0.0020 0.0004 0.0003
RU 0 0 1 0 0 0 0 0.0026 0.0005 0.0002
UK 0 0 1 0 0 0 0 0.0003 0.0034 0.0001
UK 0 0 1 0 0 0 0 0.0003 0.0026 0.0001
VI 1 0 0 0 0.0028 0 0 0 0 0
VI 1 0 0 0 0.0029 0 0.0001 0 0 0
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
Experiment 1: 25 languages
Arabic (AR)
Bulgarian (BG)
German (DE)
Modern Greek (EL)
Spanish (ES)
Persian (FA)
French (FR)
Hebrew (HE)
Hungarian (HU)
Italian (IT)
Japanese (JA)
Korean (KO)
Dutch (NL)
Polish (PL)
Portuguese (PT)
Brazilian Portuguese (PT-BR)
Romanian (RO)
Russian (RU)
Serbian (SR)
Thai (TH)
Turkish (TR)
Ukrainian (UK)
Vietnamese (VI)
Tradi onal Chinese (ZH-TW)
Simplified Chinese (ZH-CN)
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
Exp 1: Training and Test Sets
Training Set (30 files/lang) Test Set (21 files/lang)
Lang. Smaller Larger ¯x σ Smaller Larger ¯x σ
ar 871921 969387 907562 21392 863 4618 2366 1210
bg 988450 1087435 1027581 23663 660 2099 1091 378
de 588200 653508 618463 16475 677 3890 1554 842
el 773265 885770 841203 22653 550 3297 1590 705
es 578806 651240 617341 17637 897 3850 2342 935
fa 651807 766206 697212 28994 600 5221 1338 967
fr 639582 705675 673414 15377 936 4088 1879 689
he 806098 877218 836222 20545 559 3649 1586 878
hu 406271 454506 431797 13131 729 6045 2175 1356
it 588147 643252 616391 14348 1260 6607 2991 1370
ja 538033 606053 569956 18871 323 785 495 133
ko 737118 817651 773168 20550 530 1603 780 233
nl 533497 580313 557724 14033 552 1949 1115 381
pl 521184 591299 551259 17938 435 3092 1605 694
pt-br 596158 643215 617734 14028 920 3189 1953 589
pt 338272 378872 355800 10605 486 5875 2031 1169
ro 592714 650375 616051 15442 718 3254 1438 695
ru 1019789 1144200 1069884 31232 662 2470 1444 526
sr 349389 433221 379344 20560 834 6493 1813 1263
th 529484 601244 565082 18551 334 3242 1396 734
tr 494191 549998 524271 12774 332 5390 1559 1121
uk 370785 434683 395312 16641 299 15435 2430 3553
vi 470057 541930 510409 17246 680 6237 1555 1359
zh-cn 536438 595027 562728 14457 495 6331 1695 1559
zh-tw 514993 588860 542879 16000 270 1721 925 428
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
Exp1: Accuracy
Language 1500 iters. 4000 iters.
ar, bg, de 100% 100%
el, es, fa 100% 100%
fr, he, hu 100% 100%
it, ja, ko 100% 100%
nl, pl 100% 100%
pt 5% 52% wrongly classifies as pt-br
pt-br 100% 76% wrongly classifies as pt
ro, ru, sr 100% 100%
th, tr, uk 100% 100%
vi, zh-cn, zh-tw 100% 100%
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
Exp1: Comparison of PT variants
PT PT-BR
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
Experiment 2: 55 languages
Afrikaans
Albanian
Arabic
Bulgarian
Bengali
Catalan
Czech
Danish
German
Modern
Greek
English
Esperanto
Spanish
Estonian
Persian
Finnish
French
Galician
Gujara
Hebrew
Hindi
Hungarian
Armenian
Indonesian
Italian
Japanese
Georgian
Kannada
Korean
Kurdish
Lithuanian
Latvian
Macedonian
Malayalam
Marathi
Burmese
Nepali
Dutch
Polish
Portuguese
Romanian
Russian
Slovak
Slovenian
Somali
Serbian
Swedish
Tamil
Thai
Turkish
Ukrainian
Urdu
Vietnamese
Chinese
(simplified)
Chinese
(tradi onal)
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
Exp 2: Results
55 languages,
1.126 features,
Θ(l) take 11MB on disk (binary format),
running 7500 itera ons of learning algorithm,
during 6574 minutes and 50.386 seconds (more than 4.5 days),
s ll 21 test files per language,
46 seconds to run over the 1155 test files,
accuracy of 99.740%,
mis-iden fica ons:
2 Bulgarian texts detected as Macedonian,
1 Danish text detected as Dutch.
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
Conclusions
Up to 96% of accuracy when tes ng few languages, and
including two Portuguese variants;
Over 99.7% of accuracy for 55 languages;
NN are able to grow, but training me grows exaggeratedly;
The choice of features is relevant;
(if we know a specific detail will be good to dis nguish a
language, add it to the network!)
Obtained results are not ``determinis c''. Although the same
propor on of results are expected, the random ini aliza on of
the network may lead to some different results in different
number of itera ons.
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
Future Work
Reduce number of trigrams per language and include unigrams;
Compute distribu on differences between near languages;
Make experiments on training different neural networks for
each alphabet;
Include a regulariza on coefficient (λ ̸= 0);
Make experiments to Deep Neural Networks;
Test language iden fica on on short texts (namely Twi er
tweets).
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
Language Iden fica on:
a Neural Network approach
Alberto Simões1 José João Almeida2 Simon D. Byers3
1CEHUM, Minho's University
ambs@ilch.uminho.pt
2CCTC, Minho's University
jj@di.uminho.pt
3ATT Labs, Bedminster NJ
headers@gmail.com
SLATE2014, 19--20th June 2014
Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach

More Related Content

Similar to Language Identification Using Neural Networks

Advance composition june 2015
Advance composition june 2015Advance composition june 2015
Advance composition june 2015rheynely
 
Advance composition june 2015
Advance composition june 2015Advance composition june 2015
Advance composition june 2015Nyehr Gamarcha
 
Linguistics: The Study of Language
Linguistics: The Study of LanguageLinguistics: The Study of Language
Linguistics: The Study of LanguageLorelei Logsdon
 
Researching Multilingually (slideshare, expanded)
Researching Multilingually (slideshare, expanded)Researching Multilingually (slideshare, expanded)
Researching Multilingually (slideshare, expanded)Achilleas Kostoulas
 
Lecture: literacy issues bilingual children
Lecture: literacy  issues bilingual childrenLecture: literacy  issues bilingual children
Lecture: literacy issues bilingual childrenAnnie Muir
 
TESL 603 Goals and Aims of MEG (Handout 1) (1).ppt
TESL 603 Goals and Aims of MEG (Handout 1) (1).pptTESL 603 Goals and Aims of MEG (Handout 1) (1).ppt
TESL 603 Goals and Aims of MEG (Handout 1) (1).pptLala Jeon
 
Group Presentation I
Group Presentation IGroup Presentation I
Group Presentation Ibetty122508
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsRoelof Pieters
 
Introduction_to_Language_and_Linguistics.pptx
Introduction_to_Language_and_Linguistics.pptxIntroduction_to_Language_and_Linguistics.pptx
Introduction_to_Language_and_Linguistics.pptxValeryRamirezMendez
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingYasir Khan
 
Effective Approach for Disambiguating Chinese Polyphonic Ambiguity
Effective Approach for Disambiguating Chinese Polyphonic AmbiguityEffective Approach for Disambiguating Chinese Polyphonic Ambiguity
Effective Approach for Disambiguating Chinese Polyphonic AmbiguityIDES Editor
 
The Influence of [b], [d], and [ð] of Blitar Javanese Phonemes to the Aqcuisi...
The Influence of [b], [d], and [ð] of Blitar Javanese Phonemes to the Aqcuisi...The Influence of [b], [d], and [ð] of Blitar Javanese Phonemes to the Aqcuisi...
The Influence of [b], [d], and [ð] of Blitar Javanese Phonemes to the Aqcuisi...UCsanatadharma
 
Giving able pupils a solid theoretical framework for analysing language
Giving able pupils a solid theoretical framework for analysing languageGiving able pupils a solid theoretical framework for analysing language
Giving able pupils a solid theoretical framework for analysing languageFrancis Gilbert
 
Teaching through technology power point video web 2.0 tools
Teaching through technology power point video web 2.0 toolsTeaching through technology power point video web 2.0 tools
Teaching through technology power point video web 2.0 toolsTamsaPandya
 
A world of many languages.ppt
A world of many languages.pptA world of many languages.ppt
A world of many languages.pptNunoCosta359458
 
History Of Language Powerpoint
History Of Language PowerpointHistory Of Language Powerpoint
History Of Language Powerpointpaulette59
 

Similar to Language Identification Using Neural Networks (20)

Advance composition june 2015
Advance composition june 2015Advance composition june 2015
Advance composition june 2015
 
Advance composition june 2015
Advance composition june 2015Advance composition june 2015
Advance composition june 2015
 
Linguistics: The Study of Language
Linguistics: The Study of LanguageLinguistics: The Study of Language
Linguistics: The Study of Language
 
Researching Multilingually (slideshare, expanded)
Researching Multilingually (slideshare, expanded)Researching Multilingually (slideshare, expanded)
Researching Multilingually (slideshare, expanded)
 
Lecture: literacy issues bilingual children
Lecture: literacy  issues bilingual childrenLecture: literacy  issues bilingual children
Lecture: literacy issues bilingual children
 
TESL 603 Goals and Aims of MEG (Handout 1) (1).ppt
TESL 603 Goals and Aims of MEG (Handout 1) (1).pptTESL 603 Goals and Aims of MEG (Handout 1) (1).ppt
TESL 603 Goals and Aims of MEG (Handout 1) (1).ppt
 
Group Presentation I
Group Presentation IGroup Presentation I
Group Presentation I
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
 
Introduction_to_Language_and_Linguistics.pptx
Introduction_to_Language_and_Linguistics.pptxIntroduction_to_Language_and_Linguistics.pptx
Introduction_to_Language_and_Linguistics.pptx
 
Week 1.2 Language
Week 1.2 LanguageWeek 1.2 Language
Week 1.2 Language
 
Week 3 phonology
Week 3 phonologyWeek 3 phonology
Week 3 phonology
 
Su2012 ss lg week one full pp
Su2012 ss lg week one full ppSu2012 ss lg week one full pp
Su2012 ss lg week one full pp
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Effective Approach for Disambiguating Chinese Polyphonic Ambiguity
Effective Approach for Disambiguating Chinese Polyphonic AmbiguityEffective Approach for Disambiguating Chinese Polyphonic Ambiguity
Effective Approach for Disambiguating Chinese Polyphonic Ambiguity
 
The Influence of [b], [d], and [ð] of Blitar Javanese Phonemes to the Aqcuisi...
The Influence of [b], [d], and [ð] of Blitar Javanese Phonemes to the Aqcuisi...The Influence of [b], [d], and [ð] of Blitar Javanese Phonemes to the Aqcuisi...
The Influence of [b], [d], and [ð] of Blitar Javanese Phonemes to the Aqcuisi...
 
Giving able pupils a solid theoretical framework for analysing language
Giving able pupils a solid theoretical framework for analysing languageGiving able pupils a solid theoretical framework for analysing language
Giving able pupils a solid theoretical framework for analysing language
 
Teaching through technology power point video web 2.0 tools
Teaching through technology power point video web 2.0 toolsTeaching through technology power point video web 2.0 tools
Teaching through technology power point video web 2.0 tools
 
A world of many languages.ppt
A world of many languages.pptA world of many languages.ppt
A world of many languages.ppt
 
LANGUAGE &THOUGHT -2.ppt
LANGUAGE &THOUGHT -2.pptLANGUAGE &THOUGHT -2.ppt
LANGUAGE &THOUGHT -2.ppt
 
History Of Language Powerpoint
History Of Language PowerpointHistory Of Language Powerpoint
History Of Language Powerpoint
 

More from Alberto Simões

Making the most of a 100-year-old dictionary
Making the most of a 100-year-old dictionaryMaking the most of a 100-year-old dictionary
Making the most of a 100-year-old dictionaryAlberto Simões
 
Dictionary Alignment by Rewrite-based Entry Translation
Dictionary Alignment by Rewrite-based Entry TranslationDictionary Alignment by Rewrite-based Entry Translation
Dictionary Alignment by Rewrite-based Entry TranslationAlberto Simões
 
EMLex-A5: Specialized Dictionaries
EMLex-A5: Specialized DictionariesEMLex-A5: Specialized Dictionaries
EMLex-A5: Specialized DictionariesAlberto Simões
 
Aula 04 - Introdução aos Diagramas de Sequência
Aula 04 - Introdução aos Diagramas de SequênciaAula 04 - Introdução aos Diagramas de Sequência
Aula 04 - Introdução aos Diagramas de SequênciaAlberto Simões
 
Aula 03 - Introdução aos Diagramas de Atividade
Aula 03 - Introdução aos Diagramas de AtividadeAula 03 - Introdução aos Diagramas de Atividade
Aula 03 - Introdução aos Diagramas de AtividadeAlberto Simões
 
Aula 02 - Engenharia de Requisitos
Aula 02 - Engenharia de RequisitosAula 02 - Engenharia de Requisitos
Aula 02 - Engenharia de RequisitosAlberto Simões
 
Aula 01 - Planeamento de Sistemas de Informação
Aula 01 - Planeamento de Sistemas de InformaçãoAula 01 - Planeamento de Sistemas de Informação
Aula 01 - Planeamento de Sistemas de InformaçãoAlberto Simões
 
Building C and C++ libraries with Perl
Building C and C++ libraries with PerlBuilding C and C++ libraries with Perl
Building C and C++ libraries with PerlAlberto Simões
 
Processing XML: a rewriting system approach
Processing XML: a rewriting system approachProcessing XML: a rewriting system approach
Processing XML: a rewriting system approachAlberto Simões
 
Arquitecturas de Tradução Automática
Arquitecturas de Tradução AutomáticaArquitecturas de Tradução Automática
Arquitecturas de Tradução AutomáticaAlberto Simões
 
Extracção de Recursos para Tradução Automática
Extracção de Recursos para Tradução AutomáticaExtracção de Recursos para Tradução Automática
Extracção de Recursos para Tradução AutomáticaAlberto Simões
 

More from Alberto Simões (20)

Source Code Quality
Source Code QualitySource Code Quality
Source Code Quality
 
Google Maps JS API
Google Maps JS APIGoogle Maps JS API
Google Maps JS API
 
Making the most of a 100-year-old dictionary
Making the most of a 100-year-old dictionaryMaking the most of a 100-year-old dictionary
Making the most of a 100-year-old dictionary
 
Dictionary Alignment by Rewrite-based Entry Translation
Dictionary Alignment by Rewrite-based Entry TranslationDictionary Alignment by Rewrite-based Entry Translation
Dictionary Alignment by Rewrite-based Entry Translation
 
EMLex-A5: Specialized Dictionaries
EMLex-A5: Specialized DictionariesEMLex-A5: Specialized Dictionaries
EMLex-A5: Specialized Dictionaries
 
Modelação de Dados
Modelação de DadosModelação de Dados
Modelação de Dados
 
Aula 04 - Introdução aos Diagramas de Sequência
Aula 04 - Introdução aos Diagramas de SequênciaAula 04 - Introdução aos Diagramas de Sequência
Aula 04 - Introdução aos Diagramas de Sequência
 
Aula 03 - Introdução aos Diagramas de Atividade
Aula 03 - Introdução aos Diagramas de AtividadeAula 03 - Introdução aos Diagramas de Atividade
Aula 03 - Introdução aos Diagramas de Atividade
 
Aula 02 - Engenharia de Requisitos
Aula 02 - Engenharia de RequisitosAula 02 - Engenharia de Requisitos
Aula 02 - Engenharia de Requisitos
 
Aula 01 - Planeamento de Sistemas de Informação
Aula 01 - Planeamento de Sistemas de InformaçãoAula 01 - Planeamento de Sistemas de Informação
Aula 01 - Planeamento de Sistemas de Informação
 
Building C and C++ libraries with Perl
Building C and C++ libraries with PerlBuilding C and C++ libraries with Perl
Building C and C++ libraries with Perl
 
PLN em Perl
PLN em PerlPLN em Perl
PLN em Perl
 
Classification Systems
Classification SystemsClassification Systems
Classification Systems
 
Redes de Pert
Redes de PertRedes de Pert
Redes de Pert
 
Dancing Tutorial
Dancing TutorialDancing Tutorial
Dancing Tutorial
 
Processing XML: a rewriting system approach
Processing XML: a rewriting system approachProcessing XML: a rewriting system approach
Processing XML: a rewriting system approach
 
Sistemas de Numeração
Sistemas de NumeraçãoSistemas de Numeração
Sistemas de Numeração
 
Álgebra de Boole
Álgebra de BooleÁlgebra de Boole
Álgebra de Boole
 
Arquitecturas de Tradução Automática
Arquitecturas de Tradução AutomáticaArquitecturas de Tradução Automática
Arquitecturas de Tradução Automática
 
Extracção de Recursos para Tradução Automática
Extracção de Recursos para Tradução AutomáticaExtracção de Recursos para Tradução Automática
Extracção de Recursos para Tradução Automática
 

Recently uploaded

Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldRoberto Pérez Alcolea
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecturerahul_net
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITmanoharjgpsolutions
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolsosttopstonverter
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Rob Geurden
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesKrzysztofKkol1
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonApplitools
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shardsChristopher Curtin
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxRTS corp
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...OnePlan Solutions
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...Bert Jan Schrijver
 
VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024VictoriaMetrics
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 

Recently uploaded (20)

Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository world
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh IT
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration tools
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
 
VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 

Language Identification Using Neural Networks

  • 1. Language Iden fica on: a Neural Network approach Alberto Simões1 José João Almeida2 Simon D. Byers3 1CEHUM, Minho's University ambs@ilch.uminho.pt 2CCTC, Minho's University jj@di.uminho.pt 3AT&T Labs, Bedminster NJ headers@gmail.com SLATE2014, 19--20th June 2014 Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • 2. In which languages are these texts? Malgranda Sablodezerto estas dezerto de Okcidenta Aŭstralio Esperanto Po nepavykusių pirmųjų bandymų su kukurūzais Lithuanian Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • 3. In which languages are these texts? Malgranda Sablodezerto estas dezerto de Okcidenta Aŭstralio Esperanto Po nepavykusių pirmųjų bandymų su kukurūzais Lithuanian Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • 4. In which languages are these texts? Malgranda Sablodezerto estas dezerto de Okcidenta Aŭstralio Esperanto Po nepavykusių pirmųjų bandymų su kukurūzais Lithuanian Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • 5. In which languages are these texts? 俄罗斯眼下不具备航母建造、 停泊和维护所需的基础设施和条件 Simplified Chinese 임금체계 개편은 기본적으로 노사 합의 또는 Korean Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • 6. In which languages are these texts? 俄罗斯眼下不具备航母建造、 停泊和维护所需的基础设施和条件 Simplified Chinese 임금체계 개편은 기본적으로 노사 합의 또는 Korean Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • 7. In which languages are these texts? 俄罗斯眼下不具备航母建造、 停泊和维护所需的基础设施和条件 Simplified Chinese 임금체계 개편은 기본적으로 노사 합의 또는 Korean Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • 8. In which languages are these texts? ‫جلوگیری‬ .‫کردند‬ ‫گروه‬ ‫دوم‬ ‫هم‬ ‫به‬ Persian আেবদনকারীেদর পক্েষ শুনািন কেরন িফদা Bengali Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • 9. In which languages are these texts? ‫جلوگیری‬ .‫کردند‬ ‫گروه‬ ‫دوم‬ ‫هم‬ ‫به‬ Persian আেবদনকারীেদর পক্েষ শুনািন কেরন িফদা Bengali Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • 10. In which languages are these texts? ‫جلوگیری‬ .‫کردند‬ ‫گروه‬ ‫دوم‬ ‫هم‬ ‫به‬ Persian আেবদনকারীেদর পক্েষ শুনািন কেরন িফদা Bengali Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • 11. In which languages are these texts? ဦးသိန္းစိန္အစိုးရရဲ ဝန္ကီးအမ်ားစုဟာ စစ္ဗုိလ္နဲ စစ္ဗိုလ္လူထြက္ေတြ Burmese આ રસ મ લ િનચોડી સારી રી િમકસ કરો અ લાસમ Gujara Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • 12. In which languages are these texts? ဦးသိန္းစိန္အစိုးရရဲ ဝန္ကီးအမ်ားစုဟာ စစ္ဗုိလ္နဲ စစ္ဗိုလ္လူထြက္ေတြ Burmese આ રસ મ લ િનચોડી સારી રી િમકસ કરો અ લાસમ Gujara Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • 13. In which languages are these texts? ဦးသိန္းစိန္အစိုးရရဲ ဝန္ကီးအမ်ားစုဟာ စစ္ဗုိလ္နဲ စစ္ဗိုလ္လူထြက္ေတြ Burmese આ રસ મ લ િનચોડી સારી રી િમકસ કરો અ લાસમ Gujara Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • 14. Approaches Using a dic onary of words for each language: Problem: amount of word forms! Using language features: compute unigrams, bigrams, trigrams, …; compute short words; compute word beginnings or termina ons; Then use language models: Naïve Bayes; Hidden Markov Models (HMM); Support Vector Machines (SVM); Neural Networks (NN); Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • 15. Approaches Using a dic onary of words for each language: Problem: amount of word forms! Using language features: compute unigrams, bigrams, trigrams, …; compute short words; compute word beginnings or termina ons; Then use language models: Naïve Bayes; Hidden Markov Models (HMM); Support Vector Machines (SVM); Neural Networks (NN); Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • 16. Approaches Using a dic onary of words for each language: Problem: amount of word forms! Using language features: compute unigrams, bigrams, trigrams, …; compute short words; compute word beginnings or termina ons; Then use language models: Naïve Bayes; Hidden Markov Models (HMM); Support Vector Machines (SVM); Neural Networks (NN); Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • 17. Mo va on for a new tool lack of a decent iden fica on tool for Perl; use of Chrome Language Detec on library is limited: how to add new languages? how to restrict results to specific languages? there are tools for other programming languages: language interoperability can be a hassle; not clear how to add new languages; Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • 18. Why using a Neural Network? learn how Neural Networks work! an approach where: training is tedious and slow; iden fica on is easy to implement; iden fica on efficient when BLAS available; therefore: possible to use trained data in different programming languages; easy to restrict analysis to a set of languages; iden fica on probabili es are comparable; Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • 19. Neural Network Architecture x1 x2 x3 . . . xn input layer (features) a (2) 1 a (2) 2 a (2) 3 . . . a (2) s2 y1 y2 . . . yK Θ(1) Θ(2) output layer Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • 20. Preparing Training Data texts from TED website; more than 105 languages available! English texts were matched against English dic onary; OOV items are removed from the English texts and from other language texts (trying to remove named en es wri en in their English form from other texts). Example …began spoken word poet Sarah Kay, in a talk that inspired two standing ova ons at TED2011. She tells the story of her metamorphosis — from a wide-eyed teenager soaking in verse at New York's Bowery Poetry Club to a teacher connec ng kids with the power of self-expression through Project V.O.I.C.E. — and gives two breathtaking performances of ``B'' and ``Hiroshima.'' Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • 21. Preparing Training Data texts from TED website; more than 105 languages available! English texts were matched against English dic onary; OOV items are removed from the English texts and from other language texts (trying to remove named en es wri en in their English form from other texts). Example …began spoken word poet Sarah Kay, in a talk that inspired two standing ova ons at TED2011. She tells the story of her metamorphosis — from a wide-eyed teenager soaking in verse at New York's Bowery Poetry Club to a teacher connec ng kids with the power of self-expression through Project V.O.I.C.E. — and gives two breathtaking performances of ``B'' and ``Hiroshima.'' Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • 22. Preparing Training Data texts from TED website; more than 105 languages available! English texts were matched against English dic onary; OOV items are removed from the English texts and from other language texts (trying to remove named en es wri en in their English form from other texts). Example …began spoken word poet Sarah Kay, in a talk that inspired two standing ova ons at TED2011. She tells the story of her metamorphosis — from a wide-eyed teenager soaking in verse at New York's Bowery Poetry Club to a teacher connec ng kids with the power of self-expression through Project V.O.I.C.E. — and gives two breathtaking performances of ``B'' and ``Hiroshima.'' Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • 23. Two kind of Features Used Alphabet Which are the computer characters used in the text? Are they usually used in Asia c, Arabic or La n text? Used Sequences of Characters Which unigrams, bigrams or trigrams are used? Which are most common for each language? Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • 24. Two kind of Features Used Alphabet Which are the computer characters used in the text? Are they usually used in Asia c, Arabic or La n text? Used Sequences of Characters Which unigrams, bigrams or trigrams are used? Which are most common for each language? Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • 25. Alphabet Features Count number of Unicode characters in the following classes: C1 La n characters, only a-z, without diacri cs; C2 Cyrillic characters (0x0410-0x042F and 0x0430-0x044F); C3 Hiragana and Katakana characters (0x3040-0x30FF); C4 Hangul characters (0xAC00-0xD7AF, 0x1100-0x11FF, 0x3130-0x318F, 0xA960-0xA97F and 0xD7B0-0xD7FF); C5 Kanji characters (0x4E00-0x9FAF); C6 Simplified Chinese characters (2877 hand defined characters); C7 Tradi onal Chinese characters (2663 hand defined characters); C8 Arabic characters (0x0600-0x06FF); C9 Thai characters (0x0E00-0x0E7F); C10 Greek characters (0x0370-0x03FF and 0x1F00-0x1FFF). Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • 26. Binariza on of Alphabet Features In order of reducing entropy in the NN: Alphabet features are binarized using a set of rules: set C1 ⇐ C1 0.20 set C2 ⇐ C2 0.20 set C3 ⇐ C3 0.20 set C4 ⇐ C4 0.20 set C6 ⇐ C5 0.30 ∧ C6 C7 set C7 ⇐ C5 0.30 ∧ C6 C7 set C8 ⇐ C8 0.20 set C9 ⇐ C9 0.20 set C10 ⇐ C10 0.20 where set Ci ⇔ Ci ← 1 ∧ ∀j̸=i Cj ← 0 Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • 27. Trigram Features Why Trigrams? bigrams would be too small when comparing very close languages like Portuguese and Spanish; tetragrams would be too big for some languages (like Asia c's), where some glyphs represent words or morphemes; as punctua on and numbers were removed, and spaces normalized, trigrams would be able to capture, as well, the end or beginning of words as well as to capture single character words that appear surrounded by spaces. Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • 28. Trigram Features: example Für mich war das eine neue Erkenntnis. Und ich denke, mit der Zeit, in den kommenden Jahren, Wir haben Künstler, aber leider haben wir sie noch nicht entdeckt. Der visuelle Ausdruck ist nur eine Form kultureller Integra on. Wir haben erkannt, dass seit kurzem immer mehr Leutea Top occurring trigrams en␣ 0.02299 er␣ 0.02682 ␣de 0.01533 abe 0.01533 der 0.01149 hab 0.01149 ich 0.01149 ir␣ 0.01149 it␣ 0.01149 r␣h 0.01149 ␣wi 0.01149 ben 0.01149 ch␣ 0.01149 den 0.01149 wir 0.01149 ␣ha 0.01149 ine 0.00766 ler 0.00766 lle 0.00766 n␣k 0.00766 mme 0.00766 ne␣ 0.00766 nnt 0.00766 r␣l 0.00766 r␣m 0.00766 Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • 29. Trigram Features: example Für mich war das eine neue Erkenntnis. Und ich denke, mit der Zeit, in den kommenden Jahren, Wir haben Künstler, aber leider haben wir sie noch nicht entdeckt. Der visuelle Ausdruck ist nur eine Form kultureller Integra on. Wir haben erkannt, dass seit kurzem immer mehr Leutea Top occurring trigrams en␣ 0.02299 er␣ 0.02682 ␣de 0.01533 abe 0.01533 der 0.01149 hab 0.01149 ich 0.01149 ir␣ 0.01149 it␣ 0.01149 r␣h 0.01149 ␣wi 0.01149 ben 0.01149 ch␣ 0.01149 den 0.01149 wir 0.01149 ␣ha 0.01149 ine 0.00766 ler 0.00766 lle 0.00766 n␣k 0.00766 mme 0.00766 ne␣ 0.00766 nnt 0.00766 r␣l 0.00766 r␣m 0.00766 Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • 30. Trigram Features: Merging features ← {}; for L ∈ L do trigrams ← ∅; for file ∈ FilesL do T ← computeTrigrams(file) ; // Str → IN T ← mostOccurring(T) ; // Top 30 trigrams for t ∈ keys(T) do trigrams[t] ← trigrams[t] + 1; T ← mostOccurring(T) ; features ← features ∪ keys(trigrams); Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • 31. Training Data Matrix (excerpt) Alphabet Features Trigram Features La n Greek Cyril. ␣pa ới␣ par nia ест ати. ата PT 1 0 0 0.0041 0 0.0038 0.0001 0 0 0 PT 1 0 0 0.0039 0 0.0036 0 0 0 0 RU 0 0 1 0 0 0 0 0.0020 0.0004 0.0003 RU 0 0 1 0 0 0 0 0.0026 0.0005 0.0002 UK 0 0 1 0 0 0 0 0.0003 0.0034 0.0001 UK 0 0 1 0 0 0 0 0.0003 0.0026 0.0001 VI 1 0 0 0 0.0028 0 0 0 0 0 VI 1 0 0 0 0.0029 0 0.0001 0 0 0 Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • 32. Experiment 1: 25 languages Arabic (AR) Bulgarian (BG) German (DE) Modern Greek (EL) Spanish (ES) Persian (FA) French (FR) Hebrew (HE) Hungarian (HU) Italian (IT) Japanese (JA) Korean (KO) Dutch (NL) Polish (PL) Portuguese (PT) Brazilian Portuguese (PT-BR) Romanian (RO) Russian (RU) Serbian (SR) Thai (TH) Turkish (TR) Ukrainian (UK) Vietnamese (VI) Tradi onal Chinese (ZH-TW) Simplified Chinese (ZH-CN) Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • 33. Exp 1: Training and Test Sets Training Set (30 files/lang) Test Set (21 files/lang) Lang. Smaller Larger ¯x σ Smaller Larger ¯x σ ar 871921 969387 907562 21392 863 4618 2366 1210 bg 988450 1087435 1027581 23663 660 2099 1091 378 de 588200 653508 618463 16475 677 3890 1554 842 el 773265 885770 841203 22653 550 3297 1590 705 es 578806 651240 617341 17637 897 3850 2342 935 fa 651807 766206 697212 28994 600 5221 1338 967 fr 639582 705675 673414 15377 936 4088 1879 689 he 806098 877218 836222 20545 559 3649 1586 878 hu 406271 454506 431797 13131 729 6045 2175 1356 it 588147 643252 616391 14348 1260 6607 2991 1370 ja 538033 606053 569956 18871 323 785 495 133 ko 737118 817651 773168 20550 530 1603 780 233 nl 533497 580313 557724 14033 552 1949 1115 381 pl 521184 591299 551259 17938 435 3092 1605 694 pt-br 596158 643215 617734 14028 920 3189 1953 589 pt 338272 378872 355800 10605 486 5875 2031 1169 ro 592714 650375 616051 15442 718 3254 1438 695 ru 1019789 1144200 1069884 31232 662 2470 1444 526 sr 349389 433221 379344 20560 834 6493 1813 1263 th 529484 601244 565082 18551 334 3242 1396 734 tr 494191 549998 524271 12774 332 5390 1559 1121 uk 370785 434683 395312 16641 299 15435 2430 3553 vi 470057 541930 510409 17246 680 6237 1555 1359 zh-cn 536438 595027 562728 14457 495 6331 1695 1559 zh-tw 514993 588860 542879 16000 270 1721 925 428 Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • 34. Exp1: Accuracy Language 1500 iters. 4000 iters. ar, bg, de 100% 100% el, es, fa 100% 100% fr, he, hu 100% 100% it, ja, ko 100% 100% nl, pl 100% 100% pt 5% 52% wrongly classifies as pt-br pt-br 100% 76% wrongly classifies as pt ro, ru, sr 100% 100% th, tr, uk 100% 100% vi, zh-cn, zh-tw 100% 100% Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • 35. Exp1: Comparison of PT variants PT PT-BR Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • 36. Experiment 2: 55 languages Afrikaans Albanian Arabic Bulgarian Bengali Catalan Czech Danish German Modern Greek English Esperanto Spanish Estonian Persian Finnish French Galician Gujara Hebrew Hindi Hungarian Armenian Indonesian Italian Japanese Georgian Kannada Korean Kurdish Lithuanian Latvian Macedonian Malayalam Marathi Burmese Nepali Dutch Polish Portuguese Romanian Russian Slovak Slovenian Somali Serbian Swedish Tamil Thai Turkish Ukrainian Urdu Vietnamese Chinese (simplified) Chinese (tradi onal) Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • 37. Exp 2: Results 55 languages, 1.126 features, Θ(l) take 11MB on disk (binary format), running 7500 itera ons of learning algorithm, during 6574 minutes and 50.386 seconds (more than 4.5 days), s ll 21 test files per language, 46 seconds to run over the 1155 test files, accuracy of 99.740%, mis-iden fica ons: 2 Bulgarian texts detected as Macedonian, 1 Danish text detected as Dutch. Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • 38. Conclusions Up to 96% of accuracy when tes ng few languages, and including two Portuguese variants; Over 99.7% of accuracy for 55 languages; NN are able to grow, but training me grows exaggeratedly; The choice of features is relevant; (if we know a specific detail will be good to dis nguish a language, add it to the network!) Obtained results are not ``determinis c''. Although the same propor on of results are expected, the random ini aliza on of the network may lead to some different results in different number of itera ons. Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • 39. Future Work Reduce number of trigrams per language and include unigrams; Compute distribu on differences between near languages; Make experiments on training different neural networks for each alphabet; Include a regulariza on coefficient (λ ̸= 0); Make experiments to Deep Neural Networks; Test language iden fica on on short texts (namely Twi er tweets). Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach
  • 40. Language Iden fica on: a Neural Network approach Alberto Simões1 José João Almeida2 Simon D. Byers3 1CEHUM, Minho's University ambs@ilch.uminho.pt 2CCTC, Minho's University jj@di.uminho.pt 3ATT Labs, Bedminster NJ headers@gmail.com SLATE2014, 19--20th June 2014 Alberto Simões, José João Almeida, Simon D. Byers Language Iden fica on: a Neural Network approach