Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pandas by Kostas Perifanos

Authorship Attribution & Forensic
Linguistics with Python/Scikit-Learn/Pandas
Kostas Perifanos, Search & Analytics Engineer
@perifanoskostas
Learner Analytics & Data Science Team

Definition
“Automated authorship attribution is the problem
of identifying the author of an anonymous text, or
text whose authorship is in doubt” [Love, 2002]

Domains of application
● Author attribution
● Author verification
● Plagiarism detection
● Author profiling [age, education, gender]
● Stylistic inconsistencies [multiple collaborators/authors]
● Can be also applied in computer code, music scores, ...

“Automated authorship attribution is the problem
of identifying the author of an anonymous text, or
text whose authorship is in doubt”
“Automation”, “identification”, “text”: Machine Learning

A classification problem
● Define classes
● Extract features
● Train ML classifier
● Evaluate

Class definition[s]
● AuthorA, AuthorB, AuthorC, …
● Author vs rest-of-the-world [1-class classification
problem]
● Or even, in extended contexts, a clustering problem

Feature extraction
● Lexical features
● Character features
● Syntactic features
● Application specific

Feature extraction
● Lexical features
● Word length, sentence length etc
● Vocabulary richness [lexical density: functional word vs content words ratio]
● Word frequencies
● Word n-grams
● Spelling errors

Feature extraction
● Character features
● Character types (letters, digits, punctuation)
● Character n-grams (fixed and variable length)
● Compression methods [Entropy, which is really nice but for another talk :) ]

Feature extraction
● Syntactic features
● Part-of-speech tags [eg Verbs (VB), Nouns (NN), Prepositions (PP) etc]
● Sentence and phrase structure
● Errors

Feature extraction
● Semantic features
● Synonyms
● Semantic dependencies
● Application specific features
● Structural
● Content specific
● Language specific

Demo application
Let’s apply a classification algorithm on texts, using word
and character n-grams and POS n-grams
Data set (1): 12867 tweets from 10 users, in Greek
Language, collected in 2012 [4]
Data set (2): 1157 judgments from 2 judges, in English [5]

But what’s an “n-gram”?
[…]an n-gram is a contiguous sequence of n items from a given sequence of
text. [http://en.wikipedia.org/wiki/N-gram]
So, for the sentence above:
word 2-grams (or bigrams): [ (an, n-gram), (n-gram, is), (is, a), (a,
contiguous), …]
char 2-grams: [ ‘an’, ‘n ‘, ‘ n’, ‘n-’, ‘-g’, …]
We will use the TF-IDF weighted frequencies of both word and character n-
grams as features.

Enter Python
Flashback [or, transforming experiments to accepted papers in t<=2h]
A few months earlier, Dec 13, just one day before my holidays I get this call...

Load the dataset
# assume we have the data in 10 tsv files, one file per author.
# each file consists of two columns, id and actual text
import pandas as pd
def load_corpus(input_dir):
trainfiles= [ f for f in listdir( input_dir ) if isfile(join(input_dir ,f)) ]
trainset = []
for filename in trainfiles:
df = pd.read_csv( input_dir + "/" + filename , sep="t",
dtype={ 'id':object, 'text':object } )
for row in df['text']:
trainset.append( { "label":filename, "text": row } )
return trainset

Extract features [1]
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import FeatureUnion
word_vector = TfidfVectorizer( analyzer="word" , ngram_range=(2,2),
max_features = 2000, binary = False )
char_vector = TfidfVectorizer(ngram_range=(2, 3), analyzer="char",
max_features = 2000,binary=False, min_df=0 )
for item in trainset:
corpus.append( item[“text”] )
classes.append( item["label"] )
#our vectors are the feature union of word/char ngrams
vectorizer = FeatureUnion([ ("chars", char_vector),("words", word_vector) ] )
# load corpus, use fit_transform to get vectors
X = vectorizer.fit_transform(corpus)

Extract features [2]
import nltk
#generate POS tags using nltk, return the sequence as whitespace separated string
def pos_tags(txt):
tokens = nltk.word_tokenize(txt)
return " ".join( [ tag for (word, tag) in nltk.pos_tag( tokens ) ] )
#combine word and char ngrams with POS-ngrams
tag_vector = TfidfVectorizer( analyzer="word" , ngram_range=(2,2),
binary = False, max_features= 2000, decode_error = 'ignore' )
X1 = vectorizer.fit_transform( corpus )
X2 = tag_vector.fit_transform( tags )
#concatenate the two matrices
X = sp.hstack((X1, X2), format='csr')

Extract features [2.1]
#this last part is a little bit tricky
X = sp.hstack((X1, X2), format='csr')
There was no (obvious) way to use FeatureUnion
X1, X2 are sparse matrices - so, we are using hstack to stack two matrices horizontally
(column wise)
http://docs.scipy.org/doc/numpy/reference/generated/numpy.hstack.html

Put everything together
word n-
grams
character n-
grams
POS tags n-
grams
(optional)
feature vector components
Author: A function of

Fit the model and evaluate (10-fold-CV)
model = LinearSVC( loss='l1', dual=True)
scores = cross_validation.cross_val_score( estimator = model,
X = matrix.toarray(),
y= np.asarray(classes), cv=10 )
print "10-fold cross validation results:", "mean score = ", scores.mean(),
"std=", scores.std(), ", num folds =", len(scores)
Results: 96% accuracy for two authors, using 10-fold-
CV

Evaluate (train set vs test set)
from sklearn.cross_validation import train_test_split
model = LinearSVC( loss='l1', dual=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
y_pred = model.fit(X_train, y_train).predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)
pl.matshow(cm)
pl.title('Confusion matrix')
pl.colorbar()
pl.ylabel('True label')
pl.xlabel('Predicted label')
pl.show()

[[ 57 1 2 0 4 8 3 27 13 2]
[ 0 71 1 1 0 13 0 8 6 0]
[ 3 0 51 1 3 5 4 8 25 0]
[ 0 1 0 207 2 8 8 8 82 2]
[ 5 4 3 7 106 30 10 25 23 3]
[ 9 11 3 15 11 350 14 46 42 12]
[ 3 1 3 8 13 16 244 21 38 5]
[ 8 12 10 3 11 46 13 414 39 8]
[ 8 4 7 59 11 21 31 49 579 10]
[ 2 6 1 4 3 24 13 29 15 61]]
Confusion Matrix

Interesting questions
● Many authors?
● Short texts / “micro messages"?
● Is writing style affected by time/age?
● Can we detect “mood”?
● Psychological profiles?
● What about obfuscation?
● Even more subtle problems [PAN Workshop 2013]
● Other applications (code, music scores etc)

References & Libraries
1. Authorship Attribution: An Introduction, Harold Love, 2002
2. A Survey of Modern Authorship Attribution Methods,Efstathios
Stamatatos, 2007
3. Authorship Attribution, Patrick Juola, 2008
4. Authorship Attribution in Greek Tweets Using Author's Multilevel
N-Gram Profiles, G. Mikros, Kostas Perifanos. 2012
5. Authorship Attribution with Latent Dirichlet Allocation,
Seroussi,Zukerman, Bohnert, 2011
Python libraries:
● Pandas: http://pandas.pydata.org/
● Scikit-learn: http://scikit-learn.org/stable/
● nltk, http://www.nltk.org/
Data:
www.csse.monash.edu.au/research/umnl/data
Demo Python code:
https://gist.github.com/kperi/f0730ff3028f7be86b15

Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pandas by Kostas Perifanos

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pandas by Kostas Perifanos

Similar to Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pandas by Kostas Perifanos (20)

More from PyData

More from PyData (20)

Recently uploaded

Recently uploaded (20)

Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pandas by Kostas Perifanos