SlideShare a Scribd company logo
1 of 13
1
S O T I R I S B A R A T S A S S O C I A L N E T W O R K A N A L Y S I S
T W I T T E R M E N T I O N G R A P H
Twitter Mention Graph
SOCIAL NETWORK ANALYSIS
Sotiris Baratsas
MSc in Business Analytics
TASK 1: Twitter Mention Graph
Our first task is to create a weighted directed graph with igraph, using raw data from
Twitter. To do that, first, we will clean the data and try to bring it as close as possible to
the usable format we want to import to R. Probably, the most efficient way to do that,
is using the native commands of the Unix terminal (bash), which are able to process the
data much faster than the alternatives.
Step 1: Extract only the dates we want
The first thing we can do, in order to work faster with the dataset, is to extract only the
dates we want. To do that, we will use the “grep” command and keep only the rows that
start with “T 2009-07-01” as well as the next 2 rows after every much (passed through
the -A 2 parameter). In this way, we will keep a total of 3 lines for every match, the date,
the user and the tweet.
time grep -A 2 "^T.2009-07-01" tweets2009-07.txt > tweets1.txt
# real 1m45.001s
# user 1m31.585s
# sys 0m4.816s
grep -A 2 "^T.2009-07-02" tweets2009-07.txt > tweets2.txt
grep -A 2 "^T.2009-07-03" tweets2009-07.txt > tweets3.txt
grep -A 2 "^T.2009-07-04" tweets2009-07.txt > tweets4.txt
grep -A 2 "^T.2009-07-05" tweets2009-07.txt > tweets5.txt
As we can see, the time spent to create each file is about 1,5 minutes, which is quite
good.
Step 2: Clean the data
After extracting only the dates we want, we could load the data into python and start
the data cleaning, however, it would be far more efficient to do some part of the data
cleaning inside the terminal, using the sed command.
2
S O T I R I S B A R A T S A S S O C I A L N E T W O R K A N A L Y S I S
T W I T T E R M E N T I O N G R A P H
Sed allows us to pass multiple sed’s in a single line, so, we will combine the following
sed commands:
This sed command will match the rows with the Timestamp of the tweets and keep
only the date.
sed -i '' 's/.*([0-9]{4})-([0-9]{2})-([0-9]{2}).*/1-
2-3/g' tweets1.txt
This sed command will match the rows with link to the user’s profile and keep only the
actual twitter handle, removing “http://www.twitter.com/” and also the U at the
beginning of the line.
sed -i '' 's/(U.http://twitter.com/)(.*)/2/g' tweets1.txt
This sed command will match the tweets that include one or more mentions, remove
the words that are not mentions and make it extremely faster for us to later iterate
through each word in a tweet.
sed -i '' 's/W.[^@]*(@[^ :,.]*)*/1 /g' tweets1.txt
This sed command will remove the – separators between each match that is generated
in the output of the previous sed’s.
sed -i '' '/--/d' tweets1.txt
Next, we combine the previous commands in one sed and execute it for each of the 5
files.
§ the -i argument indicates that we want to overwrite the file with the results
§ the empty quotes after the -i is used to indicate that we want to write in the
existing file (it’s needed because I have a Mac. In Linux it might be obsolete.)
time sed -i '' 's/.*([0-9]{4})-([0-9]{2})-([0-
9]{2}).*/1-2-3/g; s/(U.http://twitter.com/)(.*)/2/g;
s/W.[^@]*(@[^ :,.]*)*/1 /g; /--/d' tweets1.txt
sed -i '' 's/.*([0-9]{4})-([0-9]{2})-([0-9]{2}).*/1-
2-3/g; s/(U.http://twitter.com/)(.*)/2/g; s/W.[^@]*(@[^
:,.]*)*/1 /g; /--/d' tweets2.txt
sed -i '' 's/.*([0-9]{4})-([0-9]{2})-([0-9]{2}).*/1-
2-3/g; s/(U.http://twitter.com/)(.*)/2/g; s/W.[^@]*(@[^
:,.]*)*/1 /g; /--/d' tweets3.txt
sed -i '' 's/.*([0-9]{4})-([0-9]{2})-([0-9]{2}).*/1-
2-3/g; s/(U.http://twitter.com/)(.*)/2/g; s/W.[^@]*(@[^
:,.]*)*/1 /g; /--/d' tweets4.txt
3
S O T I R I S B A R A T S A S S O C I A L N E T W O R K A N A L Y S I S
T W I T T E R M E N T I O N G R A P H
sed -i '' 's/.*([0-9]{4})-([0-9]{2})-([0-9]{2}).*/1-
2-3/g; s/(U.http://twitter.com/)(.*)/2/g; s/W.[^@]*(@[^
:,.]*)*/1 /g; /--/d' tweets5.txt
The resulting files have the following format:
For tweets that did not include a mention, the content has been cleared.
For tweets with mentions, the content before and between each mention has been
cleared. Each record takes 3 rows (date, user, tweet).
2009-07-04
dailynascar
2009-07-04
dcompanyau
@JezebellXOXO Please Come and See my Lasted pics http://short.to/h0r7
2009-07-04
donnamurrutia
Step 3: Put the data into tabular format
Next, we are ready to load the data in Python to put them in Tabular format and
generate the needed CSVs.
To do that, we follow the process described below:
(I describe the process in detail inside the .ipynb file)
1. We read the file line-by-line
2. We put the data into tabular format, by iterating through every 3 lines and
putting the content in the appropriate column (i.e. Date, from, to).
3. We create a function that looks for every word that starts with @ (mention) and
splits multiple mentions into different rows, keeping the Date and User who
made the mention the same between multiple mentions in the same tweet.
4. We group the data by “from” – “to” pairs and get the size() to find the frequency
(weight) of mentions for each pair.
5. We extract the resulting data frame as CSV
6. We run this process for every file and end up with 5 CSV files, one for each day
The result is 5 CSV files with the following format.
"from","to","weight"
"suddenlyjamie","dmscott",1
"aruanpc","danilogentili",1
"gloriahansen","janedavila",2
"uluvsheena","PreciousSoHot",1
4
S O T I R I S B A R A T S A S S O C I A L N E T W O R K A N A L Y S I S
T W I T T E R M E N T I O N G R A P H
"adreamon","jlovely69",1
"cin7415","sdriven1",1
Finally, after having our CSV files ready, we load them in R and create the directed
iGraph objects.
# Reading ths CSV files we have created
tweets1 = read.csv(file="tweets1.csv", header=T, sep=",",
fileEncoding = "utf-8")
tweets2 = read.csv(file="tweets2.csv", header=T, sep=",",
fileEncoding = "utf-8")
tweets3 = read.csv(file="tweets3.csv", header=T, sep=",",
fileEncoding = "utf-8")
tweets4 = read.csv(file="tweets4.csv", header=T, sep=",",
fileEncoding = "utf-8")
tweets5 = read.csv(file="tweets5.csv", header=T, sep=",",
fileEncoding = "utf-8")
# Checking the structure of the data
str(tweets1)
DUPLICATE NODES
By taking a look at the dataset, we observe that some of the records are duplicate,
because of case sensitivity (e.g. 'OfficialTila' and 'officialtila' are considered as separate
users, while they are the same). In Twitter, a username is not dependent on whether it’s
written in lowercase, uppercase, sentence-case or any other case.
To correct the problem, we will make all the characters lowercase and then merge the
duplicate records that have been created. We need to be careful to sum up the weights
of each record, so that they are represented correctly after merging the different cases.
# To correct this problem, first we will make all characters
lowercase
tweets1[,1:2] <- sapply(tweets1[,1:2], tolower)
tweets2[,1:2] <- sapply(tweets2[,1:2], tolower)
tweets3[,1:2] <- sapply(tweets3[,1:2], tolower)
tweets4[,1:2] <- sapply(tweets4[,1:2], tolower)
tweets5[,1:2] <- sapply(tweets5[,1:2], tolower)
# Then, we will use ddply to merge duplicate rows, but also sum
their weights, so that we don't lose any values
library(plyr)
tweets1<- ddply(tweets1,~from + to,summarise,weight=sum(weight))
tweets2<- ddply(tweets2,~from + to,summarise,weight=sum(weight))
5
S O T I R I S B A R A T S A S S O C I A L N E T W O R K A N A L Y S I S
T W I T T E R M E N T I O N G R A P H
tweets3<- ddply(tweets3,~from + to,summarise,weight=sum(weight))
tweets4<- ddply(tweets4,~from + to,summarise,weight=sum(weight))
tweets5<- ddply(tweets5,~from + to,summarise,weight=sum(weight))
# CREATING THE iGRAPH OBJECTS (READ FROM DATA FRAME)
g1 <- graph_from_data_frame(tweets1, directed = TRUE, vertices =
NULL)
g2 <- graph_from_data_frame(tweets2, directed = TRUE, vertices =
NULL)
g3 <- graph_from_data_frame(tweets3, directed = TRUE, vertices =
NULL)
g4 <- graph_from_data_frame(tweets4, directed = TRUE, vertices =
NULL)
g5 <- graph_from_data_frame(tweets5, directed = TRUE, vertices =
NULL)
TASK 2: Average Degree over time
Our next task is to create plots that visualize the 5-day evolution of some important
metrics of the network, such as number of vertices, number of edges, diameter and
average degrees.
To do that, we create a loop that takes each network, computes the needed metrics
and write them in a data frame. The resulting table is the following:
To get a better representation, we can also plot the 7 metrics, using ggplot:
(In all directed graphs, the sum of in-degree and sum of out-degree is the same, so we
plot the average in-degree and out-degree on the same plot. Moreover, we calculate
the average weighted in/out degree).
6
S O T I R I S B A R A T S A S S O C I A L N E T W O R K A N A L Y S I S
T W I T T E R M E N T I O N G R A P H
There are a few things we can observe in the above graphs:
§ The number of users involved in mentioning others or being mentioned by
others is at its peak on Wednesday 01/07/2009 and steadily decreases every
day until Sunday 05/07/2009.
§ As expected, the number of mentions also decreases, following roughly the
same curve.
§ The diameter of the network starts at 71 on Wed 01/07, steadily increases to
75 until Friday, but drops to its lowest point on Saturday. On Sunday, however,
it reaches its highest point (85). What is fascinating about these numbers, is the
fact that, even if we only took into account the direct mentions, the diameter is
still low, given that we talk about a global social network. Of course, this might
have something to do with the data being from 2009, when Twitter was more
well-known and widely used in the USA.
§ The Average In-Degree and the Average In-Degree follow the opposite
direction, starting just below 1.19 on Wednesday, increasing to over 1.45 on
Friday and decreasing to 1.23 on Sunday. This means that the average user
mentions ~1 person and the average user is also mentioned about once.
§ If we take into account the weights of the network, we can calculate the
weighted average degree, which gives results very similar to the unweighted
degree graph, but a little bit higher.
TASK 3: Important nodes
Next, we will identify the important nodes of the network per day. We will select the
nodes that rank highest for each day in 3 key metrics:
7
S O T I R I S B A R A T S A S S O C I A L N E T W O R K A N A L Y S I S
T W I T T E R M E N T I O N G R A P H
§ In-degree
§ Out-degree
§ Page-Rank
In-Degree: Shows us which users are mentioned the most for each day
The process we follow is to calculate the Top 10 for each day and then bind them into
a single data frame:
intop1 <- head(sort(degree(g1, mode="in"), decreasing=TRUE), 10)
intop2 <- head(sort(degree(g2, mode="in"), decreasing=TRUE), 10)
intop3 <- head(sort(degree(g3, mode="in"), decreasing=TRUE), 10)
intop4 <- head(sort(degree(g4, mode="in"), decreasing=TRUE), 10)
intop5 <- head(sort(degree(g5, mode="in"), decreasing=TRUE), 10)
intop <- data.frame(cbind(names(intop1), names(intop2),
names(intop3), names(intop4), names(intop5)))
colnames(intop) <- c("01-07-2009", "02-07-2009", "03-07-2009", "04-
07-2009", "05-07-2009")
intop
In-degree Top 10 per day
As we can see, the users who are mentioned the most, are pretty much the same every
day. It’s accounts that post memes or news, such as tweetmeme, mashable, addthis,
cnn, cnnbrk, breakingnews, and celebrities, such as mileycyrus, ddlovato, adamlambert,
souljaboytellem, officialtila. There are some exceptions that might have to do with
something significant or news-worthy that happened on that day and caused an
account to receive more mentions.
Since we have a weighted network, it might make more sense to calculate the Top 10,
using the weighted in-degree.
8
S O T I R I S B A R A T S A S S O C I A L N E T W O R K A N A L Y S I S
T W I T T E R M E N T I O N G R A P H
Weighted in-degree Top 10 per day
If we do that, we can see some variations in the positions of each user inside the Top
10, but the accounts are mostly the same as the previous results.
Out-Degree: Shows us which users mention other users the most for each day
We follow the same process to calculate the Top 10 using the Out-degree
Out-degree Top 10 per day & Weighted out-degree Top 10 per day
On the contrary, when we identify the Top 10 users for each day using the Out-Degree
and Weighted Out-Degree, we can see a lot of variation, with only a few users being
exceptions. The explanation for this is, that it needs a certain level of popularity to be
the receiver of a lot of mentions (in-degree), but any user, on any day, can mention as
many users they want, as long as it doesn’t violate a limit imposed by Twitter.
Page-Rank: Shows us the users who received the most page-rank value every day. It
takes into account whether a user was mentioned by many users who were in turn
mentioned by other users (e.g. infuencers).
# PAGERANK
pgrnk1 <- page_rank(g1, algo="prpack" , directed=FALSE)$vector
pgrnk2 <- page_rank(g2, algo="prpack" , directed=FALSE)$vector
pgrnk3 <- page_rank(g3, algo="prpack" , directed=FALSE)$vector
pgrnk4 <- page_rank(g4, algo="prpack" , directed=FALSE)$vector
pgrnk5 <- page_rank(g5, algo="prpack" , directed=FALSE)$vector
9
S O T I R I S B A R A T S A S S O C I A L N E T W O R K A N A L Y S I S
T W I T T E R M E N T I O N G R A P H
# Here are all the characters ranked in descending order, based on
their Pagerank value
ranked1 <- head(sort(pgrnk1, decreasing=TRUE), 10)
ranked2 <- head(sort(pgrnk2, decreasing=TRUE), 10)
ranked3 <- head(sort(pgrnk3, decreasing=TRUE), 10)
ranked4 <- head(sort(pgrnk4, decreasing=TRUE), 10)
ranked5 <- head(sort(pgrnk5, decreasing=TRUE), 10)
ranked <- data.frame(cbind(names(ranked1), names(ranked2),
names(ranked3), names(ranked4), names(ranked5)))
colnames(ranked) <- c("01-07-2009", "02-07-2009", "03-07-2009", "04-
07-2009", "05-07-2009")
ranked
Top 10 users per day, based on Page-Rank
As far as page-rank is concerned, we can see that most of the Top 10 users are the
same as the Top 10 users based on in-degree. This makes sense, since these users (e.g.
celebrities) receive a lot of mentions and retweets from other users, but they only
mention a few other users themselves. This way, they concentrate a lot of page-rank
value.
TASK 4: Communities
Our final task is to identify different communities, by applying fast greedy clustering,
infomap clustering, and louvain clustering on the undirected versions of the 5 mention
graphs.
# Making the graphs undirected
ug1 <- as.undirected(g1)
ug2 <- as.undirected(g2)
ug3 <- as.undirected(g3)
10
S O T I R I S B A R A T S A S S O C I A L N E T W O R K A N A L Y S I S
T W I T T E R M E N T I O N G R A P H
ug4 <- as.undirected(g4)
ug5 <- as.undirected(g5)
# Finding communities with fast greedy clustering
communities_fast_greedy1 <- cluster_fast_greedy(ug1)
# Finding communities with infomap clustering
communities_infomap1 <- cluster_infomap(ug1)
# Finding communities with louvain clustering
communities_louvain1 <- cluster_louvain(ug1)
communities_louvain2 <- cluster_louvain(ug2)
communities_louvain3 <- cluster_louvain(ug3)
communities_louvain4 <- cluster_louvain(ug4)
communities_louvain5 <- cluster_louvain(ug5)
However, as it turns out fast greedy clustering takes too long to execute (I got results
after about 45-50 minutes). As a matter of fact, Infomap clustering takes even longer.
The only method that is able to produce results in a matter of seconds, is the Louvain
community detection algorithm, because although it’s based on a greedy optimization
process, it includes an additional aggregation step to improve processing on very large
networks.
compare(communities_fast_greedy1, communities_infomap1)
compare(communities_fast_greedy1, communities_louvain1)
compare(communities_infomap1, communities_louvain1)
Comparing different clustering methods
We can compare the different resulting communities, using the compare() function. It
seems that Louvain is closer to the FastGreedy method.
EVOLUTION OF COMMUNITY MEMBERSHIP
Then, using the Louvain method, we will try to detect the evolution of the communities
the user “KimKardashian” belongs to.
To do that, first we identify the communities in which Kim Kardashian belongs in each
graph (=each day) and then find the intersect of these communities.
11
S O T I R I S B A R A T S A S S O C I A L N E T W O R K A N A L Y S I S
T W I T T E R M E N T I O N G R A P H
# Detecting the evolution of communities to which user "KimKardashian"
belongs
c1<-communities_louvain1[membership(communities_louvain1)["kimkardashian"]]
c2<-communities_louvain2[membership(communities_louvain2)["kimkardashian"]]
c3<-communities_louvain3[membership(communities_louvain3)["kimkardashian"]]
c4<-communities_louvain5[membership(communities_louvain4)["kimkardashian"]]
c5<-communities_louvain5[membership(communities_louvain5)["kimkardashian"]]
# Finding similarities between
intersect(c1$`54008`, c2$`41188`)
intersect(c1$`54008`, c3$`22013`)
intersect(c1$`54008`, c4$`8036`)
intersect(c1$`54008`, c5$`21162`)
intersect(c2$`41188`, c3$`22013`)
intersect(c2$`41188`, c4$`8036`)
intersect(c2$`41188`, c5$`21162`)
intersect(c3$`22013`, c4$`8036`)
intersect(c3$`22013`, c5$`21162`)
As we can see from the results, the communities that are most similar (in terms of
common members) are the community of Day 3 and the community of Day 5, as well
as the community from Day 2.
12
S O T I R I S B A R A T S A S S O C I A L N E T W O R K A N A L Y S I S
T W I T T E R M E N T I O N G R A P H
On the other hand, the community from Day 4 is very small, and it doesn’t have any
common member with the other communities.
VISUALIZING THE COMMUNITIES
In order to visualize the communities (we will use the 1st
day’s graph as an example),
first we need to:
§ set a color to the different communities (represented as levels of a factor)
§ check the sizes of each community to select our filtering parameters
§ filter to keep only some mid-sized communities
§ induce a subgraph using this filter to keep only the nodes that belong in these
communities
§ plot the subgraph and adjust the parameters to get a good visual result
# Setting colors for the different communities
V(g1)$color <- factor(membership(communities_louvain1))
#Get the sizes of each community of Graph1 (g1)
community_size <- sizes(communities_louvain1)
head(sort(community_size, decreasing=TRUE), 20)
head(sort(community_size, decreasing=FALSE), 20)
mean(community_size)
length(community_size)
# Keep only some mid-size communities with more than 50 and less than
90 members
in_mid_community1 <- unlist(communities_louvain1[community_size > 50
& community_size < 90])
# Induce a subgraph of graph 1 using in_mid_community
sub_g1 <- induced.subgraph(g1, in_mid_community1)
# Plot those mid-size communities
plot(sub_g1, vertex.label = NA, edge.arrow.width = 0.8,
edge.arrow.size = 0.2,
coords = layout_with_fr(sub_g1), margin = 0, vertex.size = 3)
13
S O T I R I S B A R A T S A S S O C I A L N E T W O R K A N A L Y S I S
T W I T T E R M E N T I O N G R A P H
Visualization of some mid-sized communities for each day (1 to 5). Each community is depicted in a
different color.

More Related Content

Similar to Twitter Mention Graph - Analytics Project

The best ETL questions in a nut shell
The best ETL questions in a nut shellThe best ETL questions in a nut shell
The best ETL questions in a nut shellSrinimf-Slides
 
Computer 10 Quarter 3 Lesson .ppt
Computer 10 Quarter 3 Lesson .pptComputer 10 Quarter 3 Lesson .ppt
Computer 10 Quarter 3 Lesson .pptRedenOriola
 
Unit I - introduction to r language 2.pptx
Unit I - introduction to r language 2.pptxUnit I - introduction to r language 2.pptx
Unit I - introduction to r language 2.pptxSreeLaya9
 
"R & Text Analytics" (15 January 2013)
"R & Text Analytics" (15 January 2013)"R & Text Analytics" (15 January 2013)
"R & Text Analytics" (15 January 2013)Portland R User Group
 
IRJET- Sentiment Analysis of Election Result based on Twitter Data using R
IRJET- Sentiment Analysis of Election Result based on Twitter Data using RIRJET- Sentiment Analysis of Election Result based on Twitter Data using R
IRJET- Sentiment Analysis of Election Result based on Twitter Data using RIRJET Journal
 
Natural Language Processing sample code by Aiden
Natural Language Processing sample code by AidenNatural Language Processing sample code by Aiden
Natural Language Processing sample code by AidenAiden Wu, FRM
 
PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)Hansol Kang
 
Scanned by CamScannerModule 03 Lab WorksheetWeb Developmen.docx
Scanned by CamScannerModule 03 Lab WorksheetWeb Developmen.docxScanned by CamScannerModule 03 Lab WorksheetWeb Developmen.docx
Scanned by CamScannerModule 03 Lab WorksheetWeb Developmen.docxanhlodge
 
K-Means Algorithm Implementation In python
K-Means Algorithm Implementation In pythonK-Means Algorithm Implementation In python
K-Means Algorithm Implementation In pythonAfzal Ahmad
 
body { text-align center; font-family sans-serif;}.docx
body {  text-align center;  font-family sans-serif;}.docxbody {  text-align center;  font-family sans-serif;}.docx
body { text-align center; font-family sans-serif;}.docxmoirarandell
 
Nix for etl using scripting to automate data cleaning & transformation
Nix for etl using scripting to automate data cleaning & transformationNix for etl using scripting to automate data cleaning & transformation
Nix for etl using scripting to automate data cleaning & transformationLynchpin Analytics Consultancy
 
HW2.pdfCSEEEE 230 Computer Organization and Assembly La.docx
HW2.pdfCSEEEE 230 Computer Organization and Assembly La.docxHW2.pdfCSEEEE 230 Computer Organization and Assembly La.docx
HW2.pdfCSEEEE 230 Computer Organization and Assembly La.docxadampcarr67227
 
Introduction to Python for Bioinformatics
Introduction to Python for BioinformaticsIntroduction to Python for Bioinformatics
Introduction to Python for BioinformaticsJosé Héctor Gálvez
 

Similar to Twitter Mention Graph - Analytics Project (20)

PMED Undergraduate Workshop - R Tutorial for PMED Undegraduate Workshop - Xi...
PMED Undergraduate Workshop - R Tutorial for PMED Undegraduate Workshop  - Xi...PMED Undergraduate Workshop - R Tutorial for PMED Undegraduate Workshop  - Xi...
PMED Undergraduate Workshop - R Tutorial for PMED Undegraduate Workshop - Xi...
 
The best ETL questions in a nut shell
The best ETL questions in a nut shellThe best ETL questions in a nut shell
The best ETL questions in a nut shell
 
Computer 10 Quarter 3 Lesson .ppt
Computer 10 Quarter 3 Lesson .pptComputer 10 Quarter 3 Lesson .ppt
Computer 10 Quarter 3 Lesson .ppt
 
Unit I - introduction to r language 2.pptx
Unit I - introduction to r language 2.pptxUnit I - introduction to r language 2.pptx
Unit I - introduction to r language 2.pptx
 
"R & Text Analytics" (15 January 2013)
"R & Text Analytics" (15 January 2013)"R & Text Analytics" (15 January 2013)
"R & Text Analytics" (15 January 2013)
 
IRJET- Sentiment Analysis of Election Result based on Twitter Data using R
IRJET- Sentiment Analysis of Election Result based on Twitter Data using RIRJET- Sentiment Analysis of Election Result based on Twitter Data using R
IRJET- Sentiment Analysis of Election Result based on Twitter Data using R
 
Natural Language Processing sample code by Aiden
Natural Language Processing sample code by AidenNatural Language Processing sample code by Aiden
Natural Language Processing sample code by Aiden
 
Bt0065
Bt0065Bt0065
Bt0065
 
B T0065
B T0065B T0065
B T0065
 
PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)
 
Scanned by CamScannerModule 03 Lab WorksheetWeb Developmen.docx
Scanned by CamScannerModule 03 Lab WorksheetWeb Developmen.docxScanned by CamScannerModule 03 Lab WorksheetWeb Developmen.docx
Scanned by CamScannerModule 03 Lab WorksheetWeb Developmen.docx
 
Unit 1 - R Programming (Part 2).pptx
Unit 1 - R Programming (Part 2).pptxUnit 1 - R Programming (Part 2).pptx
Unit 1 - R Programming (Part 2).pptx
 
K-Means Algorithm Implementation In python
K-Means Algorithm Implementation In pythonK-Means Algorithm Implementation In python
K-Means Algorithm Implementation In python
 
body { text-align center; font-family sans-serif;}.docx
body {  text-align center;  font-family sans-serif;}.docxbody {  text-align center;  font-family sans-serif;}.docx
body { text-align center; font-family sans-serif;}.docx
 
Aggregate.pptx
Aggregate.pptxAggregate.pptx
Aggregate.pptx
 
e_lumley.pdf
e_lumley.pdfe_lumley.pdf
e_lumley.pdf
 
CS 151 homework2a
CS 151 homework2aCS 151 homework2a
CS 151 homework2a
 
Nix for etl using scripting to automate data cleaning & transformation
Nix for etl using scripting to automate data cleaning & transformationNix for etl using scripting to automate data cleaning & transformation
Nix for etl using scripting to automate data cleaning & transformation
 
HW2.pdfCSEEEE 230 Computer Organization and Assembly La.docx
HW2.pdfCSEEEE 230 Computer Organization and Assembly La.docxHW2.pdfCSEEEE 230 Computer Organization and Assembly La.docx
HW2.pdfCSEEEE 230 Computer Organization and Assembly La.docx
 
Introduction to Python for Bioinformatics
Introduction to Python for BioinformaticsIntroduction to Python for Bioinformatics
Introduction to Python for Bioinformatics
 

More from Sotiris Baratsas

Suicides in Greece (vs rest of Europe)
Suicides in Greece (vs rest of Europe)Suicides in Greece (vs rest of Europe)
Suicides in Greece (vs rest of Europe)Sotiris Baratsas
 
Predicting US house prices using Multiple Linear Regression in R
Predicting US house prices using Multiple Linear Regression in RPredicting US house prices using Multiple Linear Regression in R
Predicting US house prices using Multiple Linear Regression in RSotiris Baratsas
 
Azure Stream Analytics Report - Toll Booth Stream
Azure Stream Analytics Report - Toll Booth StreamAzure Stream Analytics Report - Toll Booth Stream
Azure Stream Analytics Report - Toll Booth StreamSotiris Baratsas
 
Brooklyn Property Sales - DATA WAREHOUSE (DW)
Brooklyn Property Sales - DATA WAREHOUSE (DW)Brooklyn Property Sales - DATA WAREHOUSE (DW)
Brooklyn Property Sales - DATA WAREHOUSE (DW)Sotiris Baratsas
 
Car Rental Agency - Database - MySQL
Car Rental Agency - Database - MySQLCar Rental Agency - Database - MySQL
Car Rental Agency - Database - MySQLSotiris Baratsas
 
Predicting Customer Churn in Telecom (Corporate Presentation)
Predicting Customer Churn in Telecom (Corporate Presentation)Predicting Customer Churn in Telecom (Corporate Presentation)
Predicting Customer Churn in Telecom (Corporate Presentation)Sotiris Baratsas
 
Understanding Customer Churn in Telecom - Corporate Presentation
Understanding Customer Churn in Telecom - Corporate PresentationUnderstanding Customer Churn in Telecom - Corporate Presentation
Understanding Customer Churn in Telecom - Corporate PresentationSotiris Baratsas
 
How to Avoid (causing) Death by Powerpoint!
How to Avoid (causing) Death by Powerpoint!How to Avoid (causing) Death by Powerpoint!
How to Avoid (causing) Death by Powerpoint!Sotiris Baratsas
 
The Secrets of the World's Best Presenters
The Secrets of the World's Best PresentersThe Secrets of the World's Best Presenters
The Secrets of the World's Best PresentersSotiris Baratsas
 
The Capitalist's Dilemma - Presentation
The Capitalist's Dilemma - PresentationThe Capitalist's Dilemma - Presentation
The Capitalist's Dilemma - PresentationSotiris Baratsas
 
A behavioral explanation of the DOT COM bubble
A behavioral explanation of the DOT COM bubbleA behavioral explanation of the DOT COM bubble
A behavioral explanation of the DOT COM bubbleSotiris Baratsas
 
[AIESEC] Welcome Week Presentation
[AIESEC] Welcome Week Presentation[AIESEC] Welcome Week Presentation
[AIESEC] Welcome Week PresentationSotiris Baratsas
 
Advanced Feedback Methodologies
Advanced Feedback MethodologiesAdvanced Feedback Methodologies
Advanced Feedback MethodologiesSotiris Baratsas
 
Advanced University Relations [AIESEC Training]
Advanced University Relations [AIESEC Training]Advanced University Relations [AIESEC Training]
Advanced University Relations [AIESEC Training]Sotiris Baratsas
 
How to organize massive EwA Events [AIESEC Training]
How to organize massive EwA Events [AIESEC Training]How to organize massive EwA Events [AIESEC Training]
How to organize massive EwA Events [AIESEC Training]Sotiris Baratsas
 
How to run effective Social Media Campaigns [AIESEC Training]
How to run effective Social Media Campaigns [AIESEC Training]How to run effective Social Media Campaigns [AIESEC Training]
How to run effective Social Media Campaigns [AIESEC Training]Sotiris Baratsas
 
Global Youth to Business Forum Sponsorship Package
Global Youth to Business Forum Sponsorship PackageGlobal Youth to Business Forum Sponsorship Package
Global Youth to Business Forum Sponsorship PackageSotiris Baratsas
 
Online Marketing - STEP IT UP
Online Marketing - STEP IT UPOnline Marketing - STEP IT UP
Online Marketing - STEP IT UPSotiris Baratsas
 
Ready, Set, Go Global (Opening & Closing)
Ready, Set, Go Global (Opening & Closing)Ready, Set, Go Global (Opening & Closing)
Ready, Set, Go Global (Opening & Closing)Sotiris Baratsas
 

More from Sotiris Baratsas (20)

Suicides in Greece (vs rest of Europe)
Suicides in Greece (vs rest of Europe)Suicides in Greece (vs rest of Europe)
Suicides in Greece (vs rest of Europe)
 
Predicting US house prices using Multiple Linear Regression in R
Predicting US house prices using Multiple Linear Regression in RPredicting US house prices using Multiple Linear Regression in R
Predicting US house prices using Multiple Linear Regression in R
 
Azure Stream Analytics Report - Toll Booth Stream
Azure Stream Analytics Report - Toll Booth StreamAzure Stream Analytics Report - Toll Booth Stream
Azure Stream Analytics Report - Toll Booth Stream
 
Brooklyn Property Sales - DATA WAREHOUSE (DW)
Brooklyn Property Sales - DATA WAREHOUSE (DW)Brooklyn Property Sales - DATA WAREHOUSE (DW)
Brooklyn Property Sales - DATA WAREHOUSE (DW)
 
Car Rental Agency - Database - MySQL
Car Rental Agency - Database - MySQLCar Rental Agency - Database - MySQL
Car Rental Agency - Database - MySQL
 
Predicting Customer Churn in Telecom (Corporate Presentation)
Predicting Customer Churn in Telecom (Corporate Presentation)Predicting Customer Churn in Telecom (Corporate Presentation)
Predicting Customer Churn in Telecom (Corporate Presentation)
 
Understanding Customer Churn in Telecom - Corporate Presentation
Understanding Customer Churn in Telecom - Corporate PresentationUnderstanding Customer Churn in Telecom - Corporate Presentation
Understanding Customer Churn in Telecom - Corporate Presentation
 
How to Avoid (causing) Death by Powerpoint!
How to Avoid (causing) Death by Powerpoint!How to Avoid (causing) Death by Powerpoint!
How to Avoid (causing) Death by Powerpoint!
 
The Secrets of the World's Best Presenters
The Secrets of the World's Best PresentersThe Secrets of the World's Best Presenters
The Secrets of the World's Best Presenters
 
The Capitalist's Dilemma - Presentation
The Capitalist's Dilemma - PresentationThe Capitalist's Dilemma - Presentation
The Capitalist's Dilemma - Presentation
 
Why Global Talent
Why Global TalentWhy Global Talent
Why Global Talent
 
A behavioral explanation of the DOT COM bubble
A behavioral explanation of the DOT COM bubbleA behavioral explanation of the DOT COM bubble
A behavioral explanation of the DOT COM bubble
 
[AIESEC] Welcome Week Presentation
[AIESEC] Welcome Week Presentation[AIESEC] Welcome Week Presentation
[AIESEC] Welcome Week Presentation
 
Advanced Feedback Methodologies
Advanced Feedback MethodologiesAdvanced Feedback Methodologies
Advanced Feedback Methodologies
 
Advanced University Relations [AIESEC Training]
Advanced University Relations [AIESEC Training]Advanced University Relations [AIESEC Training]
Advanced University Relations [AIESEC Training]
 
How to organize massive EwA Events [AIESEC Training]
How to organize massive EwA Events [AIESEC Training]How to organize massive EwA Events [AIESEC Training]
How to organize massive EwA Events [AIESEC Training]
 
How to run effective Social Media Campaigns [AIESEC Training]
How to run effective Social Media Campaigns [AIESEC Training]How to run effective Social Media Campaigns [AIESEC Training]
How to run effective Social Media Campaigns [AIESEC Training]
 
Global Youth to Business Forum Sponsorship Package
Global Youth to Business Forum Sponsorship PackageGlobal Youth to Business Forum Sponsorship Package
Global Youth to Business Forum Sponsorship Package
 
Online Marketing - STEP IT UP
Online Marketing - STEP IT UPOnline Marketing - STEP IT UP
Online Marketing - STEP IT UP
 
Ready, Set, Go Global (Opening & Closing)
Ready, Set, Go Global (Opening & Closing)Ready, Set, Go Global (Opening & Closing)
Ready, Set, Go Global (Opening & Closing)
 

Recently uploaded

Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 

Recently uploaded (20)

Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 

Twitter Mention Graph - Analytics Project

  • 1. 1 S O T I R I S B A R A T S A S S O C I A L N E T W O R K A N A L Y S I S T W I T T E R M E N T I O N G R A P H Twitter Mention Graph SOCIAL NETWORK ANALYSIS Sotiris Baratsas MSc in Business Analytics TASK 1: Twitter Mention Graph Our first task is to create a weighted directed graph with igraph, using raw data from Twitter. To do that, first, we will clean the data and try to bring it as close as possible to the usable format we want to import to R. Probably, the most efficient way to do that, is using the native commands of the Unix terminal (bash), which are able to process the data much faster than the alternatives. Step 1: Extract only the dates we want The first thing we can do, in order to work faster with the dataset, is to extract only the dates we want. To do that, we will use the “grep” command and keep only the rows that start with “T 2009-07-01” as well as the next 2 rows after every much (passed through the -A 2 parameter). In this way, we will keep a total of 3 lines for every match, the date, the user and the tweet. time grep -A 2 "^T.2009-07-01" tweets2009-07.txt > tweets1.txt # real 1m45.001s # user 1m31.585s # sys 0m4.816s grep -A 2 "^T.2009-07-02" tweets2009-07.txt > tweets2.txt grep -A 2 "^T.2009-07-03" tweets2009-07.txt > tweets3.txt grep -A 2 "^T.2009-07-04" tweets2009-07.txt > tweets4.txt grep -A 2 "^T.2009-07-05" tweets2009-07.txt > tweets5.txt As we can see, the time spent to create each file is about 1,5 minutes, which is quite good. Step 2: Clean the data After extracting only the dates we want, we could load the data into python and start the data cleaning, however, it would be far more efficient to do some part of the data cleaning inside the terminal, using the sed command.
  • 2. 2 S O T I R I S B A R A T S A S S O C I A L N E T W O R K A N A L Y S I S T W I T T E R M E N T I O N G R A P H Sed allows us to pass multiple sed’s in a single line, so, we will combine the following sed commands: This sed command will match the rows with the Timestamp of the tweets and keep only the date. sed -i '' 's/.*([0-9]{4})-([0-9]{2})-([0-9]{2}).*/1- 2-3/g' tweets1.txt This sed command will match the rows with link to the user’s profile and keep only the actual twitter handle, removing “http://www.twitter.com/” and also the U at the beginning of the line. sed -i '' 's/(U.http://twitter.com/)(.*)/2/g' tweets1.txt This sed command will match the tweets that include one or more mentions, remove the words that are not mentions and make it extremely faster for us to later iterate through each word in a tweet. sed -i '' 's/W.[^@]*(@[^ :,.]*)*/1 /g' tweets1.txt This sed command will remove the – separators between each match that is generated in the output of the previous sed’s. sed -i '' '/--/d' tweets1.txt Next, we combine the previous commands in one sed and execute it for each of the 5 files. § the -i argument indicates that we want to overwrite the file with the results § the empty quotes after the -i is used to indicate that we want to write in the existing file (it’s needed because I have a Mac. In Linux it might be obsolete.) time sed -i '' 's/.*([0-9]{4})-([0-9]{2})-([0- 9]{2}).*/1-2-3/g; s/(U.http://twitter.com/)(.*)/2/g; s/W.[^@]*(@[^ :,.]*)*/1 /g; /--/d' tweets1.txt sed -i '' 's/.*([0-9]{4})-([0-9]{2})-([0-9]{2}).*/1- 2-3/g; s/(U.http://twitter.com/)(.*)/2/g; s/W.[^@]*(@[^ :,.]*)*/1 /g; /--/d' tweets2.txt sed -i '' 's/.*([0-9]{4})-([0-9]{2})-([0-9]{2}).*/1- 2-3/g; s/(U.http://twitter.com/)(.*)/2/g; s/W.[^@]*(@[^ :,.]*)*/1 /g; /--/d' tweets3.txt sed -i '' 's/.*([0-9]{4})-([0-9]{2})-([0-9]{2}).*/1- 2-3/g; s/(U.http://twitter.com/)(.*)/2/g; s/W.[^@]*(@[^ :,.]*)*/1 /g; /--/d' tweets4.txt
  • 3. 3 S O T I R I S B A R A T S A S S O C I A L N E T W O R K A N A L Y S I S T W I T T E R M E N T I O N G R A P H sed -i '' 's/.*([0-9]{4})-([0-9]{2})-([0-9]{2}).*/1- 2-3/g; s/(U.http://twitter.com/)(.*)/2/g; s/W.[^@]*(@[^ :,.]*)*/1 /g; /--/d' tweets5.txt The resulting files have the following format: For tweets that did not include a mention, the content has been cleared. For tweets with mentions, the content before and between each mention has been cleared. Each record takes 3 rows (date, user, tweet). 2009-07-04 dailynascar 2009-07-04 dcompanyau @JezebellXOXO Please Come and See my Lasted pics http://short.to/h0r7 2009-07-04 donnamurrutia Step 3: Put the data into tabular format Next, we are ready to load the data in Python to put them in Tabular format and generate the needed CSVs. To do that, we follow the process described below: (I describe the process in detail inside the .ipynb file) 1. We read the file line-by-line 2. We put the data into tabular format, by iterating through every 3 lines and putting the content in the appropriate column (i.e. Date, from, to). 3. We create a function that looks for every word that starts with @ (mention) and splits multiple mentions into different rows, keeping the Date and User who made the mention the same between multiple mentions in the same tweet. 4. We group the data by “from” – “to” pairs and get the size() to find the frequency (weight) of mentions for each pair. 5. We extract the resulting data frame as CSV 6. We run this process for every file and end up with 5 CSV files, one for each day The result is 5 CSV files with the following format. "from","to","weight" "suddenlyjamie","dmscott",1 "aruanpc","danilogentili",1 "gloriahansen","janedavila",2 "uluvsheena","PreciousSoHot",1
  • 4. 4 S O T I R I S B A R A T S A S S O C I A L N E T W O R K A N A L Y S I S T W I T T E R M E N T I O N G R A P H "adreamon","jlovely69",1 "cin7415","sdriven1",1 Finally, after having our CSV files ready, we load them in R and create the directed iGraph objects. # Reading ths CSV files we have created tweets1 = read.csv(file="tweets1.csv", header=T, sep=",", fileEncoding = "utf-8") tweets2 = read.csv(file="tweets2.csv", header=T, sep=",", fileEncoding = "utf-8") tweets3 = read.csv(file="tweets3.csv", header=T, sep=",", fileEncoding = "utf-8") tweets4 = read.csv(file="tweets4.csv", header=T, sep=",", fileEncoding = "utf-8") tweets5 = read.csv(file="tweets5.csv", header=T, sep=",", fileEncoding = "utf-8") # Checking the structure of the data str(tweets1) DUPLICATE NODES By taking a look at the dataset, we observe that some of the records are duplicate, because of case sensitivity (e.g. 'OfficialTila' and 'officialtila' are considered as separate users, while they are the same). In Twitter, a username is not dependent on whether it’s written in lowercase, uppercase, sentence-case or any other case. To correct the problem, we will make all the characters lowercase and then merge the duplicate records that have been created. We need to be careful to sum up the weights of each record, so that they are represented correctly after merging the different cases. # To correct this problem, first we will make all characters lowercase tweets1[,1:2] <- sapply(tweets1[,1:2], tolower) tweets2[,1:2] <- sapply(tweets2[,1:2], tolower) tweets3[,1:2] <- sapply(tweets3[,1:2], tolower) tweets4[,1:2] <- sapply(tweets4[,1:2], tolower) tweets5[,1:2] <- sapply(tweets5[,1:2], tolower) # Then, we will use ddply to merge duplicate rows, but also sum their weights, so that we don't lose any values library(plyr) tweets1<- ddply(tweets1,~from + to,summarise,weight=sum(weight)) tweets2<- ddply(tweets2,~from + to,summarise,weight=sum(weight))
  • 5. 5 S O T I R I S B A R A T S A S S O C I A L N E T W O R K A N A L Y S I S T W I T T E R M E N T I O N G R A P H tweets3<- ddply(tweets3,~from + to,summarise,weight=sum(weight)) tweets4<- ddply(tweets4,~from + to,summarise,weight=sum(weight)) tweets5<- ddply(tweets5,~from + to,summarise,weight=sum(weight)) # CREATING THE iGRAPH OBJECTS (READ FROM DATA FRAME) g1 <- graph_from_data_frame(tweets1, directed = TRUE, vertices = NULL) g2 <- graph_from_data_frame(tweets2, directed = TRUE, vertices = NULL) g3 <- graph_from_data_frame(tweets3, directed = TRUE, vertices = NULL) g4 <- graph_from_data_frame(tweets4, directed = TRUE, vertices = NULL) g5 <- graph_from_data_frame(tweets5, directed = TRUE, vertices = NULL) TASK 2: Average Degree over time Our next task is to create plots that visualize the 5-day evolution of some important metrics of the network, such as number of vertices, number of edges, diameter and average degrees. To do that, we create a loop that takes each network, computes the needed metrics and write them in a data frame. The resulting table is the following: To get a better representation, we can also plot the 7 metrics, using ggplot: (In all directed graphs, the sum of in-degree and sum of out-degree is the same, so we plot the average in-degree and out-degree on the same plot. Moreover, we calculate the average weighted in/out degree).
  • 6. 6 S O T I R I S B A R A T S A S S O C I A L N E T W O R K A N A L Y S I S T W I T T E R M E N T I O N G R A P H There are a few things we can observe in the above graphs: § The number of users involved in mentioning others or being mentioned by others is at its peak on Wednesday 01/07/2009 and steadily decreases every day until Sunday 05/07/2009. § As expected, the number of mentions also decreases, following roughly the same curve. § The diameter of the network starts at 71 on Wed 01/07, steadily increases to 75 until Friday, but drops to its lowest point on Saturday. On Sunday, however, it reaches its highest point (85). What is fascinating about these numbers, is the fact that, even if we only took into account the direct mentions, the diameter is still low, given that we talk about a global social network. Of course, this might have something to do with the data being from 2009, when Twitter was more well-known and widely used in the USA. § The Average In-Degree and the Average In-Degree follow the opposite direction, starting just below 1.19 on Wednesday, increasing to over 1.45 on Friday and decreasing to 1.23 on Sunday. This means that the average user mentions ~1 person and the average user is also mentioned about once. § If we take into account the weights of the network, we can calculate the weighted average degree, which gives results very similar to the unweighted degree graph, but a little bit higher. TASK 3: Important nodes Next, we will identify the important nodes of the network per day. We will select the nodes that rank highest for each day in 3 key metrics:
  • 7. 7 S O T I R I S B A R A T S A S S O C I A L N E T W O R K A N A L Y S I S T W I T T E R M E N T I O N G R A P H § In-degree § Out-degree § Page-Rank In-Degree: Shows us which users are mentioned the most for each day The process we follow is to calculate the Top 10 for each day and then bind them into a single data frame: intop1 <- head(sort(degree(g1, mode="in"), decreasing=TRUE), 10) intop2 <- head(sort(degree(g2, mode="in"), decreasing=TRUE), 10) intop3 <- head(sort(degree(g3, mode="in"), decreasing=TRUE), 10) intop4 <- head(sort(degree(g4, mode="in"), decreasing=TRUE), 10) intop5 <- head(sort(degree(g5, mode="in"), decreasing=TRUE), 10) intop <- data.frame(cbind(names(intop1), names(intop2), names(intop3), names(intop4), names(intop5))) colnames(intop) <- c("01-07-2009", "02-07-2009", "03-07-2009", "04- 07-2009", "05-07-2009") intop In-degree Top 10 per day As we can see, the users who are mentioned the most, are pretty much the same every day. It’s accounts that post memes or news, such as tweetmeme, mashable, addthis, cnn, cnnbrk, breakingnews, and celebrities, such as mileycyrus, ddlovato, adamlambert, souljaboytellem, officialtila. There are some exceptions that might have to do with something significant or news-worthy that happened on that day and caused an account to receive more mentions. Since we have a weighted network, it might make more sense to calculate the Top 10, using the weighted in-degree.
  • 8. 8 S O T I R I S B A R A T S A S S O C I A L N E T W O R K A N A L Y S I S T W I T T E R M E N T I O N G R A P H Weighted in-degree Top 10 per day If we do that, we can see some variations in the positions of each user inside the Top 10, but the accounts are mostly the same as the previous results. Out-Degree: Shows us which users mention other users the most for each day We follow the same process to calculate the Top 10 using the Out-degree Out-degree Top 10 per day & Weighted out-degree Top 10 per day On the contrary, when we identify the Top 10 users for each day using the Out-Degree and Weighted Out-Degree, we can see a lot of variation, with only a few users being exceptions. The explanation for this is, that it needs a certain level of popularity to be the receiver of a lot of mentions (in-degree), but any user, on any day, can mention as many users they want, as long as it doesn’t violate a limit imposed by Twitter. Page-Rank: Shows us the users who received the most page-rank value every day. It takes into account whether a user was mentioned by many users who were in turn mentioned by other users (e.g. infuencers). # PAGERANK pgrnk1 <- page_rank(g1, algo="prpack" , directed=FALSE)$vector pgrnk2 <- page_rank(g2, algo="prpack" , directed=FALSE)$vector pgrnk3 <- page_rank(g3, algo="prpack" , directed=FALSE)$vector pgrnk4 <- page_rank(g4, algo="prpack" , directed=FALSE)$vector pgrnk5 <- page_rank(g5, algo="prpack" , directed=FALSE)$vector
  • 9. 9 S O T I R I S B A R A T S A S S O C I A L N E T W O R K A N A L Y S I S T W I T T E R M E N T I O N G R A P H # Here are all the characters ranked in descending order, based on their Pagerank value ranked1 <- head(sort(pgrnk1, decreasing=TRUE), 10) ranked2 <- head(sort(pgrnk2, decreasing=TRUE), 10) ranked3 <- head(sort(pgrnk3, decreasing=TRUE), 10) ranked4 <- head(sort(pgrnk4, decreasing=TRUE), 10) ranked5 <- head(sort(pgrnk5, decreasing=TRUE), 10) ranked <- data.frame(cbind(names(ranked1), names(ranked2), names(ranked3), names(ranked4), names(ranked5))) colnames(ranked) <- c("01-07-2009", "02-07-2009", "03-07-2009", "04- 07-2009", "05-07-2009") ranked Top 10 users per day, based on Page-Rank As far as page-rank is concerned, we can see that most of the Top 10 users are the same as the Top 10 users based on in-degree. This makes sense, since these users (e.g. celebrities) receive a lot of mentions and retweets from other users, but they only mention a few other users themselves. This way, they concentrate a lot of page-rank value. TASK 4: Communities Our final task is to identify different communities, by applying fast greedy clustering, infomap clustering, and louvain clustering on the undirected versions of the 5 mention graphs. # Making the graphs undirected ug1 <- as.undirected(g1) ug2 <- as.undirected(g2) ug3 <- as.undirected(g3)
  • 10. 10 S O T I R I S B A R A T S A S S O C I A L N E T W O R K A N A L Y S I S T W I T T E R M E N T I O N G R A P H ug4 <- as.undirected(g4) ug5 <- as.undirected(g5) # Finding communities with fast greedy clustering communities_fast_greedy1 <- cluster_fast_greedy(ug1) # Finding communities with infomap clustering communities_infomap1 <- cluster_infomap(ug1) # Finding communities with louvain clustering communities_louvain1 <- cluster_louvain(ug1) communities_louvain2 <- cluster_louvain(ug2) communities_louvain3 <- cluster_louvain(ug3) communities_louvain4 <- cluster_louvain(ug4) communities_louvain5 <- cluster_louvain(ug5) However, as it turns out fast greedy clustering takes too long to execute (I got results after about 45-50 minutes). As a matter of fact, Infomap clustering takes even longer. The only method that is able to produce results in a matter of seconds, is the Louvain community detection algorithm, because although it’s based on a greedy optimization process, it includes an additional aggregation step to improve processing on very large networks. compare(communities_fast_greedy1, communities_infomap1) compare(communities_fast_greedy1, communities_louvain1) compare(communities_infomap1, communities_louvain1) Comparing different clustering methods We can compare the different resulting communities, using the compare() function. It seems that Louvain is closer to the FastGreedy method. EVOLUTION OF COMMUNITY MEMBERSHIP Then, using the Louvain method, we will try to detect the evolution of the communities the user “KimKardashian” belongs to. To do that, first we identify the communities in which Kim Kardashian belongs in each graph (=each day) and then find the intersect of these communities.
  • 11. 11 S O T I R I S B A R A T S A S S O C I A L N E T W O R K A N A L Y S I S T W I T T E R M E N T I O N G R A P H # Detecting the evolution of communities to which user "KimKardashian" belongs c1<-communities_louvain1[membership(communities_louvain1)["kimkardashian"]] c2<-communities_louvain2[membership(communities_louvain2)["kimkardashian"]] c3<-communities_louvain3[membership(communities_louvain3)["kimkardashian"]] c4<-communities_louvain5[membership(communities_louvain4)["kimkardashian"]] c5<-communities_louvain5[membership(communities_louvain5)["kimkardashian"]] # Finding similarities between intersect(c1$`54008`, c2$`41188`) intersect(c1$`54008`, c3$`22013`) intersect(c1$`54008`, c4$`8036`) intersect(c1$`54008`, c5$`21162`) intersect(c2$`41188`, c3$`22013`) intersect(c2$`41188`, c4$`8036`) intersect(c2$`41188`, c5$`21162`) intersect(c3$`22013`, c4$`8036`) intersect(c3$`22013`, c5$`21162`) As we can see from the results, the communities that are most similar (in terms of common members) are the community of Day 3 and the community of Day 5, as well as the community from Day 2.
  • 12. 12 S O T I R I S B A R A T S A S S O C I A L N E T W O R K A N A L Y S I S T W I T T E R M E N T I O N G R A P H On the other hand, the community from Day 4 is very small, and it doesn’t have any common member with the other communities. VISUALIZING THE COMMUNITIES In order to visualize the communities (we will use the 1st day’s graph as an example), first we need to: § set a color to the different communities (represented as levels of a factor) § check the sizes of each community to select our filtering parameters § filter to keep only some mid-sized communities § induce a subgraph using this filter to keep only the nodes that belong in these communities § plot the subgraph and adjust the parameters to get a good visual result # Setting colors for the different communities V(g1)$color <- factor(membership(communities_louvain1)) #Get the sizes of each community of Graph1 (g1) community_size <- sizes(communities_louvain1) head(sort(community_size, decreasing=TRUE), 20) head(sort(community_size, decreasing=FALSE), 20) mean(community_size) length(community_size) # Keep only some mid-size communities with more than 50 and less than 90 members in_mid_community1 <- unlist(communities_louvain1[community_size > 50 & community_size < 90]) # Induce a subgraph of graph 1 using in_mid_community sub_g1 <- induced.subgraph(g1, in_mid_community1) # Plot those mid-size communities plot(sub_g1, vertex.label = NA, edge.arrow.width = 0.8, edge.arrow.size = 0.2, coords = layout_with_fr(sub_g1), margin = 0, vertex.size = 3)
  • 13. 13 S O T I R I S B A R A T S A S S O C I A L N E T W O R K A N A L Y S I S T W I T T E R M E N T I O N G R A P H Visualization of some mid-sized communities for each day (1 to 5). Each community is depicted in a different color.