Monitoring and Analysis of Online Communities

Monitoring and Analysis of Online
Communities

Harith Alani
Knowledge Media institute,
The Open University, UK

http://twitter.com/halani
http://delicious.com/halani
http://www.linkedin.com/pub/harith-alani/9/739/534

Web Science Summer School
Galway, 2011 1

Market value of Web Analytics

2

Agenda
•  Community monitoring

•  Offline and online social networking

•  Modeling and tracking behaviour

•  Analysing community features

•  Predicting discussion activity

3

Online community monitoring
•  Analysing and understanding activities and dynamics
•  Studying impact of social and technical features
•  Forecast future growth and evolution
•  Tracking behaviour and influence
•  Tracking reputation and buzz
•  Listening to customer opinion
•  Profiling the user base
•  Gauging customer sentiment

4

Measuring social media

Deloitte, Beeline Labs, & Society for New Communication Research surveyed 140 companies
with online communities, 2008
5


Deloitte, Beeline Labs, & Society for New Communication Research surveyed 140 companies
with online communities, 2008
6


“B2B Marketing Goes Social: A White Horse Survey Report” – March 2010 – study of 104 companies
7


“Social media usage, attitudes and measurability: What do marketers think?” – KingFishMedia,
2010 8

Tools for monitoring social media

9

•  Analytics:
–  Mention volume
–  Sentiment
–  Discussion clouds
–  Activity graphs and
metrics
–  Language and
geolocation filtering
–  Filter by social
platform
–  Comparisons

10
http://www.ubervu.com/

•  Analytics:
–  Influencing users
–  Sentiment and opinion analysis
–  Viral content analysis
–  Detecting sales leads
–  Filter by geo-location

11
http://www.viralheat.com/home
!

Monitoring and Analysis of
Online Communities
With a Web Science flavour

12

Online vs. Offline social
networking

13

Online vs. offline social networking: The Bad News!

•  Digital social networking
increases physical social
isolation
•  Causes
–  Genetic alterations
–  Weakened immune system
–  Less resistant to cancer
–  Higher risk of heart disease
–  Higher blood pressure
–  Faster dementia
–  Narrower arteries

Aric Sigman, “Well Connected? The Biological
Implications of 'Social Networking’”, Biologist, 56
(1), 2009 14

Online vs. offline social networking: The Good News!

•  Digital networking increase social interaction
–  Transforms little boxed societies to networked and networking
societies
–  Create more opportunities to network
–  New methods to communicate, easily, and widely
–  Supports and increases F2F contact!
–  The stronger the offline social tie, the more intense the online
communication
–  The stronger the offline social tie, the more diverse online
communications
–  F2F is medium of choice in weaker social ties

Keith Hampton and Barry Wellman, Long Distance Community in the Network Society: Contact and
Support Beyond Netville, American Behavioral Scientist 45 (3), November, 2001.

Barry Wellman, The Glocal Village: Internet and Community, Idea’s - The Arts & Science Review,
15
University of Toronto, 1(1),2004

Physical online & digital offline

16

Sensor & Social Networks

17

Sensor & Social Networks
www.nabaztag.com

The Canine Twitterer

“Having my daily workout.
Already did 15 leg lifts!”

18

Location Sensors & Social Networking

Tag-Along Marketing
The New York Times,
November 6, 2010

“Everything is in place for location-based social
networking to be the next big thing. Tech
companies are building the platforms, venture
capitalists are providing the cash and marketers
are eager to develop advertising. “

19

Monitoring online/offline social activity
Where
is
everybody?

20


•  Generating
opportunities for
F2F networking

21


“There are more than 250 million active users
currently accessing Facebook through their mobile
devices“

“People that use Facebook on their mobile devices
are twice as active on Facebook than non-mobile
users”
http://www.facebook.com/press/info.php?statistics
22

Tracking of F2F contact networks
Sociometer, MIT, 2002
-  F2F and productivity
-  F2F dynamics
-  Who are key players?
-  F2F and office distance

TraceEncounters - 2004

23

SocioPatterns platform

http://www.sociopatterns.org/! 24

Offline social networks

From a small conference
at ISI, Turin

by Ciro Cattuto
25

Offline social networks

•  Similarity students
features
–  Country of
origin
SR
–  Seniority
–  .. Age? Role?
Projects?
Interests?
•  What other JR
info can we
get to help us students

understand
these network SR
dynamics?
26

Offline + online social networking
Who should
Anyone I I talk to? Where have I
know here? met this guy?
Where
should I go?

ESWC2010 27

Live Social Semantics (LSS):
RFIDs + Social Web + Semantic Web
<?xml version="1.0"?>!
<rdf:RDF!
xmlns="http://
tagora.ecs.soton.ac.uk/schemas/
tagging#"!
xmlns:rdf="http://www.w3.org/
1999/02/22-rdf-syntax-ns#"!
xmlns:xsd="http://www.w3.org/2001/
XMLSchema#"!
xmlns:rdfs="http://www.w3.org/
2000/01/rdf-schema#"!
xmlns:owl="http://www.w3.org/
2002/07/owl#"!
xml:base="http://
tagora.ecs.soton.ac.uk/schemas/
tagging">!
<owl:Ontology rdf:about=""/>!
<owl:Class rdf:ID="Post"/>!
<owl:Class rdf:ID="TagInfo"/>!
<owl:Class
rdf:ID="GlobalCooccurrenceInfo"/>!
<owl:Class
rdf:ID="DomainCooccurrenceInfo"/>!
<owl:Class rdf:ID="UserTag"/>!
<owl:Class
rdf:ID="UserCooccurrenceInfo"/>!
<owl:Class rdf:ID="Resource"/>!
<owl:Class rdf:ID="GlobalTag"/>!
<owl:Class rdf:ID="Tagger"/>!
<owl:Class rdf:ID="DomainTag"/>!
<owl:ObjectProperty
rdf:ID="hasPostTag">!
<rdfs:domain
rdf:resource="#TagInfo"/>!
</owl:ObjectProperty>!
<owl:ObjectProperty
rdf:ID="hasDomainTag">!
<rdfs:domain
rdf:resource="#UserTag"/>!
<owl:ObjectProperty
rdf:ID="isFilteredTo">!

•  Integration of physical presence and online information
<rdfs:range
rdf:resource="#GlobalTag"/>!
<rdfs:domain

•  Semantic user profile generation
rdf:resource="#GlobalTag"/>!
<owl:ObjectProperty

•  Logging of face-to-face contactrdf:ID="hasResource">!
<rdfs:domain rdf:resource="#Post"/>!
<rdfs:range =…!

•  Social network browsing
•  Analysis of online vs offline social networks

SW sources

conference

chair proceedings

chair
author

CoP

29

Social and information networks

30

Merging social networks

FOAF 31

Tag Filtering Service

Semantic modeling
Semantic analysis
Collective intelligence
Statistical analysis
Syntactical analysis
32

Tag Filtering Service

33

From Tags to Semantics

34

Tags to User Interests

35

From raw tags and social relations
to Structured Data

Collective
intelligence

User raw Semantic
data data

Structured
data
ontologies

36

RFIDs for tracking social contact

37

Convergence with online social networks

38

People contact à RFID à RDF Triples

foaf#Person1
contactWith

Place

hasContact

foaf#Person2
contactPlace
F2FContact

contactDate
contactDura0on

XMLSchema#date

XMLSchema#0me
39

Real-time F2F networks with SNS links

42
http://www.vimeo.com/6590604

Live Social Semantics
Deployed at:

Data analysis
•  Face-to-face interactions across scientific conferences
•  Networking behaviour of frequent users
•  Correlations between scientific seniority and social networking
•  Comparison of F2F contact network with Twitter and Facebook
•  Social networking with online and offline friends
43

Analysis of LSS Results

The New Yorker 2/11/2008

44

Characteristics of F2F contact network
Network ESWC 2009 HT 2009 ESWC 2010
characteristics
Number of users 175 113 158
Average degree 54 39 55
Avg. strength (mn) 143 123 130
Avg. weight (mn) 2.65 3.15 2.35

Weights ≤ 1 mn 70% 67% 74%

Weights ≤ 5 mn 90% 89% 93%

Weights ≤ 10 mn 95% 94% 96%

•  Degree is number of people with whom the person had at least one F2F
contact
•  Strength is the time spent in a F2F contact
•  Edge weight is total time spent by a pair of users in F2F contact
45

Characteristics of F2F contact events
Contact ESWC 2009 HT 2009 ESWC 2010
characteristics
Number of 16258 9875 14671
contact events
Average contact 46 42 42
length (s)

Contacts ≤ 1mn 87% 89% 88%

Contacts ≤ 2mn 94% 96% 95%

Contacts ≤ 5mn 99% 99% 99%

Contacts ≤ 10mn 99.8% 99.8% 99.8%

F2F contact pattern is very similar for all three conferences

F2F contacts of returning users
Degree
•  Degree: number of other 10
2

participants with whom an attendee
has interacted
1
10 1 2
10 10
•  Total time: total time spent in

ESWC2010
Total interaction time
interaction by an attendee 4
10

3
10 3 4 5
10 10 10
•  Link weight: total time spent in F2F 4 Links’ weights
10
interaction by a pair of returning 3
10
attendees in 2010, versus the same 2
10
quantity measured in 2009 1
10 1 2 3 4 5
10 10 10 10 10
ESWC 2009 & Pearson Correlation ESWC2009
ESWC 2010
Degree 0.37 Time spent on F2F networking by frequent
users is stable, even when the list of
Total F2F 0.76
interaction time people they networked with changed
Link weight 0.75
47

Average seniority of neighbours in F2F networks

•  No clear pattern is observed 5
if the unweighted average senn
Avg seniority of the neighbours
over all neighbours in the

Average seniority of neighbors
senn,w
with weighted averages
aggregated network is 4
considered
senn,max
Seniority of user with strongest link

•  A correlation is observed 3
when each neighbour is
weighted by the time spent
with the main person
2
•  The correlation becomes
much stronger when 1
considering for each
individual only the neighbour
with whom the most time was
spent 0
0 5 10
seniority (number of papers)

Conference attendees tend to networks with others of similar
levels of scientific seniority
48

Presence
of
A<endees
HT2009

Importance
of
the
bar?

Popularity
of
sessions?

par0cular
talks?

Number
of
cliques
HT2009

Offline networking vs online networking
Twitterers Spearman
Correlation (ρ)
Tweets – F2F Degree - 0.15

Tweets – F2F Strength - 0.15

Twitter Following – F2F - 0.21
Degree

users

Users with Facebook and Twitter accounts in ESWC 2010

•  people who have a large number of friends on Twitter and/or Facebook don’t seem to
be the most socially active in the offline world in comparison to other SNS users

No strong correlation between amount of F2F
contact activity and size of online social networks 51

Scientific seniority vs Twitter followers
Twitter users Correlation
H-index – Twitter Followers 0.32
(#$"

H-index – Tweets - 0.13
("

!#'"

*+,-./"01221+./3"
!#&"
45678.9"
*+..:3"

!#%"

!#$"

!"
(" &" ((" (&" $(" $&" )(" )&" %(" users

•  Comparison between people’s scientific seniority and the number of people following
them on Twitter

People who have the highest number of Twitter followers are not
necessarily the most scientifically senior, although they do have high
visibility and experience 52

Conference Chairs
all chairs all chairs
participants 2009 participants 2010
2009 2010
average degree 55 77.7 54 77.6
average strength 8590 19590 7807 22520
average weight 159 500 141 674
average number of 3.44 8 3.37 12
events per edge

•  Conf chairs interact with more distinct people (larger average degree)

•  Conf chairs spend more time in F2F interaction (almost three times as much
as a random participant)

Networking with online and offline ‘friends’
Characteristics all users coauthors Facebook Twitter
friends followers
average contact 42 75 63 72
duration (s)
average edge weight 141 4470 830 1010
(s)
average number of 3.37 60 13 14
events per edge
•  Individuals sharing an online or professional social link meet much more
often than other individuals
•  Average number of encounters, and total time spent in interaction, is highest
for co-authors

F2F contacts with Facebook & Twitter friends were respectively %50 and
%71 longer, and %286 and %315 more frequent than with others

They spent %79 more time in F2F contacts with their co-authors, and they
met them %1680 more times than they met non co-authors

Twitterers vs Non-Twitterers

•  Time spent in conference rooms
–  Twitter users spent on average 11.4% more time in the
conf rooms than non-twitter users (mean is 26% higher)

•  Number of people met F2F during the conference
–  Twitter users met on average 9% more people F2F
(mean 8% higher)

•  Duration of F2F contacts
–  Twitter users spent on average 63% more time in F2F
contact than non twitter users (mean is 20% higher)

55

Analysis of behaviour in online
communities

Web Science Summer School
Galway, 2011 56

Behaviour of individuals – micro level analysis
(#$"

6DD1">?@20AB?M" 89O1209>M"PQM"12R2<DE27>#"
;01">D?@;<">@60;<>"" @0"K88"92;L" S:DT>"9:2"0239">9;7"72>2;7?:27N"

("

!#'"

!#&"

:2;<9:=">?@20AB?"C"
>D?@;<"E7DB<2>#"F72G"
?:;@7>HIJ>"
!#%"

!#$"

DO9>@127M"
:@6:" >:="
E7DB<2" >?@20A>9N"
!"
(" )" *" (+" (," $(" $)" $*" ++" +," %(" %)"
-./0123" 4$4"526722" 4$4"8972069:"
57

Why monitor behaviour?
•  Understand impact of behaviour on community evolution
•  Forecast community future
•  Learn when intervention might be needed
•  Learn which behaviour should be encouraged or
discouraged
•  Find what could trigger certain behaviours
•  What is the best mix of behaviour to increase
engagement in the community
•  To see which users need more support, which ones
should be confined, and which ones should be promoted

58

Behaviour analysis

Jeffrey Chan, Conor Hayes, and Elizabeth Daly. Decomposing discussion forums using
common user roles. In Proc. Web Science Conf. (WebSci10), Raleigh, NC: US, 2010

•  Behaviour compositions in Boards.ie:

Encoding Rules in Ontologies with SPIN

Approach for inferring User Roles
Structural, social network, Feature levels change with the
reciprocity, persistence, participation dynamics of the community

Run our rules over each user’s features Associate Roles with a collection of
and derive the role composition feature-to-level Mappings
e.g. in-degree -> high, out-degree ->
high

62

Data from Boards.ie
•  Forum 246 (Commuting and Transport): Demonstrates a clear increase in
activity over time.
•  Forum 388 (Rugby): Exhibits periodic increase and decrease in activity and
hence it provides good examples of healthy/unhealthy evolutions.
•  Forum 411 (Mobile Phones and PDAs): Increase in activity over time with
some fluctuation - i.e. reduction and increase over various time windows.
•  For the time in 2004-01 to 2006-12

Features

•  In-degree Ratio: The proportion of users U that reply to user υi, thus
indicating the concentration of users that reply to υi
•  Posts Replied Ratio: Proportion of posts by user υi that yield a reply, used
to gauge the popularity of the user’s content based on replies
•  Thread Initiation Ratio: Proportion of threads that have been started by υi.
•  Bi-directional Threads Ratio: Proportion of threads where user υi replies to
a user and receives a reply, thus forming a reciprocal communication
•  Bi-directional Neighbours Ratio: The proportion of neighbours where a
reciprocal interaction has taken place - e.g. υi replied to υi and υi replied to υi.
•  Average Posts per Thread: The average number of posts made in every
thread that user υi has participated in
•  Standard Deviation of Posts per Thread: The standard deviation of the
number of posts in every thread that user υi has participated in. This gauges
the distribution of the discussion lengths.

Results
Commuting and Transport Rugby Mobile Phones and PDAs

•  Correlation of individual features in each of the three forums

(a) Forum 246: Commuting and Transport

Results

(b) Forum 388: Rugby
(c) Forum 411: Mobile Phones and PDAs
•  Variation in behaviour
composition & activity
•  Behaviour composition in/
stability influences forum
activity

Prediction analysis – preliminary results!
•  Predicting rise/fall in post submission numbers
•  Binary classification
•  Features : Community composition, roles and percentages of users
associated with each
Forum P R F1 ROC

246 0.799 0.769 0.780 0.800

388 0.603 0.615 0.605 0.775

411 0.765 0.692 0.714 0.617

All 0.583 0.667 0.607 0.466

•  Cross-community predictions are less reliable than individual
community analysis due to the idiosyncratic behaviour observed in
each individual community

Observations so far
•  Growing communities contain more elitists and popular participants

•  Shrinking communities contain many taciturns and ignored users

•  A stable composition, with a mix of roles, is associated with
increased community activity

•  Different communities may require different behaviour compositions
to increase activity/health

What features make online
communities tick

•  How many do you
recognise? Use?

•  Which ones still exist?

•  Which are strong and
healthy?

•  Which are aging and
withering?

•  What health signs should
we look for?

•  How can we predict their
future evolution?

71

Rise and fall of social networks

72

Predicting engagement

•  Which posts will receive a reply?
–  What are the most influential features here?

•  How much discussion will it generate?
–  What are the key factors of lengthy discussions?

73

user attributes - describing the reputation of the user - and attributes of a post’s
content - generally referred to as content features. In Table 1 we define user and

Common online communityFeatures features
content features and study their influence on the discussion “continuation”.
Table 1. User and Content
User Features
In Degree: Number of followers of U #
Out Degree: Number of users U follows #
List Degree: Number of lists U appears on. Lists group users by topic #
Post Count: Total number of posts the user has ever posted #
User Age: Number of minutes from user join date #
P ostCount
Post Rate: Posting frequency of the user U serAge
Content Features
Post length: Length of the post in characters #
Complexity: Cumulative entropy of the unique words in post p λ
i∈[1,n] pi(log λ−log pi)
of total word length n and pi the frequency of each word λ
Uppercase count: Number of uppercase words #
Readability: Gunning fog index using average sentence length (ASL) [7]
and the percentage of complex words (PCW). 0.4(ASL + P CW )
Verb Count: Number of verbs #
Noun Count: Number of nouns #
Adjective Count: Number of adjectives #
Referral Count: Number of @user #
Time in the day: Normalised time in the day measured in minutes #
Informativeness: Terminological novelty of the post wrt other posts
The cumulative tfIdf value of each term t in post p t∈p tf idf (t, p)
Polarity: Cumulation of polar term weights in p (using
P o+N e
Sentiwordnet3 lexicon) normalised by polar terms count |terms|

•  How do all these features influence activity generation in an online
4.2 Experiments
community? are intended to test the performance of different classification mod-
Experiments
– els in identifying seed posts. Therefore we used four classifiers: discriminative
Such knowledge leads to better use and management of the community 74

classifiers Perceptron and SVM, the generative classifier Naive Bayes and the

Experiment for identifying seed posts

•  Twitter data on the Haiti earthquake, and the Union
Address

Dataset Users Tweets Seeds Non-seeds Replies

Haiti 44,497 65,022 1,405 60,686 2,931

Union Address 66,300 80,272 7,228 55,169 17,875

•  Evaluated a binary classification task
–  Is this post a seed post or not?

75

first report on the results obtained from our model selection phase, before moving
Identifying seeds with different type of
onto our results from using the best model with the top-k features.

features
Table 3. Results from the classification of seed posts using varying feature sets and
classification models
(a) Haiti Dataset (b) Union Address Dataset
P R F1 ROC P R F1 ROC
User Perc 0.794 0.528 0.634 0.727 User Perc 0.658 0.697 0.677 0.673
SVM 0.843 0.159 0.267 0.566 SVM 0.510 0.946 0.663 0.512
NB 0.948 0.269 0.420 0.785 NB 0.844 0.086 0.157 0.707
J48 0.906 0.679 0.776 0.822 J48 0.851 0.722 0.782 0.830
Content Perc 0.875 0.077 0.142 0.606 Content Perc 0.467 0.698 0.560 0.457
SVM 0.552 0.727 0.627 0.589 SVM 0.650 0.589 0.618 0.638
NB 0.721 0.638 0.677 0.769 NB 0.762 0.212 0.332 0.649
J48 0.685 0.705 0.695 0.711 J48 0.740 0.533 0.619 0.736
All Perc 0.794 0.528 0.634 0.726 All Perc 0.630 0.762 0.690 0.672
SVM 0.483 0.996 0.651 0.502 SVM 0.499 0.990 0.664 0.506
NB 0.962 0.280 0.434 0.852 NB 0.874 0.212 0.341 0.737
J48 0.824 0.775 0.798 0.836 J48 0.890 0.810 0.848 0.877

4.3 Results
Our•  findings from Table 3 demonstrate the effectiveness of using solely user
User features are most important in Twitter
features for identifying seed posts. Infeatures gives best results Address datasets
•  But combining user & content both the Haiti and Union
training a classification model using user features shows improved performance76
over the same models trained using content features. In the case of the Union

Impact of different features
which we found to be 0.674 indicating a good correlation between the two lists
and• their respective ranks.the highest impact on identification of seed
What features have
posts?
TableRank features by information gainGain Ratio wrt Seed Post class label. The
•  4. Features ranked by Information ratio wrt seed post class label
feature name is paired within its IG in brackets.

Rank Haiti Union Address
1 user-list-degree (0.275) user-list-degree (0.319)
2 user-in-degree (0.221) content-time-in-day (0.152)
3 content-informativeness (0.154) user-in-degree (0.133)
4 user-num-posts (0.111) user-num-posts (0.104)
5 content-time-in-day (0.089) user-post-rate (0.075)
6 user-post-rate (0.075) user-out-degree (0.056)
7 content-polarity (0.064) content-referral-count (0.030)
8 user-out-degree (0.040) user-age (0.015)
9 content-referral-count (0.038) content-polarity (0.015)
10 content-length (0.020) content-length (0.010)
11 content-readability (0.018) content-complexity (0.004)
12 user-age (0.015) content-noun-count (0.002)
13 content-uppercase-count (0.012) content-readability (0.001)
14 content-noun-count (0.010) content-verb-count (0.001)
15 content-adj-count (0.005) content-adj-count (0.0)
16 content-complexity (0.0) content-informativeness (0.0)
17 content-verb-count (0.0) content-uppercase-count (0.0)
77

7 content-polarity (0.064) content-referral-count (0.030)
8 user-out-degree (0.040) user-age (0.015)
9 content-referral-count (0.038) content-polarity (0.015)

Positive/negative impact of features
10
11
12
content-length (0.020)
content-readability (0.018)
user-age (0.015)
content-length (0.010)
content-complexity (0.004)
content-noun-count (0.002)
13 content-uppercase-count (0.012) content-readability (0.001)
14 content-noun-count (0.010) content-verb-count (0.001)
•  What is the correlation between seed posts and features?
15
16
content-adj-count (0.005)
content-complexity (0.0)
content-adj-count (0.0)
content-informativeness (0.0)
17 content-verb-count (0.0) content-uppercase-count (0.0)
Haiti
Union Address

Fig. 3. Contributions of top-5 features to identifying Non-seeds (N ) and Seeds(S).
Upper plots are for the Haiti dataset and the lower plots are for the Union Address 78
dataset.

Identifying Seed Posts
•  Can we identify seed posts using the top-k features?

–  Stability is reached with
5 features

–  Classification with 5
features is sufficient for
identifying posts that
generate responses

79

Predicting Discussion Activity
•  Reply rates:
–  Haiti 1-74 responses, Union Address 1-75 responses
•  Compare rankings
–  Ground truth vs predicted
•  Experiments
–  Using Haiti and Union Address datasets
–  Evaluate predicted rank k where k={1,5,10,20,50,100)
–  Support Vector Regression with user, content, user+content
features

Dataset Training Test size Test Vol Test Vol SD
size Mean
Haiti 980 210 1.664 3.017

Union Address 5,067 1,161 1.761 2.342 80

Predicting Discussion Activity

Haiti dataset Union Address dataset

•  Content features are key for top ranks
•  Use features more important for higher ranks

81

Identifying Seed Posts in Boards.ie

•  Used the same features as before
–  User features
•  In-degree, out-degree, post count, user age, post rate
–  Content features
•  Post Length, complexity, readability, referral count, time in day,
informativeness, polarity

•  New features designed to capture user affinity
–  Forum Entropy
•  Concentration of forum activity
•  Higher entropy = large forum spread
–  Forum Likelihood
•  Likelihood of forum post given user history
•  Combines post history with incoming data

82

Experiment for identifying seed posts
•  Used all posts from Boards.ie in 2006
•  Built features using a 6-month window prior to seed post date

Posts Seeds Non-Seeds Replies Users

1,942,030 90,765 21,800 1,829,465 29,908

•  Evaluated a binary classification task
–  Is this post a seed post or not?
–  Precision, Recall, F1 and Accuracy
–  Tested: user, content, focus features, and their combinations

83

h the features (i.e., user TABLE II
om t − 188 to t − 1. In R ESULTS FROMTHE CLASSIFICATION OF SEED POSTS USING

Identifying seeds with different type of
he features compiled for
outcomes and will not
VARYING FEATURE SETS AND CLASSIFICATION MODELS

features
user may increase their
User SVM
P
0.775
R
0.810
F
0.774
ROC
0.581
1

ich would not be a true Naive Bayes 0.691 0.767 0.719 0.540
ime the post was made. Max Ent 0.776 0.806 0.722 0.556
J48 0.778 0.809 0.734 0.582
e number of posts (seeds, Content SVM 0.739 0.804 0.729 0.511
tained within. Naive Bayes 0.730 0.794 0.740 0.616
Max Ent 0.758 0.806 0.730 0.678
TING S EED P OSTS J48 0.795 0.822 0.783 0.617
ls are often hindered by Focus SVM 0.649 0.805 0.719 0.500
Naive Bayes 0.710 0.737 0.722 0.588
We alleviate this problem Max Ent 0.649 0.805 0.719 0.586
and non-seeds through a J48 0.649 0.805 0.719 0.500
posts have been identiﬁed User + Content SVM 0.790 0.808 0.727 0.509
Naive Bayes 0.712 0.772 0.732 0.593
of discussion that such Max Ent 0.767 0.807 0.734 0.671
ook for the best classiﬁer J48 0.795 0.821 0.779 0.675
ts and then search for the User + Focus SVM 0.776 0.810 0.776 0.583
Naive Bayes 0.699 0.778 0.724 0.585
guishing seed posts from Max Ent 0.771 0.806 0.722 0.607
atures that are associated J48 0.777 0.810 0.742 0.617
Content + Focus SVM 0.750 0.805 0.729 0.511
Naive Bayes 0.732 0.787 0.746 0.658
Max Ent 0.762 0.807 0.731 0.692
J48 0.798 0.823 0.787 0.662
the previously described All SVM 0.791 0.808 0.727 0.510
ntaining both seeds and Naive Bayes 0.724 0.780 0.740 0.637
Max Ent 0.768 0.808 0.733 0.688
r collection of posts we J48 0.798 0.824 0.792 0.692
tures listed in section III 84

Positive/negative impact of features on Boards.ie
TABLE III
R EDUCTION IN F1 LEVELS AS INDIVIDUAL FEATURES ARE
DROPPED FROM THE J 48 CLASSIFIER

•  What are the most
Feature Dropped F1
important features for - 0.815
predicting seed posts? Post Count
In-Degree
0.815
0.811*
Out-Degree 0.811*
User Age 0.807***
Post Rate 0.815
Forum Entropy 0.815
•  Correlations: Forum Likelihood 0.798***
Post Length 0.810**
–  Referral counts (non-seeds) Complexity 0.811**
–  Forum likelihood (seeds) Readability 0.802***
Referral Count 0.793***
–  Informativeness (non-seeds) Time in Day 0.810**
Informativeness 0.801***
–  Readability (seeds) Polarity 0.808***
Signif. codes: p-value < 0.001 *** 0.01 ** 0.05 * 0.1 .
–  User age (non-seeds)

hyperlinks (e.g., ads and spams). This contrasts with work in
Twitter which found that tweets containing many links were
85

Predicting Discussion Activity in Boards.ie

•  Can we predict the level of
discussion activity?

86

Predicting Discussion Activity in Boards.ie

•  What impact do features have on discussion length?
–  Assessed Linear Regression model with focus and content
features

–  Forum Likelihood (pos)
–  Content Length (+/neutral)
–  Complexity (pos)
–  Readability (+/neutral)
–  Referral Count (neg)
–  Time in Day (+/neutral)
–  Informativeness (-/neutral)
–  Polarity (neg)

87

Stay tuned
•  More communities
–  SAP, IBM, StackOverflow, Reddit
–  Compare impact of features on their dynamics

•  Better behaviour analysis
–  Less features, more forums/communities, more graphs!
–  Healthy? posts, reciprocation, discussions, sentiment mixture

•  Churn analysis
–  Correlation of features/behaviour to ‘bounce rate’

•  Intervention!
–  Opportunities and mechanisms to influence behaviour 88

Upcoming events

Social Object Networks
IEEE Social Computing, 2011
October 9-10, Boston, USA

http://ir.ii.uam.es/socialobjects2011/
!
Deadline: August 5, 2011

Intelligent Web Services Meet Social Computing
AAAI Spring Symposium 2012,
March 26-28, Stanford, California

http://vitvar.com/events/aaai-ss12
Deadline: Octover 7, 2011

89

Questionnaire on user needs

http://socsem.open.ac.uk/limesurvey/index.php?sid=55487

Questionnaire is to identify the needs that community users have within online
communities and to learn the factors and issues that influence those needs.

90

Thanks to
My social semantics team Live Social Semantics team

Sofia Angeletou Ciro Cattuto Wouter van Den Broeck
Matthew Rowe
Research Associate ISI, Turin ISI, Turin
Research Associate

Acknowledgements
Alain Barrat Martin Szomszor
CPT Marseille & ISI CeRC, City University, UK

Gianluca Correndo, Uni Southampton
Ivan Cantador, UAM, Madrid
STI International
ESWC09/10 & HT09 chairs and organisers
All LSS participants

91

Monitoring and Analysis of Online Communities

Monitoring and Analysis of Online Communities

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to Monitoring and Analysis of Online Communities

Similar to Monitoring and Analysis of Online Communities (20)

More from The Open University

More from The Open University (15)

Recently uploaded

Recently uploaded (20)

Monitoring and Analysis of Online Communities