1. mining
the
social
web
Aris2des
Gionis
Michael
Mathioudakis
firstname.lastname@aalto.fi
Aalto
University
Spring
2015
2. social
web
facebook
twiEer
linkedin
foursquare
flickr
instagram
pinterest
youtube
ustream
github
stackoverflow
wikipedia
2
3. social
web
websites
and
plaHorms
that
enable
users
to
produce
content
blog
posts,
‘status’
messages,
videos,
pictures,
podcasts
consume
content
read
text
-‐
blog
posts,
‘status’
messages
listen
to
podcasts,
watch
videos
interact
with
each
other
comment
on
each
other’s
posts,
‘like’
or
rate
items
3
4. mining
the
social
web
a
lot
of
users...
a
lot
of
data...
what
could
we
learn*?
*
assuming
we
have
the
data
-‐
more
on
that
later
gain
insights
into...
social
behavior
how
many
connec2ons
does
an
average
person
have?
do
people
connect
with
like-‐minded
people?
poli2cal
sen2ment
what
do
people
think
about
current
poli2cal
issues?
how
we
experience
our
ci2es
what’s
the
best
neighborhood
for
food/nightlife?
how
we
build
our
careers
how
oRen
do
people
change
careers?
how
beneficial
is
it
to
‘network’
professionally?
other?
4
5. mining
the
social
web
there
is
already
research
that
explores
those
ques2ons
we
will
discuss
some
of
it
now
and
in
the
next
two
lectures
5
6. twiEer
• a
social
sensor
– social
network
+
news
media
– what
is
happening?
– where,
who?
happening?
– trends
– events
– opinions
– poli2cal
views
– sen2ments
– demographics
6
12. foursquare
• loca2on-‐based
social
network
• users
check-‐in
to
different
loca2ons
• loca2ons
have
types
(hierarchy)
– restaurant,
sport
venue,
museum,
college,
…
• ques2ons:
– where
do
people
hang
out?
– where
events
take
place?
– do
friends
influence
each
other?
12
13. when/where
people
check
in?
. exploration
0 5 10 15 20
New-York
London
Barcelona
Helsinki
Total
(a) Hourly check-ins frequency during the day. The activity is at its lowest
around a.m. and after that, there are three peaks: one when people
go to work in the morning, one in the middle of the day and the last
one at the end of the evening. Yet, depending of the city, these peaks
do not happen at the same time, nor with the same intensity. Therefore,
instead of working directly the raw values of features, we use the number
of standard deviation or z-score.
– – – – – – – –
10
20
hour
perce
– – – – – –
10
20
30
40
50
60
hour
percentage
hours time clusters in Paris
Figure : Venues clustered by time of check-ins.
13
14. when/where
people
check
in?
datasets
City Name Category Entropy
Barcelona
Castellers de Barcelona Non-Profit 0.0139
Café de la Pompeu Café 0.0172
Ràdio Radio Station 0.0176
Paris
Boutique Orange Electronics Store 0.0099
Métro Goncourt [] Subway 0.0105
Blue Acacia Office 0.0112
Barcelona
Plaça de Catalunya Plaza 0.5835
Sants Estació Train Station 0.6298
Sagrada Família Government Building 0.6309
Camp Nou Stadium 0.6852
Paris
Gare SNCF : Gare de Lyon Train Station 0.6725
Gare SNCF : Paris Nord Train Station 0.6911
Musée du Louvre Museum 0.6924
Tour Eiffel Government Building 0.7167
(a) Venues in Paris and Barcelona with lowest and highest user en-
tropy.
14
18. your
project
come
up
with
a
project
idea
implement
it!
report
on
your
results
and
findings
18
19. types
of
projects
• form
a
hypothesis
and
set
out
to
test
it
– are
rich
people
happier?
• start
with
an
interes2ng
ques2on
– which
are
hipster
neighborhoods
in
my
city?
• start
with
a
business
idea
– recommend
relevant
music
to
music
listeners
– recommend
clothes
to
music
listeners
• start
with
a
problem
that
you
(think)
can
solve
– how
to
iden2fy
trends
in
space
and
2me?
• start
with
a
cool
dataset
and
explore
it
19
20. your
project
analyze
data
set
a
goal
for
your
project
(what’s
the
ques2on
you
want
to
answer)
study
related
literature
(what
has
/
hasn’t
been
done
already?
or
you
think
you
can
do
it
beEer)
collect
data
(some
data
are
more
difficult
to
come
by)
results
evalua2on
(have
you
answered
the
ques2on
asked
originally?
possible
improvements?
future
work?)
1
2
3
4
5
6
20
21. coming
up
with
a
project
idea
• conferences:
SIGKDD,
ICWSM,
WWW,
WSDM
• themes
– urban
compu2ng,
trend
/
event
detec2on,
social
networks,
poli2cal
sen2ment,
privacy
– other
• google
scholar
• talk
with
us
office
hours:
Mon,
14:15-‐15:30
and
by
appointment
21
22. collec2ng
the
data
• what
data
are
available?
– different
plaHorms
share
different
data
about
their
users’
ac2vity
– browse
dev
sites
of
social
networks
find
out
about
privacy
policies
and
APIs
– browse
public
data
repositories
– the
data
mining
group
has
data
for
blog
posts,
twiEer,
google+,
facebook,
foursquare
• code
Mining
the
Social
Web
(github)
hEps://github.com/ptwobrussell/Mining-‐the-‐Social-‐
Web-‐2nd-‐Edi2on
22
23. schedule
• Today:
overview
• February
2nd
:
discuss
literature
(Aris)
• February
9th
:
discuss
literature
(Michael)
• February
16th
23rd:
present
project
proposals
• March
30th
:
students
submit
progress
report
• March
30th
April
6th:
intermediate
presenta2ons
• May
4th
May
11th
:
final
presenta2ons
• May
15th
:
final
report
due
23
24. final
report
• introduc2on
• related
work
• problem
statement
• proposed
technique
(algorithms)
• data
descrip2on
• empirical
evalua2on
– results
– comparison
with
state
of
the
art
• future
work
24
25. grading
• originality
(has
it
been
done
before)
• poten2al
impact
(how
interes2ng
it
is
why)
• rigorousness
of
proposed
technique
• reproducibility
(public
code)
• presenta2on
• teams
of
2
are
encouraged
• presenta2ons
reports
are
required
• surveys
of
exis2ng
techniques
are
ok,
too
25
26. schedule
• Today:
overview
• February
2nd
:
discuss
literature
(Aris)
• February
9th
:
discuss
literature
(Michael)
• February
16th
and
23rd:
students
present
project
proposals
• March
30th
:
students
submit
progress
report
• March
30th
April
6th:
intermediate
presenta2ons
• May
4th
May
11th
:
final
presenta2ons
• May
15th
:
final
report
due
26
27. un2l
then...
browse
literature
see
papers
posted
on
noppa
for
a
sample
conferences
KDD,
ICWSM,
WWW,
WSDM
google
scholar
dev
websites,
for
example...
hEps://dev.twiEer.com,
hEps://developers.facebook.com,
hEps://developer.github.com/,
hEps://developer.foursquare.com
code
samples,
hEps://github.com/ptwobrussell/Mining-‐the-‐Social-‐Web-‐2nd-‐Edi2on
data
repositories,
hEp://snap.stanford.edu/,
hEp://icwsm.org/2013/datasets/datasets/,
hEp://wadam-‐data.dis.uniroma1.it
and
talk
to
us!
27
28. see
you
next
week!
Aris2des
Gionis
Michael
Mathioudakis
contact:
firstname.lastname@aalto.fi
Office
Hours:
Mon,
14:15-‐15:30
and
by
appointment
28