Recent years have seen an increased use of social media data as a cheaper alternative to more traditional methods of market research. Social media services generate a large quantity of data every day and some of the data is available through their Application Programming Interfaces (APIs). This presentation outlines some of the research work carried out as part of the Uncertainty of Identity (http://www.uncertaintyofidentity.com) project. In particular, the use of social media data for activity pattern analysis and demographic profiling is explored.
Analysing the digital traces of Social Media users
1. Analysing the digital traces of Social Media users
Muhammad Adnan, Guy Lansley, Paul Longley
Consumer Research Data Centre, Department of Geography, University College London
Web: www.uncertaintyofidentity.com ; www.cdrc.ac.uk
Twitter: @gisandtech
2. Introduction
•Past years have witnessed a rapid growth of the use of online services
•Online shopping, bank transactions, social networking services
•Issues related to cyber-crimes, identity frauds, and hacking
•‘Uncertainty of Identity’ project: Combining real and virtual world datasets to better understand the identity of individuals
•Real world (Census, Demographic Classifications)
•Virtual world (Email addresses, Social media accounts)
3. Introduction
•Geodemographics
•Census data represent the night time geography
•Social media datasets can be used to provide day and travel time geographies
•Spatial and temporal analysis of social media users
•Activity pattern analysis
•Tweet content analysis
•Develop tools for Identity analysis
•E-mail addresses
•Social media accounts
4. Outline
•Some popular social media services
•Twitter
•Introduction
•Case Study 1: Social Media Geodemographics
•Case Study 2: Activity pattern analysis
•Temporal analysis of Twitter activity around different world cities
•Case Study 3: Twitter Geographic Profiler
•An Uncertainty of Identity tool
5.
6. Some popular social media services
•Facebook
•2 billion total users
•1.28 billion active users
•Google Plus
•1.6 billion total users
•540 million active users
•Twitter
•More than 1 billion total users
•255 million active users
(1) Mediabistro. 2014. Social Media Stats 2014. Retrieved 17th November, 2014 from http://www.mediabistro.com/alltwitter/social-media-statistics- 2014_b57746.
7. Twitter (www.twitter.com)
•Online social networking and micro-blogging web service
•Users can send messages of 140 characters or less
•Approx. 500 million tweets daily
•78% of Twitter’s active users are on mobile
•44% of users have never sent a tweet (inactive users)
•Twitter API: for downloading live tweets of data
8. Data available through the Twitter API
•User Creation Date
•Followers
•Friends
•User ID
•Language
•Location
•Name
•Screen Name
•Time Zone
•Geo Enabled
•Latitude
•Longitude
•Tweet date and time
•Tweet text
•A database of 1.4 billion social media messages
•September, 2012 – February, 2014
•Geo-tagged tweets
•Latitude / Longitude
12. Social Media Geodemographics
•Geodemographics
•Analysis of people by where they live” (2)
•Night time characteristics of the population
•Social Media Geodemographics
•Moving beyond the night time geography
•Who: Ethnicity, Gender, and Age of social media users
•When: What time of day conversations happen
•Where: Where social media conversations happen
(2) Sleight, P. (2004). Targetting Customers-How to Use Geodemographic and Lifestyle Data in Your Business.
13. Twitter data for the case study
•Approx. 8 million geo-tagged tweets (Jan – Dec, 2013)
•Sent by 385,050 unique users
•155,249 users sent 5 or more tweets (7.6 million tweets)
14. Flows of people and information
•Entropy is a measure of uncertainty in a random variable
•Shannon Entropy
•7.6 million tweets were aggregated to 4,765 LSOAs
•Entropy was calculated
•High values indicate high flows of people and information
퐻푋=− 푝푥푖log푏푝푥푖 푛 푖=1
16. Morning (6am – 11.59am)
Afternoon (12pm – 5.59pm)
Flows of people and information
17. Evening (6pm – 11.59pm)
Afternoon (12 midnight – 6.59am)
Flows of people and information
18. Variables for creating a geo-temporal classification
1.Residence
•Where twitter users live
1.Ethnicity
•Probable ethnic origins of Twitter users
1.Age
•Probable Age of Twitter users
1.Land Use Category of a Tweet message
•Residential; Non-domestic building; Park etc.
2.Temporal Scales
•Day, Afternoon, Night, Peak travel hours
19. Residence of Twitter Users
•170m X 170m grid was used to find the probable residence of users
•Probable residence was found for the 75,522 users
20. Extracting demographic attributes of Twitter users by using their forenames and surnames
A name is a statement of the bearer’s cultural, ethnic, and linguistic identity (3)
(3) Mateos P, Longley P A, O’Sullivan D 2011. Ethnicity and population structure in personal naming networks. PloS ONE (Public Library of Science) 6 (9) e22943.
21. Analysing Names on Twitter
•Some examples of NAME variations on Twitter
•Approx. 68% of the accounts have real names
Fake Names Castor 5. WHAT IS LOVE? MysticMind KIRILL_aka_KID Vanessa Justin Bieber Home
Real Names Kevin Hodge Andre Alves Jose de Franco Carolina Thomas, Dr. Prof. Martha Del Val Fabíola Sanchez Fernandes
22. Onomap: Names to Ethnicity classification
•Onomap was created by clustering names of 1 billion individuals around the world
•Applied ONOMAP (www.onomap.org) on forename – surname pairs
Kevin Hodge (English) Pablo Mateos (Spanish) … … … …
23. Top 10 Ethnic Groups of Twitter Users
•A total of 67 ethnic groups were identified
24. •Monica dataset provided by CACI Ltd, UK
•Supplemented with UK birth certificate records
Age estimation from ‘forenames’
25. Age distribution of Twitter users
Twitter Users vs. 2011 Census (Greater London)
(4) Longley, P., Adnan, M., Lansley, G. 2013. “The geo-temporal demographics of Twitter usage”. Environment and Planning A. (In Press)
27. Variables for creating a geo-temporal classification
1. Residence V1: Tweet made near probable London residence V2: Tweeter lives ‘outside the UK’ V3: Tweeter lives in the rest of the UK outside London
2. Total Number of Tweets V4: Total number of tweets made by the user
3. Ethnicity V5: West European V6: East European V7: Greek or Turkish V8: South East Asian V9: Other Asian V10: African & Caribbean V11: Jewish V12: Chinese V13: Other minority
4. Age V14: <=20 V15: 21 - 30 V16: 31 - 40 V17: 41 - 50 V18: 50+
5. Tweets outside the UK V19: In West Europe (not including UK) V20: In East Europe V21: In North America V22: In Central or South American V23: In Australasia V24: In Africa V25: In Middle East V26: In Asia V27: In Paris
28. Variables for creating a geo-temporal classification
6. Number of countries visited V28: Number of countries tweeter has visited
7. London Land Use Category V29: Residential location V30: Non-domestic buildings V31: Transport links and locations V32: Green-spaces V33: All other land uses
8. 2011 London Output Area Classification V34: Intermediate Lifestyles V35: High Density and High Rise Flats V36: Settled Asians V37: Urban Elites V38: City Vibe V39: London Life-Cycle V40: Multi-Ethnic Suburbs V41: Ageing-City Fringe
9. Temporal Scales V42: Morning Peak Hours V43: Week Day V44: Afternoon V45: Week Night V46: Weekend
29. •Segmentations were created by using K-means clustering algorithm
•K-means tries to find cluster centroids by minimising
•Seven clusters
•Group A: London Residents
•Group B: Commuting Professionals
•Group C: Student Lifestyle
•Group D: The Daily Grind
•Group E: Spectators
•Group F: Visitors
•Group G: Workplace and tourist activity
Computing the geo-temporal classifications
nxnyyxVz112)(
33. Group G: Workplace and tourist activity
•Tweets sent from non-domestic buildings
•Full range of Twitter age cohorts
•Tweets originate from a mix of residents and international visitors
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
V1
V2
V3
V4
V5
V6
V7
V8
V9
V10
V11
V12
V13
V14
V15
V16
V17
V18
V19
V20
V21
V22
V23
V24
V25
V26
V27
V28
V29
V30
V31
V32
V33
V34
V35
V36
V37
V38
V39
V40
V41
V42
V43
V44
V45
V46
34. Social Media Geodemographics
•Geo-temporal demographic classifications
•Census (night time geography)
•Social media data (day and travel time geography)
•Issues of representation
•An insight into the residential and travel geographies of individuals
•An insight into the spatial activity patterns of different kind of social media users
35. Case Study 2: Analysis of Twitter activity around world cities
(5) Muhammad Adnan, Alistair Leak, Paul Longley. “A geocomputational analysis of Twitter activity around different world cities”. Geospatial Information Science.
36. Activity Pattern Analysis
•Comparison of the use of Twitter between different cities
•Weekly patterns of activity
•Seasonal shifts
•Data: 19th September, 2012 – 25th September, 2013
•Point-in-polygon operations were performed to extract data for different city in the world
•Approx. 170 million tweets were sent from the top 30 cities
37. Top 30 cities on Twitter
0
5
10
15
20
25
30
35
40
Number of Tweets (Millions)
•Approx. 170 million tweets were sent from the following 30 cities.
38. Time zone issue
•By default, Twitter API sends the data in local time zone
•Data was converted from GMT to the corresponding time zones
Date & Time (GMT)
Date & Time (UTC +1)
Wed Dec 05 00:04:23, 2012
Wed Dec 05 01:04:23 2012
Wed Dec 05 00:06:29, 2012
Wed Dec 05 01:06:29 2012
Wed Dec 05 00:07:35, 2012
Wed Dec 05 01:07:35 2012
39. Temporal Analysis of Twitter Cities
Jakarta Istanbul Paris
Sao Paulo, Brazil New York City London
47. Case Study 3: Twitter Geographic Profiler (a part of Uncertainty of Identity Toolkit)
48. Introduction
•Uncertainty of Identity Toolkit is a framework for the identification and profiling of individuals from their
•Social media accounts
•E-mail addresses
•Twitter Geographic Profiler
•Maps ethno-cultural communities of a person’s friends
•Extracting identities of Twitter users
•Mapping them to probable ethnic origins
•Could have potential applications in targeted marketing
49. Twitter Geographic Profiler
•Given an individual’s Twitter Username or ID
•Extracts the information of individual’s friends
•Extracts the forename-surname pairs of the friends
•Maps forename-surname pairs to Onomap
•Builds an ethno-cultural profile person’s friends
•Maps the geographic distribution
50. Data available through the Twitter API
•User ID
•User Creation Date
•Followers
•Friends
•Language
•Location
•Name
•Screen Name or User Name
•Time Zone
•Geo Enabled
•Latitude
•Longitude
•Tweet date and time
•Tweet text
51. Twitter: getting the ids and usernames
•Given a Twitter username of a person, we use the Twitter API to get the list of friends’ ids
–A max of 15 requests every 15 minutes is allowed
–Each query can get up to 5000 ids
–Generally enough to download all the ids
•Using the ids, we fetch the name associated to each id
–Limited to 180 requests every 15 min
–Returns a single string from which we need to extract the name and surname tokens
–Not necessarily a valid forename + surname!
•E.g., “University of Birmingham”, “John1965”, “ What is Love”, “Mystic_mind”
52. Twitter: getting forename-surname pairs
•Name field was divided into different tokens
•Forenames and Surnames were detected by matching the string tokens against the database of forename surnames pairs of 26 countries
•Users discarded
–where tokens were not matched against valid forename and surname
53. Onomap: from names to ethnicity
•ONOMAP (www.onomap.org) was applied on forename – surname pairs
Kevin Hodge (English) Pablo Mateos (Spanish) … … … …
54. Friends’ Ethnicity Histogram
Once the entire list of friends name + surname pairs has been parsed, we can easily estimate the distribution over the set of possible ethno-cultural groups of the Twitter user's friends
GEOGRAPHIC PROFILER cultural communities of a determine the distribution groups of the friends of a integrate information from two Note, that the same ideas other Online Social Foursquare1. However, around different and Foursquare’s venues. In this because of the general not restricted to a specific Facebook, information is username of the person being surname, forename) pairs of of names to a list of classification of Onomap. probable countries of estimate respectively the set of possible ethno- countries. In the following details of the tool and terms of users' privacy. Twitter is directed, in the necessarily reciprocated. associated with each user, following and one for the her followers. In this representing the list of a user's actually follow a limited number of profiles, which are then accessible even with the rate limitation in place. With the list of (surname, forename) pairs to hand, we query Onomap to get the ethno-cultural classification associated with each (surname, forename) pair, and the SearchSurnameTopCountries method to get the list of the countries where an instance of a given surname was observed. Figure 1: Screenshot of the Twitter Geographic Profiler. The bottom part of the screen shows the histogram of the Twitter user's friends ethno-cultural groups.
55. Friends’ Geographic Origins
Map showing the geographic origin of the Twitter user's friends’ surnames as assigned by our tool. Below the map the user is shown a list of the top 10 countries with the respective frequency.
pair among the extracted tokens. In this work we mark as invalid any string that is composed of a single token. If this is the case, we skip the profile of the corresponding friend. If the string contains two or more tokens, we take the first one to be the forename and the last one to be the surname. Moreover, when a (surname, forename) pair is sent to Onomap, an error distance matrix one can embed Euclidean space for the purpose similar ethno-cultural groups. However, note that we expect the ethno-cultural groups to vary across is, on average a resident of London spanning a wider spectrum of communities of Swansea4, due to the substantial in London. As a consequence, performed within a limited geographical been shown that roughly 50% assigned in their profile, and the are at town level [10], thus such feasible. Given the friendships distribution it is also possible to use outlier identify individuals or group of individuals of the ethno-cultural groups they also infer the ethnicity of an individual but for which a list of friend names To understand the extent of the we should stress that the default profile of a user as public. Although private, thus making it impossible profile, when testing our tool we profile. Consequently, we can download the list of names of a ethno-cultural profiling. As for the limitations of the current we observed that the Twitter data noise, which can considerably computation. The source of this of extracting the surname and string introduces unwanted uncertainty. Figure 2: Map showing the geographical origin of the Twitter user's friends’ surnames as assigned by our tool. Below the map the user is shown a list of the top 10 countries with the respective frequency.
56. Twitter Geographic Profiler
•Potential applications include
–Measure the level of segregation/integration of a given individual (community) as the Shannon entropy of the (average) friends’ ethnicity histogram
–Outliers detection: identify uncommon behaviors, e.g., individuals that stand out in terms of the ethno-cultural groups they bond with
•Limitations
–Twitter data is very noisy
–Request limits
57. •Social media datasets can be used to create Geo-temporal demographic classifications
•Day and travel time geographies
•Activity patterns
•Temporal analysis can identify some interesting patterns of a geographical area
•Weekly patterns of activity
•Seasonal shifts
•Twitter Geographic Profiler: Identification and profiling of ethno-cultural characteristics of individuals
•From their Twitter accounts
Conclusion
58. •Study of privacy implications on social media services
•Facebook, FourSquare
•Future work: Consumer Data Research Centre
•Use of social media for retail sector
•Spatial and temporal catchments of the social media users
Conclusion
59. •E.g. Day-time catchment
1.Identify the unique ID of users frequently transmitting from a particular location at a given time or date range
2.Request their other activity through Twitter’s API, filter by time/date
3.Aggregate
Time catchments
The Twitter work-day time catchment of Bishopsgate
Activity at Bishopsgate in 2013
60. 60
Waterloo
St Pancras
Victoria
Paddington
London Bridge
Liverpool Street
Kings Cross
Euston
Natural History Museum
61. Residential catchment of Twitter users
•First establish which users have tweeted from inside the building
•Create a customer catchment by identifying all of these users Tweets sent from domestic land uses
•E.g. ASDA in Clapham Junction
The Twitter residential catchment of ASDA Supermarket at Clapham Junction