This document proposes a method to analyze communities on social networks using graph representation learning. It involves collecting data on brands and followers from Instagram, constructing graphs to model interactions, extracting embeddings using node2vec, classifying users, and clustering communities. Experiments on an Italian fashion brand found embeddings from reduced graphs performed well in classification. Clustering identified sub-communities validated by domain experts as related to professionals, holidays, and regular users. The method effectively analyzed social network communities through network modeling and representation learning.
Community analysis using graph representation learning on social networks
1. Community Analysis
Using Graph Representation Learning
On Social Networks
Marco Brambilla and Mattia Gasparini
Politecnico di Milano
2. Introduction
• Development of platforms such as Instagram and
Facebook increased levels of interaction among
people
• Variety of social networks data exploited to map
users behavior
• Graphs perfectly fit for modeling all the
interactions of these users
2
3. Problem Statement
• Analysis of communities on on-line social
networks, applying machine learning on graphs
• Representation learning is used to extract valuable
information about users inside the community
• Classification of consumer and business users
• Grouping of similar users
3
4. Representation Learning
• Define a continuos representation for each node of the
graph (embedding) to easily apply machine learning
techniques on graphs
• Embeddings are based on neighbourhood nodes:
4
u
u :
5. Node2vec
• Emeddings computations performed using
node2vec algorithm[1], included in the Stanford
Network Analysis Platform (SNAP) library
• The algorithm calculates the embeddings solving an
optimization problem:
max
𝑓
𝑢 ∈𝑉
log Pr(𝑁𝑠(𝑢)|𝑓 𝑢 )
5
[1] Grover and Leskovec. 2016. node2vec: Scalable Feature Learning for Networks.
7. Case Study
• Emerging Italian fashion brand: Emporio Le Sirenuse
• Products: luxury swimsuits and dresses
• Case study is focused on the brand, its competitors
and their communities, defined as the set of
followers users on social network
7
http://www.fashiondatasensing.polimi.it/
8. Related Work
• Users’ communities defined using graph’s structural
properties [himelboim2017, deeb2017, guerrero2017]
• Brand-related communities have a specific role,
with business strategies as final target [ramadan2018,
kim2014, campbell2014]
• Fashion brands gain major advantages from social
media [brambilla2017, schmidt2017]
8
10. 1 – Data Collection
• Web scraping of 10 brands and their followers data
from Instagram
• Time window: from 1 𝑠𝑡
January 2017 to 1 𝑠𝑡
November 2017
• Final database : 400K users, 10M posts
10
11. 2 – Graph Construction
• Graphs are built using several entities: users that we
want to analyze (𝑈𝑡), their posts (𝑃), hashtags
referenced in the posts (𝐻) and mentioned users (𝑈 𝑚)
• Symmetrically, three different types of edges are
defined:
o 𝐸 𝑜𝑤𝑛𝑒𝑟 = 𝑒1, 𝑒2 𝑒1 ∈ 𝑈𝑡, 𝑒2 ∈ 𝑃}
o 𝐸𝑡𝑎𝑔 = 𝑒1, 𝑒2 𝑒1 ∈ 𝑃, 𝑒2 ∈ 𝑇}
o 𝐸 𝑚𝑒𝑛𝑡𝑖𝑜𝑛 = 𝑒1, 𝑒2 𝑒1 ∈ 𝑃, 𝑒2 ∈ 𝑈 𝑚}
11
12. 2 – Graph Construction
• Three graph models are used for the analysis:
1. Mixed network: 𝐺 𝑀 = 𝑈, 𝑃, 𝑇 , 𝐸 𝑜𝑤𝑛𝑒𝑟, 𝐸𝑡𝑎𝑔, 𝐸 𝑚𝑒𝑛𝑡𝑖𝑜𝑛
2. Hashtags network: 𝐺ℎ = 𝑈𝑡, 𝑃, 𝑇 , 𝐸 𝑜𝑤𝑛𝑒𝑟, 𝐸𝑡𝑎𝑔
3. Mentions network: 𝐺 𝑚 = 𝑈𝑡, 𝑈 𝑚, 𝑃 , 𝐸 𝑜𝑤𝑛𝑒𝑟, 𝐸 𝑚𝑒𝑛𝑡𝑖𝑜𝑛
• 𝐺ℎ and 𝐺 𝑚 are subgraphs of 𝐺 𝑀: they map the
influence of specific social media aspects
12
13. Example Hashtags
Network
13
The central part of the graph features
the most connected nodes, which
correspond to the users that
have many hashtags in common.
14. 3 – Graph Reduction
• A reduction process is applied to 𝐺ℎ and 𝐺 𝑚 to obtain «classical» social
networks, where the nodes are the users and the edges are weighted
based on the number of shared entities:
𝑤𝑖𝑗 =
𝑡𝑖 ∩ 𝑡𝑗 , 𝑖𝑓 𝑖, 𝑗 ∈ 𝐺ℎ
𝑚𝑖 ∩ 𝑚𝑗 , 𝑖𝑓 𝑖, 𝑗 ∈ 𝐺 𝑚
where 𝑖, 𝑗 ⊂ 𝑈𝑡, 𝑡𝑖,𝑗 ⊆ 𝑇, 𝑚𝑖,𝑗 ⊆ 𝑈 𝑚
• 𝐺ℎ and 𝐺 𝑚, the reduced hashtags and reduced mentions networks, are
generated
14
16. 4 – Features Extraction
• Both heterogeneous networks 𝐺ℎ,𝑚 and reduced
networks 𝐺ℎ,𝑚 are used to extract the embeddings
• Feature vectors dimension is fixed for the two types
of networks: 𝑑 𝐺 = 8 and 𝑑 𝐺 = 4, respectively.
• Hyper-parameter tuning for 𝑝 and 𝑞 in supervised
setting
16
17. 5 – Classification
• Domain specific task:
«Discriminate between consumer and non-consumer
users»
• Ground-truth of 351 labelled users defined with
domain experts
• Three features set are tested:
• Social media account data(#followers, #following,
#posts, bio)
• Complete network embeddings
• Reduced network embeddings
17
18. 5 – Classification Experiment
18
Description of the user is valuable if a good fraction of the neighborhood
is exploited, which is not always feasible for complete networks.
19. 5 – Classification Experiment on Reduced Networks
19
Performance and number of classified users increase with the number of user nodes
included in the model, even if they are not classified: they enrich the neighborhood and, by
consequence, the features vector.
20. 6 – Clustering
• Hashtags reduced networks 𝐺ℎ used as proxy to
content-based similarity
• K-means is applied on extracted features vectors
• Focus on 𝐺ℎ of Emporio Le Sirenuse community
20
21. 6 – Clustering
Network Input
21
Hashtags Reduced
Network 𝐺ℎ of
Emporio Le Sirenuse
community.
22. 6 – Clustering Features
22
Embeddings extracted from the
network.
First two features components
are used for visualization.
23. 6 – Clustering Output
23
K selection: plot of inertia
against number of clusters
25. 6 – Cluster Validation: Domain Experts
• Domain experts are provided with a subset of users for each
cluster
• Manual inspection of user profiles, providing feedback
about the patterns present in each cluster
25
26. 6 – Cluster Validation: Experts Feedback
• Cluster 0, 1 and 2 very well defined: professionals
users, such as showrooms and other brands
• Cluster 3 contains regular users that share contents
about holidays in Italy
• Clusters 3, 4, 5 and 6 composed mostly by regular
users, too
26
27. 6 – Cluster Labels
27
Cluster labels extracted using the set of hashtags shared at least by two users inside the
cluster.
29. Conclusion
• Results:
• Definition of an effective method to analyze
communities inside social network domain
• Modeling of user similarities through network features
• Detection of content-driven sub-communities
• Future work:
• Inclusion of time variable
29
Good morning, today I am going to present our research work about community analysis using graph representation learning on social networks.
Starting point is that modern social networks such as Instagram and Facebook increased exponentially the number of interactions among people. That variety of data can be exploited to map user behavior. Data itself can perfectly fit to a graph model, capturing users interactions.
Our purpose is to analyze communities on on-line social networks, applying innovative machine learning techniques on graphs.
In the specific, we want to apply representation learning on graphs to describe users inside communities: two main tasks have been developed, one that classifies users in consumers and non-consumers, the other that extracts subgroups of similar users.
.
Just a brief mention to the technique: representation learning is a technique that defines a continuos features vector for each node of the graph, referred to as embedding. The embeddings are learnt with different strategies: as one possible example, focusing on a specific node u, we can exploit local neighbors, the blue nodes in the picture, to learn the feature vectors of u.
Many algorithms are able to perform this operation and we chose node2vec that provides a very flexible technique. It computes the embedding f(u) of a node u using the following equation.
It maximizes log-probability of observing a network neighborhood 𝑁 𝑠 (𝑢) for a node 𝑢 conditioned on its feature representation, given by 𝑓.
We can see how it works here: the intuition is that nodes near in the graph are also near in vector space.
The scenario that we take as case study is about an emerging Italian fashion brand, Emporio Le Sirenuse: this brand is located in Positano, near Neaples, andit mainly produces women luxury swimsuits and dresses.
The work focuses on the community of the brand, defined as the set of its Instagram followers: the idea is that brands can get valuable insights about the specific interests of its followers, and in its way better targeting their products and marketing campaigns.
1st group: community detection on social networks is quite well-known domain, but network structure is not really exploited.
2° group: Analysis of users reaction to brand network marketing, as well as content sharing indicator of brand turst and community commitment
3° group: Instagram is a visual social network that has high potential for fashion brands, that have in visual aspects their main feature
We defined an analysis pipeline and now I will go into the details of each step.
As step 1, data is gathered using web scraping to collect posts of the brands and their followers from Instagram, in a time window that spans from January 2017 to November 2017. We collected around 400K users and 10M posts.
Second step is the definition of the graph model.
We consider as entities users, posts and hashtags and then we define three sets of edges: one connects users to the posts produced, while the other two connect posts to references entities, hashtags and mentions.
The heterogeneous graph 𝐺 𝑀 contains all the entities and relationships. From this graph, two subgraphs are extracted: the hashtags network and the mentions network, that map two important aspects of social media interactions.
As an example, this is a hashtags graph built from gathered data: green nodes are the users (the ones with more connections), blue nodes are posts and hashtags are in orange. The important fact is that users that have many hashtags in common are concetrated in the centre of the network, so their are «near».
In a further step, a graph reduction is applied to previously presented graphs to obtain homogeneous networks, where only users nodes are present.
The reduced graphs are weighted as well, where the weight is based on the number of common entities, either hashtags or mentioned users.
In this way, 𝐺 ℎ and 𝐺 𝑚 are generated.
In this example, you can see a reduced mentions network: edges connect each user to the ones that he or she mentioned and number of mentions, the weight, is mapped to a color, from low (blue), to high (red).
Embeddings are extracted both for heterogeneous networks and reduced networks. Number of dimensions for the output vectors are fixed a-priori: it is set to 8 for heterogeneous networks (that are bigger) and to 4 for reduced networks.
Instead, main parameters of the algorithm p and q are selected via hyper-parameter tuning.
Classification step is defined as to prove the effectiveness of features in our domain: we want to disciminate between consumers and non-consumers users on Instagram. To do so, we manually labelled a set of users with the help of domain experts of Politecnico di Milano fashion department.
Then, a classifier is implemented to test three set of features: social media quantitative features are used as a baseline, compared against features extracted from complete and reduced networks.
Results of first experiment are shown in this table: it is possible to see that reduced network features perform better than complete network ones and than the social media baseline, too.
This is given by the fact that, given a fixed computational power, reduced networks are smaller and so the neighborhood is easier to be exploited.
On the other way, they are able to encode the main dynamics useful for our purpose.
Given first experiment results, we performed a second experiment on reduced networks only.
In this experiment, ground truth network is enlarged using a set of additional non-labelled users taken from followers of different brands: results show that the more users are included the richer is the neighborhood of labeled users and so the performance increases.
As the second task, we want to exploit the features to extract new subgroups of users from the community of the brand, defined as the set of its followers. So, focus is on the community of Emporio Sirenuse, using the reduced hashtags graph as a proxy of content description.
This is the real reduced network over which we run our analysis: each node is a user and the edges connect pair of users that shared same hashtags.
We extract the embeddings of this graph: a 2-d visualization, using the first two components, is presented.
We use a standard parametrization of the algorithm (p=1, q=0.5), that allows to exploit the local neighborhood.
We run K-means over this set of features: K is selected using inertia as structural validation metric.
These are the 7 clusters obtained, as well as the plot of inertia with respect to K.
The output network is presented, with colors associated with clusters.
Clustering needs external measures to validate the results: for this reason, we provided domain experts with a subset of users for each cluster.
They manually inspected the social media profile of each user, providing feedback about presence of patterns inside clusters.
The lists are ordered by distance from centroid, which is used as similarity quality measure
The insights are simple but quite interesting:
Cluster 0, 1 and 2 are users that share very specific contents, such as interior desing or food: they are mainly professional profiles, such as showrooms or brands.
Cluster 3 is very well defined, too, but it contains regular users: they share contents about holidays in Italy, which matches with brand identity.
Clusters 4, 5 and 6 contain regular users with broader contents.
As additional validation, we provided a way to label each cluster: we compute the list of cluster hashtags T(c) as the set of hashtags shared by at least two users inside the cluster (e.g.: hashtags that increase the weights and/or connections inside the cluster).
Then, label is defined as the top 10 hashtags by frequency belonging to this list: these lists are presented in the table, showing a consistent labeling with previous validation (e.g.: cluster 3 use hashtags related to italian vacation, cluster 0 about luxury accessories, cluster 1 about food, …)
𝑇 𝑐 = 𝑢, 𝑣 ∈𝑐 𝑡 𝑢 ∩ 𝑡 𝑣
What we obtain as final result of clustering is a segmentation of users that can be used by brand to better target their marketing campaigns [or to make other collaborations, (e.g.: luxury(0), food (1) and interior design (2) clusters are professionals)].
As final conclusion, in this work we defined an effective method to characterize users inside online communities: users are described using features extracted from their network representation and we are able to use these features to solve domain-specific classification tasks, as well as defining subgroups of users based on shared interests.
In this analysis, time variable is missing and graphs are built using a single snapshot of all the data: having time-varying graphs could potentially capture more fine-grained patterns.