2. Abdessamad Benlahbib et al. / Procedia Computer Science 148 (2019) 80–86 81
order to figure out user likes or dislikes [5]. In this paper, we propose to use LSA (Latent Semantic Analysis) model,
then applying K-means algorithm to cluster opinions based on their semantic relations, and by aggregating the ratings
attached to the fused opinions, we normalize the reputation of an entity. The paper is organized as follows. Section
2 gives a brief review of related work. In Section 3, we present the details of our approach. We show experimental
results followed by additional analysis and discussions in Section 4. Finally, conclusions are presented in Section 5.
2. Related work
Reputation is a measure that is derived from direct or indirect knowledge on earlier interactions of entities and is
used to assess the level of trust an entity puts into another entity [1].
Reputation systems are typically based on public information in order to reflect the community’s opinion in general
[2]. The simplest form of computing reputation scores is simply to sum the number of positive ratings and negative
ratings separately, and to keep a total score as the positive score minus the negative score. This is the principle used in
eBay’s reputation forum which is described in [3]. In [4], a more advanced scheme proposed to compute the reputation
score as the average of all ratings, and this principle is used in the reputation systems of numerous commercial web
sites, such as Epinions and Amazon. Advanced models in this category compute a weighted average of all the ratings,
where the rating weight can be determined by factors such as rater trustworthiness/reputation, age of the rating,
distance between rating and current score etc.
Recently, Zheng et al [5] proposed a novel reputation generation approach based on opinion fusion and mining. In their
approach, opinions are filtered to eliminate unrelated ones, and then grouped into a number of fused principal opinion
sets that contain opinions with a similar or the same attitude or preference. By aggregating the ratings attached to the
fused opinions, they normalize the reputation of an entity. They claimed that: ”No work has explored the opinions
expressed in natural languages, opinion voting, opinion citation and user feedback ratings in a comprehensive way
for reputation generation” [5].
3. Proposed method
In this section, we remember the LSA technique and the K-means algorithm, then we describe in depth our pro-
posed method for reputation generation.
3.1. Latent Semantic Analysis
Latent Semantic Analysis (LSA) is a technique in natural language processing of analyzing relationships between a
set of documents and the terms they contain by producing a set of concepts related to the documents and terms. In [6],
T.K. Landauer, P.W. Foltz and D. Laham describe LSA as follows: ”Latent Semantic Analysis (LSA) is a theory and
method for extracting and representing the contextual-usage meaning of words by statistical computations applied to a
large corpus of text (Landauer and Dumais, 1997). The underlying idea is that the aggregate of all the word contexts
in which a given word does and does not appear provides a set of mutual constraints that largely determines the
similarity of meaning of words and sets of words to each other. The adequacy of LSAs reflection of human knowledge
has been established in a variety of ways. For example, its scores overlap those of humans on standard vocabulary and
subject matter tests; it mimics human word sorting and category judgments; it simulates wordword and passageword
lexical priming data, and, it accurately estimates passage coherence, learnability of passages by individual students,
and the quality of knowledge contained in an essay”.
3.2. K-means Algorithm
K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster
analysis in data mining. K-means algorithm aims to divide M points in N dimensions into K clusters so that the
within-cluster sum of squares is minimized [7][8].
3. 82 Abdessamad Benlahbib et al. / Procedia Computer Science 148 (2019) 80–86
3.3. System overview
We propose the following procedure to cluster and mine opinions for reputation generation.
1. Opinion data collection and preprocess. During this step, we collect the opinion data about an entity coming
from websites (product, movie, etc). Because there are many types of raw opinion data that contain many words
and symbols, preprocessing of such collected raw data is required, such as filtering word segmentation and stop
words and eliminating useless expressions and pictures, etc.
2. Opinion clustering. After applying LSA model, we cluster opinions into different clusters by using K-means
algorithm. In this step, some statistics can be gained for reputation generation such as the number of opinions,
the sum of similarity and the sum of ratings in each cluster.
3. Reputation generation. This step further aggregates clustered opinions to generate a reputation value by con-
sidering the popularity and other statistics of principal opinions.
3.4. Opinion clustering
The opinion clustering algorithm is shown below (Algorithm 1).
Algorithm 1 Opinion clustering
Begin
Step 1: Apply LSA model (We have used TruncatedSVD from Sklearn library in Python).
Step 2: Set a number of clusters and apply K-means clustering algorithm.
Step 3: Acquire the statistics of each cluster: (the sum of the similarity in a cluster using cosine similarity metric,
the sum of ratings in a cluster and the number of similar reviews in a cluster).
End
By applying Algorithm 1, we cluster opinions into several principal opinion sets after applying LSA model. The
opinions in each cluster hold a similar or same perspective. Once the processing based on Algorithm 1 has been
completed, the opinions are grouped into a number of clusters. Meanwhile, we also get the statistics of the clusters,
i.e., the number of similar opinions in each cluster, the sum of their ratings and the sum of their similarity by using
the cosine similarity metric.
3.5. Reputation generation
Based on the result of opinion clustering, we propose a method for generating a single reputation value of an entity.
In the overall, it is important to show users a concrete scale of reputation expressed by a single value. This reputation
presentation can provide good user experiences, especially for mobile Internet users who use mobile devices with
small screen sizes.
We propose formula (1) to generate the reputation of entity ”A” based on the clustering of the opinions on ”A”.
Rep(A) =
1
n clusters
.
n clusters
k=1
Vk.Sk
Nk.Nk
(1)
We denote:
n clusters : The number of clusters.
Nk : The number of opinions in cluster k.
4. Abdessamad Benlahbib et al. / Procedia Computer Science 148 (2019) 80–86 83
Sk : The sum of the similarity in cluster k.
Vk : The sum of ratings in cluster k.
In (1), we assume that each opinion has a rating on the entity attached to it. In our case, the rating is a number
ranging from 1 to 10 to represent a level of satisfaction.
4. Results and discussion
4.1. Dataset
We have created manually a dataset containing 600 reviews for six different movies by using IMDB website that
contains user reviews and ratings towards movies.
The statistical information of datasets is shown in Table 1.
Table 1. Statistical information of Datasets
The total number of reviews and ratings 600
The number of reviews per movie 100
4.2. Preprocessing reviews
After collecting all reviews, we applied tokenization, stemming and stop words removal in the reviews in order to
use them to carry out opinion clustering.
4.3. Evaluation measures
To measure the effectiveness of our system, we use AE (Absolute Error) and MAE(Mean Absolute Error) which
are defined as follows:
Absolute Error: The difference between the measured or inferred value of a quantity and its actual value.
Mean Absolute Error: The average of the absolute difference between prediction and actual observation.
4.4. Opinion clustering
The reviews can be grouped into a number of clusters. We can also acquire their statistics during clustering, such
as the number of similar opinions, the sum of similarity, and the sum of ratings in each cluster. To illustrate this
process, we provided example results of opinion clustering based on the 100 reviews of a movie in datasets as shown
in Table 2. We provide a python implementation for the clustering step in Github 1
.
For defining the best value of n clusters, we perform many execution with different number of clusters values.
4.5. Reputation generation
In order to evaluate our approach, we compared the final reputation computed by formula (1) with a users weighted
average vote computed by IMDB (IMDBWAV) website to represent a rating for a target movie, which is a number
ranging from 1 to 10 as shown in Fig 1. We varied the number of clusters from 2 to 19. Fig 2 and 3 show the Absolute
Error between IMDBWAV (IMDB users weighted average vote) and reputation value computed by our approach for
all movies.
1 https://github.com/abdessamadbenlahbib/Reputation-generation-K-mean/blob/master/Python_Code.
txt
5. 84 Abdessamad Benlahbib et al. / Procedia Computer Science 148 (2019) 80–86
Table 2. Example results of opinion clustering (Algorithm 1).
Cluster SimSum RatSum Num
C1 12.99 94 13
C2 14.97 120 15
C3 14.94 114 15
C4 9.98 61 10
C5 10.95 91 11
C6 14.96 128 15
C7 9.98 82 10
C8 10.99 90 11
Legend: SimSum: the sum of the similarity in a cluster.
RatSum: the sum of ratings in a cluster.
Num: the number of similar reviews in a cluster.
Fig. 1. IMDBWAV (IMDB users weighted average vote) for The Shawshank Redemption movie
Fig. 2. Absolute Error between IMDBWAV and reputation value computed by our approach for movie 1, 2 and 3
6. Abdessamad Benlahbib et al. / Procedia Computer Science 148 (2019) 80–86 85
Fig. 3. Absolute Error between IMDBWAV and reputation value computed by our approach for movie 4, 5 and 6
As we can see in Fig 2 and 3, the Absolute Error between IMDBWAV and reputation value computed by our
approach is high when n clusters = 2 and n clusters = 3, then it begins to decrease. As described in Al-
gorithm 1, different values of n clusters could lead to different clustering results of reviews, which cause dif-
ferent final reputation values. Therefore, choosing a suitable value of n clusters is particularly important. We
conducted experiments to study the influence of the number of clusters n clusters on reputation generation. We
varied the number of clusters from 2 to 19 and we computed the MAE (Mean Absolute Error) between IMDB-
WAV and the reputation values computed by our approach for all the reviews of datasets for n clusters =
{2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19}
Fig 4 shows the result of our experiments.
Fig. 4. Mae for different n clusters values
Both Fig 4 and Table 3 show that our approach performs best when n clusters = 9, since the MAE between
IMDBWAV and the values computed by (1) using all the reviews of dataset reaches its minimum.
5. Conclusions
In this paper, we have proposed an approach to generate reputation based on opinion clustering. By performing
opinion clustering, we classify various opinions into a number of clusters and gain their popularities, average similarity
and ratings. Thus, it becomes easy to aggregate all clusters to generate a single reputation value.
The experimental results have shown that our approach achieves an accurate reputation value in comparison with the
IMDB weighted average vote towards the target movies by choosing a suitable number of clusters.
7. 86 Abdessamad Benlahbib et al. / Procedia Computer Science 148 (2019) 80–86
Table 3. MAE between IMDBWAV and the reputation values computed by our method using all the reviews of the 6 movies
n clusters Mean Absolute Error
2 0.43478668
3 0.28583178
4 0.18399365
5 0.1158738
6 0.08081721
7 0.06048751
8 0.04663486
9 0.0419782
10 0.0535229
11 0.05158553
12 0.05617364
13 0.07550483
14 0.05620427
15 0.05560771
16 0.05956126
17 0.08015718
18 0.07338574
19 0.04463707
6. References
[1] Z. Yan, Trust Management in Mobile Environments - Usable and Autonomic Models, IGI Global, Hershey,
Pennsylvania, USA, 2013.
[2] Audun Josang, Roslan Ismail, Colin Boyd. A survey of trust and reputation systems for online service provi-
sion. in: Decision Support Systems Volume 43 Issue 2, March, 2007, Pages 618-644. DOI: 10.1016/j.dss.2005.05.019.
[3] P. Resnick and R. Zeckhauser. Trust Among Strangers in Internet Transactions: Empirical Analysis of
eBay’s Reputation System. In M.R. Baye, editor, The Economics of the Internet and E-Commerce, volume 11 of
Advances in Applied Microeconomics. Elsevier Science, 2002.
[4] J. Schneider et al. Disseminating Trust Information in Wearable Communities. In Proceedings of the 2nd
International Symposium on Handheld and Ubiquitous Computing (HUC2K), September 2000.
[5] Zheng Yan , Xu-yang Jing , Witold Pedrycz , Fusing and Mining Opinions for Reputation Generation, In-
formation Fusion (2016), doi: 10.1016/j.inffus.2016.11.011.
[6] Landauer, T. K., Foltz, P. W., Laham, D. (1998). Introduction to Latent Semantic Analysis. Discourse Pro-
cesses, 25, 259-284.
[7] J. A. HARTIGAN and M. A. WONG. Algorithm AS 136: A K-Means Clustering Algorithm. Journal of
the Royal Statistical Society. Series C (Applied Statistics), Vol. 28, No. 1 (1979), pp. 100-108.
[8] John A. Hartigan. Clustering Algorithms. 99th John Wiley Sons, Inc. New York, NY, USA 1975.
ISBN:047135645X