SlideShare a Scribd company logo
1 of 5
Download to read offline
Applying Machine Learning to Product Categorization
                                                Sushant Shankar and Irving Lin
                                        Department of Computer Science, Stanford University

ABSTRACT                                                            software and sophisticated methods, most companies that
                                                                    sell goods online do not have such resources. We propose to
We present a method for classifying products into a set of          use machine learning as a way to automatically classify
known categories by using supervised learning. That is, given       products.
a product with accompanying informational details such as
name and descriptions, we group the product into a                  3. DATA
particular category with similar products, e.g., ‘Electronics’ or
‘Automotive’. To do this, we analyze product catalog                3.1 Initial Dataset
information from different distributors on Amazon.com to
build features for a classifier. Our implementation results         For our project, we have acquired multiple datasets with
show significant improvement over baseline results. Taking          thousands of products and their respective data from Ingram
into particular criteria, our implementation is potentially able    Micro, the leading information technology distributor, which
to substantially increase automation of categorization of           list of product descriptors and resulting categorizations in
products.                                                           the form of a detailed CSV document separating multiple
                                                                    features seen in Table 1. The datasets we have chosen to use
General Terms                                                       for the project comprises of the categories seen in Table 2.
Supervised and Unsupervised Learning, E-Commerce
                                                                                                     Table	
  1:	
  Product	
  Descriptors	
  
Keywords                                                                                        Product Descriptor
Amazon, Product Categorization, Classifier                                                      SKU
                                                                                                Asin
                                                                                                Model Number
1. INTRODUCTION                                                                                 Title with Categorization
                                                                                                Description
                                                                                                1-10 Tech Details
Small to medium sized businesses who sell products online                                       Seller
spend a significant part of their time, money, and effort                                       Unspecified Number of Images
organizing the products they sell, understanding consumer                                       URL
behavior to better market their products, and determining           	
  
which products to sell. We would like to use machine                            Table	
  2:	
  Company	
  A	
  (left)	
  and	
  Company	
  B	
  (right)	
  Catalogs	
  
                                                                           Category                              #               Category                                 #
learning techniques to define product categories (e.g.
                                                                           Electronics                           3514            Health                                   768
‘Electronics’) and potential subcategories (e.g., ‘Printers’).             Camera & Photo                        653             Beauty                                   645
This is useful for the case where a business has a list of new             Video Games                           254             Electronics                              585
products that they want to sell and they want to                           Musical Instruments                   212             Home                                     542
automatically classify these products based on training data               Kitchen & Dining                      173             Toys                                     271
                                                                           Computer & Accessories                61              Office                                   255
of the businesses’ other products and classifications. This                Automotive                            48              Sports                                   238
will also be useful when there is a new product line that has              Health & Personal Care                32              PC                                       232
not been previously introduced in the market before, or the                Office Products                       30              Personal Care Appliances                 204
products are more densely populated than the training data                 Cell Phones & Accessories             29              Books                                    203
                                                                           Home & Garden                         26              Kitchen                                  160
(for example, if a business just sells electronic equipment, we            Home Improvement                      24              Wireless                                 154
would want to come up with a more granular structure). For                 Sports & Outdoors                     17              Grocery                                  113
this algorithm to be used in industry, we have consulted with              Patio, Lawn & Garden                  11              Pet                                      112
a few small-to-medium sized companies and find that we will                Software                              6               Video                                    106
                                                                           Movies & TV                           5               Apparel                                  91
need an accuracy range of 95% when we have enough prior                    Toys & Games                          2               Lawn                                     81
training data and a dense set of categorizations.                          Beauty                                2               Baby                                     68
                                                                           Everything Else                       1               Automotive                               52
2. PRIOR WORK                                                              Clothing
                                                                           Baby
                                                                                                                 1
                                                                                                                 1
                                                                                                                                 Shoes
                                                                                                                                 Jewelry
                                                                                                                                                                          32
                                                                                                                                                                          22
                                                                                                                                 Watches                                  21
There has been much millions of dollars spent developing                                                                         Software                                 8
software (e.g., IBM eCommerce) that maintain information                                                                         Music                                    6
about products, buying history of users for particular              3.2 Parsing the Data Set
products, etc. As such there is much work being done on
how to create good product categories. However, while large         Each product comes with a large amount of accompanying
companies (like Amazon) can afford to use such expensive            data that may not be relevant or may add unnecessary
complexity that we feel would be detrimental to our              ‘Electronics’. In other words, for the large categories there
classification purposes. Of the numerous product                 was high recall but low precision; for the smaller categories
descriptors, we decided to keep SKU as a way to uniquely         there was low recall but high precision. This effect was also
identify products, title with categorization, the description,   seen in Product Catalog B, but to a much lesser degree as the
and tech details. The title in particular contained a high       skewness of the catalogs is much less than in Product
concentration of important product information such as           Catalog A.
brand and categorization. As a result, we were able to parse
out the categorization, but decided to temporarily leave the     4.4 Filtering Items
brand as part of the title and treat it as another word. As a
side note, we decided not to incorporate the images as we        One issue we noted for the this classification scheme was
felt that the task would be more suited to computer vision       that often times, an item just didn’t have enough information
and image processing.                                            to classify it properly, or that the words just didn’t provide
                                                                 enough information to really provide a situation such that
3.3 Splitting Training and Test Sets                             even a human would be able to properly classify one way or
                                                                 another. In the prior case, if there weren’t enough words, we
Before creating our training and test sets, we decided to        just removed the item from consideration, and didn’t
merge all of the categories with less than 0.5% of the overall   incorporate that into the classification accuracy. This was
products into one “Other” category, and then decided to          especially true for the Company B’s catalog, which included
train on a 25% random subset of the data and testing our         books whose titles were often 3-4 words or less in length,
error rate on the rest.                                          and were absolutely failing in their categorizations. The latter
                                                                 situation leads up to the next subsection.
4. NAÏVE IMPLEMENTATION
                                                                 4.5 Multiple Classifications / ‘Strong Accuracy’
Our predominant source of data is in the form of text
descriptors about a product; thus, we decided to implement a     Upon manual inspection of miss-classifications, the classified
simple NaĂŻve Bayes classifier and focus on features described    found arguably equivalently good classifications. In this case,
in Section 5. The simplicity of Naïve Bayes makes for a great    the classification was ‘Electronics’ but we classified it as
testing environment and feature building, which helped to        ‘Office Products’, which also makes sense:
define the scope of the overall project.                             Ex: Fellowes Powershred MS-450C Safe Sense 7-sheet
                                                                     Confetti Cut Shredder
4.1 First Implementation – Naïve Bayes
                                                                 We decided to redefine success if the top three categories the
To begin, we implemented a simple multinomial event model        classifier spit out were the ‘correct’ categorization. For each
Naive Bayes model with Laplace smoothing on just the ‘title’     SKU, we took up to three categories that were greater than
specification. Our accuracy using this implementation was        or equal to the range of the probabilities (categories) (max -
around 70%.                                                      min).

4.2 Modified Implementation – Naïve Bayes                        Accuracies were low when looking at Product Catalog B
                                                                 (around 50%). Including multiple classifications increased the
We realized that our classifier was classifying everything as    accuracy to around 91%. This is not just a fudge statistic, but
‘Electronics’ because the probability of the ‘Electronics’       actually makes sense in the context of these catalogs. When a
category was astronomical compared to everything else.           company gets a user interested in an analogous category, we
Thus, we decided to remove the prior class probability out of    want to be able to show him a product to buy if it the
the equation in Naive Bayes.                                     analogous category also makes sense.

However, when we break up errors by category, we find the        Using multiple classifications, if only one category was
less training data we have on this category, the worse the       returned, this means our classifier is reasonably certain that
accuracy. Thus, from our previous observation, we decided        this is the correct categorization as the probability this
to only use categories that have 0.5% of the total number of     belongs to other categories is much less (in fact, less than
products (so > ~25 products).                                    50% of the top categorization according to our heuristic).
                                                                 We term this the strong accuracy. Indeed, if we look at our
With these changes, our results immediately improved to an       strong accuracy for Product Catalog B for multiple category,
accuracy of about 75%.                                           our accuracy is 72.7% while our strong accuracy is 84.5%.

4.3 Precision / Recall                                           5. FEATURES
We found that for Product Catalog A, there was a large bias      As we use the human generated text content to define a
towards classifying towards the large categories, e.g.           product, we decided on using the bag of words model as
features for each item. However, because the text descriptors     attribute this to the fact that for certain categories, specific
are human generated, we encounter a lot of human                  numbers may show up more, and thus affect the resulting
eccentricities and differences in word choice semantics for       categorization.
similar words. Thus, we incorporate a couple of domain
specific ideas to help with classification. The first is per      5.1.4 Punctuation Removal
feature processing, and the second is determining which
features to use.                                                  Removing punctuation was another result of looking at our
                                                                  data and noticing that humans have a very large variance in
5.1 Feature Processing                                            phrasing and word choice. This is especially true for words
                                                                  that may have multiple accepted forms, or words with
We apply some standard techniques in text processing to           punctuation in them. Also, because we parse the item
clean up our feature set: stemming/lemmatization,                 descriptors on spaces, any punctuations that are in the
lowecasing, and stop word, punctuation, and and number            phrase are left in, including ellipses, periods, exclamation
removal.                                                          points, and others. In addition, words that are concatenated
                                                                  are often used differently.
5.1.1 Stemming/Lemmatization
                                                                      Ex: don’t, dont, 3D, 3-D, period., period?
We use stemming to narrow our overall feature space so as
                                                                  Punctuation removal was the second most effective feature
to help with the lack of features per item.
                                                                  normalization method used for Company A, while it was a
    Ex: case, casing, cases -> cas                                (close) third best for Company B (Table 3).

Overall, we see mixed results for stemming. As seen in Table      5.1.5 Lowercasing
3, while stemming shows improvement for both companies
over the baseline without any feature processing, it clearly      Sentences in the English language tend to have the first word
negatively affects the overall accuracy when combined with        capitalized. In addition, different people will capitalize
other feature processing for Company A, while it positively       different words intentionally or otherwise, depending on
affects overall accuracy for Company B.                           their intent interpretation of the word, of choice of
                                                                  capitalizing acronyms. Clearly, this is not always a useful
5.1.2 Stop Word Removal                                           normalization, and it can go very wrong especially when
                                                                  dealing with product names vs. generic objects.
We found as in much text, that there were many common,
short function words, often prepositional terms and other             Ex. President, president, CD, cd, Windows, windows
grammatical syntax fillers that were found in many of the
                                                                  As expected, lowercasing doesn’t help very much (Table 3),
product descriptions. While we have not yet attempted to
                                                                  most likely because any benefits we receive from lowercasing
build a domain-specific (i.e. for product descriptions), we
                                                                  occurrences such as “Knives are great” to “knives are great”
used a standard set of Porter stem words.
                                                                  are offset by “Windows Vista” to “windows vista”, which
    Ex: the, is, at, who, which, on                               have very different meanings. Perhaps it would’ve been more
                                                                  useful to just lowercase the word of the item if there’s no
Stop words turned out to be by far the most impressive            chance that it’s a product name, but that would’ve been
feature processing technique we used, in so much as               particularly difficult given the infinite amount of product
dominating both catalogs for accuracy boosts (Table 3). In        names out there.
fact, for Company A, it resulted in nearly a 10% increase in
overall accuracy.                                                        Table	
  3:	
  Showing	
  Feature	
  to	
  NB	
  Classification	
  Accuracy	
  
                                                                      Feature Normalization                    Company A               Company B
5.1.3 Number Removal                                                                                           NB Accuracy             NB Accuracy
                                                                      Stemming                                 .775                    .519
                                                                      Stopword Removal                         .843                    .521
The decision to remove numbers was a result of looking at             Number Removal                           .794                    .492
our data and realizing that many numbers had little to no             Punctuation Removal                      .803                    .515
relevance to a given category, and that the scatter (and range)       Lowercase                                .778                    .507
                                                                      Include Certain Descriptors              .829                    .498
of numbers only would adversely impact a given                        None                                     .747                    .497
categorization. Because there are a limited and finite number         All                                      .796                    .524
of training examples, but an infinite number of numbers, we
believed that we could not fully capture the information of       5.2 Choosing Features
numbers for a given category.
    Ex: 5-10, 2, 6, 15.3                                          Secondly, we use domain specific knowledge about the
                                                                  product catalog andwhat is contained in it. In addition to the
The slight performance increase for Company A is offset by        title, items have brands and other descriptions such as
a slight performance decrease for Company B (Table 3). We         specifications that we may use. We felt many of the
specifications such as size and weight wouldn’t benefit the                                                      accuracy features. When we do use the least informative
overall classification, but we felt that the brand might have a                                                  features, the accuracy halves.
large impact, and we when we tested it, we saw that the                                                        - For the Company B product catalog, when we do not
impact really depends on the dataset used (Table 3).                                                             include the least informative features, the accuracy
                                                                                                                 drops. However, when we do include the least
5.3 Calculating the importance of features                                                                       informative features, there is a slightly insignificant jump
                                                                                                                 of 0.5% accuracy.
In order to 1) understand what features we can add and 2)
and weighting the features we have by how important they                                                     We surmise that this weird behavior is due to the mutual
are in distinguishing between categorizations. When we used                                                  information not properly describing or capturing the ‘least
the conventional MI formula, we found that two of the four                                                   informative features’, due to the skewness of the distribution
terms in the summation were actually skewing the MI                                                          of categories in Company A product catalog.	
  
calculation for Company Catalog A -- in particular, the case
of calculating the term of not this feature and in this class                                                6. DIFFERENT CLASSIFIERS
and not this feature and not in this class. This is due to the
skewness in the distribution of categories (where Electronics                                                We decided to continue working with our version of Naive
categories are Ÿ of the catalog) and the modification is not                                                 Bayes after trying a few different classifiers. We used the
applicable for Company Catalog B as the distribution of                                                      Orange (http://orange.biolab.si/) machine learning library
categories is more uniform. The modified MI formula we                                                       for Python to try out a few different machine learning
used for this:                                                                                               algorithms. We tried this on Company A’s catalog and the
                                                                                                             table below has the accuracy and time taken for each
                                                                          𝑝 đ‘„! , 𝑩
                      đ‘€đŒ đ‘„! , 𝑩 =                         𝑝 đ‘„! , 𝑩 log                                       algorithm
                                                                         𝑝 đ‘„! 𝑝 𝑩
                                      !! ∈{!} !! ∈{!,!}
                                                                                                                           Table	
  6:	
  Different	
  Classifiers	
  and	
  Accuracies	
  
Table	
  4:	
  Most	
  and	
  Least	
  Informative	
  Words	
  in	
  'Health	
  &	
  Personal	
  Care'	
               Algorithm                      Company A Catalog                   Time Taken
                                  for	
  Company	
  A's	
  Catalog	
                                                                                   Dataset Accuracy
 Most informative words                 M.I.              Least informative words            M.I.             NaĂŻve Bayes                                   79.6%                             ~3 seconds
 tim                                    0.00160           M30p20                             2.5e^-8          K nearest neighbors (k = 5)                   69.4%                             ~4 minutes
 beard                                  0.00084           pz462                              2.5e^-8          Tree classifier                               86.0%                              ~8 hours
 wahl                                   0.00070           superspee                          2.5e^-8
 hair                                   0.00049           stick                              2.5e^-8         Here in Table 6, we can see that the Tree Classifier
                                                                                                             performed the best, but took a few hours to complete. In a
Table	
  5:	
  Most	
  and	
  Least	
  Informative	
  Words	
  in	
  'Health	
  &	
  Personal	
  Care'	
  
                                                                                                             real commercial situation, we see an algorithm like the Tree
                                  for	
  Company	
  A's	
  Catalog	
  
 Most informative words                 M.I.              Least informative words            M.I.
                                                                                                             Classifier (or a more sophisticated algorithm as Random
 black                                  0.1437            rapid                              0.1291          Forests) as a back-end deamon. In addition, this tree gives a
 pack                                   0.1357            wet                                0.1291          potential to more deeply understand a product catalog (more
 set                                    0.1356            orthot                             0.1291          so than a fixed set of categories). However, Naive Bayes
 cas                                    0.1355            champ                              0.1291          achieved almost the same accuracy as the Tree Classifier in
                                                                                                             significantly less time. We thus stuck with our Naive Bayes
5.4 Feature Selection                                                                                        classifier to develop features for (see Section 5.2) and select
                                                                                                             features from (see Section 5.3). We tried the kNN (k-nearest
                                                                                                             algorithm) for different values of k, but they all performed
As we are using a modified set of the words to describe
                                                                                                             poorly, with k=1 to 5 performing the best.
products as our features, not all these features will be useful
for classification or differentiation between categories. We
considered forward or backward search to select our                                                          7. CONCLUSIONS
features, but as our features run in the order of thousands,
we decided to try filter feature selection instead. To do this,                                              We found two main difficulties working with two datasets.
we first looked at the distribution of MI in our two catalogs.                                               In Company A’s product catalog, the distribution of
In Company Catalog A, the spread between the most and                                                        categories was severely skewed towards one category
least informative features was large. We had some interesting                                                (‘Electronics’) and we tended to always classify products into
findings (both with using max of 3 possible categories):                                                     the larger categories. As such, the larger the category the
                                                                                                             better the accuracy. In Company B’s product catalog, we
  - Using the most informative features increased the                                                        found many of the products had very short descriptions and
    accuracy in both Company catalogs.                                                                       there many fewer words than in Company A’s catalog.
  - For the Company A product catalog, not using the least
    informative features along with the most informative                                                     Along with implementing regular text processing techniques
    helped. We found a slightly significant jump of 2%                                                       (stemming, etc.) we found that using mulltiple classifications,
                                                                                                             the notion of strong accuracies, and filtering products that
                                                                                                             had greater than a certain number of words -- helped the
accuracy significantly and made sense in the context of          [1] Witschel, Hans F., and Fabian Schmidt. "Information
categorizing products. In particular, we believe that using      Structuring and Product Classification." TSTT 2006 Beijing.
multiple classifications is one of the main ways this can be     Web. 2011.
extended to incorporating user data from these company
websites and offering collaborative filtering capabilities.      [2] UNSPSC Homepage. 2011. <http://www.unspsc.org/>.

We plan to continue this project in the context of a startup     [3] "eCl@ss - The Leading Classification System." Content
and we would like to work towards a collaborative filtering      Development Platform. Paradine, 15 June 2011. Web. 2011.
platform for small and medium sized businesses by first          <http://www.eclass-cdp.com>.
making our classification and categorization of products
better for the current catalogs and robust for different types
of product catalogs. We would also like to implement
unsupervised learning to obtain sub-categorization of
existing categories.

7.1 Product-Dependent Features
So far, we have use the bag of words model and have not
incorporated features that are domain-specific:

    -   Bi-words: Many brand names and descriptions have
        words that occur together, e.g., ‘digital camera’. We
        would like to capture these.
    -   Upweight the title and pick the best percentage of
        upweighting through a model selection method.
    -   Use our mutual information features on how to
        weight the features (the ones that differentiate
        between classes are more important). We already
        have this, but this will be important when we have
        different types of features.
    -   Add spelling correction as we noticed there are
        many misspelled words in the catalog.

7.2 Explore More Sophisticated Methods
We chose Naive Bayes as it is the simplest method we could
implement. When we tried a few other algorithms using
Orange, the Tree classifier looked promising. However, since
it took around 8 hours on the entire set of data, we did not
pursue working with this algorithm for this project. We
would like to pursue variants of the Tree classifier (such as
Random Forest) by exploring a subset of the data.

7.3 Unsupervised
Supervised classification with product catalogs is limited to
the categories that we train on. To find more granular sub-
categories, we need to be able to discover structure within
categories of products. We are considering using an
unsupervised hierarchical clustering approach. In addition,
this would allow us to categorize and give structure to
catalogs where there is no existing product catalog (for
example, products for a new or sparsely explored market).

8. REFERENCES

More Related Content

Viewers also liked

Data mining with Google analytics
Data mining with Google analyticsData mining with Google analytics
Data mining with Google analyticsGreg Bray
 
Using NLP to find contextual relationships between fashion houses
Using NLP to find contextual relationships between fashion housesUsing NLP to find contextual relationships between fashion houses
Using NLP to find contextual relationships between fashion housesSushant Shankar
 
SF ElasticSearch Meetup 2013.04.06 - Monitoring
SF ElasticSearch Meetup 2013.04.06 - MonitoringSF ElasticSearch Meetup 2013.04.06 - Monitoring
SF ElasticSearch Meetup 2013.04.06 - MonitoringSushant Shankar
 
Supervised Classifcation Portland Metro
Supervised Classifcation Portland MetroSupervised Classifcation Portland Metro
Supervised Classifcation Portland MetroDonnych Diaz
 
The Evolution of Digital Ecommerce
The Evolution of Digital EcommerceThe Evolution of Digital Ecommerce
The Evolution of Digital EcommerceIncubeta NMPi
 
Machine Learning and Data Mining: 10 Introduction to Classification
Machine Learning and Data Mining: 10 Introduction to ClassificationMachine Learning and Data Mining: 10 Introduction to Classification
Machine Learning and Data Mining: 10 Introduction to ClassificationPier Luca Lanzi
 
Machine Learning with Applications in Categorization, Popularity and Sequence...
Machine Learning with Applications in Categorization, Popularity and Sequence...Machine Learning with Applications in Categorization, Popularity and Sequence...
Machine Learning with Applications in Categorization, Popularity and Sequence...Nicolas Nicolov
 
Big Data in e-Commerce
Big Data in e-CommerceBig Data in e-Commerce
Big Data in e-CommerceDivante
 
(CMP305) Deep Learning on AWS Made EasyCmp305
(CMP305) Deep Learning on AWS Made EasyCmp305(CMP305) Deep Learning on AWS Made EasyCmp305
(CMP305) Deep Learning on AWS Made EasyCmp305Amazon Web Services
 
Magento scalability from the trenches (Meet Magento Sweden 2016)
Magento scalability from the trenches (Meet Magento Sweden 2016)Magento scalability from the trenches (Meet Magento Sweden 2016)
Magento scalability from the trenches (Meet Magento Sweden 2016)Divante
 

Viewers also liked (10)

Data mining with Google analytics
Data mining with Google analyticsData mining with Google analytics
Data mining with Google analytics
 
Using NLP to find contextual relationships between fashion houses
Using NLP to find contextual relationships between fashion housesUsing NLP to find contextual relationships between fashion houses
Using NLP to find contextual relationships between fashion houses
 
SF ElasticSearch Meetup 2013.04.06 - Monitoring
SF ElasticSearch Meetup 2013.04.06 - MonitoringSF ElasticSearch Meetup 2013.04.06 - Monitoring
SF ElasticSearch Meetup 2013.04.06 - Monitoring
 
Supervised Classifcation Portland Metro
Supervised Classifcation Portland MetroSupervised Classifcation Portland Metro
Supervised Classifcation Portland Metro
 
The Evolution of Digital Ecommerce
The Evolution of Digital EcommerceThe Evolution of Digital Ecommerce
The Evolution of Digital Ecommerce
 
Machine Learning and Data Mining: 10 Introduction to Classification
Machine Learning and Data Mining: 10 Introduction to ClassificationMachine Learning and Data Mining: 10 Introduction to Classification
Machine Learning and Data Mining: 10 Introduction to Classification
 
Machine Learning with Applications in Categorization, Popularity and Sequence...
Machine Learning with Applications in Categorization, Popularity and Sequence...Machine Learning with Applications in Categorization, Popularity and Sequence...
Machine Learning with Applications in Categorization, Popularity and Sequence...
 
Big Data in e-Commerce
Big Data in e-CommerceBig Data in e-Commerce
Big Data in e-Commerce
 
(CMP305) Deep Learning on AWS Made EasyCmp305
(CMP305) Deep Learning on AWS Made EasyCmp305(CMP305) Deep Learning on AWS Made EasyCmp305
(CMP305) Deep Learning on AWS Made EasyCmp305
 
Magento scalability from the trenches (Meet Magento Sweden 2016)
Magento scalability from the trenches (Meet Magento Sweden 2016)Magento scalability from the trenches (Meet Magento Sweden 2016)
Magento scalability from the trenches (Meet Magento Sweden 2016)
 

Similar to Applying machine learning to product categorization

Odi best-practice-data-warehouse-168255
Odi best-practice-data-warehouse-168255Odi best-practice-data-warehouse-168255
Odi best-practice-data-warehouse-168255nm2013
 
Precise case b2b
Precise case b2bPrecise case b2b
Precise case b2bAbhijeet Kumar
 
Arya.ai artificial intelligence platform vinay kumar
Arya.ai   artificial intelligence platform  vinay kumarArya.ai   artificial intelligence platform  vinay kumar
Arya.ai artificial intelligence platform vinay kumarTechXpla
 
Bhadale group of companies manufacturing industry products catalogue
Bhadale group of companies manufacturing industry products catalogueBhadale group of companies manufacturing industry products catalogue
Bhadale group of companies manufacturing industry products catalogueVijayananda Mohire
 
Jewelry has been a really unstructuredsomewhat chaotic with resp.docx
Jewelry has been a really unstructuredsomewhat chaotic with resp.docxJewelry has been a really unstructuredsomewhat chaotic with resp.docx
Jewelry has been a really unstructuredsomewhat chaotic with resp.docxvrickens
 
Bhadale group of companies automobile products catalogue
Bhadale group of companies automobile products catalogueBhadale group of companies automobile products catalogue
Bhadale group of companies automobile products catalogueVijayananda Mohire
 
Automated Product Data Publishing from Oracle Product Hub Is the Way Forward
Automated Product Data Publishing from Oracle Product Hub Is the Way ForwardAutomated Product Data Publishing from Oracle Product Hub Is the Way Forward
Automated Product Data Publishing from Oracle Product Hub Is the Way ForwardCognizant
 
Engineering plant facilities 17 artificial intelligence algorithms &amp; proc...
Engineering plant facilities 17 artificial intelligence algorithms &amp; proc...Engineering plant facilities 17 artificial intelligence algorithms &amp; proc...
Engineering plant facilities 17 artificial intelligence algorithms &amp; proc...Luis Cabrera
 
Engineering plant facilities 17 artificial intelligence algorithms &amp; proc...
Engineering plant facilities 17 artificial intelligence algorithms &amp; proc...Engineering plant facilities 17 artificial intelligence algorithms &amp; proc...
Engineering plant facilities 17 artificial intelligence algorithms &amp; proc...Luis Cabrera
 
Selling Online – Do You Push Or Pull
Selling Online – Do You Push Or PullSelling Online – Do You Push Or Pull
Selling Online – Do You Push Or Pullrazorboy1
 
Internet of things_tutorial
Internet of things_tutorialInternet of things_tutorial
Internet of things_tutorialMuhammed Hassan M
 
Is Your Organization Ready to Embrace a Digital Twin?
Is Your Organization Ready to Embrace a Digital Twin?Is Your Organization Ready to Embrace a Digital Twin?
Is Your Organization Ready to Embrace a Digital Twin?Cognizant
 
Oracle ebs r12eam part2
Oracle ebs r12eam part2Oracle ebs r12eam part2
Oracle ebs r12eam part2jcvd12
 
Bhadale group of companies ai robotics products catalogue
Bhadale group of companies ai robotics products catalogueBhadale group of companies ai robotics products catalogue
Bhadale group of companies ai robotics products catalogueVijayananda Mohire
 
Applied AI for Startups
Applied AI for StartupsApplied AI for Startups
Applied AI for StartupsAdrien Hernandez
 
MeasureCamp #10 - WTF are Related Products in Google Analytics Ecommerce?
MeasureCamp #10 - WTF are Related Products in Google Analytics Ecommerce?MeasureCamp #10 - WTF are Related Products in Google Analytics Ecommerce?
MeasureCamp #10 - WTF are Related Products in Google Analytics Ecommerce?Michaela Linhart
 
Bhadale group of companies it-bt- bpo products catalogue
Bhadale group of companies it-bt- bpo products catalogueBhadale group of companies it-bt- bpo products catalogue
Bhadale group of companies it-bt- bpo products catalogueVijayananda Mohire
 
1 it 210 final project guidelines and rubric over
1 it 210 final project guidelines and rubric  over1 it 210 final project guidelines and rubric  over
1 it 210 final project guidelines and rubric overjasmin849794
 
Onlineshopping
OnlineshoppingOnlineshopping
Onlineshoppingamitesh2690
 
Make Spring Home (Spring Customization and Extensibility)
Make Spring Home (Spring Customization and Extensibility)Make Spring Home (Spring Customization and Extensibility)
Make Spring Home (Spring Customization and Extensibility)VMware Tanzu
 

Similar to Applying machine learning to product categorization (20)

Odi best-practice-data-warehouse-168255
Odi best-practice-data-warehouse-168255Odi best-practice-data-warehouse-168255
Odi best-practice-data-warehouse-168255
 
Precise case b2b
Precise case b2bPrecise case b2b
Precise case b2b
 
Arya.ai artificial intelligence platform vinay kumar
Arya.ai   artificial intelligence platform  vinay kumarArya.ai   artificial intelligence platform  vinay kumar
Arya.ai artificial intelligence platform vinay kumar
 
Bhadale group of companies manufacturing industry products catalogue
Bhadale group of companies manufacturing industry products catalogueBhadale group of companies manufacturing industry products catalogue
Bhadale group of companies manufacturing industry products catalogue
 
Jewelry has been a really unstructuredsomewhat chaotic with resp.docx
Jewelry has been a really unstructuredsomewhat chaotic with resp.docxJewelry has been a really unstructuredsomewhat chaotic with resp.docx
Jewelry has been a really unstructuredsomewhat chaotic with resp.docx
 
Bhadale group of companies automobile products catalogue
Bhadale group of companies automobile products catalogueBhadale group of companies automobile products catalogue
Bhadale group of companies automobile products catalogue
 
Automated Product Data Publishing from Oracle Product Hub Is the Way Forward
Automated Product Data Publishing from Oracle Product Hub Is the Way ForwardAutomated Product Data Publishing from Oracle Product Hub Is the Way Forward
Automated Product Data Publishing from Oracle Product Hub Is the Way Forward
 
Engineering plant facilities 17 artificial intelligence algorithms &amp; proc...
Engineering plant facilities 17 artificial intelligence algorithms &amp; proc...Engineering plant facilities 17 artificial intelligence algorithms &amp; proc...
Engineering plant facilities 17 artificial intelligence algorithms &amp; proc...
 
Engineering plant facilities 17 artificial intelligence algorithms &amp; proc...
Engineering plant facilities 17 artificial intelligence algorithms &amp; proc...Engineering plant facilities 17 artificial intelligence algorithms &amp; proc...
Engineering plant facilities 17 artificial intelligence algorithms &amp; proc...
 
Selling Online – Do You Push Or Pull
Selling Online – Do You Push Or PullSelling Online – Do You Push Or Pull
Selling Online – Do You Push Or Pull
 
Internet of things_tutorial
Internet of things_tutorialInternet of things_tutorial
Internet of things_tutorial
 
Is Your Organization Ready to Embrace a Digital Twin?
Is Your Organization Ready to Embrace a Digital Twin?Is Your Organization Ready to Embrace a Digital Twin?
Is Your Organization Ready to Embrace a Digital Twin?
 
Oracle ebs r12eam part2
Oracle ebs r12eam part2Oracle ebs r12eam part2
Oracle ebs r12eam part2
 
Bhadale group of companies ai robotics products catalogue
Bhadale group of companies ai robotics products catalogueBhadale group of companies ai robotics products catalogue
Bhadale group of companies ai robotics products catalogue
 
Applied AI for Startups
Applied AI for StartupsApplied AI for Startups
Applied AI for Startups
 
MeasureCamp #10 - WTF are Related Products in Google Analytics Ecommerce?
MeasureCamp #10 - WTF are Related Products in Google Analytics Ecommerce?MeasureCamp #10 - WTF are Related Products in Google Analytics Ecommerce?
MeasureCamp #10 - WTF are Related Products in Google Analytics Ecommerce?
 
Bhadale group of companies it-bt- bpo products catalogue
Bhadale group of companies it-bt- bpo products catalogueBhadale group of companies it-bt- bpo products catalogue
Bhadale group of companies it-bt- bpo products catalogue
 
1 it 210 final project guidelines and rubric over
1 it 210 final project guidelines and rubric  over1 it 210 final project guidelines and rubric  over
1 it 210 final project guidelines and rubric over
 
Onlineshopping
OnlineshoppingOnlineshopping
Onlineshopping
 
Make Spring Home (Spring Customization and Extensibility)
Make Spring Home (Spring Customization and Extensibility)Make Spring Home (Spring Customization and Extensibility)
Make Spring Home (Spring Customization and Extensibility)
 

Recently uploaded

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel AraĂșjo
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 

Applying machine learning to product categorization

  • 1. Applying Machine Learning to Product Categorization Sushant Shankar and Irving Lin Department of Computer Science, Stanford University ABSTRACT software and sophisticated methods, most companies that sell goods online do not have such resources. We propose to We present a method for classifying products into a set of use machine learning as a way to automatically classify known categories by using supervised learning. That is, given products. a product with accompanying informational details such as name and descriptions, we group the product into a 3. DATA particular category with similar products, e.g., ‘Electronics’ or ‘Automotive’. To do this, we analyze product catalog 3.1 Initial Dataset information from different distributors on Amazon.com to build features for a classifier. Our implementation results For our project, we have acquired multiple datasets with show significant improvement over baseline results. Taking thousands of products and their respective data from Ingram into particular criteria, our implementation is potentially able Micro, the leading information technology distributor, which to substantially increase automation of categorization of list of product descriptors and resulting categorizations in products. the form of a detailed CSV document separating multiple features seen in Table 1. The datasets we have chosen to use General Terms for the project comprises of the categories seen in Table 2. Supervised and Unsupervised Learning, E-Commerce Table  1:  Product  Descriptors   Keywords Product Descriptor Amazon, Product Categorization, Classifier SKU Asin Model Number 1. INTRODUCTION Title with Categorization Description 1-10 Tech Details Small to medium sized businesses who sell products online Seller spend a significant part of their time, money, and effort Unspecified Number of Images organizing the products they sell, understanding consumer URL behavior to better market their products, and determining   which products to sell. We would like to use machine Table  2:  Company  A  (left)  and  Company  B  (right)  Catalogs   Category # Category # learning techniques to define product categories (e.g. Electronics 3514 Health 768 ‘Electronics’) and potential subcategories (e.g., ‘Printers’). Camera & Photo 653 Beauty 645 This is useful for the case where a business has a list of new Video Games 254 Electronics 585 products that they want to sell and they want to Musical Instruments 212 Home 542 automatically classify these products based on training data Kitchen & Dining 173 Toys 271 Computer & Accessories 61 Office 255 of the businesses’ other products and classifications. This Automotive 48 Sports 238 will also be useful when there is a new product line that has Health & Personal Care 32 PC 232 not been previously introduced in the market before, or the Office Products 30 Personal Care Appliances 204 products are more densely populated than the training data Cell Phones & Accessories 29 Books 203 Home & Garden 26 Kitchen 160 (for example, if a business just sells electronic equipment, we Home Improvement 24 Wireless 154 would want to come up with a more granular structure). For Sports & Outdoors 17 Grocery 113 this algorithm to be used in industry, we have consulted with Patio, Lawn & Garden 11 Pet 112 a few small-to-medium sized companies and find that we will Software 6 Video 106 Movies & TV 5 Apparel 91 need an accuracy range of 95% when we have enough prior Toys & Games 2 Lawn 81 training data and a dense set of categorizations. Beauty 2 Baby 68 Everything Else 1 Automotive 52 2. PRIOR WORK Clothing Baby 1 1 Shoes Jewelry 32 22 Watches 21 There has been much millions of dollars spent developing Software 8 software (e.g., IBM eCommerce) that maintain information Music 6 about products, buying history of users for particular 3.2 Parsing the Data Set products, etc. As such there is much work being done on how to create good product categories. However, while large Each product comes with a large amount of accompanying companies (like Amazon) can afford to use such expensive data that may not be relevant or may add unnecessary
  • 2. complexity that we feel would be detrimental to our ‘Electronics’. In other words, for the large categories there classification purposes. Of the numerous product was high recall but low precision; for the smaller categories descriptors, we decided to keep SKU as a way to uniquely there was low recall but high precision. This effect was also identify products, title with categorization, the description, seen in Product Catalog B, but to a much lesser degree as the and tech details. The title in particular contained a high skewness of the catalogs is much less than in Product concentration of important product information such as Catalog A. brand and categorization. As a result, we were able to parse out the categorization, but decided to temporarily leave the 4.4 Filtering Items brand as part of the title and treat it as another word. As a side note, we decided not to incorporate the images as we One issue we noted for the this classification scheme was felt that the task would be more suited to computer vision that often times, an item just didn’t have enough information and image processing. to classify it properly, or that the words just didn’t provide enough information to really provide a situation such that 3.3 Splitting Training and Test Sets even a human would be able to properly classify one way or another. In the prior case, if there weren’t enough words, we Before creating our training and test sets, we decided to just removed the item from consideration, and didn’t merge all of the categories with less than 0.5% of the overall incorporate that into the classification accuracy. This was products into one “Other” category, and then decided to especially true for the Company B’s catalog, which included train on a 25% random subset of the data and testing our books whose titles were often 3-4 words or less in length, error rate on the rest. and were absolutely failing in their categorizations. The latter situation leads up to the next subsection. 4. NAÏVE IMPLEMENTATION 4.5 Multiple Classifications / ‘Strong Accuracy’ Our predominant source of data is in the form of text descriptors about a product; thus, we decided to implement a Upon manual inspection of miss-classifications, the classified simple NaĂŻve Bayes classifier and focus on features described found arguably equivalently good classifications. In this case, in Section 5. The simplicity of NaĂŻve Bayes makes for a great the classification was ‘Electronics’ but we classified it as testing environment and feature building, which helped to ‘Office Products’, which also makes sense: define the scope of the overall project. Ex: Fellowes Powershred MS-450C Safe Sense 7-sheet Confetti Cut Shredder 4.1 First Implementation – NaĂŻve Bayes We decided to redefine success if the top three categories the To begin, we implemented a simple multinomial event model classifier spit out were the ‘correct’ categorization. For each Naive Bayes model with Laplace smoothing on just the ‘title’ SKU, we took up to three categories that were greater than specification. Our accuracy using this implementation was or equal to the range of the probabilities (categories) (max - around 70%. min). 4.2 Modified Implementation – NaĂŻve Bayes Accuracies were low when looking at Product Catalog B (around 50%). Including multiple classifications increased the We realized that our classifier was classifying everything as accuracy to around 91%. This is not just a fudge statistic, but ‘Electronics’ because the probability of the ‘Electronics’ actually makes sense in the context of these catalogs. When a category was astronomical compared to everything else. company gets a user interested in an analogous category, we Thus, we decided to remove the prior class probability out of want to be able to show him a product to buy if it the the equation in Naive Bayes. analogous category also makes sense. However, when we break up errors by category, we find the Using multiple classifications, if only one category was less training data we have on this category, the worse the returned, this means our classifier is reasonably certain that accuracy. Thus, from our previous observation, we decided this is the correct categorization as the probability this to only use categories that have 0.5% of the total number of belongs to other categories is much less (in fact, less than products (so > ~25 products). 50% of the top categorization according to our heuristic). We term this the strong accuracy. Indeed, if we look at our With these changes, our results immediately improved to an strong accuracy for Product Catalog B for multiple category, accuracy of about 75%. our accuracy is 72.7% while our strong accuracy is 84.5%. 4.3 Precision / Recall 5. FEATURES We found that for Product Catalog A, there was a large bias As we use the human generated text content to define a towards classifying towards the large categories, e.g. product, we decided on using the bag of words model as
  • 3. features for each item. However, because the text descriptors attribute this to the fact that for certain categories, specific are human generated, we encounter a lot of human numbers may show up more, and thus affect the resulting eccentricities and differences in word choice semantics for categorization. similar words. Thus, we incorporate a couple of domain specific ideas to help with classification. The first is per 5.1.4 Punctuation Removal feature processing, and the second is determining which features to use. Removing punctuation was another result of looking at our data and noticing that humans have a very large variance in 5.1 Feature Processing phrasing and word choice. This is especially true for words that may have multiple accepted forms, or words with We apply some standard techniques in text processing to punctuation in them. Also, because we parse the item clean up our feature set: stemming/lemmatization, descriptors on spaces, any punctuations that are in the lowecasing, and stop word, punctuation, and and number phrase are left in, including ellipses, periods, exclamation removal. points, and others. In addition, words that are concatenated are often used differently. 5.1.1 Stemming/Lemmatization Ex: don’t, dont, 3D, 3-D, period., period? We use stemming to narrow our overall feature space so as Punctuation removal was the second most effective feature to help with the lack of features per item. normalization method used for Company A, while it was a Ex: case, casing, cases -> cas (close) third best for Company B (Table 3). Overall, we see mixed results for stemming. As seen in Table 5.1.5 Lowercasing 3, while stemming shows improvement for both companies over the baseline without any feature processing, it clearly Sentences in the English language tend to have the first word negatively affects the overall accuracy when combined with capitalized. In addition, different people will capitalize other feature processing for Company A, while it positively different words intentionally or otherwise, depending on affects overall accuracy for Company B. their intent interpretation of the word, of choice of capitalizing acronyms. Clearly, this is not always a useful 5.1.2 Stop Word Removal normalization, and it can go very wrong especially when dealing with product names vs. generic objects. We found as in much text, that there were many common, short function words, often prepositional terms and other Ex. President, president, CD, cd, Windows, windows grammatical syntax fillers that were found in many of the As expected, lowercasing doesn’t help very much (Table 3), product descriptions. While we have not yet attempted to most likely because any benefits we receive from lowercasing build a domain-specific (i.e. for product descriptions), we occurrences such as “Knives are great” to “knives are great” used a standard set of Porter stem words. are offset by “Windows Vista” to “windows vista”, which Ex: the, is, at, who, which, on have very different meanings. Perhaps it would’ve been more useful to just lowercase the word of the item if there’s no Stop words turned out to be by far the most impressive chance that it’s a product name, but that would’ve been feature processing technique we used, in so much as particularly difficult given the infinite amount of product dominating both catalogs for accuracy boosts (Table 3). In names out there. fact, for Company A, it resulted in nearly a 10% increase in overall accuracy. Table  3:  Showing  Feature  to  NB  Classification  Accuracy   Feature Normalization Company A Company B 5.1.3 Number Removal NB Accuracy NB Accuracy Stemming .775 .519 Stopword Removal .843 .521 The decision to remove numbers was a result of looking at Number Removal .794 .492 our data and realizing that many numbers had little to no Punctuation Removal .803 .515 relevance to a given category, and that the scatter (and range) Lowercase .778 .507 Include Certain Descriptors .829 .498 of numbers only would adversely impact a given None .747 .497 categorization. Because there are a limited and finite number All .796 .524 of training examples, but an infinite number of numbers, we believed that we could not fully capture the information of 5.2 Choosing Features numbers for a given category. Ex: 5-10, 2, 6, 15.3 Secondly, we use domain specific knowledge about the product catalog andwhat is contained in it. In addition to the The slight performance increase for Company A is offset by title, items have brands and other descriptions such as a slight performance decrease for Company B (Table 3). We specifications that we may use. We felt many of the
  • 4. specifications such as size and weight wouldn’t benefit the accuracy features. When we do use the least informative overall classification, but we felt that the brand might have a features, the accuracy halves. large impact, and we when we tested it, we saw that the - For the Company B product catalog, when we do not impact really depends on the dataset used (Table 3). include the least informative features, the accuracy drops. However, when we do include the least 5.3 Calculating the importance of features informative features, there is a slightly insignificant jump of 0.5% accuracy. In order to 1) understand what features we can add and 2) and weighting the features we have by how important they We surmise that this weird behavior is due to the mutual are in distinguishing between categorizations. When we used information not properly describing or capturing the ‘least the conventional MI formula, we found that two of the four informative features’, due to the skewness of the distribution terms in the summation were actually skewing the MI of categories in Company A product catalog.   calculation for Company Catalog A -- in particular, the case of calculating the term of not this feature and in this class 6. DIFFERENT CLASSIFIERS and not this feature and not in this class. This is due to the skewness in the distribution of categories (where Electronics We decided to continue working with our version of Naive categories are Ÿ of the catalog) and the modification is not Bayes after trying a few different classifiers. We used the applicable for Company Catalog B as the distribution of Orange (http://orange.biolab.si/) machine learning library categories is more uniform. The modified MI formula we for Python to try out a few different machine learning used for this: algorithms. We tried this on Company A’s catalog and the table below has the accuracy and time taken for each 𝑝 đ‘„! , 𝑩 đ‘€đŒ đ‘„! , 𝑩 = 𝑝 đ‘„! , 𝑩 log algorithm 𝑝 đ‘„! 𝑝 𝑩 !! ∈{!} !! ∈{!,!} Table  6:  Different  Classifiers  and  Accuracies   Table  4:  Most  and  Least  Informative  Words  in  'Health  &  Personal  Care'   Algorithm Company A Catalog Time Taken for  Company  A's  Catalog   Dataset Accuracy Most informative words M.I. Least informative words M.I. NaĂŻve Bayes 79.6% ~3 seconds tim 0.00160 M30p20 2.5e^-8 K nearest neighbors (k = 5) 69.4% ~4 minutes beard 0.00084 pz462 2.5e^-8 Tree classifier 86.0% ~8 hours wahl 0.00070 superspee 2.5e^-8 hair 0.00049 stick 2.5e^-8 Here in Table 6, we can see that the Tree Classifier performed the best, but took a few hours to complete. In a Table  5:  Most  and  Least  Informative  Words  in  'Health  &  Personal  Care'   real commercial situation, we see an algorithm like the Tree for  Company  A's  Catalog   Most informative words M.I. Least informative words M.I. Classifier (or a more sophisticated algorithm as Random black 0.1437 rapid 0.1291 Forests) as a back-end deamon. In addition, this tree gives a pack 0.1357 wet 0.1291 potential to more deeply understand a product catalog (more set 0.1356 orthot 0.1291 so than a fixed set of categories). However, Naive Bayes cas 0.1355 champ 0.1291 achieved almost the same accuracy as the Tree Classifier in significantly less time. We thus stuck with our Naive Bayes 5.4 Feature Selection classifier to develop features for (see Section 5.2) and select features from (see Section 5.3). We tried the kNN (k-nearest algorithm) for different values of k, but they all performed As we are using a modified set of the words to describe poorly, with k=1 to 5 performing the best. products as our features, not all these features will be useful for classification or differentiation between categories. We considered forward or backward search to select our 7. CONCLUSIONS features, but as our features run in the order of thousands, we decided to try filter feature selection instead. To do this, We found two main difficulties working with two datasets. we first looked at the distribution of MI in our two catalogs. In Company A’s product catalog, the distribution of In Company Catalog A, the spread between the most and categories was severely skewed towards one category least informative features was large. We had some interesting (‘Electronics’) and we tended to always classify products into findings (both with using max of 3 possible categories): the larger categories. As such, the larger the category the better the accuracy. In Company B’s product catalog, we - Using the most informative features increased the found many of the products had very short descriptions and accuracy in both Company catalogs. there many fewer words than in Company A’s catalog. - For the Company A product catalog, not using the least informative features along with the most informative Along with implementing regular text processing techniques helped. We found a slightly significant jump of 2% (stemming, etc.) we found that using mulltiple classifications, the notion of strong accuracies, and filtering products that had greater than a certain number of words -- helped the
  • 5. accuracy significantly and made sense in the context of [1] Witschel, Hans F., and Fabian Schmidt. "Information categorizing products. In particular, we believe that using Structuring and Product Classification." TSTT 2006 Beijing. multiple classifications is one of the main ways this can be Web. 2011. extended to incorporating user data from these company websites and offering collaborative filtering capabilities. [2] UNSPSC Homepage. 2011. <http://www.unspsc.org/>. We plan to continue this project in the context of a startup [3] "eCl@ss - The Leading Classification System." Content and we would like to work towards a collaborative filtering Development Platform. Paradine, 15 June 2011. Web. 2011. platform for small and medium sized businesses by first <http://www.eclass-cdp.com>. making our classification and categorization of products better for the current catalogs and robust for different types of product catalogs. We would also like to implement unsupervised learning to obtain sub-categorization of existing categories. 7.1 Product-Dependent Features So far, we have use the bag of words model and have not incorporated features that are domain-specific: - Bi-words: Many brand names and descriptions have words that occur together, e.g., ‘digital camera’. We would like to capture these. - Upweight the title and pick the best percentage of upweighting through a model selection method. - Use our mutual information features on how to weight the features (the ones that differentiate between classes are more important). We already have this, but this will be important when we have different types of features. - Add spelling correction as we noticed there are many misspelled words in the catalog. 7.2 Explore More Sophisticated Methods We chose Naive Bayes as it is the simplest method we could implement. When we tried a few other algorithms using Orange, the Tree classifier looked promising. However, since it took around 8 hours on the entire set of data, we did not pursue working with this algorithm for this project. We would like to pursue variants of the Tree classifier (such as Random Forest) by exploring a subset of the data. 7.3 Unsupervised Supervised classification with product catalogs is limited to the categories that we train on. To find more granular sub- categories, we need to be able to discover structure within categories of products. We are considering using an unsupervised hierarchical clustering approach. In addition, this would allow us to categorize and give structure to catalogs where there is no existing product catalog (for example, products for a new or sparsely explored market). 8. REFERENCES