SlideShare a Scribd company logo
1 of 151
Download to read offline
Machine Learning in R

 Alexandros Karatzoglou1

     1 Telefonica
               Research
      Barcelona, Spain


   December 15, 2010




                           1
Outline
1   Introduction to R
       CRAN
       Objects and Operations
       Basic Data Structures
       Missing Values
       Entering Data
       File Input and Output
       Installing Packages
       Indexing and Subsetting
2   Basic Plots
3   Lattice Plots
4   Basic Statistics & Machine Learning
       Tests
5   Linear Models
6   Naive Bayes
7   Support Vector Machines
8   Decision Trees
9   Dimensionality Reduction
                                          2
R


    Environment for statistical data analysis, inference and
    visualization.
    Ports for Unix, Windows and MacOSX
    Highly extensible through user-defined functions
    Generic functions and conventions for standard operations like
    plot, predict etc.
    ∼ 1200 add-on packages contributed by developers from all over
    the world
    e.g. Multivariate Statistics, Machine Learning, Natural Language
    Processing, Bioinformatics (Bioconductor), SNA, .
    Interfaces to C, C++, Fortran, Java



                                                                       3
Outline
1   Introduction to R
       CRAN
       Objects and Operations
       Basic Data Structures
       Missing Values
       Entering Data
       File Input and Output
       Installing Packages
       Indexing and Subsetting
2   Basic Plots
3   Lattice Plots
4   Basic Statistics & Machine Learning
       Tests
5   Linear Models
6   Naive Bayes
7   Support Vector Machines
8   Decision Trees
9   Dimensionality Reduction
                                          4
Comprehensive R Archive Network (CRAN)



   CRAN includes packages which provide additional functionality to
   the one existing in R
   Currently over 1200 packages in areas like multivariate statistics,
   time series analysis, Machine Learning, Geo-statistics,
   environmental statistics etc.
   packages are written mainly by academics, PhD students, or
   company staff
   Some of the package have been ordered into Task Views




                                                                     5
CRAN Task Views




   Some of the CRAN packages are ordered into categories called
   Task Views
   Task Views are maintained by people with experience in the
   corresponding field
   there are Task Views on Econometrics, Environmental Statistics,
   Finance, Genetics, Graphics Machine Learning, Multivariate
   Statistics, Natural Language Processing, Social Sciences, and
   others.




                                                                     6
Online documentation




 1   Go to http://www.r-project.org
 2   On the left side under Documentation
 3   Official Manuals, FAQ, Newsletter, Books
 4   Other and click on other publications contains a collection of
     contributed guides and manuals




                                                                      7
Mailing lists




    R-announce is for announcing new major enhancements in R
    R-packages announcing new version of packages
    R-help is for R related user questions!
    R-devel is for R developers




                                                               8
Command Line Interface




   R does not have a “real” gui
   All computations and statistics are performed with commands
   These commands are called “functions” in R




                                                                 9
help




   click on help
   search help functionality
   FAQ
   link to reference web pages
   apropos




                                 10
Outline
1   Introduction to R
       CRAN
       Objects and Operations
       Basic Data Structures
       Missing Values
       Entering Data
       File Input and Output
       Installing Packages
       Indexing and Subsetting
2   Basic Plots
3   Lattice Plots
4   Basic Statistics & Machine Learning
       Tests
5   Linear Models
6   Naive Bayes
7   Support Vector Machines
8   Decision Trees
9   Dimensionality Reduction
                                          11
Objects



Everything in R is an object
Everything in R has a class.
    Data, intermediate results and even e.g. the result of a regression
    are stored in R objects
    The Class of the object both describes what the object contains
    and what many standard functions
    Objects are usually accessed by name. Syntactic names for
    objects are made up from letters, the digits 0 to 9 in any non-initial
    position and also the period “.”




                                                                       12
Assignment and Expression Operations




   R commands are either assignments or expressions
   Commands are separated either by a semicolon ; or newline
   An expression command is evaluated and (normally) printed
   An assignment command evaluates an expression and passes the
   value to a variable but the result is not printed




                                                               13
Expression Operations




> 1 + 1
[1] 2




                        14
Assigment Operations




> res <- 1 + 1

   “<-” is the assignment operator in R
   a series of commands in a file (script) can be executed with the
   command : source(“myfile.R”)




                                                                     15
Sample session


> 1:5
[1] 1 2 3 4 5
> powers.of.2 <- 2^(1:5)
> powers.of.2
[1]   2   4   8 16 32
> class(powers.of.2)
[1] "numeric"
> ls()
[1] "powers.of.2" "res"
> rm(powers.of.2)


                           16
Workspace




   R stores objects in workspace that is kept in memory
   When quiting R ask you if you want to save that workspace
   The workspace containing all objects you work on can then be
   restored next time you work with R along with a history of the used
   commands.




                                                                   17
Outline
1   Introduction to R
       CRAN
       Objects and Operations
       Basic Data Structures
       Missing Values
       Entering Data
       File Input and Output
       Installing Packages
       Indexing and Subsetting
2   Basic Plots
3   Lattice Plots
4   Basic Statistics & Machine Learning
       Tests
5   Linear Models
6   Naive Bayes
7   Support Vector Machines
8   Decision Trees
9   Dimensionality Reduction
                                          18
Basic Data Structures




   Vectors
   Factors
   Data Frame
   Matrices
   Lists




                        19
Vectors



In R the building blocks for storing data are vectors of various types.
The most common classes are:
    "character", a vector of character strings of varying length. These
    are normally entered and printed surrounded by double quotes.
    "numeric", a vector of real numbers.
    "integer", a vector of (signed) integers.
    "logical", a vector of logical (true or false) values. The values are
    printed and as TRUE and FALSE.




                                                                          20
Numeric Vectors

> vect <- c(1, 2, 99, 6, 8, 9)
> is(vect)
[1] "numeric" "vector"
> vect[2]
[1] 2
> vect[2:3]
[1]   2 99
> length(vect)
[1] 6
> sum(vect)
[1] 125

                                 21
Character Vectors
> vect3 <- c("austria", "spain",
+     "france", "uk", "belgium",
+     "poland")
> is(vect3)
[1]   "character"
[2]   "vector"
[3]   "data.frameRowLabels"
[4]   "SuperClassMethod"
> vect3[2]
[1] "spain"
> vect3[2:3]
[1] "spain"    "france"
> length(vect3)
[1] 6
                                   22
Logical Vectors




> vect4 <- c(TRUE, TRUE, FALSE, TRUE,
+     FALSE, TRUE)
> is(vect4)
[1] "logical" "vector"




                                        23
Factors
> citizen <- factor(c("uk", "us",
+     "no", "au", "uk", "us", "us"))
> citizen
[1] uk us no au uk us us
Levels: au no uk us
> unclass(citizen)
[1] 3 4 2 1 3 4 4
attr(,"levels")
[1] "au" "no" "uk" "us"
> citizen[5:7]
[1] uk us us
Levels: au no uk us
> citizen[5:7, drop = TRUE]
[1] uk us us
Levels: uk us
                                       24
Factors Ordered

> income <- ordered(c("Mid", "Hi",
+     "Lo", "Mid", "Lo", "Hi"), levels = c("Lo",
+     "Mid", "Hi"))
> income
[1] Mid Hi Lo Mid Lo     Hi
Levels: Lo < Mid < Hi
> as.numeric(income)
[1] 2 3 1 2 1 3
> class(income)
[1] "ordered" "factor"
> income[1:3]
[1] Mid Hi Lo
Levels: Lo < Mid < Hi
                                                   25
Data Frames




   A data frame is the type of object normally used in R to store a
   data matrix.
   It should be thought of as a list of variables of the same length, but
   possibly of different types (numeric, factor, character, logical, etc.).




                                                                        26
Data Frames
> library(MASS)
> data(painters)
> painters[1:3, ]
           Composition Drawing Colour
Da Udine            10        8    16
Da Vinci            15       16     4
Del Piombo           8      13     16
           Expression School
Da Udine            3      A
Da Vinci           14      A
Del Piombo          7      A
> names(painters)
[1] "Composition" "Drawing"
[3] "Colour"      "Expression"
[5] "School"
> row.names(painters)
 [1] "Da Udine"         "Da Vinci"
Data Frames




   Data frames are by far the commonest way to store data in R
   They are normally imported by reading a file or from a
   spreadsheet or database.
   However, vectors of the same length can be collected into a data
   frame by the function data.frame




                                                                  28
Data Frame


> mydf <- data.frame(vect, vect3,
+     income)
> summary(mydf)
      vect           vect3   income
 Min.   : 1.00   austria:1   Lo :2
 1st Qu.: 3.00   belgium:1   Mid:2
 Median : 7.00   france :1   Hi :2
 Mean   :20.83   poland :1
 3rd Qu.: 8.75   spain :1
 Max.   :99.00   uk     :1
> mydf <- data.frame(vect, I(vect3),
+     income)



                                       29
Matrices




   A data frame may be printed like a matrix, but it is not a matrix.
   Matrices like vectors have all their elements of the same type




                                                                        30
> mymat <- matrix(1:10, 2, 5)
> class(mymat)
[1] "matrix"
> dim(mymat)
[1] 2 5




                                31
Lists




    A list is a vector of other R objects.
    Lists are used to collect together items of different classes.
    For example, an employee record might be created by :




                                                                     32
Lists




> Empl <- list(employee = "Anna",
+     spouse = "Fred", children = 3,
+     child.ages = c(4, 7, 9))
> Empl$employee
[1] "Anna"
> Empl$child.ages[2]
[1] 7




                                       33
Basic Mathematical Operations



> 5 - 3
[1] 2
> a <- 2:4
> b <- rep(1, 3)
> a - b
[1] 1 2 3
> a * b
[1] 2 3 4




                                34
Matrix Multiplication




    "
> a <- matrix(1:9, 3, 3)
> b <- matrix(2, 3, 3)
> c <- a %*% b




                           35
Outline
1   Introduction to R
       CRAN
       Objects and Operations
       Basic Data Structures
       Missing Values
       Entering Data
       File Input and Output
       Installing Packages
       Indexing and Subsetting
2   Basic Plots
3   Lattice Plots
4   Basic Statistics & Machine Learning
       Tests
5   Linear Models
6   Naive Bayes
7   Support Vector Machines
8   Decision Trees
9   Dimensionality Reduction
                                          36
Missing Values




   Missing Values in R are labeled with the logical value NA




                                                               37
Missing Values



> mydf[mydf == 99] <- NA
> mydf
    vect   vect3 income
1      1 austria    Mid
2      2   spain     Hi
3     NA france      Lo
4      6      uk    Mid
5      8 belgium     Lo
6      9 poland      Hi




                           38
Missing Values




> a <- c(3, 5, 6)
> a[2] <- NA
> a
[1]   3 NA   6
> is.na(a)
[1] FALSE    TRUE FALSE




                          39
Special Values




   R has the also special values NaN , Inf and -Inf
> d <- c(-1, 0, 1)
> d/0
[1] -Inf    NaN    Inf




                                                      40
Outline
1   Introduction to R
       CRAN
       Objects and Operations
       Basic Data Structures
       Missing Values
       Entering Data
       File Input and Output
       Installing Packages
       Indexing and Subsetting
2   Basic Plots
3   Lattice Plots
4   Basic Statistics & Machine Learning
       Tests
5   Linear Models
6   Naive Bayes
7   Support Vector Machines
8   Decision Trees
9   Dimensionality Reduction
                                          41
Entering Data




   For all but the smallest datasets the easiest way to get data into R
   is to import it from a connection such as a file




                                                                     42
Entering Data




> a <- c(5, 8, 5, 4, 9, 7)
> b <- scan()




                             43
Entering Data




> mydf2 <- edit(mydf)




                        44
Outline
1   Introduction to R
       CRAN
       Objects and Operations
       Basic Data Structures
       Missing Values
       Entering Data
       File Input and Output
       Installing Packages
       Indexing and Subsetting
2   Basic Plots
3   Lattice Plots
4   Basic Statistics & Machine Learning
       Tests
5   Linear Models
6   Naive Bayes
7   Support Vector Machines
8   Decision Trees
9   Dimensionality Reduction
                                          45
File Input and Output




   read.table can be used to read data from text file like csv
   read.table creates a data frame from the values and tries to guess
   the type of each variable
> mydata <- read.table("file.csv",
+     sep = ",")




                                                                  46
Packages



   Packages contain additional functionality in the form of extra
   functions and data
   Installing a package can be done with the function
   install.packages()
   The default R installation contains a number of contributed
   packages like MASS, foreign, utils
   Before we can access the functionality in a package we have to
   load the package with the function library()




                                                                    47
Outline
1   Introduction to R
       CRAN
       Objects and Operations
       Basic Data Structures
       Missing Values
       Entering Data
       File Input and Output
       Installing Packages
       Indexing and Subsetting
2   Basic Plots
3   Lattice Plots
4   Basic Statistics & Machine Learning
       Tests
5   Linear Models
6   Naive Bayes
7   Support Vector Machines
8   Decision Trees
9   Dimensionality Reduction
                                          48
Packages




   Help on the contents of the packages is available
> library(help = foreign)

   Help on installed packages is also available by help.start()
   Vignettes are also available for many packages




                                                                  49
SPSS files




   Package foreign contains functions that read data from SPSS
   (sav), STATA and other formats
> library(foreign)
> spssdata <- read.spss("ees04.sav",
+     to.data.frame = TRUE)




                                                                 50
Excel files




   To load Excel file into R an external package xlsReadWrite is
   needed
   We will install the package directly from CRAN using
   install.packages
> install.packages("xlsReadWrite",
+     lib = "C:temp")




                                                                  51
Excel files




> library(xlsReadWrite)
> data <- read.xls("sampledata.xls")




                                       52
Data Export



   You can save your data by using the functions write.table()
   or save()
   write.table is more data specific, while save can save any R
   object
   save saves in a binary format while write.table() in text
   format.
> write.table(mydata, file = "mydata.csv",
+     quote = FALSE)
> save(mydata, file = "mydata.rda")
> load(file = "mydata.rda")




                                                               53
Outline
1   Introduction to R
       CRAN
       Objects and Operations
       Basic Data Structures
       Missing Values
       Entering Data
       File Input and Output
       Installing Packages
       Indexing and Subsetting
2   Basic Plots
3   Lattice Plots
4   Basic Statistics & Machine Learning
       Tests
5   Linear Models
6   Naive Bayes
7   Support Vector Machines
8   Decision Trees
9   Dimensionality Reduction
                                          54
Indexing




   R has a powerful set of Indexing capabilities
   This facilitates Data manipulation and can speed up data
   pre-processing
   Indexing Vectors, Data Frames and matrices is done in a similar
   manner




                                                                 55
Indexing by numeric vectors



> letters[1:3]
[1] "a" "b" "c"
> letters[c(7, 9)]
[1] "g" "i"
> letters[-(1:15)]
 [1] "p" "q" "r" "s" "t" "u" "v" "w" "x"
[10] "y" "z"




                                           56
Indexing by logical vectors




> a <- 1:10
> a[c(3, 5, 7)] <- NA
> is.na(a)
 [1] FALSE FALSE TRUE FALSE      TRUE FALSE
 [7] TRUE FALSE FALSE FALSE
> a[!is.na(a)]
[1]   1   2   4   6   8   9 10




                                              57
Indexing Matrices and Data Frames



> mymat[1, ]
[1] 1 3 5 7 9
> mymat[, c(2, 3)]
     [,1] [,2]
[1,]    3    5
[2,]    4    6
> mymat[1, -(1:3)]
[1] 7 9




                                    58
Selecting Subsets
> attach(painters)
> painters[Colour >= 17, ]
          Composition Drawing Colour
Bassano              6        8   17
Giorgione            8        9   18
Pordenone            8       14   17
Titian             12        15   18
Rembrandt          15         6   17
Rubens             18        13   17
Van Dyck           15        10   17
          Expression School
Bassano            0       D
Giorgione          4       D
Pordenone          5       D
Titian             6       D
Rembrandt         12       G
Rubens            17       G
                                       59
Selecting Subsets
> painters[Colour >= 15 & Composition >
+      10, ]
            Composition Drawing Colour
Tintoretto           15       14    16
Titian               12       15    18
Veronese             15       10    16
Corregio             13       13    15
Rembrandt            15        6    17
Rubens               18       13    17
Van Dyck             15       10    17
            Expression School
Tintoretto           4      D
Titian               6      D
Veronese             3      D
Corregio            12      E
Rembrandt           12      G
Rubens              17      G
                                          60
Graphics




   R has a rich set of visualization capabilities
   scatter-plots, histograms, box-plots and more.




                                                    61
histograms




> data(hills)
> hist(hills$time)




                     62
Scatter Plot




> plot(climb ~ time, hills)




                              63
Paired Scatterplot




> pairs(hills)




                     64
Box Plots




> boxplot(count ~ spray, data = InsectSprays,
+     col = "lightgray")




                                                65
Formula Interface




   A formula is of the general form response ∼ expression where the
   left-hand side, response, may in some uses be absent and the
   right-hand side, expression, is a collection of terms joined by
   operators usually resembling an arithmetical expression.
   The meaning of the right-hand side is context dependent
   e.g. in linear and generalized linear modelling it specifies the form
   of the model matrix




                                                                    66
Scatterplot and line




>   library(MASS)
>   data(hills)
>   attach(hills)
>   plot(climb ~ time, hills, xlab = "Time",
+       ylab = "Feet")
>   abline(0, 40)
>   abline(lm(climb ~ time))




                                               67
Multiple plots




>   par(mfrow = c(1, 2))
>   plot(climb ~ time, hills, xlab = "Time",
+       ylab = "Feet")
>   plot(dist ~ time, hills, xlab = "Time",
+       ylab = "Dist")
>   par(mfrow = c(1, 1))




                                               68
Plot to file




> pdf(file = "H:/My Documents/plot.pdf")
> plot(dist ~ time, hills, xlab = "Time",
+     ylab = "Dist", main = "Hills")
> dev.off()




                                            69
Paired Scatterplot




> splom(~hills)




                     70
Box-whisker plots




> data(michelson)
> bwplot(Expt ~ Speed, data = michelson,
+     ylab = "Experiment No.")
> title("Speed of Light Data")




                                           71
Box-whisker plots




> data(quine)
> bwplot(Age ~ Days | Sex * Lrn *
+     Eth, data = quine, layout = c(4,
+     2))




                                         72
Discriptive Statistics
> mean(hills$time)
[1] 57.87571
> colMeans(hills)
       dist       climb            time
   7.528571 1815.314286       57.875714
> median(hills$time)
[1] 39.75
> quantile(hills$time)
     0%        25%      50%      75%    100%
 15.950     28.000   39.750   68.625 204.617
> var(hills$time)
[1] 2504.073
> sd(hills$time)
[1] 50.04072
                                               73
Discriptive Statistics

> cor(hills)
           dist     climb      time
dist 1.0000000 0.6523461 0.9195892
climb 0.6523461 1.0000000 0.8052392
time 0.9195892 0.8052392 1.0000000
> cov(hills)
            dist       climb       time
dist    30.51387    5834.638   254.1944
climb 5834.63782 2621648.457 65243.2567
time   254.19442   65243.257 2504.0733
> cor(hills$time, hills$climb)
[1] 0.8052392


                                          74
Distributions




   R contains a number of functions for drawing from a number of
   distribution like e.g. normal, uniform etc.




                                                                   75
Sampling from a Normal Distribution




> nvec <- rnorm(100, 3)




                                      76
Outline
1   Introduction to R
       CRAN
       Objects and Operations
       Basic Data Structures
       Missing Values
       Entering Data
       File Input and Output
       Installing Packages
       Indexing and Subsetting
2   Basic Plots
3   Lattice Plots
4   Basic Statistics & Machine Learning
       Tests
5   Linear Models
6   Naive Bayes
7   Support Vector Machines
8   Decision Trees
9   Dimensionality Reduction
                                          77
Testing for the mean T-test


> t.test(nvec, mu = 1)
One Sample t-test

data: nvec
t = 21.187, df = 99, p-value <
2.2e-16
alternative hypothesis: true mean is not equal to 1
95 percent confidence interval:
 2.993364 3.405311
sample estimates:
mean of x
 3.199337



                                                  78
Testing for the mean T-test


> t.test(nvec, mu = 2)
One Sample t-test

data: nvec
t = 11.5536, df = 99, p-value <
2.2e-16
alternative hypothesis: true mean is not equal to 2
95 percent confidence interval:
 2.993364 3.405311
sample estimates:
mean of x
 3.199337



                                                  79
Testing for the mean Wilcox-test




> wilcox.test(nvec, mu = 3)
Wilcoxon signed rank test with
continuity correction

data: nvec
V = 3056, p-value = 0.06815
alternative hypothesis: true location is not equal to




                                                  80
Two Sample Tests


> nvec2 <- rnorm(100, 2)
> t.test(nvec, nvec2)
Welch Two Sample t-test

data: nvec and nvec2
t = 7.4455, df = 195.173, p-value =
3.025e-12
alternative hypothesis: true difference in means is no
95 percent confidence interval:
 0.8567072 1.4741003
sample estimates:
mean of x mean of y
 3.199337 2.033933


                                                  81
Two Sample Tests




> wilcox.test(nvec, nvec2)
Wilcoxon rank sum test with
continuity correction

data: nvec and nvec2
W = 7729, p-value = 2.615e-11
alternative hypothesis: true location shift is not equ




                                                  82
Paired Tests


> t.test(nvec, nvec2, paired = TRUE)
Paired t-test

data: nvec and nvec2
t = 7.6648, df = 99, p-value =
1.245e-11
alternative hypothesis: true difference in means is no
95 percent confidence interval:
 0.863712 1.467096
sample estimates:
mean of the differences
               1.165404



                                                  83
Paired Tests




> wilcox.test(nvec, nvec2, paired = TRUE)
Wilcoxon signed rank test with
continuity correction

data: nvec and nvec2
V = 4314, p-value = 7.776e-10
alternative hypothesis: true location shift is not equ




                                                  84
Density Estimation




> library(MASS)
> truehist(nvec, nbins = 20, xlim = c(0,
+     6), ymax = 0.7)
> lines(density(nvec, width = "nrd"))




                                           85
Density Estimation
>   data(geyser)
>   geyser2 <- data.frame(as.data.frame(geyser)[-1,
+       ], pduration = geyser$duration[-299])
>   attach(geyser2)
>   par(mfrow = c(2, 2))
>   plot(pduration, waiting, xlim = c(0.5,
+       6), ylim = c(40, 110), xlab = "previous duration
+       ylab = "waiting")
>   f1 <- kde2d(pduration, waiting,
+       n = 50, lims = c(0.5, 6, 40,
+           110))
>   image(f1, zlim = c(0, 0.075), xlab = "previous durat
+       ylab = "waiting")
>   f2 <- kde2d(pduration, waiting,
+       n = 50, lims = c(0.5, 6, 40,
+           110), h = c(width.SJ(duration),
+           width.SJ(waiting)))
                                                    86
Simple Linear Model

Fitting a simple linear model in R is done using the lm function
> data(hills)
> mhill <- lm(time ~ dist, data = hills)
> class(mhill)
[1] "lm"
> mhill
Call:
lm(formula = time ~ dist, data = hills)

Coefficients:
(Intercept)                dist
     -4.841               8.330


                                                                   87
Simple Linear Model




> summary(mhill)
> names(mhill)
> mhill$residuals




                      88
Multivariate Linear Model



> mhill2 <- lm(time ~ dist + climb,
+     data = hills)
> mhill2
Call:
lm(formula = time ~ dist + climb, data = hills)

Coefficients:
(Intercept)         dist       climb
   -8.99204      6.21796     0.01105




                                                  89
Multifactor Linear Model




> summary(mhill2)
> update(update(mhill2, weights = 1/hills$dist^2))
> predict(mhill2, hills[2:3, ])




                                                     90
Bayes Rule




                                     p(x|y )p(y )
                         p(y |x) =                                    (1)
                                        p(x)

Consider the problem of email filtering: x is the email e.g. in the form
of a word vector, y the label spam, ham




                                                                      91
Prior p(y )




                         mham                   mspam
              p(ham) ≈          ,   p(spam) ≈            (2)
                         mtotal                 mtotal




                                                         92
Likelihood Ratio



                                     p(x|y )p(y )
                         p(y |x) =                                 (3)
                                        p(x)
key problem: we do not know p(x|y ) or p(x). We can get rid of p(x) by
settling for a likelihood ratio:

                        p(spam|x)   p(x|spam)p(spam)
              L(x) :=             =                                (4)
                        p(ham|x)     p(x|ham)p(ham)

Whenever L(x) exceeds a given threshold c we decide that x is spam
and consequently reject the e-mail




                                                                   93
Computing p(x|y )



Key assumption in Naive Bayes is that the variables (or elements of x)
are conditionally independent i.e. words in emails are independent of
each other. We can then compute p(x|y ) in a naive fashion by
assuming that:
                                  Nwords∈x
                      p(x|y ) =              p(w j |y )             (5)
                                    j=1

wj denotes the j-th word in document x. Estimates for p(w|y ) can be
obtained, for instance, by simply counting the frequency occurrence of
the word within documents of a given class




                                                                    94
Computing p(x|y)




                          m
                          i=1
                                 Nwords∈xi
                                 j=1       {yi = spam&wij =   w}
         p(w|spam) ≈            m     Nwords∈xi
                                                                    (6)
                                i=1   j=1       {yi = spam}

{yi = spam&wij = w} equals 1 if xi is labeled as spam and w occurs
as the j-th word in xi . The denominator counts the number of words in
spam documents.




                                                                    95
Laplacian smoothing




Issue: estimating p(w|y ) for words w which we might not have seen
before.
Solution:
increment all counts by 1. This method is commonly referred to as
Laplace smoothing.




                                                                     96
Naive Bayes in R


>   library(kernlab)
>   library(e1071)
>   data(spam)
>   idx <- sample(1:dim(spam)[1], 300)
>   spamtrain <- spam[-idx, ]
>   spamtest <- spam[idx, ]
>   model <- naiveBayes(type ~ ., data = spamtrain)
>   predict(model, spamtest)
>   table(predict(model, spamtest),
+       spamtest$type)
>   predict(model, spamtest, type = "raw")




                                                      97
k-Nearest Neighbors


An even simpler estimator than Naive Bayes is nearest neighbors. In
its most basic form it assigns the label of its nearest neighbor to an




observation x
Identify k nearest neigbhboors given distance metric d(x, x ) (e.g.
euclidian distance) and determine label by a vote.



                                                                      98
k-Nearest Neighbors




> library(animation)
> knn.ani(k = 4)




                       99
KNN in R




> model <- knn(spamtrain[, -58],
+     spamtest[, -58], spamtrain[,
+         58])
> predict(model, spamtest)
> table(predict(model, spamtest),
+     spamtest$type)




                                     100
Support Vector Machines




                          101
Support Vector Machines




                          102
Support Vector Machines




                          103
Support Vector Machines




   Support Vector Machines work by maximizing a margin between a
   hyperplane and the data.
   SVM is a simple well understood linear method
   The optimization problem is convex thus only one optimal solution
   exists
   Excellent performance




                                                                 104
Kernels: Data Mapping
Data are implicitly mapped into a high dimensional feature space
Φ:X →H

                               Φ : R2 → R3


                                                2
                                                   √          2
              (x1 , x2 ) → (z1 , z2 , z3 ) := (x1 , 2x1 x2 , x2 )




                                                                    105
Kernel Methods: Kernel Function

   In Kernel Methods learning takes place in the feature space, data
   only appear inside inner products Φ(x), Φ(x )
   Inner product is given by a kernel function (“kernel trick”)

                         k (x, x ) = Φ(x), Φ(x )

   Inner product :
                           2
                                 √                   2
                                                       √            2
           Φ(x), Φ(x ) = (x1 ,       2x1 x2 , x2 )(x1 , 2x1 x 2 , x2 )T
                                               2



                                             2
                                 = x, x


                                 = k (x, x )


                                                                          106
Kernel Functions




   Gaussian kernel k (x, x ) = exp(−σ x − x                2)

   Polynomial kernel k (x, x ) = ( x, x + c)p
   String Kernels (for text):

           k (x, x ) =               λs δs,s =         nums (x)nums (x )λs
                         s x,s   x               s∈A

   Kernels on Graphs, Mismatch kernels etc.




                                                                             107
SVM Primal Optimization Problem


                                                       m
                              1            2     C
       minimize     t(w, ξ) =   w              +            ξi
                              2                  m
                                                      i=1
      subject to    yi ( Φ(xi ), w + b) ≥ 1 − ξi                 (i = 1, . . . , m)
                    ξi ≥ 0          (i = 1, . . . , m)

where m is the number of training patterns, and yi = ±1.
SVM Decision Function

                        f (xi ) = w, Φ(xi ) + b
                                     m
                             fi =         k (xi , xj )αj
                                    j=1



                                                                                      108
SVM for Regression




                     109
SVM in R




>   filter <- ksvm(type ~ ., data = spamtrain,
+       kernel = "rbfdot", kpar = list(sigma = 0.05),
+       C = 5, cross = 3)
>   filter
>   mailtype <- predict(filter, spamtest[,
+       -58])
>   table(mailtype, spamtest[, 58])




                                                   110
SVM in R




>   filter <- ksvm(type ~ ., data = spamtrain,
+       kernel = "rbfdot", kpar = list(sigma = 0.05),
+       C = 5, cross = 3, prob.model = TRUE)
>   filter
>   mailpro <- predict(filter, spamtest[,
+       -58], type = "prob")
>   mailpro
SVM in R



>   x <- rbind(matrix(rnorm(120), ,
+       2), matrix(rnorm(120, mean = 3),
+       , 2))
>   y <- matrix(c(rep(1, 60), rep(-1,
+       60)))
>   svp <- ksvm(x, y, type = "C-svc",
+       kernel = "vanilladot", C = 200)
>   svp
>   plot(svp, data = x)




                                           112
SVM in R




> svp <- ksvm(x, y, type = "C-svc",
+     kernel = "rbfdot", kpar = list(sigma = 1),
+     C = 10)
> plot(svp, data = x)




                                                   113
SVM in R




> svp <- ksvm(x, y, type = "C-svc",
+     kernel = "rbfdot", kpar = list(sigma = 1),
+     C = 10)
> plot(svp, data = x)




                                                   114
SVM in R




>   data(reuters)
>   is(reuters)
>   tsv <- ksvm(reuters, rlabels, kernel = "stringdot",
+       kpar = list(length = 5), cross = 3,
+       C = 10)
>   tsv




                                                   115
SVM in R




>   x <- seq(-20, 20, 0.1)
>   y <- sin(x)/x + rnorm(401, sd = 0.03)
>   plot(x, y, type = "l")
>   regm <- ksvm(x, y, epsilon = 0.02,
+        kpar = list(sigma = 16), cross = 3)
>   regm
>   lines(x, predict(regm, x), col = "red")




                                               116
Decision Trees



               Petal.Length< 2.45
                        |




                                       Petal.Width< 1.75
      setosa




                          versicolor                       virginica




                                                                       117
Decision Trees


Tree-based methods partition the feature space into a set of
rectangles, and then fit a simple model in each one. The partitioning is
done taking one variable so that the partitioning of that variable
increases some impurity measure Q(x). In a node m of a classification
tree let
                               1
                        pmk =          I(yi = k )                   (7)
                               N
                                 xi ∈Rm

the portion of class k observation in node m. Trees classify the
observations in each node m to class

                         k (m) = argmaxk pmk                        (8)




                                                                   118
Impurity Measures

Misclassification error:
                            1
                                        I(y = k (m))              (9)
                           Nm
                                i∈Rm

Gini Index:
                                          K
                          pmk pmk =             pmk (1 − pmk )   (10)
                   k =k                  k =1

Cross-entropy or deviance:
                                 K
                            −          pmk log(pmk )             (11)
                                k =1




                                                                 119
Classification Trees in R




>   data(iris)
>   tree <- rpart(Species ~ ., data = iris)
>   plot(tree)
>   text(tree, digits = 3)




                                              120
Classification Trees in R

>   fit <- rpart(Kyphosis ~ Age + Number +
+       Start, data = kyphosis)
>   fit2 <- rpart(Kyphosis ~ Age +
+       Number + Start, data = kyphosis,
+       parms = list(prior = c(0.65,
+           0.35), split = "information"))
>   fit3 <- rpart(Kyphosis ~ Age +
+       Number + Start, data = kyphosis,
+       control = rpart.control(cp = 0.05))
>   par(mfrow = c(1, 3), xpd = NA)
>   plot(fit)
>   text(fit, use.n = TRUE)
>   plot(fit2)
>   text(fit2, use.n = TRUE)
>   plot(fit3)
>   text(fit3, use.n = TRUE)
Regression Trees in R




> data(cpus, package = "MASS")
> cpus.ltr <- tree(log10(perf) ~
+     syct + mmin + mmax + cach +
+          chmin + chmax, cpus)
> cpus.ltr




                                    122
Random Forests

RF is an ensemble classifier that consist of many decision trees.
Decisions are taken using a majority vote from the trees.
    Number of training cases be N, and the number of variables in the
    classifier be M.
    m is the number of input variables to be used to determine the
    decision at a node of the tree; m should be much less than M.
    Sample a training set for this tree by choosing N times with
    replacement from all N available training cases (i.e. take a
    bootstrap sample).
    For each node of the tree, randomly choose m variables on which
    to base the decision at that node.
    Calculate the best split based on these m variables in the training
    set.


                                                                     123
Random Forests in R




> library(randomForest)
> filter <- randomForest(type ~ .,
+     data = spam)
> filter




                                     124
Principal Component Analysis (PCA)


   A projection method finding projections of maximal variability
   i.e. it seeks linear combinations of the columns of X with maximal
   (or minimal) variance
   The first k principal components span a subspace containing the
   “best” kdimensional view of the data
   Projecting the data on the first few principal components is often
   useful to reveal structure in the data
   PCA depends on the scaling of the original variables,
   thus it is conventional to do PCA using the correlation matrix,
   implicitly rescaling all the variables to unit sample variance.




                                                                     125
Principal Component Analysis (PCA)




                                     126
Principal Components Analysis (PCA)




>   ir.pca <- princomp(log(iris[, -5]),
+       cor = T)
>   summary(ir.pca)
>   loadings(ir.pca)
>   ir.pc <- predict(ir.pca)




                                          127
PCA Plot




> plot(ir.pc[, 1:2], xlab = "first principal component
+     ylab = "second principal component")
> text(ir.pc[, 1:2], labels = as.character(iris[,
+     5]), col = as.numeric(iris[,
+     5]))




                                                 128
kernel PCA



>   test <- sample(1:150, 20)
>   kpc <- kpca(~., data = iris[-test,
+       -5], kernel = "rbfdot", kpar = list(sigma = 0.2)
+       features = 2)
>   plot(rotated(kpc), col = as.integer(iris[-test,
+       5]), xlab = "1st Principal Component",
+       ylab = "2nd Principal Component")
>   text(rotated(kpc), labels = as.character(iris[-test,
+       5]), col = as.numeric(iris[-test,
+       5]))




                                                   129
kernel PCA




> kpc <- kpca(reuters, kernel = "stringdot",
+     kpar = list(length = 5), features = 2)
> plot(rotated(kpc), col = as.integer(rlabels),
+     xlab = "1st Principal Component",
+     ylab = "2nd Principal Component")




                                                  130
Distance methods



   Distance Methods work by representing the cases in a low
   dimensional Euclidian space so that their proximity reflects the
   similarity of their vairables
   A distance measure needs to be defined when using Distance
   Methods
   The function dist() implements four distance measures
   between the points in the p-dimensional space of variables; the
   default is Euclidean distance
   can be used with categorical variables!




                                                                     131
Multidimensional scaling




> dm <- dist(iris[, -5])
> mds <- cmdscale(dm, k = 2)
> plot(mds, xlab = "1st coordinate",
+     ylab = "2nd coordinate", col = as.numeric(iris[,
+         5]))




                                                 132
Miltidimensional scaling for categorical v.




>   library(kernlab)
>   data(income)
>   inc <- income[1:300, ]
>   daisy(inc)
>   csc <- cmdscale(as.dist(daisy(inc)),
+       k = 2)
>   plot(csc, xlab = "1st coordinate",
+       ylab = "2nd coordinate")




                                              133
Shannon’s non-linear mapping




> snm <- sammon(dist(ir[-143, ]))
> plot(snm$points, xlab = "1st coordinate",
+     ylab = "2nd coordinate", col = as.numeric(iris[,
+         5]))




                                                 134
Biplot




>   data(state)
>   state <- state.x77[, 2:7]
>   row.names(state) <- state.abb
>   biplot(princomp(state, cor = T),
+       pc.biplot = T, cex = 0.7, expand = 0.8)




                                                  135
Stars plot




stars(state.x77[, c(7, 4, 6, 2, 5, 3)], full = FALSE,key.loc = c(10, 2))




                                                                           136
Factor Analysis




   Principal component analysis looks for linear combinations of the
   data matrix that are uncorrelated and of high variance
   Factor analysis seeks linear combinations of the variables, called
   factors, that represent underlying fundamental quantities of which
   the observed variables are expressions
   the idea being that a small number of factors might explain a large
   number of measurements in an observational study




                                                                  137
Factor Analysis




>   data(swiss)
>   swiss.x <- as.matrix(swiss[, -1])
>   swiss.FA1 <- factanal(swiss.x,
+       method = "mle")
>   swiss.FA1
>   summary(swiss.FA1)




                                        138
k-means Clustering




Partition data into k sets S = {S1 , S2 , . . . , §k } so as to minimize the
within-cluster sum of squares (WCSS):
                                    k
                         argminS                |xj − µi |2                (12)
                                   i=1 xj ∈Si




                                                                               139
k-means Clustering




> data(iris)
> clust <- kmeans(iris[, -5], centers = 3)
> clust




                                             140
k-means Clustering I




>   data(swiss)
>   swiss.x <- as.matrix(swiss[, -1])
>   km <- kmeans(swiss.x, 3)
>   swiss.pca <- princomp(swiss.x)
>   swiss.px <- predict(swiss.pca)




                                        141
k-means Clustering II




> dimnames(km$centers)[[2]] <- dimnames(swiss.x)[[2]]
> swiss.centers <- predict(swiss.pca,
+     km$centers)
> plot(swiss.px[, 1:2], xlab = "first principal compon
+     ylab = "second principal component",
+     col = km$cluster)
k-means Clustering III




> points(swiss.centers[, 1:2], pch = 3,
+     cex = 3)
> identify(swiss.px[, 1:2], cex = 0.5)




                                          143
kernel k-means




Partition data into k sets S = {S1 , S2 , . . . , §k } so as to minimize the
within-cluster sum of squares (WCSS) in kernel feature space:
                                k
                     argminS                |Φ(xj ) − Φ(µi )|2             (13)
                               i=1 xj ∈Si




                                                                               144
kernel k-means




> sc <- kkmeans(as.matrix(iris[,
+     -5]), kernel = "rbfdot", centers = 3)
> sc
> matchClasses(table(sc, iris[, 5]))




                                              145
kernel k-means




>   str <- stringdot(lenght = 4)
>   K <- kernelMatrix(str, reuters)
>   sc <- kkmeans(K, centers = 2)
>   sc
>   matchClasses(table(sc, rlabels))




                                       146
Spectral Clustering in kernlab




   Embedding data points into the subspace of eigenvectors of a
   kernel matrix
   Embedded points clustered using k -means
   Better performance (embedded points form tighter clusters)
   Can deal with clusters that do not form convex regions




                                                                  147
Spectral Clustering in R




>   data(spirals)
>   plot(spirals)
>   sc <- specc(spirals, centers = 2)
>   sc
>   plot(spirals, col = sc)




                                        148
Spectral Clustering


>   sc <- specc(reuters, kernel = "stringdot",
+       kpar = list(length = 5), centers = 2)
>   matchClasses(table(sc, rlabels))
>   par(mfrow = c(1, 2))
>   kpc <- kpca(reuters, kernel = "stringdot",
+       kpar = list(length = 5), features = 2)
>   plot(rotated(kpc), col = as.integer(rlabels),
+       xlab = "1st Principal Component",
+       ylab = "2nd Principal Component")
>   plot(rotated(kpc), col = as.integer(sc),
+       xlab = "1st Principal Component",
+       ylab = "2nd Principal Component")



                                                    149
Hierarchical Clustering




> data(state)
> h <- hclust(dist(state.x77), method = "single")
> plot(h)




                                                    150
Hierarchical Clustering




> pltree(diana(state.x77))




                             151

More Related Content

What's hot

Exploratory Data Analysis using Python
Exploratory Data Analysis using PythonExploratory Data Analysis using Python
Exploratory Data Analysis using PythonShirin Mojarad, Ph.D.
 
Data analysis with R
Data analysis with RData analysis with R
Data analysis with RShareThis
 
Decision Tree In R | Decision Tree Algorithm | Data Science Tutorial | Machin...
Decision Tree In R | Decision Tree Algorithm | Data Science Tutorial | Machin...Decision Tree In R | Decision Tree Algorithm | Data Science Tutorial | Machin...
Decision Tree In R | Decision Tree Algorithm | Data Science Tutorial | Machin...Simplilearn
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceSrishti44
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reductionmrizwan969
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 
Data Visualization in Python
Data Visualization in PythonData Visualization in Python
Data Visualization in PythonJagriti Goswami
 
Exploratory data analysis data visualization
Exploratory data analysis data visualizationExploratory data analysis data visualization
Exploratory data analysis data visualizationDr. Hamdan Al-Sabri
 
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...Edureka!
 
Introduction to NumPy (PyData SV 2013)
Introduction to NumPy (PyData SV 2013)Introduction to NumPy (PyData SV 2013)
Introduction to NumPy (PyData SV 2013)PyData
 
Introduction to Data Science and Analytics
Introduction to Data Science and AnalyticsIntroduction to Data Science and Analytics
Introduction to Data Science and AnalyticsSrinath Perera
 
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...Data Science Training | Data Science Tutorial for Beginners | Data Science wi...
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...Edureka!
 
Functional Programming in R
Functional Programming in RFunctional Programming in R
Functional Programming in RSoumendra Dhanee
 
Introduction to basic data analytics tools
Introduction to basic data analytics toolsIntroduction to basic data analytics tools
Introduction to basic data analytics toolsNascenia IT
 

What's hot (20)

Exploratory Data Analysis using Python
Exploratory Data Analysis using PythonExploratory Data Analysis using Python
Exploratory Data Analysis using Python
 
Data analysis with R
Data analysis with RData analysis with R
Data analysis with R
 
Decision Tree In R | Decision Tree Algorithm | Data Science Tutorial | Machin...
Decision Tree In R | Decision Tree Algorithm | Data Science Tutorial | Machin...Decision Tree In R | Decision Tree Algorithm | Data Science Tutorial | Machin...
Decision Tree In R | Decision Tree Algorithm | Data Science Tutorial | Machin...
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Python pandas Library
Python pandas LibraryPython pandas Library
Python pandas Library
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Statistics for data science
Statistics for data science Statistics for data science
Statistics for data science
 
Data Visualization in Python
Data Visualization in PythonData Visualization in Python
Data Visualization in Python
 
Exploratory data analysis data visualization
Exploratory data analysis data visualizationExploratory data analysis data visualization
Exploratory data analysis data visualization
 
Data Wrangling
Data WranglingData Wrangling
Data Wrangling
 
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
 
Introduction to NumPy (PyData SV 2013)
Introduction to NumPy (PyData SV 2013)Introduction to NumPy (PyData SV 2013)
Introduction to NumPy (PyData SV 2013)
 
Introduction to Data Science and Analytics
Introduction to Data Science and AnalyticsIntroduction to Data Science and Analytics
Introduction to Data Science and Analytics
 
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...Data Science Training | Data Science Tutorial for Beginners | Data Science wi...
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...
 
Data science
Data scienceData science
Data science
 
Functional Programming in R
Functional Programming in RFunctional Programming in R
Functional Programming in R
 
Lecture #01
Lecture #01Lecture #01
Lecture #01
 
Introduction to basic data analytics tools
Introduction to basic data analytics toolsIntroduction to basic data analytics tools
Introduction to basic data analytics tools
 

Viewers also liked

用十分鐘開始理解深度學習技術 (從 dnn.js 專案出發)
用十分鐘開始理解深度學習技術  (從 dnn.js 專案出發)用十分鐘開始理解深度學習技術  (從 dnn.js 專案出發)
用十分鐘開始理解深度學習技術 (從 dnn.js 專案出發)鍾誠 陳鍾誠
 
系統程式 -- 為何撰寫此書
系統程式 -- 為何撰寫此書系統程式 -- 為何撰寫此書
系統程式 -- 為何撰寫此書鍾誠 陳鍾誠
 
用十分鐘 學會《資料結構、演算法和計算理論》
用十分鐘  學會《資料結構、演算法和計算理論》用十分鐘  學會《資料結構、演算法和計算理論》
用十分鐘 學會《資料結構、演算法和計算理論》鍾誠 陳鍾誠
 
用十分鐘搞懂 《資管、資工、電子、電機、機械》 這些科系到底在學些甚麼?
用十分鐘搞懂  《資管、資工、電子、電機、機械》  這些科系到底在學些甚麼?用十分鐘搞懂  《資管、資工、電子、電機、機械》  這些科系到底在學些甚麼?
用十分鐘搞懂 《資管、資工、電子、電機、機械》 這些科系到底在學些甚麼?鍾誠 陳鍾誠
 
用十分鐘搞懂《離散數學》
用十分鐘搞懂《離散數學》用十分鐘搞懂《離散數學》
用十分鐘搞懂《離散數學》鍾誠 陳鍾誠
 
人造交談語言 (使用有BNF的口語透過機器翻譯和外國人溝通)
人造交談語言  (使用有BNF的口語透過機器翻譯和外國人溝通)人造交談語言  (使用有BNF的口語透過機器翻譯和外國人溝通)
人造交談語言 (使用有BNF的口語透過機器翻譯和外國人溝通)鍾誠 陳鍾誠
 
用十分鐘決定要不要念大學《資訊工程系》
用十分鐘決定要不要念大學《資訊工程系》用十分鐘決定要不要念大學《資訊工程系》
用十分鐘決定要不要念大學《資訊工程系》鍾誠 陳鍾誠
 
用十分鐘學會 《微積分、工程數學》及其應用
用十分鐘學會  《微積分、工程數學》及其應用用十分鐘學會  《微積分、工程數學》及其應用
用十分鐘學會 《微積分、工程數學》及其應用鍾誠 陳鍾誠
 
如何用十分鐘快速瞭解一個程式語言 《以JavaScript和C語言為例》
如何用十分鐘快速瞭解一個程式語言  《以JavaScript和C語言為例》如何用十分鐘快速瞭解一個程式語言  《以JavaScript和C語言為例》
如何用十分鐘快速瞭解一個程式語言 《以JavaScript和C語言為例》鍾誠 陳鍾誠
 
《計算機結構與作業系統裏》-- 資工系學生們經常搞錯的那些事兒!
《計算機結構與作業系統裏》--  資工系學生們經常搞錯的那些事兒!《計算機結構與作業系統裏》--  資工系學生們經常搞錯的那些事兒!
《計算機結構與作業系統裏》-- 資工系學生們經常搞錯的那些事兒!鍾誠 陳鍾誠
 
用十分鐘瞭解 《JavaScript的程式世界》
用十分鐘瞭解  《JavaScript的程式世界》用十分鐘瞭解  《JavaScript的程式世界》
用十分鐘瞭解 《JavaScript的程式世界》鍾誠 陳鍾誠
 
用十分鐘《讓你的專案一開始就搞砸》!
用十分鐘《讓你的專案一開始就搞砸》!用十分鐘《讓你的專案一開始就搞砸》!
用十分鐘《讓你的專案一開始就搞砸》!鍾誠 陳鍾誠
 
十分鐘讓程式人搞懂雲端平台與技術
十分鐘讓程式人搞懂雲端平台與技術十分鐘讓程式人搞懂雲端平台與技術
十分鐘讓程式人搞懂雲端平台與技術鍾誠 陳鍾誠
 
用十分鐘快速掌握《數學的整體結構》
用十分鐘快速掌握《數學的整體結構》用十分鐘快速掌握《數學的整體結構》
用十分鐘快速掌握《數學的整體結構》鍾誠 陳鍾誠
 
用十分鐘瞭解《線性代數、向量微積分》以及電磁學理論
用十分鐘瞭解《線性代數、向量微積分》以及電磁學理論用十分鐘瞭解《線性代數、向量微積分》以及電磁學理論
用十分鐘瞭解《線性代數、向量微積分》以及電磁學理論鍾誠 陳鍾誠
 
用十分鐘瞭解JavaScript的模組 -- 《還有關於npm套件管理的那些事情》
用十分鐘瞭解JavaScript的模組 -- 《還有關於npm套件管理的那些事情》用十分鐘瞭解JavaScript的模組 -- 《還有關於npm套件管理的那些事情》
用十分鐘瞭解JavaScript的模組 -- 《還有關於npm套件管理的那些事情》鍾誠 陳鍾誠
 
十分鐘化學史 (一) 《拉瓦錫之前的那些事兒》
十分鐘化學史 (一)  《拉瓦錫之前的那些事兒》十分鐘化學史 (一)  《拉瓦錫之前的那些事兒》
十分鐘化學史 (一) 《拉瓦錫之前的那些事兒》鍾誠 陳鍾誠
 
用20分鐘搞懂 《系統分析、軟體工程、專案管理與設計模式》
用20分鐘搞懂   《系統分析、軟體工程、專案管理與設計模式》用20分鐘搞懂   《系統分析、軟體工程、專案管理與設計模式》
用20分鐘搞懂 《系統分析、軟體工程、專案管理與設計模式》鍾誠 陳鍾誠
 
FastData 快速的人文資料庫撰寫方式
FastData  快速的人文資料庫撰寫方式FastData  快速的人文資料庫撰寫方式
FastData 快速的人文資料庫撰寫方式鍾誠 陳鍾誠
 
用十分鐘將你的網站送上雲端
用十分鐘將你的網站送上雲端用十分鐘將你的網站送上雲端
用十分鐘將你的網站送上雲端鍾誠 陳鍾誠
 

Viewers also liked (20)

用十分鐘開始理解深度學習技術 (從 dnn.js 專案出發)
用十分鐘開始理解深度學習技術  (從 dnn.js 專案出發)用十分鐘開始理解深度學習技術  (從 dnn.js 專案出發)
用十分鐘開始理解深度學習技術 (從 dnn.js 專案出發)
 
系統程式 -- 為何撰寫此書
系統程式 -- 為何撰寫此書系統程式 -- 為何撰寫此書
系統程式 -- 為何撰寫此書
 
用十分鐘 學會《資料結構、演算法和計算理論》
用十分鐘  學會《資料結構、演算法和計算理論》用十分鐘  學會《資料結構、演算法和計算理論》
用十分鐘 學會《資料結構、演算法和計算理論》
 
用十分鐘搞懂 《資管、資工、電子、電機、機械》 這些科系到底在學些甚麼?
用十分鐘搞懂  《資管、資工、電子、電機、機械》  這些科系到底在學些甚麼?用十分鐘搞懂  《資管、資工、電子、電機、機械》  這些科系到底在學些甚麼?
用十分鐘搞懂 《資管、資工、電子、電機、機械》 這些科系到底在學些甚麼?
 
用十分鐘搞懂《離散數學》
用十分鐘搞懂《離散數學》用十分鐘搞懂《離散數學》
用十分鐘搞懂《離散數學》
 
人造交談語言 (使用有BNF的口語透過機器翻譯和外國人溝通)
人造交談語言  (使用有BNF的口語透過機器翻譯和外國人溝通)人造交談語言  (使用有BNF的口語透過機器翻譯和外國人溝通)
人造交談語言 (使用有BNF的口語透過機器翻譯和外國人溝通)
 
用十分鐘決定要不要念大學《資訊工程系》
用十分鐘決定要不要念大學《資訊工程系》用十分鐘決定要不要念大學《資訊工程系》
用十分鐘決定要不要念大學《資訊工程系》
 
用十分鐘學會 《微積分、工程數學》及其應用
用十分鐘學會  《微積分、工程數學》及其應用用十分鐘學會  《微積分、工程數學》及其應用
用十分鐘學會 《微積分、工程數學》及其應用
 
如何用十分鐘快速瞭解一個程式語言 《以JavaScript和C語言為例》
如何用十分鐘快速瞭解一個程式語言  《以JavaScript和C語言為例》如何用十分鐘快速瞭解一個程式語言  《以JavaScript和C語言為例》
如何用十分鐘快速瞭解一個程式語言 《以JavaScript和C語言為例》
 
《計算機結構與作業系統裏》-- 資工系學生們經常搞錯的那些事兒!
《計算機結構與作業系統裏》--  資工系學生們經常搞錯的那些事兒!《計算機結構與作業系統裏》--  資工系學生們經常搞錯的那些事兒!
《計算機結構與作業系統裏》-- 資工系學生們經常搞錯的那些事兒!
 
用十分鐘瞭解 《JavaScript的程式世界》
用十分鐘瞭解  《JavaScript的程式世界》用十分鐘瞭解  《JavaScript的程式世界》
用十分鐘瞭解 《JavaScript的程式世界》
 
用十分鐘《讓你的專案一開始就搞砸》!
用十分鐘《讓你的專案一開始就搞砸》!用十分鐘《讓你的專案一開始就搞砸》!
用十分鐘《讓你的專案一開始就搞砸》!
 
十分鐘讓程式人搞懂雲端平台與技術
十分鐘讓程式人搞懂雲端平台與技術十分鐘讓程式人搞懂雲端平台與技術
十分鐘讓程式人搞懂雲端平台與技術
 
用十分鐘快速掌握《數學的整體結構》
用十分鐘快速掌握《數學的整體結構》用十分鐘快速掌握《數學的整體結構》
用十分鐘快速掌握《數學的整體結構》
 
用十分鐘瞭解《線性代數、向量微積分》以及電磁學理論
用十分鐘瞭解《線性代數、向量微積分》以及電磁學理論用十分鐘瞭解《線性代數、向量微積分》以及電磁學理論
用十分鐘瞭解《線性代數、向量微積分》以及電磁學理論
 
用十分鐘瞭解JavaScript的模組 -- 《還有關於npm套件管理的那些事情》
用十分鐘瞭解JavaScript的模組 -- 《還有關於npm套件管理的那些事情》用十分鐘瞭解JavaScript的模組 -- 《還有關於npm套件管理的那些事情》
用十分鐘瞭解JavaScript的模組 -- 《還有關於npm套件管理的那些事情》
 
十分鐘化學史 (一) 《拉瓦錫之前的那些事兒》
十分鐘化學史 (一)  《拉瓦錫之前的那些事兒》十分鐘化學史 (一)  《拉瓦錫之前的那些事兒》
十分鐘化學史 (一) 《拉瓦錫之前的那些事兒》
 
用20分鐘搞懂 《系統分析、軟體工程、專案管理與設計模式》
用20分鐘搞懂   《系統分析、軟體工程、專案管理與設計模式》用20分鐘搞懂   《系統分析、軟體工程、專案管理與設計模式》
用20分鐘搞懂 《系統分析、軟體工程、專案管理與設計模式》
 
FastData 快速的人文資料庫撰寫方式
FastData  快速的人文資料庫撰寫方式FastData  快速的人文資料庫撰寫方式
FastData 快速的人文資料庫撰寫方式
 
用十分鐘將你的網站送上雲端
用十分鐘將你的網站送上雲端用十分鐘將你的網站送上雲端
用十分鐘將你的網站送上雲端
 

Similar to Machine Learning in R with R

R Programming - part 1.pdf
R Programming - part 1.pdfR Programming - part 1.pdf
R Programming - part 1.pdfRohanBorgalli
 
RDataMining slides-r-programming
RDataMining slides-r-programmingRDataMining slides-r-programming
RDataMining slides-r-programmingYanchang Zhao
 
Best corporate-r-programming-training-in-mumbai
Best corporate-r-programming-training-in-mumbaiBest corporate-r-programming-training-in-mumbai
Best corporate-r-programming-training-in-mumbaiUnmesh Baile
 
Modeling in R Programming Language for Beginers.ppt
Modeling in R Programming Language for Beginers.pptModeling in R Programming Language for Beginers.ppt
Modeling in R Programming Language for Beginers.pptanshikagoel52
 
CuRious about R in Power BI? End to end R in Power BI for beginners
CuRious about R in Power BI? End to end R in Power BI for beginners CuRious about R in Power BI? End to end R in Power BI for beginners
CuRious about R in Power BI? End to end R in Power BI for beginners Jen Stirrup
 
A short tutorial on r
A short tutorial on rA short tutorial on r
A short tutorial on rAshraf Uddin
 
Data analysis in R
Data analysis in RData analysis in R
Data analysis in RAndrew Lowe
 
software engineering modules iii & iv.pptx
software engineering  modules iii & iv.pptxsoftware engineering  modules iii & iv.pptx
software engineering modules iii & iv.pptxrani marri
 
R Programming Language
R Programming LanguageR Programming Language
R Programming LanguageNareshKarela1
 
PPT ON MACHINE LEARNING by Ragini Ratre
PPT ON MACHINE LEARNING by Ragini RatrePPT ON MACHINE LEARNING by Ragini Ratre
PPT ON MACHINE LEARNING by Ragini RatreRaginiRatre
 
1_Introduction.pptx
1_Introduction.pptx1_Introduction.pptx
1_Introduction.pptxranapoonam1
 

Similar to Machine Learning in R with R (20)

R Programming - part 1.pdf
R Programming - part 1.pdfR Programming - part 1.pdf
R Programming - part 1.pdf
 
RDataMining slides-r-programming
RDataMining slides-r-programmingRDataMining slides-r-programming
RDataMining slides-r-programming
 
Best corporate-r-programming-training-in-mumbai
Best corporate-r-programming-training-in-mumbaiBest corporate-r-programming-training-in-mumbai
Best corporate-r-programming-training-in-mumbai
 
Ggplot2 v3
Ggplot2 v3Ggplot2 v3
Ggplot2 v3
 
Lecture1_R.ppt
Lecture1_R.pptLecture1_R.ppt
Lecture1_R.ppt
 
Lecture1_R.ppt
Lecture1_R.pptLecture1_R.ppt
Lecture1_R.ppt
 
Lecture1 r
Lecture1 rLecture1 r
Lecture1 r
 
Modeling in R Programming Language for Beginers.ppt
Modeling in R Programming Language for Beginers.pptModeling in R Programming Language for Beginers.ppt
Modeling in R Programming Language for Beginers.ppt
 
Unit 3
Unit 3Unit 3
Unit 3
 
R basics
R basicsR basics
R basics
 
CuRious about R in Power BI? End to end R in Power BI for beginners
CuRious about R in Power BI? End to end R in Power BI for beginners CuRious about R in Power BI? End to end R in Power BI for beginners
CuRious about R in Power BI? End to end R in Power BI for beginners
 
An Intoduction to R
An Intoduction to RAn Intoduction to R
An Intoduction to R
 
R Introduction
R IntroductionR Introduction
R Introduction
 
A short tutorial on r
A short tutorial on rA short tutorial on r
A short tutorial on r
 
Data analysis in R
Data analysis in RData analysis in R
Data analysis in R
 
software engineering modules iii & iv.pptx
software engineering  modules iii & iv.pptxsoftware engineering  modules iii & iv.pptx
software engineering modules iii & iv.pptx
 
محاضرة برنامج التحليل الكمي R program د.هديل القفيدي
محاضرة برنامج التحليل الكمي   R program د.هديل القفيديمحاضرة برنامج التحليل الكمي   R program د.هديل القفيدي
محاضرة برنامج التحليل الكمي R program د.هديل القفيدي
 
R Programming Language
R Programming LanguageR Programming Language
R Programming Language
 
PPT ON MACHINE LEARNING by Ragini Ratre
PPT ON MACHINE LEARNING by Ragini RatrePPT ON MACHINE LEARNING by Ragini Ratre
PPT ON MACHINE LEARNING by Ragini Ratre
 
1_Introduction.pptx
1_Introduction.pptx1_Introduction.pptx
1_Introduction.pptx
 

More from Alexandros Karatzoglou

Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial Alexandros Karatzoglou
 
Deep Learning for Recommender Systems - Budapest RecSys Meetup
Deep Learning for Recommender Systems  - Budapest RecSys MeetupDeep Learning for Recommender Systems  - Budapest RecSys Meetup
Deep Learning for Recommender Systems - Budapest RecSys MeetupAlexandros Karatzoglou
 
Machine Learning for Recommender Systems MLSS 2015 Sydney
Machine Learning for Recommender Systems MLSS 2015 SydneyMachine Learning for Recommender Systems MLSS 2015 Sydney
Machine Learning for Recommender Systems MLSS 2015 SydneyAlexandros Karatzoglou
 
Ranking and Diversity in Recommendations - RecSys Stammtisch at SoundCloud, B...
Ranking and Diversity in Recommendations - RecSys Stammtisch at SoundCloud, B...Ranking and Diversity in Recommendations - RecSys Stammtisch at SoundCloud, B...
Ranking and Diversity in Recommendations - RecSys Stammtisch at SoundCloud, B...Alexandros Karatzoglou
 
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorial
Learning to Rank for Recommender Systems -  ACM RecSys 2013 tutorialLearning to Rank for Recommender Systems -  ACM RecSys 2013 tutorial
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorialAlexandros Karatzoglou
 
ESSIR 2013 Recommender Systems tutorial
ESSIR 2013 Recommender Systems tutorial ESSIR 2013 Recommender Systems tutorial
ESSIR 2013 Recommender Systems tutorial Alexandros Karatzoglou
 
Multiverse Recommendation: N-dimensional Tensor Factorization for Context-awa...
Multiverse Recommendation: N-dimensional Tensor Factorization for Context-awa...Multiverse Recommendation: N-dimensional Tensor Factorization for Context-awa...
Multiverse Recommendation: N-dimensional Tensor Factorization for Context-awa...Alexandros Karatzoglou
 
TFMAP: Optimizing MAP for Top-N Context-aware Recommendation
TFMAP: Optimizing MAP for Top-N Context-aware RecommendationTFMAP: Optimizing MAP for Top-N Context-aware Recommendation
TFMAP: Optimizing MAP for Top-N Context-aware RecommendationAlexandros Karatzoglou
 
CLiMF: Collaborative Less-is-More Filtering
CLiMF: Collaborative Less-is-More FilteringCLiMF: Collaborative Less-is-More Filtering
CLiMF: Collaborative Less-is-More FilteringAlexandros Karatzoglou
 

More from Alexandros Karatzoglou (9)

Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial
 
Deep Learning for Recommender Systems - Budapest RecSys Meetup
Deep Learning for Recommender Systems  - Budapest RecSys MeetupDeep Learning for Recommender Systems  - Budapest RecSys Meetup
Deep Learning for Recommender Systems - Budapest RecSys Meetup
 
Machine Learning for Recommender Systems MLSS 2015 Sydney
Machine Learning for Recommender Systems MLSS 2015 SydneyMachine Learning for Recommender Systems MLSS 2015 Sydney
Machine Learning for Recommender Systems MLSS 2015 Sydney
 
Ranking and Diversity in Recommendations - RecSys Stammtisch at SoundCloud, B...
Ranking and Diversity in Recommendations - RecSys Stammtisch at SoundCloud, B...Ranking and Diversity in Recommendations - RecSys Stammtisch at SoundCloud, B...
Ranking and Diversity in Recommendations - RecSys Stammtisch at SoundCloud, B...
 
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorial
Learning to Rank for Recommender Systems -  ACM RecSys 2013 tutorialLearning to Rank for Recommender Systems -  ACM RecSys 2013 tutorial
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorial
 
ESSIR 2013 Recommender Systems tutorial
ESSIR 2013 Recommender Systems tutorial ESSIR 2013 Recommender Systems tutorial
ESSIR 2013 Recommender Systems tutorial
 
Multiverse Recommendation: N-dimensional Tensor Factorization for Context-awa...
Multiverse Recommendation: N-dimensional Tensor Factorization for Context-awa...Multiverse Recommendation: N-dimensional Tensor Factorization for Context-awa...
Multiverse Recommendation: N-dimensional Tensor Factorization for Context-awa...
 
TFMAP: Optimizing MAP for Top-N Context-aware Recommendation
TFMAP: Optimizing MAP for Top-N Context-aware RecommendationTFMAP: Optimizing MAP for Top-N Context-aware Recommendation
TFMAP: Optimizing MAP for Top-N Context-aware Recommendation
 
CLiMF: Collaborative Less-is-More Filtering
CLiMF: Collaborative Less-is-More FilteringCLiMF: Collaborative Less-is-More Filtering
CLiMF: Collaborative Less-is-More Filtering
 

Machine Learning in R with R

  • 1. Machine Learning in R Alexandros Karatzoglou1 1 Telefonica Research Barcelona, Spain December 15, 2010 1
  • 2. Outline 1 Introduction to R CRAN Objects and Operations Basic Data Structures Missing Values Entering Data File Input and Output Installing Packages Indexing and Subsetting 2 Basic Plots 3 Lattice Plots 4 Basic Statistics & Machine Learning Tests 5 Linear Models 6 Naive Bayes 7 Support Vector Machines 8 Decision Trees 9 Dimensionality Reduction 2
  • 3. R Environment for statistical data analysis, inference and visualization. Ports for Unix, Windows and MacOSX Highly extensible through user-defined functions Generic functions and conventions for standard operations like plot, predict etc. ∼ 1200 add-on packages contributed by developers from all over the world e.g. Multivariate Statistics, Machine Learning, Natural Language Processing, Bioinformatics (Bioconductor), SNA, . Interfaces to C, C++, Fortran, Java 3
  • 4. Outline 1 Introduction to R CRAN Objects and Operations Basic Data Structures Missing Values Entering Data File Input and Output Installing Packages Indexing and Subsetting 2 Basic Plots 3 Lattice Plots 4 Basic Statistics & Machine Learning Tests 5 Linear Models 6 Naive Bayes 7 Support Vector Machines 8 Decision Trees 9 Dimensionality Reduction 4
  • 5. Comprehensive R Archive Network (CRAN) CRAN includes packages which provide additional functionality to the one existing in R Currently over 1200 packages in areas like multivariate statistics, time series analysis, Machine Learning, Geo-statistics, environmental statistics etc. packages are written mainly by academics, PhD students, or company staff Some of the package have been ordered into Task Views 5
  • 6. CRAN Task Views Some of the CRAN packages are ordered into categories called Task Views Task Views are maintained by people with experience in the corresponding field there are Task Views on Econometrics, Environmental Statistics, Finance, Genetics, Graphics Machine Learning, Multivariate Statistics, Natural Language Processing, Social Sciences, and others. 6
  • 7. Online documentation 1 Go to http://www.r-project.org 2 On the left side under Documentation 3 Official Manuals, FAQ, Newsletter, Books 4 Other and click on other publications contains a collection of contributed guides and manuals 7
  • 8. Mailing lists R-announce is for announcing new major enhancements in R R-packages announcing new version of packages R-help is for R related user questions! R-devel is for R developers 8
  • 9. Command Line Interface R does not have a “real” gui All computations and statistics are performed with commands These commands are called “functions” in R 9
  • 10. help click on help search help functionality FAQ link to reference web pages apropos 10
  • 11. Outline 1 Introduction to R CRAN Objects and Operations Basic Data Structures Missing Values Entering Data File Input and Output Installing Packages Indexing and Subsetting 2 Basic Plots 3 Lattice Plots 4 Basic Statistics & Machine Learning Tests 5 Linear Models 6 Naive Bayes 7 Support Vector Machines 8 Decision Trees 9 Dimensionality Reduction 11
  • 12. Objects Everything in R is an object Everything in R has a class. Data, intermediate results and even e.g. the result of a regression are stored in R objects The Class of the object both describes what the object contains and what many standard functions Objects are usually accessed by name. Syntactic names for objects are made up from letters, the digits 0 to 9 in any non-initial position and also the period “.” 12
  • 13. Assignment and Expression Operations R commands are either assignments or expressions Commands are separated either by a semicolon ; or newline An expression command is evaluated and (normally) printed An assignment command evaluates an expression and passes the value to a variable but the result is not printed 13
  • 15. Assigment Operations > res <- 1 + 1 “<-” is the assignment operator in R a series of commands in a file (script) can be executed with the command : source(“myfile.R”) 15
  • 16. Sample session > 1:5 [1] 1 2 3 4 5 > powers.of.2 <- 2^(1:5) > powers.of.2 [1] 2 4 8 16 32 > class(powers.of.2) [1] "numeric" > ls() [1] "powers.of.2" "res" > rm(powers.of.2) 16
  • 17. Workspace R stores objects in workspace that is kept in memory When quiting R ask you if you want to save that workspace The workspace containing all objects you work on can then be restored next time you work with R along with a history of the used commands. 17
  • 18. Outline 1 Introduction to R CRAN Objects and Operations Basic Data Structures Missing Values Entering Data File Input and Output Installing Packages Indexing and Subsetting 2 Basic Plots 3 Lattice Plots 4 Basic Statistics & Machine Learning Tests 5 Linear Models 6 Naive Bayes 7 Support Vector Machines 8 Decision Trees 9 Dimensionality Reduction 18
  • 19. Basic Data Structures Vectors Factors Data Frame Matrices Lists 19
  • 20. Vectors In R the building blocks for storing data are vectors of various types. The most common classes are: "character", a vector of character strings of varying length. These are normally entered and printed surrounded by double quotes. "numeric", a vector of real numbers. "integer", a vector of (signed) integers. "logical", a vector of logical (true or false) values. The values are printed and as TRUE and FALSE. 20
  • 21. Numeric Vectors > vect <- c(1, 2, 99, 6, 8, 9) > is(vect) [1] "numeric" "vector" > vect[2] [1] 2 > vect[2:3] [1] 2 99 > length(vect) [1] 6 > sum(vect) [1] 125 21
  • 22. Character Vectors > vect3 <- c("austria", "spain", + "france", "uk", "belgium", + "poland") > is(vect3) [1] "character" [2] "vector" [3] "data.frameRowLabels" [4] "SuperClassMethod" > vect3[2] [1] "spain" > vect3[2:3] [1] "spain" "france" > length(vect3) [1] 6 22
  • 23. Logical Vectors > vect4 <- c(TRUE, TRUE, FALSE, TRUE, + FALSE, TRUE) > is(vect4) [1] "logical" "vector" 23
  • 24. Factors > citizen <- factor(c("uk", "us", + "no", "au", "uk", "us", "us")) > citizen [1] uk us no au uk us us Levels: au no uk us > unclass(citizen) [1] 3 4 2 1 3 4 4 attr(,"levels") [1] "au" "no" "uk" "us" > citizen[5:7] [1] uk us us Levels: au no uk us > citizen[5:7, drop = TRUE] [1] uk us us Levels: uk us 24
  • 25. Factors Ordered > income <- ordered(c("Mid", "Hi", + "Lo", "Mid", "Lo", "Hi"), levels = c("Lo", + "Mid", "Hi")) > income [1] Mid Hi Lo Mid Lo Hi Levels: Lo < Mid < Hi > as.numeric(income) [1] 2 3 1 2 1 3 > class(income) [1] "ordered" "factor" > income[1:3] [1] Mid Hi Lo Levels: Lo < Mid < Hi 25
  • 26. Data Frames A data frame is the type of object normally used in R to store a data matrix. It should be thought of as a list of variables of the same length, but possibly of different types (numeric, factor, character, logical, etc.). 26
  • 27. Data Frames > library(MASS) > data(painters) > painters[1:3, ] Composition Drawing Colour Da Udine 10 8 16 Da Vinci 15 16 4 Del Piombo 8 13 16 Expression School Da Udine 3 A Da Vinci 14 A Del Piombo 7 A > names(painters) [1] "Composition" "Drawing" [3] "Colour" "Expression" [5] "School" > row.names(painters) [1] "Da Udine" "Da Vinci"
  • 28. Data Frames Data frames are by far the commonest way to store data in R They are normally imported by reading a file or from a spreadsheet or database. However, vectors of the same length can be collected into a data frame by the function data.frame 28
  • 29. Data Frame > mydf <- data.frame(vect, vect3, + income) > summary(mydf) vect vect3 income Min. : 1.00 austria:1 Lo :2 1st Qu.: 3.00 belgium:1 Mid:2 Median : 7.00 france :1 Hi :2 Mean :20.83 poland :1 3rd Qu.: 8.75 spain :1 Max. :99.00 uk :1 > mydf <- data.frame(vect, I(vect3), + income) 29
  • 30. Matrices A data frame may be printed like a matrix, but it is not a matrix. Matrices like vectors have all their elements of the same type 30
  • 31. > mymat <- matrix(1:10, 2, 5) > class(mymat) [1] "matrix" > dim(mymat) [1] 2 5 31
  • 32. Lists A list is a vector of other R objects. Lists are used to collect together items of different classes. For example, an employee record might be created by : 32
  • 33. Lists > Empl <- list(employee = "Anna", + spouse = "Fred", children = 3, + child.ages = c(4, 7, 9)) > Empl$employee [1] "Anna" > Empl$child.ages[2] [1] 7 33
  • 34. Basic Mathematical Operations > 5 - 3 [1] 2 > a <- 2:4 > b <- rep(1, 3) > a - b [1] 1 2 3 > a * b [1] 2 3 4 34
  • 35. Matrix Multiplication " > a <- matrix(1:9, 3, 3) > b <- matrix(2, 3, 3) > c <- a %*% b 35
  • 36. Outline 1 Introduction to R CRAN Objects and Operations Basic Data Structures Missing Values Entering Data File Input and Output Installing Packages Indexing and Subsetting 2 Basic Plots 3 Lattice Plots 4 Basic Statistics & Machine Learning Tests 5 Linear Models 6 Naive Bayes 7 Support Vector Machines 8 Decision Trees 9 Dimensionality Reduction 36
  • 37. Missing Values Missing Values in R are labeled with the logical value NA 37
  • 38. Missing Values > mydf[mydf == 99] <- NA > mydf vect vect3 income 1 1 austria Mid 2 2 spain Hi 3 NA france Lo 4 6 uk Mid 5 8 belgium Lo 6 9 poland Hi 38
  • 39. Missing Values > a <- c(3, 5, 6) > a[2] <- NA > a [1] 3 NA 6 > is.na(a) [1] FALSE TRUE FALSE 39
  • 40. Special Values R has the also special values NaN , Inf and -Inf > d <- c(-1, 0, 1) > d/0 [1] -Inf NaN Inf 40
  • 41. Outline 1 Introduction to R CRAN Objects and Operations Basic Data Structures Missing Values Entering Data File Input and Output Installing Packages Indexing and Subsetting 2 Basic Plots 3 Lattice Plots 4 Basic Statistics & Machine Learning Tests 5 Linear Models 6 Naive Bayes 7 Support Vector Machines 8 Decision Trees 9 Dimensionality Reduction 41
  • 42. Entering Data For all but the smallest datasets the easiest way to get data into R is to import it from a connection such as a file 42
  • 43. Entering Data > a <- c(5, 8, 5, 4, 9, 7) > b <- scan() 43
  • 44. Entering Data > mydf2 <- edit(mydf) 44
  • 45. Outline 1 Introduction to R CRAN Objects and Operations Basic Data Structures Missing Values Entering Data File Input and Output Installing Packages Indexing and Subsetting 2 Basic Plots 3 Lattice Plots 4 Basic Statistics & Machine Learning Tests 5 Linear Models 6 Naive Bayes 7 Support Vector Machines 8 Decision Trees 9 Dimensionality Reduction 45
  • 46. File Input and Output read.table can be used to read data from text file like csv read.table creates a data frame from the values and tries to guess the type of each variable > mydata <- read.table("file.csv", + sep = ",") 46
  • 47. Packages Packages contain additional functionality in the form of extra functions and data Installing a package can be done with the function install.packages() The default R installation contains a number of contributed packages like MASS, foreign, utils Before we can access the functionality in a package we have to load the package with the function library() 47
  • 48. Outline 1 Introduction to R CRAN Objects and Operations Basic Data Structures Missing Values Entering Data File Input and Output Installing Packages Indexing and Subsetting 2 Basic Plots 3 Lattice Plots 4 Basic Statistics & Machine Learning Tests 5 Linear Models 6 Naive Bayes 7 Support Vector Machines 8 Decision Trees 9 Dimensionality Reduction 48
  • 49. Packages Help on the contents of the packages is available > library(help = foreign) Help on installed packages is also available by help.start() Vignettes are also available for many packages 49
  • 50. SPSS files Package foreign contains functions that read data from SPSS (sav), STATA and other formats > library(foreign) > spssdata <- read.spss("ees04.sav", + to.data.frame = TRUE) 50
  • 51. Excel files To load Excel file into R an external package xlsReadWrite is needed We will install the package directly from CRAN using install.packages > install.packages("xlsReadWrite", + lib = "C:temp") 51
  • 52. Excel files > library(xlsReadWrite) > data <- read.xls("sampledata.xls") 52
  • 53. Data Export You can save your data by using the functions write.table() or save() write.table is more data specific, while save can save any R object save saves in a binary format while write.table() in text format. > write.table(mydata, file = "mydata.csv", + quote = FALSE) > save(mydata, file = "mydata.rda") > load(file = "mydata.rda") 53
  • 54. Outline 1 Introduction to R CRAN Objects and Operations Basic Data Structures Missing Values Entering Data File Input and Output Installing Packages Indexing and Subsetting 2 Basic Plots 3 Lattice Plots 4 Basic Statistics & Machine Learning Tests 5 Linear Models 6 Naive Bayes 7 Support Vector Machines 8 Decision Trees 9 Dimensionality Reduction 54
  • 55. Indexing R has a powerful set of Indexing capabilities This facilitates Data manipulation and can speed up data pre-processing Indexing Vectors, Data Frames and matrices is done in a similar manner 55
  • 56. Indexing by numeric vectors > letters[1:3] [1] "a" "b" "c" > letters[c(7, 9)] [1] "g" "i" > letters[-(1:15)] [1] "p" "q" "r" "s" "t" "u" "v" "w" "x" [10] "y" "z" 56
  • 57. Indexing by logical vectors > a <- 1:10 > a[c(3, 5, 7)] <- NA > is.na(a) [1] FALSE FALSE TRUE FALSE TRUE FALSE [7] TRUE FALSE FALSE FALSE > a[!is.na(a)] [1] 1 2 4 6 8 9 10 57
  • 58. Indexing Matrices and Data Frames > mymat[1, ] [1] 1 3 5 7 9 > mymat[, c(2, 3)] [,1] [,2] [1,] 3 5 [2,] 4 6 > mymat[1, -(1:3)] [1] 7 9 58
  • 59. Selecting Subsets > attach(painters) > painters[Colour >= 17, ] Composition Drawing Colour Bassano 6 8 17 Giorgione 8 9 18 Pordenone 8 14 17 Titian 12 15 18 Rembrandt 15 6 17 Rubens 18 13 17 Van Dyck 15 10 17 Expression School Bassano 0 D Giorgione 4 D Pordenone 5 D Titian 6 D Rembrandt 12 G Rubens 17 G 59
  • 60. Selecting Subsets > painters[Colour >= 15 & Composition > + 10, ] Composition Drawing Colour Tintoretto 15 14 16 Titian 12 15 18 Veronese 15 10 16 Corregio 13 13 15 Rembrandt 15 6 17 Rubens 18 13 17 Van Dyck 15 10 17 Expression School Tintoretto 4 D Titian 6 D Veronese 3 D Corregio 12 E Rembrandt 12 G Rubens 17 G 60
  • 61. Graphics R has a rich set of visualization capabilities scatter-plots, histograms, box-plots and more. 61
  • 63. Scatter Plot > plot(climb ~ time, hills) 63
  • 65. Box Plots > boxplot(count ~ spray, data = InsectSprays, + col = "lightgray") 65
  • 66. Formula Interface A formula is of the general form response ∼ expression where the left-hand side, response, may in some uses be absent and the right-hand side, expression, is a collection of terms joined by operators usually resembling an arithmetical expression. The meaning of the right-hand side is context dependent e.g. in linear and generalized linear modelling it specifies the form of the model matrix 66
  • 67. Scatterplot and line > library(MASS) > data(hills) > attach(hills) > plot(climb ~ time, hills, xlab = "Time", + ylab = "Feet") > abline(0, 40) > abline(lm(climb ~ time)) 67
  • 68. Multiple plots > par(mfrow = c(1, 2)) > plot(climb ~ time, hills, xlab = "Time", + ylab = "Feet") > plot(dist ~ time, hills, xlab = "Time", + ylab = "Dist") > par(mfrow = c(1, 1)) 68
  • 69. Plot to file > pdf(file = "H:/My Documents/plot.pdf") > plot(dist ~ time, hills, xlab = "Time", + ylab = "Dist", main = "Hills") > dev.off() 69
  • 71. Box-whisker plots > data(michelson) > bwplot(Expt ~ Speed, data = michelson, + ylab = "Experiment No.") > title("Speed of Light Data") 71
  • 72. Box-whisker plots > data(quine) > bwplot(Age ~ Days | Sex * Lrn * + Eth, data = quine, layout = c(4, + 2)) 72
  • 73. Discriptive Statistics > mean(hills$time) [1] 57.87571 > colMeans(hills) dist climb time 7.528571 1815.314286 57.875714 > median(hills$time) [1] 39.75 > quantile(hills$time) 0% 25% 50% 75% 100% 15.950 28.000 39.750 68.625 204.617 > var(hills$time) [1] 2504.073 > sd(hills$time) [1] 50.04072 73
  • 74. Discriptive Statistics > cor(hills) dist climb time dist 1.0000000 0.6523461 0.9195892 climb 0.6523461 1.0000000 0.8052392 time 0.9195892 0.8052392 1.0000000 > cov(hills) dist climb time dist 30.51387 5834.638 254.1944 climb 5834.63782 2621648.457 65243.2567 time 254.19442 65243.257 2504.0733 > cor(hills$time, hills$climb) [1] 0.8052392 74
  • 75. Distributions R contains a number of functions for drawing from a number of distribution like e.g. normal, uniform etc. 75
  • 76. Sampling from a Normal Distribution > nvec <- rnorm(100, 3) 76
  • 77. Outline 1 Introduction to R CRAN Objects and Operations Basic Data Structures Missing Values Entering Data File Input and Output Installing Packages Indexing and Subsetting 2 Basic Plots 3 Lattice Plots 4 Basic Statistics & Machine Learning Tests 5 Linear Models 6 Naive Bayes 7 Support Vector Machines 8 Decision Trees 9 Dimensionality Reduction 77
  • 78. Testing for the mean T-test > t.test(nvec, mu = 1) One Sample t-test data: nvec t = 21.187, df = 99, p-value < 2.2e-16 alternative hypothesis: true mean is not equal to 1 95 percent confidence interval: 2.993364 3.405311 sample estimates: mean of x 3.199337 78
  • 79. Testing for the mean T-test > t.test(nvec, mu = 2) One Sample t-test data: nvec t = 11.5536, df = 99, p-value < 2.2e-16 alternative hypothesis: true mean is not equal to 2 95 percent confidence interval: 2.993364 3.405311 sample estimates: mean of x 3.199337 79
  • 80. Testing for the mean Wilcox-test > wilcox.test(nvec, mu = 3) Wilcoxon signed rank test with continuity correction data: nvec V = 3056, p-value = 0.06815 alternative hypothesis: true location is not equal to 80
  • 81. Two Sample Tests > nvec2 <- rnorm(100, 2) > t.test(nvec, nvec2) Welch Two Sample t-test data: nvec and nvec2 t = 7.4455, df = 195.173, p-value = 3.025e-12 alternative hypothesis: true difference in means is no 95 percent confidence interval: 0.8567072 1.4741003 sample estimates: mean of x mean of y 3.199337 2.033933 81
  • 82. Two Sample Tests > wilcox.test(nvec, nvec2) Wilcoxon rank sum test with continuity correction data: nvec and nvec2 W = 7729, p-value = 2.615e-11 alternative hypothesis: true location shift is not equ 82
  • 83. Paired Tests > t.test(nvec, nvec2, paired = TRUE) Paired t-test data: nvec and nvec2 t = 7.6648, df = 99, p-value = 1.245e-11 alternative hypothesis: true difference in means is no 95 percent confidence interval: 0.863712 1.467096 sample estimates: mean of the differences 1.165404 83
  • 84. Paired Tests > wilcox.test(nvec, nvec2, paired = TRUE) Wilcoxon signed rank test with continuity correction data: nvec and nvec2 V = 4314, p-value = 7.776e-10 alternative hypothesis: true location shift is not equ 84
  • 85. Density Estimation > library(MASS) > truehist(nvec, nbins = 20, xlim = c(0, + 6), ymax = 0.7) > lines(density(nvec, width = "nrd")) 85
  • 86. Density Estimation > data(geyser) > geyser2 <- data.frame(as.data.frame(geyser)[-1, + ], pduration = geyser$duration[-299]) > attach(geyser2) > par(mfrow = c(2, 2)) > plot(pduration, waiting, xlim = c(0.5, + 6), ylim = c(40, 110), xlab = "previous duration + ylab = "waiting") > f1 <- kde2d(pduration, waiting, + n = 50, lims = c(0.5, 6, 40, + 110)) > image(f1, zlim = c(0, 0.075), xlab = "previous durat + ylab = "waiting") > f2 <- kde2d(pduration, waiting, + n = 50, lims = c(0.5, 6, 40, + 110), h = c(width.SJ(duration), + width.SJ(waiting))) 86
  • 87. Simple Linear Model Fitting a simple linear model in R is done using the lm function > data(hills) > mhill <- lm(time ~ dist, data = hills) > class(mhill) [1] "lm" > mhill Call: lm(formula = time ~ dist, data = hills) Coefficients: (Intercept) dist -4.841 8.330 87
  • 88. Simple Linear Model > summary(mhill) > names(mhill) > mhill$residuals 88
  • 89. Multivariate Linear Model > mhill2 <- lm(time ~ dist + climb, + data = hills) > mhill2 Call: lm(formula = time ~ dist + climb, data = hills) Coefficients: (Intercept) dist climb -8.99204 6.21796 0.01105 89
  • 90. Multifactor Linear Model > summary(mhill2) > update(update(mhill2, weights = 1/hills$dist^2)) > predict(mhill2, hills[2:3, ]) 90
  • 91. Bayes Rule p(x|y )p(y ) p(y |x) = (1) p(x) Consider the problem of email filtering: x is the email e.g. in the form of a word vector, y the label spam, ham 91
  • 92. Prior p(y ) mham mspam p(ham) ≈ , p(spam) ≈ (2) mtotal mtotal 92
  • 93. Likelihood Ratio p(x|y )p(y ) p(y |x) = (3) p(x) key problem: we do not know p(x|y ) or p(x). We can get rid of p(x) by settling for a likelihood ratio: p(spam|x) p(x|spam)p(spam) L(x) := = (4) p(ham|x) p(x|ham)p(ham) Whenever L(x) exceeds a given threshold c we decide that x is spam and consequently reject the e-mail 93
  • 94. Computing p(x|y ) Key assumption in Naive Bayes is that the variables (or elements of x) are conditionally independent i.e. words in emails are independent of each other. We can then compute p(x|y ) in a naive fashion by assuming that: Nwords∈x p(x|y ) = p(w j |y ) (5) j=1 wj denotes the j-th word in document x. Estimates for p(w|y ) can be obtained, for instance, by simply counting the frequency occurrence of the word within documents of a given class 94
  • 95. Computing p(x|y) m i=1 Nwords∈xi j=1 {yi = spam&wij = w} p(w|spam) ≈ m Nwords∈xi (6) i=1 j=1 {yi = spam} {yi = spam&wij = w} equals 1 if xi is labeled as spam and w occurs as the j-th word in xi . The denominator counts the number of words in spam documents. 95
  • 96. Laplacian smoothing Issue: estimating p(w|y ) for words w which we might not have seen before. Solution: increment all counts by 1. This method is commonly referred to as Laplace smoothing. 96
  • 97. Naive Bayes in R > library(kernlab) > library(e1071) > data(spam) > idx <- sample(1:dim(spam)[1], 300) > spamtrain <- spam[-idx, ] > spamtest <- spam[idx, ] > model <- naiveBayes(type ~ ., data = spamtrain) > predict(model, spamtest) > table(predict(model, spamtest), + spamtest$type) > predict(model, spamtest, type = "raw") 97
  • 98. k-Nearest Neighbors An even simpler estimator than Naive Bayes is nearest neighbors. In its most basic form it assigns the label of its nearest neighbor to an observation x Identify k nearest neigbhboors given distance metric d(x, x ) (e.g. euclidian distance) and determine label by a vote. 98
  • 100. KNN in R > model <- knn(spamtrain[, -58], + spamtest[, -58], spamtrain[, + 58]) > predict(model, spamtest) > table(predict(model, spamtest), + spamtest$type) 100
  • 104. Support Vector Machines Support Vector Machines work by maximizing a margin between a hyperplane and the data. SVM is a simple well understood linear method The optimization problem is convex thus only one optimal solution exists Excellent performance 104
  • 105. Kernels: Data Mapping Data are implicitly mapped into a high dimensional feature space Φ:X →H Φ : R2 → R3 2 √ 2 (x1 , x2 ) → (z1 , z2 , z3 ) := (x1 , 2x1 x2 , x2 ) 105
  • 106. Kernel Methods: Kernel Function In Kernel Methods learning takes place in the feature space, data only appear inside inner products Φ(x), Φ(x ) Inner product is given by a kernel function (“kernel trick”) k (x, x ) = Φ(x), Φ(x ) Inner product : 2 √ 2 √ 2 Φ(x), Φ(x ) = (x1 , 2x1 x2 , x2 )(x1 , 2x1 x 2 , x2 )T 2 2 = x, x = k (x, x ) 106
  • 107. Kernel Functions Gaussian kernel k (x, x ) = exp(−σ x − x 2) Polynomial kernel k (x, x ) = ( x, x + c)p String Kernels (for text): k (x, x ) = λs δs,s = nums (x)nums (x )λs s x,s x s∈A Kernels on Graphs, Mismatch kernels etc. 107
  • 108. SVM Primal Optimization Problem m 1 2 C minimize t(w, ξ) = w + ξi 2 m i=1 subject to yi ( Φ(xi ), w + b) ≥ 1 − ξi (i = 1, . . . , m) ξi ≥ 0 (i = 1, . . . , m) where m is the number of training patterns, and yi = ±1. SVM Decision Function f (xi ) = w, Φ(xi ) + b m fi = k (xi , xj )αj j=1 108
  • 110. SVM in R > filter <- ksvm(type ~ ., data = spamtrain, + kernel = "rbfdot", kpar = list(sigma = 0.05), + C = 5, cross = 3) > filter > mailtype <- predict(filter, spamtest[, + -58]) > table(mailtype, spamtest[, 58]) 110
  • 111. SVM in R > filter <- ksvm(type ~ ., data = spamtrain, + kernel = "rbfdot", kpar = list(sigma = 0.05), + C = 5, cross = 3, prob.model = TRUE) > filter > mailpro <- predict(filter, spamtest[, + -58], type = "prob") > mailpro
  • 112. SVM in R > x <- rbind(matrix(rnorm(120), , + 2), matrix(rnorm(120, mean = 3), + , 2)) > y <- matrix(c(rep(1, 60), rep(-1, + 60))) > svp <- ksvm(x, y, type = "C-svc", + kernel = "vanilladot", C = 200) > svp > plot(svp, data = x) 112
  • 113. SVM in R > svp <- ksvm(x, y, type = "C-svc", + kernel = "rbfdot", kpar = list(sigma = 1), + C = 10) > plot(svp, data = x) 113
  • 114. SVM in R > svp <- ksvm(x, y, type = "C-svc", + kernel = "rbfdot", kpar = list(sigma = 1), + C = 10) > plot(svp, data = x) 114
  • 115. SVM in R > data(reuters) > is(reuters) > tsv <- ksvm(reuters, rlabels, kernel = "stringdot", + kpar = list(length = 5), cross = 3, + C = 10) > tsv 115
  • 116. SVM in R > x <- seq(-20, 20, 0.1) > y <- sin(x)/x + rnorm(401, sd = 0.03) > plot(x, y, type = "l") > regm <- ksvm(x, y, epsilon = 0.02, + kpar = list(sigma = 16), cross = 3) > regm > lines(x, predict(regm, x), col = "red") 116
  • 117. Decision Trees Petal.Length< 2.45 | Petal.Width< 1.75 setosa versicolor virginica 117
  • 118. Decision Trees Tree-based methods partition the feature space into a set of rectangles, and then fit a simple model in each one. The partitioning is done taking one variable so that the partitioning of that variable increases some impurity measure Q(x). In a node m of a classification tree let 1 pmk = I(yi = k ) (7) N xi ∈Rm the portion of class k observation in node m. Trees classify the observations in each node m to class k (m) = argmaxk pmk (8) 118
  • 119. Impurity Measures Misclassification error: 1 I(y = k (m)) (9) Nm i∈Rm Gini Index: K pmk pmk = pmk (1 − pmk ) (10) k =k k =1 Cross-entropy or deviance: K − pmk log(pmk ) (11) k =1 119
  • 120. Classification Trees in R > data(iris) > tree <- rpart(Species ~ ., data = iris) > plot(tree) > text(tree, digits = 3) 120
  • 121. Classification Trees in R > fit <- rpart(Kyphosis ~ Age + Number + + Start, data = kyphosis) > fit2 <- rpart(Kyphosis ~ Age + + Number + Start, data = kyphosis, + parms = list(prior = c(0.65, + 0.35), split = "information")) > fit3 <- rpart(Kyphosis ~ Age + + Number + Start, data = kyphosis, + control = rpart.control(cp = 0.05)) > par(mfrow = c(1, 3), xpd = NA) > plot(fit) > text(fit, use.n = TRUE) > plot(fit2) > text(fit2, use.n = TRUE) > plot(fit3) > text(fit3, use.n = TRUE)
  • 122. Regression Trees in R > data(cpus, package = "MASS") > cpus.ltr <- tree(log10(perf) ~ + syct + mmin + mmax + cach + + chmin + chmax, cpus) > cpus.ltr 122
  • 123. Random Forests RF is an ensemble classifier that consist of many decision trees. Decisions are taken using a majority vote from the trees. Number of training cases be N, and the number of variables in the classifier be M. m is the number of input variables to be used to determine the decision at a node of the tree; m should be much less than M. Sample a training set for this tree by choosing N times with replacement from all N available training cases (i.e. take a bootstrap sample). For each node of the tree, randomly choose m variables on which to base the decision at that node. Calculate the best split based on these m variables in the training set. 123
  • 124. Random Forests in R > library(randomForest) > filter <- randomForest(type ~ ., + data = spam) > filter 124
  • 125. Principal Component Analysis (PCA) A projection method finding projections of maximal variability i.e. it seeks linear combinations of the columns of X with maximal (or minimal) variance The first k principal components span a subspace containing the “best” kdimensional view of the data Projecting the data on the first few principal components is often useful to reveal structure in the data PCA depends on the scaling of the original variables, thus it is conventional to do PCA using the correlation matrix, implicitly rescaling all the variables to unit sample variance. 125
  • 127. Principal Components Analysis (PCA) > ir.pca <- princomp(log(iris[, -5]), + cor = T) > summary(ir.pca) > loadings(ir.pca) > ir.pc <- predict(ir.pca) 127
  • 128. PCA Plot > plot(ir.pc[, 1:2], xlab = "first principal component + ylab = "second principal component") > text(ir.pc[, 1:2], labels = as.character(iris[, + 5]), col = as.numeric(iris[, + 5])) 128
  • 129. kernel PCA > test <- sample(1:150, 20) > kpc <- kpca(~., data = iris[-test, + -5], kernel = "rbfdot", kpar = list(sigma = 0.2) + features = 2) > plot(rotated(kpc), col = as.integer(iris[-test, + 5]), xlab = "1st Principal Component", + ylab = "2nd Principal Component") > text(rotated(kpc), labels = as.character(iris[-test, + 5]), col = as.numeric(iris[-test, + 5])) 129
  • 130. kernel PCA > kpc <- kpca(reuters, kernel = "stringdot", + kpar = list(length = 5), features = 2) > plot(rotated(kpc), col = as.integer(rlabels), + xlab = "1st Principal Component", + ylab = "2nd Principal Component") 130
  • 131. Distance methods Distance Methods work by representing the cases in a low dimensional Euclidian space so that their proximity reflects the similarity of their vairables A distance measure needs to be defined when using Distance Methods The function dist() implements four distance measures between the points in the p-dimensional space of variables; the default is Euclidean distance can be used with categorical variables! 131
  • 132. Multidimensional scaling > dm <- dist(iris[, -5]) > mds <- cmdscale(dm, k = 2) > plot(mds, xlab = "1st coordinate", + ylab = "2nd coordinate", col = as.numeric(iris[, + 5])) 132
  • 133. Miltidimensional scaling for categorical v. > library(kernlab) > data(income) > inc <- income[1:300, ] > daisy(inc) > csc <- cmdscale(as.dist(daisy(inc)), + k = 2) > plot(csc, xlab = "1st coordinate", + ylab = "2nd coordinate") 133
  • 134. Shannon’s non-linear mapping > snm <- sammon(dist(ir[-143, ])) > plot(snm$points, xlab = "1st coordinate", + ylab = "2nd coordinate", col = as.numeric(iris[, + 5])) 134
  • 135. Biplot > data(state) > state <- state.x77[, 2:7] > row.names(state) <- state.abb > biplot(princomp(state, cor = T), + pc.biplot = T, cex = 0.7, expand = 0.8) 135
  • 136. Stars plot stars(state.x77[, c(7, 4, 6, 2, 5, 3)], full = FALSE,key.loc = c(10, 2)) 136
  • 137. Factor Analysis Principal component analysis looks for linear combinations of the data matrix that are uncorrelated and of high variance Factor analysis seeks linear combinations of the variables, called factors, that represent underlying fundamental quantities of which the observed variables are expressions the idea being that a small number of factors might explain a large number of measurements in an observational study 137
  • 138. Factor Analysis > data(swiss) > swiss.x <- as.matrix(swiss[, -1]) > swiss.FA1 <- factanal(swiss.x, + method = "mle") > swiss.FA1 > summary(swiss.FA1) 138
  • 139. k-means Clustering Partition data into k sets S = {S1 , S2 , . . . , §k } so as to minimize the within-cluster sum of squares (WCSS): k argminS |xj − µi |2 (12) i=1 xj ∈Si 139
  • 140. k-means Clustering > data(iris) > clust <- kmeans(iris[, -5], centers = 3) > clust 140
  • 141. k-means Clustering I > data(swiss) > swiss.x <- as.matrix(swiss[, -1]) > km <- kmeans(swiss.x, 3) > swiss.pca <- princomp(swiss.x) > swiss.px <- predict(swiss.pca) 141
  • 142. k-means Clustering II > dimnames(km$centers)[[2]] <- dimnames(swiss.x)[[2]] > swiss.centers <- predict(swiss.pca, + km$centers) > plot(swiss.px[, 1:2], xlab = "first principal compon + ylab = "second principal component", + col = km$cluster)
  • 143. k-means Clustering III > points(swiss.centers[, 1:2], pch = 3, + cex = 3) > identify(swiss.px[, 1:2], cex = 0.5) 143
  • 144. kernel k-means Partition data into k sets S = {S1 , S2 , . . . , §k } so as to minimize the within-cluster sum of squares (WCSS) in kernel feature space: k argminS |Φ(xj ) − Φ(µi )|2 (13) i=1 xj ∈Si 144
  • 145. kernel k-means > sc <- kkmeans(as.matrix(iris[, + -5]), kernel = "rbfdot", centers = 3) > sc > matchClasses(table(sc, iris[, 5])) 145
  • 146. kernel k-means > str <- stringdot(lenght = 4) > K <- kernelMatrix(str, reuters) > sc <- kkmeans(K, centers = 2) > sc > matchClasses(table(sc, rlabels)) 146
  • 147. Spectral Clustering in kernlab Embedding data points into the subspace of eigenvectors of a kernel matrix Embedded points clustered using k -means Better performance (embedded points form tighter clusters) Can deal with clusters that do not form convex regions 147
  • 148. Spectral Clustering in R > data(spirals) > plot(spirals) > sc <- specc(spirals, centers = 2) > sc > plot(spirals, col = sc) 148
  • 149. Spectral Clustering > sc <- specc(reuters, kernel = "stringdot", + kpar = list(length = 5), centers = 2) > matchClasses(table(sc, rlabels)) > par(mfrow = c(1, 2)) > kpc <- kpca(reuters, kernel = "stringdot", + kpar = list(length = 5), features = 2) > plot(rotated(kpc), col = as.integer(rlabels), + xlab = "1st Principal Component", + ylab = "2nd Principal Component") > plot(rotated(kpc), col = as.integer(sc), + xlab = "1st Principal Component", + ylab = "2nd Principal Component") 149
  • 150. Hierarchical Clustering > data(state) > h <- hclust(dist(state.x77), method = "single") > plot(h) 150