1. Introduction to R for Data Science
Lecturers
dipl. ing Branko Kovač
Data Analyst at CUBE/Data Science Mentor
at Springboard
Institut za savremene nauke
Data Science zajednica Srbije
branko.kovac@gmail.com
dr Goran S. Milovanović
Data Scientist at DiploFoundation
Data Science zajednica Srbije
goran.s.milovanovic@gmail.com
goranm@diplomacy.edu
2. Vectors in R
• No scalars in R; a <- 5 is a vector (length(a)==1)==TRUE
• Vectorizing your code is a priority in vector programming languages such as R (more
on vectorizing takes part later during this course…)
• !!! - An excellent read: http://www.noamross.net/blog/2014/4/16/vectorization-in-r--
why.htmlwhy.html (a little bit advanced at this point - yet highly recommended)
Intro to R for Data Science
Session 2: Vectors, Matrices & Data Frames
# Introduction to R for Data Science
# SESSION 2 :: 5 May, 2016
char_list <- character(length = 0) #empty character list
> char_list
character(0)
num_list <- numeric(length = 10)
#length can be != 0, but 0 is default value
> num_list
[1] 0 0 0 0 0 0 0 0 0 0
log_list <- logical(length = 3) #default value is FALSE
> log_list
[1] FALSE FALSE FALSE
3. Vectors in R: c(), subsetting
Intro to R for Data Science
Session 2: Vectors, Matrices & Data Frames
# Introduction to R for Data Science
# SESSION 2 :: 5 May, 2016
log_list_2 <- c(TRUE, FALSE, FALSE, TRUE, TRUE, TRUE) # some Ts and Fs
> log_list_2
[1] TRUE FALSE FALSE TRUE TRUE TRUE
# Subsetting is regular-thing-to-do when using R
char_list_2[5] #single element can be selected
log_list_2[2:4] #or some interval
num_list_2[3:length(num_list_2)] #or even length() function
4. Vectors in R: ordering, coercing while concatenating
Intro to R for Data Science
Session 2: Vectors, Matrices & Data Frames
# Introduction to R for Data Science
# SESSION 2 :: 5 May, 2016
# Vector ordering
sort(test, decreasing = T) # using sort() function
test[order(test, decreasing = T)] # or with order() function
# Concatenation
new_num_vect <- c(num_list, num_list_2) #using 2 vectors to create new one
> new_num_vect #?
new_combo_vect <- c(num_list_2, log_list) #combination of num and log vector
new_combo_vect #a ll numbers? false to zero? coercion in action
5. Matrices in R: there are matrices in R, indeed
Intro to R for Data Science
Session 2: Vectors, Matrices & Data Frames
# Introduction to R for Data Science
# SESSION 2 :: 5 May, 2016
# Matrices are available in R
matr <- matrix(data = c(1,3,5,7,NA,11), nrow = 2, ncol = 3) #2x3 matrix
class(matr) # yes, it's matrix
typeof(matr) # double as expected
# Again: R Objects (like matrices) have classes, R Data (like integers)
# have types; the difference between class() and typeof().
• There are many 1e06 things that you can do with matrices in R. Only a few of them will
be discussed in the second (applied statistical modeling) part of the course.
• Matrices and vectors are fast - as fast as R (not quite a Roadrunner, beep-beep…) can
get. On the deepest implementation level, *everything in R is a vector*, in spite of the
wide-spread opinions that “everything in R is a list/an object”…
• Again !!! - An excellent read: http://www.noamross.net/blog/2014/4/16/vectorization-in-
r--why.htmlwhy.html
6. data.frame in R: mastering the Force
Intro to R for Data Science
Session 2: Vectors, Matrices & Data Frames
# Introduction to R for Data Science
# SESSION 2 :: 5 May, 2016
# Think of data frame columns as vectors! Because they are!
mean(cars_data$mpg) #mean of cars_data mpg (miles per galon) column
median(cars_data$cyl) #median of cars_data cyl (cylinders) column
is.list(cars_data[1,]); # but rows are lists!
# Introduction to R for Data Science
# SESSION 2 :: 5 May, 2016
> is.list(mtcars)
[1] TRUE
> length(mtcars)
[1] 11
> length(colnames(mtcars))
[1] 11
• A data.frame is…
• a list…
• whose components are its columns…
• which are, in turn, vectors.
• Consistency, as in any database:
• a column “is about” something –
but only about that one thing.
7. data.frame in R: subsetting data.frames
Intro to R for Data Science
Session 2: Vectors, Matrices & Data Frames
# Introduction to R for Data Science
# SESSION 2 :: 5 May, 2016
cars_data[c(1,3)] #keeping 1st and 3rd column only
cars_data[-c(1,3)] #removing 1st and 3rd column
cars_data[ ,-c(1,3)] #same as the previous line of code
cars_data[!duplicated(cars_data$mpg), ] #maybe we want to remove all cars with same mpg?
#remember it keeps only the first occurence!
subset(cars_data, mpg < 19) #this is one way (and it can be slow!)
cars_data[cars_data$mpg < 19, ] #this is another one (faster)
cars_data[which(cars_data$mpg < 19), ] #and another one (usually even more faster)
cars_data[cars_data$mpg > 20 & cars_data$am == 1, ] #multiple conditions
cars_data[grep("Merc", row.names(cars_data), value=T), ] #filtering by pattern match
8. data.frame in R: separation, joining, names(), rownames(), and
colnames()
Intro to R for Data Science
Session 2: Vectors, Matrices & Data Frames
# Introduction to R for Data Science
# SESSION 2 :: 5 May, 2016
# Separation and joining of data frames
low_mpg <- cars_data[cars_data$mpg < 15, ] #new data frame with mpg < 15
high_mpg <- cars_data[cars_data$mpg >= 15, ] #new data frame with mpg >= 15
mpg_join <- rbind(low_mpg, high_mpg) # we can combine 2 data frames like this
car_condition <- data.frame(sample(c("old","new"), replace = T, size = 32)) #creating random
# data frame with "old" and "new" values
names(car_condition) <- "condition" # for all kinds of objects
colnames(car_condition) <- "condition" # for "matrix-like" objects, but same effect here
rownames(car_condition) <- rownames(cars_data) # use row names of one data frame as row #
names of another
#or combine data frames like this:
mpg_join <- cbind(mpg_join, car_condition)