A Random Forest example using the Iris dataset in R

In this document I will show a simple example of using Random Forest to make some predictions. First I will do some data exploration using the IRIS dataset, including Principal Component Analysis using prcomp. The Iris dataset is included in R core. The Random Forest example is at the end of the document.

Most of the RF code found here (http://rischanlab.github.io/RandomForest.html)
For PCA I took inspiration from https://www.r-bloggers.com/computing-and-visualizing-pca-in-r/

Some preliminary exploratory analysis of the iris dataset

 set.seed(12345)
 data(iris)
 dim(iris)
## [1] 150   5
 iris[1:5,1:5]
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
 table(iris$Species)
## 
##     setosa versicolor  virginica 
##         50         50         50
 boxplot(iris$Sepal.Length ~ iris$Species,ylab="Sepal.Length")

 boxplot(iris$Sepal.Width ~ iris$Species,ylab="Sepal.Width")

 boxplot(iris$Petal.Length ~ iris$Species,ylab="Petal.Length")

 boxplot(iris$Petal.Width ~ iris$Species,ylab="Petal.Width")

We can see that petal descriptors show more differentiation betwen species tat sepal descriptors and that virginica and versicolor are closer to each other than to setosa for most variables.

A simple plot using petal descriptors

 plot(iris$Petal.Length,iris$Petal.Width,col=iris$Species,pch=16)
 legend( x="topleft", 
        legend=levels(as.factor(iris$Species)),
        col=c("black","red","green"), 
        pch=c(16) )

Using PCA to show differentiation between species. We won’t try to predict anything with PCA for this example.

log.ir <- log(iris[, 1:4])
ir.species <- iris[, 5]
 
ir.pca <- prcomp(log.ir,
                 center = TRUE,
                 scale. = FALSE) 
ir.pca
## Standard deviations (1, .., p=4):
## [1] 1.14655938 0.13835264 0.13525746 0.05380941
## 
## Rotation (n x k) = (4 x 4):
##                      PC1           PC2        PC3         PC4
## Sepal.Length  0.10090019 -0.0008537483 -0.4891583  0.86633858
## Sepal.Width  -0.05759298  0.5745110809 -0.7140592 -0.39590340
## Petal.Length  0.50527032 -0.6870939247 -0.4269180 -0.30057416
## Petal.Width   0.85510473  0.4447900940  0.2618865  0.04871476
pca.results<-cbind(as.data.frame(ir.pca$x),ir.species)
head(pca.results)
##          PC1         PC2         PC3          PC4 ir.species
## 1 -1.6736850  0.02055683 -0.06042345  0.015217731     setosa
## 2 -1.6688436 -0.06797029  0.06921821  0.041588345     setosa
## 3 -1.7142099  0.02006252  0.07515646  0.002209602     setosa
## 4 -1.6422467 -0.09648282  0.04725452 -0.046865053     setosa
## 5 -1.6773055  0.03675822 -0.07085250 -0.013090995     setosa
## 6 -0.9833353  0.25757951 -0.06701669 -0.002697639     setosa
plot(pca.results$PC1,pca.results$PC2,col=pca.results$ir.species,pch=16)
 legend( x="topleft", 
        legend=levels(as.factor(pca.results$ir.species)),
        col=c("black","red","green"), 
        pch=c(16) )

We notice the species separate over PC1. Still some overlp between versicolor and virginica.

Now using the Random Forest method

First we split iris data into a training and a testing set

70% of the data will be used for training the model.

ind <- sample(2,nrow(iris),replace=TRUE,prob=c(0.7,0.3))
table(ind)
## ind
##  1  2 
## 97 53
trainData <- iris[ind==1,]
testData <- iris[ind==2,]
dim(trainData)
## [1] 97  5
dim(testData)
## [1] 53  5
head(trainData)
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 5           5.0         3.6          1.4         0.2  setosa
## 6           5.4         3.9          1.7         0.4  setosa
## 7           4.6         3.4          1.4         0.3  setosa
## 8           5.0         3.4          1.5         0.2  setosa
## 11          5.4         3.7          1.5         0.2  setosa
## 12          4.8         3.4          1.6         0.2  setosa
head(testData)
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1           5.1         3.5          1.4         0.2  setosa
## 2           4.9         3.0          1.4         0.2  setosa
## 3           4.7         3.2          1.3         0.2  setosa
## 4           4.6         3.1          1.5         0.2  setosa
## 9           4.4         2.9          1.4         0.2  setosa
## 10          4.9         3.1          1.5         0.1  setosa

Loading randomForest package and generating the Random Forest Learning Tree

If the package has not been installed previously you might need to install it with install.packages("randomForest")

library(randomForest)


iris_rf <- randomForest(Species~.,data=trainData,ntree=100,proximity=TRUE)
print(iris_rf)
## 
## Call:
##  randomForest(formula = Species ~ ., data = trainData, ntree = 100,      proximity = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 7.22%
## Confusion matrix:
##            setosa versicolor virginica class.error
## setosa         34          0         0  0.00000000
## versicolor      0         28         3  0.09677419
## virginica       0          4        28  0.12500000
#table(predict(iris_rf),trainData$Species)

We can see that usin the current training data RF has no problem identifying the setosa individuals (class.error=0) but there is more uncertainty for assignment to the classes versicolor and virginica (clas.error ~ 0.08).

Importance of the class descriptors

importance(iris_rf)
##              MeanDecreaseGini
## Sepal.Length         5.714656
## Sepal.Width          3.101009
## Petal.Length        24.865751
## Petal.Width         30.130563

We can see that Petal.width and Petal.length are the more important descriptors that differentiate between species as shown before by PCA and just plotting this variables.

Now we build a random forest for our testing data

  irisPred<-predict(iris_rf,newdata=testData)
  table(irisPred, testData$Species)
##             
## irisPred     setosa versicolor virginica
##   setosa         16          0         0
##   versicolor      0         19         2
##   virginica       0          0        16

Plotting the margins. Positive margins mean correct classification.

plot(margin(iris_rf,testData$Species))

And checking the classification accuracy

  #The number of correct predictions
  print(sum(irisPred==testData$Species))
## [1] 51
  #out of this many datapoints used to test the prediction
  print(length(testData$Species)) 
## [1] 53
  #the accuracy
  print(sum(irisPred==testData$Species)/length(testData$Species))
## [1] 0.9622642