In this document I will show a simple example of using Random Forest to make some predictions. First I will do some data exploration using the IRIS dataset, including Principal Component Analysis using prcomp
. The Iris dataset is included in R core. The Random Forest example is at the end of the document.
Most of the RF code found here (http://rischanlab.github.io/RandomForest.html)
For PCA I took inspiration from https://www.r-bloggers.com/computing-and-visualizing-pca-in-r/
set.seed(12345)
data(iris)
dim(iris)
## [1] 150 5
iris[1:5,1:5]
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
table(iris$Species)
##
## setosa versicolor virginica
## 50 50 50
boxplot(iris$Sepal.Length ~ iris$Species,ylab="Sepal.Length")
boxplot(iris$Sepal.Width ~ iris$Species,ylab="Sepal.Width")
boxplot(iris$Petal.Length ~ iris$Species,ylab="Petal.Length")
boxplot(iris$Petal.Width ~ iris$Species,ylab="Petal.Width")
We can see that petal descriptors show more differentiation betwen species tat sepal descriptors and that virginica and versicolor are closer to each other than to setosa for most variables.
plot(iris$Petal.Length,iris$Petal.Width,col=iris$Species,pch=16)
legend( x="topleft",
legend=levels(as.factor(iris$Species)),
col=c("black","red","green"),
pch=c(16) )
log.ir <- log(iris[, 1:4])
ir.species <- iris[, 5]
ir.pca <- prcomp(log.ir,
center = TRUE,
scale. = FALSE)
ir.pca
## Standard deviations (1, .., p=4):
## [1] 1.14655938 0.13835264 0.13525746 0.05380941
##
## Rotation (n x k) = (4 x 4):
## PC1 PC2 PC3 PC4
## Sepal.Length 0.10090019 -0.0008537483 -0.4891583 0.86633858
## Sepal.Width -0.05759298 0.5745110809 -0.7140592 -0.39590340
## Petal.Length 0.50527032 -0.6870939247 -0.4269180 -0.30057416
## Petal.Width 0.85510473 0.4447900940 0.2618865 0.04871476
pca.results<-cbind(as.data.frame(ir.pca$x),ir.species)
head(pca.results)
## PC1 PC2 PC3 PC4 ir.species
## 1 -1.6736850 0.02055683 -0.06042345 0.015217731 setosa
## 2 -1.6688436 -0.06797029 0.06921821 0.041588345 setosa
## 3 -1.7142099 0.02006252 0.07515646 0.002209602 setosa
## 4 -1.6422467 -0.09648282 0.04725452 -0.046865053 setosa
## 5 -1.6773055 0.03675822 -0.07085250 -0.013090995 setosa
## 6 -0.9833353 0.25757951 -0.06701669 -0.002697639 setosa
plot(pca.results$PC1,pca.results$PC2,col=pca.results$ir.species,pch=16)
legend( x="topleft",
legend=levels(as.factor(pca.results$ir.species)),
col=c("black","red","green"),
pch=c(16) )
We notice the species separate over PC1. Still some overlp between versicolor and virginica.
70% of the data will be used for training the model.
ind <- sample(2,nrow(iris),replace=TRUE,prob=c(0.7,0.3))
table(ind)
## ind
## 1 2
## 97 53
trainData <- iris[ind==1,]
testData <- iris[ind==2,]
dim(trainData)
## [1] 97 5
dim(testData)
## [1] 53 5
head(trainData)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5.0 3.4 1.5 0.2 setosa
## 11 5.4 3.7 1.5 0.2 setosa
## 12 4.8 3.4 1.6 0.2 setosa
head(testData)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
If the package has not been installed previously you might need to install it with install.packages("randomForest")
library(randomForest)
iris_rf <- randomForest(Species~.,data=trainData,ntree=100,proximity=TRUE)
print(iris_rf)
##
## Call:
## randomForest(formula = Species ~ ., data = trainData, ntree = 100, proximity = TRUE)
## Type of random forest: classification
## Number of trees: 100
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 7.22%
## Confusion matrix:
## setosa versicolor virginica class.error
## setosa 34 0 0 0.00000000
## versicolor 0 28 3 0.09677419
## virginica 0 4 28 0.12500000
#table(predict(iris_rf),trainData$Species)
We can see that usin the current training data RF has no problem identifying the setosa individuals (class.error=0) but there is more uncertainty for assignment to the classes versicolor and virginica (clas.error ~ 0.08).
importance(iris_rf)
## MeanDecreaseGini
## Sepal.Length 5.714656
## Sepal.Width 3.101009
## Petal.Length 24.865751
## Petal.Width 30.130563
We can see that Petal.width and Petal.length are the more important descriptors that differentiate between species as shown before by PCA and just plotting this variables.
irisPred<-predict(iris_rf,newdata=testData)
table(irisPred, testData$Species)
##
## irisPred setosa versicolor virginica
## setosa 16 0 0
## versicolor 0 19 2
## virginica 0 0 16
plot(margin(iris_rf,testData$Species))
#The number of correct predictions
print(sum(irisPred==testData$Species))
## [1] 51
#out of this many datapoints used to test the prediction
print(length(testData$Species))
## [1] 53
#the accuracy
print(sum(irisPred==testData$Species)/length(testData$Species))
## [1] 0.9622642