Logistic regression implemented in step by step using R
What is Logistic Regression?
Many a time, situations
arise where the dependent variable isn't normally distributed; i.e., the
assumption of normality is violated. For example, think of a problem when the
dependent variable is binary (Male/Female). Will you still use Multiple Regression?
Of course not! Why? We'll look at it below.
Let's take a peek into the
history of data analysis.
So, until 1972, people
didn't know how to analyze data which has a non-normal error distribution in
the dependent variable. Then, in 1972, came a breakthrough by John Nelder and Robert Wedderburn in the form of Generalized
Linear Models. I'm sure you would be familiar with the term. Now, let's
understand it in detail.
Generalized Linear Models
are an extension of the linear model framework, which includes dependent
variables which are non-normal also. In general, they possess
three characteristics:
- These
models comprise a linear combination of input features.
- The
mean of the response variable is related to the linear combination of
input features via a link function.
- The
response variable is considered to have an underlying probability distribution
belonging to the family of exponential distributions such as binomial
distribution, Poisson distribution, or Gaussian distribution. Practically,
binomial distribution is used when the response variable is binary.
Poisson distribution is used when the response variable represents count.
And, Gaussian distribution is used when the response variable is
continuous.
Logistic Regression belongs
to the family of generalized linear models. It is a binary classification
algorithm used when the response variable is dichotomous (1 or 0). Inherently,
it returns the set of probabilities of target class. But, we can also
obtain response labels using a probability threshold value. Following are
the assumptions made by Logistic Regression:
- The
response variable must follow a binomial distribution.
- Logistic
Regression assumes a linear relationship between the independent variables
and the link function (logit).
- The
dependent variable should have mutually exclusive and exhaustive
categories.
In R, we use glm() function
to apply Logistic Regression. In Python, we use sklearn.linear_model function
to import and use Logistic Regression.
Note: We don't use Linear
Regression for binary classification because its linear function results in
probabilities outside [0,1] interval, thereby making them invalid predictions.
What are the types of
Logistic Regression techniques ?
Logistic Regression isn't
just limited to solving binary classification problems. To solve problems that
have multiple classes, we can use extensions of Logistic Regression, which
includes Multinomial Logistic Regression and Ordinal
Logistic Regression. Let's get their basic idea:
1. Multinomial Logistic
Regression: Let's
say our target variable has K = 4 classes. This technique handles the
multi-class problem by fitting K-1 independent binary logistic classifier
model. For doing this, it randomly chooses one target class as the reference
class and fits K-1 regression models that compare each of the remaining classes
to the reference class.
Due to its restrictive
nature, it isn't used widely because it does not scale very well in the
presence of a large number of target classes. In addition, since it builds K -
1 models, we would require a much larger data set to achieve reasonable
accuracy.
2. Ordinal Logistic Regression: This technique is
used when the target variable is ordinal in nature. Let's say, we want to
predict years of work experience (1,2,3,4,5, etc). So, there exists an order in
the value, i.e., 5>4>3>2>1. Unlike a multinomial model, when we train
K -1 models, Ordinal Logistic Regression builds a single model with multiple
threshold values.
If we have K classes, the
model will require K -1 threshold or cutoff points. Also, it makes an
imperative assumption of proportional odds. The assumption says that on a logit
(S shape) scale, all of the thresholds lie on a straight line.
Note: Logistic Regression
is not a great choice to solve multi-class problems. But, it's good to be aware
of its types. In this tutorial we'll focus on Logistic Regression
for binary classification task.
How does Logistic Regression work?
Now comes the interesting part!
As we know, Logistic Regression assumes that
the dependent (or response) variable follows a binomial distribution. Now, you
may wonder, what is binomial distribution? Binomial distribution can be
identified by the following characteristics:
- There must be a fixed number of
trials denoted by n, i.e. in the data set, there must be a
fixed number of rows.
- Each trial can have only two
outcomes; i.e., the response variable can have only two unique
categories.
- The outcome of each trial must
be independent of each other; i.e., the unique levels of the response
variable must be independent of each other.
- The probability of success (p)
and failure (q) should be the same for each trial.
Let's understand how Logistic Regression
works. For Linear Regression, where the output is a linear combination of
input feature(s), we write the equation as:
`Y = βo + β1X + ∈`
In Logistic Regression, we use the same
equation but with some modifications made to Y. Let's reiterate a fact
about Logistic Regression: we calculate probabilities. And, probabilities
always lie between 0 and 1. In other words, we can say:
- The response value must be
positive.
- It should be lower than 1.
First, we'll meet the above two criteria. We
know the exponential of any value is always a positive number. And, any number divided
by number + 1 will always be lower than 1. Let's implement these two
findings:
Now we are convinced that the probability
value will always lie between 0 and 1. To determine the link function, follow
the algebraic calculations carefully. P(Y=1|X) can be read as
"probability that Y =1 given some value for x." Y can take only two
values, 1 or 0. For ease of calculation, let's rewrite P(Y=1|X) as p(X).
As you might recognize, the right side of
the (immediate) equation above depicts the linear combination of
independent variables. The left side is known as the log - odds or odds
ratio or logit function and is the link function for
Logistic Regression. This link function follows a sigmoid (shown below)
function which limits its range of probabilities between 0 and 1.
How can you evaluate Logistic Regression model fit
and accuracy ?
In Linear Regression, we
check adjusted R², F Statistics, MAE, and RMSE to evaluate model fit
and accuracy. But, Logistic Regression employs all different sets of
metrics. Here, we deal with probabilities and categorical values. Following are
the evaluation metrics used for Logistic Regression:
1. Akaike Information
Criteria (AIC)
You can look at AIC as
counterpart of adjusted r square in multiple regression. It's an important
indicator of model fit. It follows the rule: Smaller the better. AIC penalizes
increasing number of coefficients in the model. In other words, adding more
variables to the model wouldn't let AIC increase. It helps to avoid
overfitting.
Looking at the AIC metric
of one model wouldn't really help. It is more useful in comparing models (model
selection). So, build 2 or 3 Logistic Regression models and compare their AIC.
The model with the lowest AIC will be relatively better.
2. Null Deviance and
Residual Deviance
Deviance of an observation
is computed as -2 times log likelihood of that observation. The importance of
deviance can be further understood using its types: Null and Residual
Deviance. Null deviance is calculated from the model with no features,
i.e.,only intercept. The null model predicts class via a constant probability.
Residual deviance is
calculated from the model having all the features.On comarison with Linear
Regression, think of residual deviance as residual sum of square (RSS) and null
deviance as total sum of squares (TSS). The larger the difference between null
and residual deviance, better the model.
Also, you can use these
metrics to compared multiple models: whichever model has a lower null deviance,
means that the model explains deviance pretty well, and is a
better model. Also, lower the residual deviance, better the model.
Practically, AIC is always given preference above deviance to evaluate model
fit.
3. Confusion Matrix
Confusion matrix is the
most crucial metric commonly used to evaluate classification models. It's quite
confusing but make sure you understand it by heart. If you still don't
understand anything, ask me in comments.
Accuracy - It determines the overall
predicted accuracy of the model. It is calculated as Accuracy
= (True Positives + True Negatives)/(True Positives + True Negatives +
False Positives + False Negatives)
True Positive Rate (TPR) - It indicates how many
positive values, out of all the positive values, have been correctly predicted. The formula to calculate the true
positive rate is (TP/TP + FN).
Also, TPR = 1 - False Negative
Rate. It is also known as Sensitivity or Recall.
False Positive Rate (FPR) - It indicates how many
negative values, out of all the negative values, have been incorrectly predicted. The formula to calculate the false
positive rate is (FP/FP + TN).
Also, FPR = 1 - True Negative Rate.
True Negative Rate (TNR) - It indicates how many
negative values, out of all the negative values, have been correctly predicted. The formula to calculate the true
negative rate is (TN/TN + FP). It
is also known as Specificity.
False Negative Rate (FNR) - It indicates how many
positive values, out of all the positive values, have been incorrectly
predicted. The formula to calculate false negative rate is (FN/FN
+ TP).
Precision: It indicates how many
values, out of all the predicted positive values, are actually positive. It is
formulated as:(TP / TP + FP). F Score: F score is the harmonic mean of
precision and recall. It lies between 0 and 1. Higher the value, better the
model. It is formulated as 2((precision*recall)
/ (precision+recall)).
4.
Receiver Operator Characteristic (ROC)
ROC determines
the accuracy of a classification model at a user defined threshold value.
It determines the model's accuracy using Area Under Curve (AUC). The area under
the curve (AUC), also referred to as index of accuracy (A) or concordant index,
represents the performance of the ROC curve. Higher the area, better the model.
ROC is plotted between True Positive Rate (Y axis) and False Positive Rate
(X Axis). In this plot, our aim is to push the red curve (shown below) toward 1
(left corner) and maximize the area under curve. Higher the curve, better the
model. The yellow line represents the ROC curve at 0.5 threshold. At this
point, sensitivity = specificity.
Practical code
path <-
E:/Data/Titanic"
setwd(path)
library(data.table)
library(plyr)
library(stringr)
train <-
fread("train.csv",na.strings = c("","
",NA,"NA"))
test <-
fread("test.csv",na.strings = c("","
",NA,"NA"))
head(train)
head(test)
#check missing
values
colSums(is.na(train))
colSums(is.na(test))
#Quick Data
Exploration
summary(train$Age)
summary(test$Age)
train[,.N/nrow(train),Pclass]
test[,.N/nrow(test),Pclass]
train[,.N/nrow(train),Sex]
test[,.N/nrow(test),Sex]
train[,.N/nrow(train),SibSp]
test[,.N/nrow(test),SibSp]
train[,.N/nrow(train),Parch]
test[,.N/nrow(test),Parch]
#extra 9
summary(train$Fare)
#skewed variable
summary(test$Fare)
train[,.N/nrow(train),Cabin]
test[,.N/nrow(test),Cabin]
train[,.N/nrow(train),Embarked]
test[,.N/nrow(test),Embarked]
#combine data
alldata <-
rbind(train,test,fill=TRUE)
#New Variables
#Extract passengers
title
alldata[,title :=
strsplit(Name,split = "[,.]")]
alldata[,title :=
ldply(.data = title,.fun = function(x) x[2])]
alldata[,title :=
str_trim(title,side = "left")]
#combine titles
alldata[,title :=
replace(title, which(title %in%
c("Capt","Col","Don","Jonkheer","Major","Rev","Sir")),
"Mr"),by=title]
alldata[,title :=
replace(title, which(title %in%
c("Lady","Mlle","Mme","Ms","the
Countess","Dr")),"Mrs"),by=title]
#ticket binary
coding
alldata[,abs_col :=
strsplit(x = Ticket,split = " ")]
alldata[,abs_col :=
ldply(.data = abs_col,.fun = function(x)length(x))]
alldata[,abs_col :=
ifelse(abs_col > 1,1,0)]
#Impute Age with
Median
for(i in
"Age")
set(alldata,i =
which(is.na(alldata[[i]])),j=i,value = median(alldata$Age,na.rm = T))
#Remove rows
containing NA from Embarked
alldata <-
alldata[!is.na(Embarked)]
#impute fare with
median
for(i in
"Fare")
set(alldata,i =
which(is.na(alldata[[i]])),j=i,value = median(alldata$Fare,na.rm = T))
#Replace missing
values in Cabin with "Miss"
alldata[is.na(Cabin),Cabin
:= "Miss"]
#Log Transform Fare
alldata$Fare <-
log(alldata$Fare + 1)
#Impute Parch 9 to
0
alldata[Parch ==
9L, Parch := 0]
#Collect train and
test
train <-
alldata[!(is.na(Survived))]
train[,Survived :=
as.factor(Survived)]
test <-
alldata[is.na(Survived)]
test[,Survived :=
NULL]
#logistic
regression
model <-
glm(Survived ~ ., family = binomial(link = 'logit'), data =
train[,-c("PassengerId","Name","Ticket")])
summary(model) #k -
1 encoded categorical variables
#run anova
anova(model, test =
'Chisq')
#model 2 with
important predictors
model2 <-
glm(Survived ~ Pclass + Sex + Age +
SibSp + Fare + title, data = train,family =
binomial(link="logit"))
summary(model2)
#lower the AIC better the model
#compare two models
- set hypothesis here
anova(model,model2,test
= "Chisq")
#make prediction
test[title ==
"Dona", title := "Miss"]
pred_titanic <-
predict(model2,newdata = test,type = "response")
#partition and
training data
library(caret)
split <-
createDataPartition(y = train$Survived,p = 0.6,list = FALSE)
new_train <-
train[split]
new_test <-
train[-split]
#model training
log_model <-
glm(Survived ~ Pclass + Sex + Age +
SibSp + Fare + title, data =
new_train[,-c("PassengerId","Name","Ticket")],family
= binomial(link="logit"))
summary(log_model)
log_predict <-
predict(log_model,newdata = new_test,type = "response")
log_predict <-
ifelse(log_predict > 0.5,1,0)
#plot ROC
library(ROCR)
library(Metrics)
pr <-
prediction(log_predict,new_test$Survived)
perf <-
performance(pr,measure = "tpr",x.measure = "fpr")
plot(perf)
auc(new_test$Survived,log_predict)
#0.78
#plot ROC with
threshold value
log_predict <-
predict(log_model,newdata = new_test,type = "response")
log_predict <-
ifelse(log_predict > 0.6,1,0)
pr <-
prediction(log_predict,new_test$Survived)
perf <-
performance(pr,measure = "tpr",x.measure = "fpr")
plot(perf)
auc(new_test$Survived,log_predict)
#0.8008
Comments
Post a Comment