In this lesson, we will be learning about simple linear regression.
Regression is EZ-PZ if you follow through this lesson.
So, lets say there are features about you – height and weight which determines your BMI.
By regression, we mean that we are trying to predict your weight by looking at your height. So, if we give our system the data of 50 people of our class regarding height and weight, for the 51st or 52nd person added later on, the system will try to predict the weight by the height of 2 more added in the whole data set.
Here, we are taking Weight as a TARGET. Lets just use a nick name for it – Dependent variable. Its a bad nick name but statisticians seemed to like it so OK.
So, if weight is our target variable which is dependent, that will make height our independent variable (nick name for other variable)
In other words, we are trying to predict weight (Dependent and Target) using height (Independent)
Note: Dependent variable should always be a numeric. We will be predicting values of weight based on the values of height.
So lets start by importing our data. BTW, you can download the data here: htwt
htwt <- read.csv("htwt.csv")
Here we are only interested in weight (in kg) and height (in cm) variable.
So, now lets use plot function like we did in the previous tutorial:
plot(weight ~ height, data = htwt)
There seems to be a positive linear relationship. As the Height increases (X – Axis) the Weight Increases (Y – Axis)
Lets look at the Pearson correlation coefficient using cor() function
In statistics, the Pearson correlation coefficient, also referred to as Pearson’s r, is a measure of the linear correlation between two variables X and Y.
this value is referred to as r.
Why we are looking at r’s value is because we want to know the relationship between the 2 variables.
r’s value is between -1 and 1 and if it is close to 0, it means there is very less correlation between the variables.
But here, r’s value is 0.77 which is good as it tells us that there is positive and strong relationship among the variables. In other words, as height increases, the weight also increases with a degree of 0.77 as per data.
Note: There can be a case when the correlation is near 0 or negative.
1. If correlation is near 0 it means the 2 variables have no relationship.
2. If correlation is negative (let’s say -0.6) then it means when 1 variable increases the other seems to be decreasing.
Coming back to Linear regression. It has an equation which is Y = mX + c.
1. Y is Dependent variable
2. m is Slope of line
3. X is Independent variable
4. c is Intercept
We will need to calculate m and c and we will be able to predict weight using value of height.
So lets now use lm() function to calculate the values of intercept(c) and slope(m)
linear <- lm(weight ~ height, data = htwt)
So we have calculated intercept and slope for the equation of linear regression.
There will not be much to tell before we build a line in between these dots which we got in plot.
So, we use abline function like below but write the code below first.
plot(weight ~ height, data = htwt)
Note: First write the code above and execute it by pressing enter and then write the code below:
abline(lm(weight ~ height, data = htwt), col = 4, lty = 4, lwd = 2)
col is color, lty is type of line, lwd is width of line
Plot function will first plot the data and then abline function overwrite its line on the output produced. The order of execution should be kept in mind.
So this will produce following output:
So, we have our line. We have our coefficients. Now what is the inference me make out of this?
It means if we were to make an imaginary line towards -ve x axis, when the height becomes 0, weight’s value will be -136.51 and slope, the way line is going up is by 1.18 units rise over run.
So, now our equation: Y = mX + b will become:
Y = 1.184 * X – 136.51
Now, if we plug in any value for height i.e X, we can be able to produce a predicted value of weight i.e Y.
In R, we use function predict() like below:
predict(linear, data.frame("height" = 190))
So, we predicted that if our height (X), independent variable is going to be 190 then our weight(Y), dependent variable is going to be 88.37
We made our model do predictions for us.
Although, this value is default for 95% which means that there is 95% chance that a person having height of 190 will have a weight 88.37
What if we want the range of the prediction for 95%
We can tweak the predict function like below:
predict(linear, data.frame("height" = 190), interval = "prediction")
Basically this gives the range of 95% chance.
In other words, there is a 95% chance of getting 71 to 105 kg weight of a person having height 190.
Congratulations on learning this lesson. Go ahead and learn more.