Multiple Linear Regression

Simple Linear Regression Vs Multiple Linear Regression

Perfect Linear Regression


Real Life


Linear Regression

we are good if and only if my assumptions are met




Case Study:-

The dataset contains 9568 data points collected from a Combined Cycle Power Plant, when the power plant was set to work with full load.  Features consist of hourly average ambient variables.

Temperature (T),
Ambient Pressure (AP),
Relative Humidity (RH) and
Exhaust Vacuum (V)
to predict the net hourly electrical energy output (EP) of the plant.

A gas turbine generator generates electricity while the waste heat from the gas turbine is used to make steam to generate additional electricity via a steam turbine.

Read the dataset and check for summary


Check boxplot for outliers

> boxplot(edata$AT,main="Temperature",horizontal = T)


boxplot(edata$V,main="Exhaust Vaccum",horizontal = T)


> boxplot(edata$AP,main="Ambient Preasure",horizontal = T)


boxplot(edata$RH,main="Relative Humidity",horizontal = T)


> boxplot(edata$PE, main="Energy Production",horizontal = T)


Split the a dataset and check correlation matrix


Check linearity of dependent variable with independent variables.


Fit the model, check summary and check plots for assumptions


Checking multicollinearity, remove vif influencing variable, remove outliers and fitting the models with new dataset “new_train”


Check Multicollinearity

> vif(model_fit)

  AT    V      AP     RH
5.911819  3.882701  1.467918  1.693694

Since there is a high correlation Between AT and V, the variation inflation factor was very high for AT.

Hence we have to drop either AT or V.

We will keep AT as it has high correlation with dependent variable.

Remove the outliers ,prepare new dataset and run the model with new dataset.


> Model_new_fit=lm(PE~AT+AP+RH,data=new_train)

Residuals Plots for assumptions


Normality plot for residuals




Cooks Distance


PE = 481.76 – 2.37*AT + 0.03*AP – 0.203*RH