Background Info
Linear Regression and the need for Diagnostic Plots
Linear regression is one of the most commonly used machine learning model. This model assumes that the relationship between the predictor (x) and the outcome variable (y) is linear, and the slope coefficients, and R^2 values are used to tell us how good the model is at representing the data. This is not the whole story, and in fact, some relationships are better explained using different mathematical models (polynomial, logarithmic).
The reason this is important to discuss is that operating under false assumptions will result in poor model performance.
Residuals is the difference between the observed value and the mean value that the model predicts for that observation. Residuals are useful showing how poorly a model represents the data, and more importantly, if the linear regression assumptions are met.
Linear Regression Assumptions
Other key points that may influence and/or leverage the regression model.
Diagnostic Plots for Linear Regression
Why is this important?
To test if the data can be described by a linear regression, several diagnostic plots have been developed. Learning how to use these tools are essential when conducting an regression analysis. Five plots will be described in this post.
The project goals
Data
Analysis
The programming language Python was used in this project. The matplotlib and seaborn libraries was used
to visualize the data. Pandas was used to wrangle the data, while numpy and statsmodels
were used in the calculations and machine learning models.
Data extraction, transformation, loading, and exploring pipeline
Residual Histogram
Residuals vs Fitted.
Normal Q-Q Plot
Scale-Location
Residuals vs Leverage