Basic of Linear Regression

Mayank Yogi
Analytics Vidhya
Published in
5 min readSep 29, 2020

--

Image by Blippar

In Machine learning there are various models and techniques to solve the real world problem related to multiple fields, each models or we can say algorithms perform distinctly with every data.

Data is a key word in field of data science, all the game is related to data and how to get more insight from it to solve business problem or enhance the business.

So, there are three type of learning in machine learning:

Image by Potentiaco

Linear Regression fall under supervised learning so we should discuss little bit more about supervised learning.

Supervised learning as the name indicates the presence of a supervisor as a teacher. Basically supervised learning is a learning in which we teach or train the machine using data which is well labeled that means some data is already tagged with the correct answer. After that, the machine is provided with a new set of examples(data) so that supervised learning algorithm analyses the training data(set of training examples) and produces a correct outcome from labeled data.

Now back to Linear Regression, as the name itself suggest this is an linear model. Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data. One variable is considered to be an explanatory variable, and the other is considered to be a dependent variable.

For example, a modeler might want to relate the weights of individuals to their heights using a linear regression model.

Types of Linear Regression

Linear regression can be further divided into two types of the algorithm:

  • Simple Linear Regression:
    If a single independent variable is used to predict the value of a numerical dependent variable, then such a Linear Regression algorithm is called Simple Linear Regression.
  • Multiple Linear regression:
    If more than one independent variable is used to predict the value of a numerical dependent variable, then such a Linear Regression algorithm is called Multiple Linear Regression.

Basic Assumption to start with Linear Regression

  • Linearity : Relationship between x(independent feature) and mean of y(dependent feature) should be Linear.
  • Normality : For any fixed value of x and y is normally distributed
  • Independence : Observation are independent of each other
  • Small or no multicollinearity between the features : Due to multicollinearity, it may difficult to find the true relationship between the predictors and target variables

Pros:

  • It works well for linear separable data means, In Euclidean geometry, linear separability is a property of two sets of points. … These two sets are linearly separable if there exists at least one line in the plane with all of the blue points on one side of the line and all the red points on the other side.
  • Easy to implement and train model
  • It can handle overfitting using dimension reduction , cross validation and regularization

Cons:

  • Sometime requires lot of feature engineering, if our data is not normally distributed then we have to use transformation, log function to make it normally distributed
  • If independent features are positive correlated then it may effect performance this is called multicollinearity problem, to avoid this, again we have to do feature engineering.

Multicollinearity : Suppose if our dataset contains 100 features and if 40 features are positive correlated to each other (more then 0.9) this is called multicollinearity.

  • It is prone or effect to noise because of any outliers out best fit line may get change.
  • It is sensitive to missing values, we have to handle it by feature engineering(replace the missing values by mean, mode or drop the null values).

Important Point:

In Linear Regression, feature scaling is important to follow, because here is a concept of gradient descent in which we have to minimize the loss function and reach to global minima. So, for this feature scaling is required, this is also used in ANN and CNN.

Feature scaling can be done by normalization and standardization ( Minimax Scaler, Standard Scaler)

Metrics:

The quality of a regression model is how well its predictions match up against actual values, but how do we actually evaluate quality? Luckily, smart statisticians have developed error metrics to judge the quality of a model and enable us to compare regressions against other regressions with different parameters. These metrics are short and useful summaries of the quality of our data.

  • Mean Absolute Error
  • Mean Square Error
  • Mean Absolute Percentage Error
  • Mean Percentage Error

Model Performance:

The Goodness of fit determines how the line of regression fits the set of observations. The process of finding the best model out of various models is called optimization. It can be achieved by below method:

1. R-squared method:

  • R-squared is a statistical method that determines the goodness of fit.
  • It measures the strength of the relationship between the dependent and independent variables on a scale of 0–100%.
  • The high value of R-square determines the less difference between the predicted values and actual values and hence represents a good model.
  • It is also called a coefficient of determination, or coefficient of multiple determination for multiple regression.
  • It can be calculated from the below formula:
Image by Google

Fitting the Linear Regression :

  1. from sklearn.linear_model import LinearRegression
  2. regressor= LinearRegression()
  3. regressor.fit(x_train, y_train)

Prediction :

  1. y_pred= regressor.predict(x_test)
  2. x_pred= regressor.predict(x_train)

Uses:

This model can be used in various platform where the data is linear separable and continuous in nature follows :

  • House price prediction.
  • Evaluation of trends
  • Making estimates
  • Forecasts etc.

For Further Query you may connect me on my LinkedIn or GitHub

--

--