Experimenting with non-linear regression!

V Venkataramanan
7 min readJan 10, 2021

--

Image source : https://lucidar.me/en/neural-networks/curve-fitting-nonlinear-regression/

I recently got some free time during the New year 2021, long weekend, and I was wondering what I can do other than watching movies and sleeping, to spend some productive time. I decided to search for some data related to covid19 on the internet and found a dataset that was interesting. I was also doing a course on Udemy called Machine Learning A-Z : Hands on Python & R in Data Science, so I decided to apply the algorithms that I learnt on the dataset and see some results.

After going through this blog, you will get an idea on how to leverage the sklearn libraries to build various non-linear regression algorithms. The intention of the blog is to throw light on how to apply the models to a dataset and not on the predictions.

The non linear regression algorithms I used are :

  1. Polynomial Regression
  2. Support Vector Regression
  3. Decision Tree Regression
  4. Random Forest Regression

The dataset

The dataset I chose is from https://ourworldindata.org/coronavirus-source-data. It has a bunch of columns, 54 to be precise. In order to keep things simple, I decided to choose covid19 data from India alone. Also, in order to visualize the results, I kept it to 2 dimensions, that is I took number of days since 30/01/2020 to 01/01/2021 as the independent variable and new cases as the dependent variable.

The preprocessing stage

There are a few steps which are common to all the algorithms. This includes importing the required libraries, importing the dataset, and assigning X (independent variable) and y (dependent variable).

Importing libraries :

We need pandas for importing the dataset and matplotlib to plot the data.

Importing the dataset :

We read the csv file and store it in a variable called dataset. We use python’s .loc method to get covid19 data of India alone. From that we filter the dataset to get only the values of “new_cases” which will be the dependent variable. X (independent variable) is a range of values from 0 to 337. y (dependent variable) is the daily new cases.

Visualizing the Data

Let’s plot the data we are going to fit using our model. The X axis is the number of days from 30/01/2020 to 01/01/2021, so we get 338 data points. The y axis is the daily new cases for each day. The visualization looks like this :

The X axis is a range of numbers from 0 to 338, 0 being 0 days since 30/01/2020 and 338 being, 338 days since 30/01/2020. The y axis contains the daily new cases.

It is clear that the data is non-linear.

The code to get the plot is as follows :

Now let’s try to fit each of the non-linear regression models.

Polynomial Regression

In polynomial regression, the relationship between independent variable x and the dependent variable y is modelled as an nth degree polynomial in x.

Polynomial regression is similar to linear regression, except the equation uses different powers of x (x, x², x³…) in polynomial regression as opposed to linear regression which uses just x.

Linear Regression

Image source : https://contentsimplicity.com/machine-learning-simple-linear-regression/

Polynomial Regression

Image source : https://medium.com/analytics-vidhya/understanding-polynomial-regression-5ac25b970e18

We are going to use the PolynomialFeatures from sklearn’s preprocessing library and LinearRegression from sklearn’s linear_model library to build the model. The PolynomialFeatures helps us get the different degrees of the independent variable, namely x, x², x³ and so on. We use the LinearRegression to build the model but with different degrees of the independent variable as input.

Now the linear_regressor variable contains our Polynomial Regression model as it is fit with the different degrees of X. We use a degree of 7, so we scale the X variable into, X, X², X³, X⁴ , X⁵, X⁶, and X⁷ and save it in X_poly.

In order to visualize the model that has been fit we have to do the following :

And we get the following result :

Now let’s try to use the model to predict what will be the new cases 350 days, after 30/01/2020, which is 14/01/2021.

The model predicts there will be around 27000 new cases in India on 14/01/2021.

Note: Various other factors like precautions, vaccinations, spreading pattern etc. weren’t considered while making this prediction. This is just to play around with algorithms and data.

Support Vector Regression

Linear support vector regression tries to fit the best line to the data points within a threshold. The line is called the hyperplane and the threshold value ε is taken above and below the hyperplane, and used to plot the boundary line. All data points which come within this boundary line has an error zero. The data points on either side of the hyperplane that are closest to the boundary lines are called Support Vectors which is used to plot the boundary line.

In non-linear support vector regression, the function transforms the data into higher dimension and performs linear separation. We are using the rbf kernel for our problem.

We have to perform feature scaling before trying to fit the model, in order to normalize the dataset into a particular range.

We use StandardScaler from sklearn’s preprocessing library to do feature scaling.

Now we are ready to fit the SVR model on our scaled dataset.

In order to visualize the model that has been fit we have to do the following :

And we get the following result :

Now let’s try to use the model to predict what will be the new cases 350 days, after 30/01/2020, which is 14/01/2021.

The model predicts there will be around 25000 new cases in India on 14/01/2021.

Note: Various other factors like precautions, vaccinations, spreading pattern etc. weren’t considered while making this prediction. This is just to play around with algorithms and data.

Decision Tree Regression

The decision tree regression algorithm, breaks down the data points into smaller subsets based on true/false questions. The end result is a tree structure with decision nodes and leaf nodes.

We are going to use the DecisionTreeRegressor from sklearn’s tree library.

In order to visualize the model that has been fit we have to do the following :

And we get the following result :

As we can see, decision tree regression is not best adapted to 2 dimensional data, as it tries to fit each and every point in a 2D plane which leads to overfitting. But it is one of the most commonly used regression algorithms for higher dimensional data.

Now let’s try to use the model to predict what will be the new cases 350 days, after 30/01/2020, which is 14/01/2021.

The model predicts there will be around 20000 new cases in India on 14/01/2021.

Note: Various other factors like precautions, vaccinations, spreading pattern etc. weren’t considered while making this prediction. This is just to play around with algorithms and data.

Random Forest Regression

Random forest regression is an extension of decision tree regression, where it constructs a multitude of decision trees and gives the average of the decision tree results as the overall result. It is an example of ensemble learning.

We are gonna use the RandomForestRegressor from sklearn’s ensemble library.

Here, the n_estimators parameter refers to the number of decision trees used by the algorithm.

In order to visualize the model that has been fit we have to do the following :

And we get the following result :

Again, random forest regression is not suited for 2D data, for the same reason as decision tree regression.

But the steps to apply decision tree and random forest regression for higher dimensional data is the same.

Now let’s try to use the model to predict what will be the new cases 350 days, after 30/01/2020, which is 14/01/2021.

The model predicts there will be around 20700 new cases in India on 14/01/2021.

Note: Various other factors like precautions, vaccinations, spreading pattern etc. weren’t considered while making this prediction. This is just to play around with algorithms and data.

Let’s see which model’s prediction comes close! :P

This blog post is not intended to compare the various regression models. Each algorithm is suited for it’s own domain and use case. It is just an effort from an aspiring data scientist to explore and learn! Hope it helps others! :-)

scikit-learn is one of the most famous libraries in machine learning. It gives many functions which makes the data scientist’s job very easy. Here’s the link to the official scikit-learn website.

https://scikit-learn.org/stable/

Happy learning! :-)

--

--