Car Price Prediction using Lasso and Linear Regression in Python

In this tutorial, we will discuss Car price prediction using two types of regression where one is Linear Regression and the second one is Lasso Regression. We will use the scikit-learn Python library.

By using the machine learning algorithms we can see the actual values vs predicted values by using the graphs of the car’s price.

You can download the dataset from here car data.

So let’s try to implement the code

Import the necessary Python libraries including the popular scikit-learn.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
#import linear regression model
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn import metrics #to find accuracy score

The first step is Data collection.

1. Load dataset.

#Data collection and processing
#load the csv file
#loading the data from csv file to pandas
car_dataset=pd.read_csv("C:\\Users\\users drive\\OneDrive\\Desktop\\car data.csv")

To check whether the data set is loaded or not print the first five rows by using the head method.

#lets check the first five rows of the data frame
car_dataset.head()

OUTPUT:-

Now print the last five rows by using the tail method.

#lets check the last five rows of the data frame
car_dataset.tail()

OUTPUT:-

There are 9 data columns present in the data field.

To check how many rows and columns are present in the given dataset.

car_dataset.shape

OUTPUT

(301, 9)
#301 rows and 9 columns

Now let’s check the basic information for the data.

#some basic information about the dataset
car_dataset.info()

OUTPUT

By seeing the above output no null values are present on the dataset.

Let’s check the missing values in the given data.

#To check the missing values in the given dataset
car_dataset.isnull().sum()

OUTPUT

Let’s check the distribution of categorical data.

#checking the distribution of the categorical data
print(car_dataset.Fuel_Type.value_counts())
print(car_dataset.Seller_Type.value_counts())
print(car_dataset.Transmission.value_counts())

OUTPUT

Encode the data because
1. Machine will not understand the text that’s why in fuel type we will change the petrol to 0 diesel to 1 and CNG to 2
2. Same as in seller.
3. It is very easy to understand the machines.

Encoding the categorical data.

#Encoding the categorical data
#lets encode the fuel type column
car_dataset.replace({'Fuel_Type':{'Petrol':0,'Diesel':1,'CNG':2}},inplace=True)
#lets encode the seller type column
car_dataset.replace({'Seller_Type':{'Dealer':0,'Individual':1}},inplace=True)
#lets encode the Transmission column
car_dataset.replace({'Transmission':{'Manual':0,'Automatic':1}},inplace=True)

After encoding check whether the data is shifted to binary or not and print the data frame.

#lets check the first five frames because need to check encoding is done or not
car_dataset.head()

OUTPUT

The 3 columns changed to binary data with 0,1&2.

Splitting our original dataset into training data and test data.

Keep the selling price in y and remaining all other values keep it in x.

X=car_dataset.drop(['Car_Name','Selling_Price'],axis=1)#This will load all the values except the car name and selling pricee
Y=car_dataset['Selling_Price']#selling price will be stored in y
print(X)

OUTPUT

print(Y)

OUTPUT

The next step is to Split the training and test data.
In the above wherein X, we need to keep labels in x train and values in x test.
In y labels keep in y train and values keep in y test.

X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.1,random_state=2)#random state ,eans reproduce the code

MODEL TRAINING

1. Linear Regression

line_regr_model=LinearRegression()
line_regr_model.fit(X_train,Y_train)

OUTPUT

LinearRegression()

Model Evaluation:-

#Predicition on training data
training_data_prediction=line_regr_model.predict(X_train)

Error checking:-

#R squared error
error_score=metrics.r2_score(Y_train,training_data_prediction)
print('R squared error : ',error_score)

OUTPUT

R squared error :  0.8799451660493695

Graph Representation:-

#Visualize the actual prices and predicted prices
plt.scatter(Y_train,training_data_prediction)
plt.xlabel("Actual Price")
plt.ylabel("Predicted Prices")
plt.title("Actual prices vs Predicted Prices")
plt.show()

OUTPUT

Seeing the above graph the price values increase the gap is there the values predicted by the machine learning model are nearly the same.

compare with the test data where 10 percent of data is present in the test dataset.

#prediction of test data
test_data_prediction=line_regr_model.predict(X_test)
#R squared error
error_score=metrics.r2_score(Y_test,test_data_prediction)
print('R squared error : ',error_score)

OUTPUT

R squared error :  0.8365766715026903

Graph representation:-

plt.scatter(Y_test,test_data_prediction)
plt.xlabel("Actual Price")
plt.ylabel("Predicted Prices")
plt.title("Actual prices vs Predicted Prices")
plt.show()

OUTPUT

There is so much difference between the train and the test data graphs because in the test only 10 percent of data is present.

This is in the case of the linear regression model.

And another model is the Lasso regression model.

Lasso regression performs better in most cases.

Linear regression performs well in those cases which are directly correlated.

2. Lasso Regression

las_regr_model=Lasso()
las_regr_model.fit(X_train,Y_train)

OUTPUT

Lasso()

Model Evaluation:-

training_data_prediction=las_regr_model.predict(X_train)

Error checking:-

#R squared error
error_score=metrics.r2_score(Y_train,training_data_prediction)
print('R squared error : ',error_score)

OUTPUT

R squared error :  0.8427856123435794

Graph Visualization:-for train

plt.scatter(Y_train,training_data_prediction)
plt.xlabel("Actual Price")
plt.ylabel("Predicted Prices")
plt.title("Actual prices vs Predicted Prices")
plt.show()

OUTPUT

Test data prediction:-

test_data_prediction=las_regr_model.predict(X_test)

Error checking:-

error_score=metrics.r2_score(Y_test,test_data_prediction)
print('R squared error : ',error_score)

OUTPUT

R squared error :  0.8709167941173195

Graph Visualization:-

plt.scatter(Y_test,test_data_prediction)
plt.xlabel("Actual Price")
plt.ylabel("Predicted Prices")
plt.title("Actual prices vs Predicted Prices")
plt.show()

OUTPUT

In the case of the test dataset, the values are not nearer to each other.

These are the main difference between the 2 models which we are using.

Leave a Reply

Your email address will not be published. Required fields are marked *