Car Price Prediction using Lasso and Linear Regression in Python
In this tutorial, we will discuss Car price prediction using two types of regression where one is Linear Regression and the second one is Lasso Regression. We will use the scikit-learn Python library.
By using the machine learning algorithms we can see the actual values vs predicted values by using the graphs of the car’s price.
You can download the dataset from here car data.
So let’s try to implement the code
Import the necessary Python libraries including the popular scikit-learn.
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split #import linear regression model from sklearn.linear_model import LinearRegression from sklearn.linear_model import Lasso from sklearn import metrics #to find accuracy score
The first step is Data collection.
1. Load dataset.
#Data collection and processing #load the csv file #loading the data from csv file to pandas car_dataset=pd.read_csv("C:\\Users\\users drive\\OneDrive\\Desktop\\car data.csv")
To check whether the data set is loaded or not print the first five rows by using the head method.
#lets check the first five rows of the data frame car_dataset.head()
OUTPUT:-
Now print the last five rows by using the tail method.
#lets check the last five rows of the data frame car_dataset.tail()
OUTPUT:-
There are 9 data columns present in the data field.
To check how many rows and columns are present in the given dataset.
car_dataset.shape
OUTPUT
(301, 9) #301 rows and 9 columns
Now let’s check the basic information for the data.
#some basic information about the dataset car_dataset.info()
OUTPUT
By seeing the above output no null values are present on the dataset.
Let’s check the missing values in the given data.
#To check the missing values in the given dataset car_dataset.isnull().sum()
OUTPUT
Let’s check the distribution of categorical data.
#checking the distribution of the categorical data print(car_dataset.Fuel_Type.value_counts()) print(car_dataset.Seller_Type.value_counts()) print(car_dataset.Transmission.value_counts())
OUTPUT
Encode the data because
1. Machine will not understand the text that’s why in fuel type we will change the petrol to 0 diesel to 1 and CNG to 2
2. Same as in seller.
3. It is very easy to understand the machines.
Encoding the categorical data.
#Encoding the categorical data #lets encode the fuel type column car_dataset.replace({'Fuel_Type':{'Petrol':0,'Diesel':1,'CNG':2}},inplace=True) #lets encode the seller type column car_dataset.replace({'Seller_Type':{'Dealer':0,'Individual':1}},inplace=True) #lets encode the Transmission column car_dataset.replace({'Transmission':{'Manual':0,'Automatic':1}},inplace=True)
After encoding check whether the data is shifted to binary or not and print the data frame.
#lets check the first five frames because need to check encoding is done or not car_dataset.head()
OUTPUT
The 3 columns changed to binary data with 0,1&2.
Splitting our original dataset into training data and test data.
Keep the selling price in y and remaining all other values keep it in x.
X=car_dataset.drop(['Car_Name','Selling_Price'],axis=1)#This will load all the values except the car name and selling pricee Y=car_dataset['Selling_Price']#selling price will be stored in y
print(X)
OUTPUT
print(Y)
OUTPUT
The next step is to Split the training and test data.
In the above wherein X, we need to keep labels in x train and values in x test.
In y labels keep in y train and values keep in y test.
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.1,random_state=2)#random state ,eans reproduce the code
MODEL TRAINING
1. Linear Regression
line_regr_model=LinearRegression()
line_regr_model.fit(X_train,Y_train)
OUTPUT
LinearRegression()
Model Evaluation:-
#Predicition on training data training_data_prediction=line_regr_model.predict(X_train)
Error checking:-
#R squared error error_score=metrics.r2_score(Y_train,training_data_prediction) print('R squared error : ',error_score)
OUTPUT
R squared error : 0.8799451660493695
Graph Representation:-
#Visualize the actual prices and predicted prices plt.scatter(Y_train,training_data_prediction) plt.xlabel("Actual Price") plt.ylabel("Predicted Prices") plt.title("Actual prices vs Predicted Prices") plt.show()
OUTPUT
Seeing the above graph the price values increase the gap is there the values predicted by the machine learning model are nearly the same.
compare with the test data where 10 percent of data is present in the test dataset.
#prediction of test data test_data_prediction=line_regr_model.predict(X_test)
#R squared error error_score=metrics.r2_score(Y_test,test_data_prediction) print('R squared error : ',error_score)
OUTPUT
R squared error : 0.8365766715026903
Graph representation:-
plt.scatter(Y_test,test_data_prediction) plt.xlabel("Actual Price") plt.ylabel("Predicted Prices") plt.title("Actual prices vs Predicted Prices") plt.show()
OUTPUT
There is so much difference between the train and the test data graphs because in the test only 10 percent of data is present.
This is in the case of the linear regression model.
And another model is the Lasso regression model.
Lasso regression performs better in most cases.
Linear regression performs well in those cases which are directly correlated.
2. Lasso Regression
las_regr_model=Lasso()
las_regr_model.fit(X_train,Y_train)
OUTPUT
Lasso()
Model Evaluation:-
training_data_prediction=las_regr_model.predict(X_train)
Error checking:-
#R squared error error_score=metrics.r2_score(Y_train,training_data_prediction) print('R squared error : ',error_score)
OUTPUT
R squared error : 0.8427856123435794
Graph Visualization:-for train
plt.scatter(Y_train,training_data_prediction) plt.xlabel("Actual Price") plt.ylabel("Predicted Prices") plt.title("Actual prices vs Predicted Prices") plt.show()
OUTPUT
Test data prediction:-
test_data_prediction=las_regr_model.predict(X_test)
Error checking:-
error_score=metrics.r2_score(Y_test,test_data_prediction) print('R squared error : ',error_score)
OUTPUT
R squared error : 0.8709167941173195
Graph Visualization:-
plt.scatter(Y_test,test_data_prediction) plt.xlabel("Actual Price") plt.ylabel("Predicted Prices") plt.title("Actual prices vs Predicted Prices") plt.show()
OUTPUT
In the case of the test dataset, the values are not nearer to each other.
These are the main difference between the 2 models which we are using.
Leave a Reply