House Price prediction using Linear, Lasso and Ridge Regression in Python
In this tutorial, we will discuss about house price prediction in a major city like Banglore using Linear, Lasso and Ridge Regression with the help of Python programming.
You can know more details about Linear lasso and Ridge regression.
In the dataset, the customer will check whatever requirements they want for the dream house.
You can download the dataset from here Bengaluru_House_Data.
So let’s try to implement the code
Import the necessary Python libraries.
import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression,Lasso,Ridge from sklearn.preprocessing import OneHotEncoder,StandardScaler from sklearn.compose import make_column_transformer from sklearn.pipeline import make_pipeline from sklearn.metrics import r2_score
The first step is,
Data Collection
1. Load data
data=pd.read_csv("C:\\Users\\usersdrive\\OneDrive\\Desktop\\Bengaluru_House_Data.csv")
To check whether the data set is loaded or not print the first five rows by using the head method.
data.head()
OUTPUT:-
Now print the last five rows by using the tail method.
data.tail()
OUTPUT:-
To check the dataset size.
data.shape
OUTPUT:-
(13320, 9)
Lets check the basic information about the dataset.
data.info()
OUTPUT:-
By seeing the above output there are only 3 columns in float.
To check the value counts
for column in data.columns: print(data[column].value_counts()) print("*",20)
OUTPUT:-
To check wether any null value present in the given dataset.
data.isnull().sum()
OUTPUT:-
There are 5502 null values present in the society column and 609 null values in the balcony column need to drop the columns otherwise the algorithm cant able to predict correctly.
To drop the null values
data.drop(columns=['area_type','availability','society','balcony'],inplace=True)
To check wether the columns are dropped or not.
data.info()
OUTPUT:-
To fill the null values in one column can use the mean values and in one column use the standard deviation values.
To check mean and standard deviation values.
data.describe()
OUTPUT:-
Now fill up the missing values one by one.
First try for the location.
data['location'].value_counts()
OUTPUT:-
To fill the values in the sarjapur road.
data['location']=data['location'].fillna('Sarjapur Road')
To check the values.
ata['size'].value_counts()
OUTPUT:-
Now fill the msiing values in the size with the 2 bhk.
data['size']=data['size'].fillna('2 BHK')
There are 73 missing values in the bath fill that values with the median value.
data['bath']=data['bath'].fillna(data['bath'].median())
To check wether the values is updated or not.
data.info()
OUTPUT:-
data['bhk']=data['size'].str.split().str.get(0).astype(int)
Print the bhk greater than 20 these are the outliers of the data.
#print the bhk greater than 20 these are outliers of data data[data.bhk > 20]
OUTPUT:-
To check the square feet values.
data['total_sqft'].unique()
OUTPUT:-
By seeing the above output we need to find the range for which the values are in hiphen need to add two and divide by two.
def convertRange(x): temp=x.split('-') if len(temp)==2: return(float(temp[0])+float(temp[1]))/2 try: return float(x) except: return None
data['total_sqft']=data['total_sqft'].apply(convertRange)
data.head()
OUTPUT:-
Now will make new column name Price per square feet it will helps for removing the outliers.
data['price_per_sqft']=data['price']*100000/data['total_sqft']
data['price_per_sqft']
OUTPUT:-
data.describe()
OUTPUT:-
The minimum price per square feet is 260 rs and the maximum price is 120 rs.
To check the location values.
data['location'].value_counts()
OUTPUT:-
By seeing the above output there are total 1306 dummy values present in the data so need to reduce other wise it will not pass in to algorithm.
To reduce we will reduce the lambda function.
data['location']=data['location'].apply(lambda x: x.strip()) location_count=data['location'].value_counts(
location_count
OUTPUT:-
Print if the location value is less than 10.
location_count_less_10=location_count[location_count<=10] location_count_less_10
OUTPUT:-
If location count less than 10 write other wise write back the location.
data['location']=data['location'].apply(lambda x: 'other' if x in location_count_less_10 else x)
data['location'].value_counts()
OUTPUT:-
OUTLIER DETECTION AND REMOVAL
data.describe()
OUTPUT:-
(data['total_sqft']/data['bhk']).describe()
OUTPUT:-
If bhk greater than 300 need to keep that.
data=data[((data['total_sqft']/data['bhk'])>=300)] data.describe() #minimum has changed
OUTPUT:-
To check the data shape.
data.shape
OUTPUT:-
(12530, 7)
data.price_per_sqft.describe()
OUTPUT:-
To remove outliers need to introduce the function.
def remove_outliers_sqft(df): df_output=pd.DataFrame() for key,subdf in df.groupby('location'): m=np.mean(subdf.price_per_sqft) st=np.std(subdf.price_per_sqft) gen_df=subdf[(subdf.price_per_sqft > (m-st))&(subdf.price_per_sqft<=(m+st))] df_output=pd.concat([df_output,gen_df],ignore_index=True) return df_output data=remove_outliers_sqft(data) data.describe()
OUTPUT:-
maximum value is reduced.
def bhk_outlier_remover(df): exclude_indices=np.array([]) for location,location_df in df.groupby('location'): bhk_stats={} for bhk,bhk_df in location_df.groupby('bhk'): bhk_stats[bhk]={ 'mean':np.mean(bhk_df.price_per_sqft), 'std':np.std(bhk_df.price_per_sqft), 'count':bhk_df.shape[0] } for bhk,bhk_df in location_df.groupby('bhk'): stats=bhk_stats.get(bhk-1) if stats and stats['count']>5: exclude_indices=np.append(exclude_indices,bhk_df[bhk_df.price_per_sqft<(stats['mean'])].index.values) return df.drop(exclude_indices,axis='index')
data=bhk_outlier_remover(data)
data.shape
OUTPUT:-
(7361, 7)
This is the cleaned data.
data
OUTPUT:-
Drop the size and price_per_sqft column.
data.drop(columns=['size','price_per_sqft'],inplace=True)
To check the cleaned data.
#cleaned data data.head()
OUTPUT:-
Send the cleaned data in to the csv file.
data.to_csv('cleaned_data.csv')
Divide the data in to the train and the test split.
X=data.drop(columns=['price']) y=data['price']
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=0)
Print the split data.
print(X_train.shape) print(X_test.shape)
OUTPUT:-
(5888, 4) (1473, 4)
Applying the 3 types of the regression.
LINEAR REGRESSION
column_trans=make_column_transformer((OneHotEncoder(sparse=False),['location']),remainder='passthrough')
scaler=StandardScaler()
lr=LinearRegression(normalize=True)
pipe=make_pipeline(column_trans,scaler,lr)
pipe.fit(X_train,y_train)
OUTPUT:-
To check the r score value for the linear regression.
y_pred_lr=pipe.predict(X_test)
r2_score(y_test,y_pred_lr)
OUTPUT:-
0.823438105512173
Now same thing we will do with the lasso regression.
LASSO REGRESSION
lasso=Lasso()
pipe=make_pipeline(column_trans,scaler,lasso)
pipe.fit(X_train,y_train)
OUTPUT:-
To check the accuracy of the lasso regression.
y_pred_lasso=pipe.predict(X_test) r2_score(y_test,y_pred_lasso)
OUTPUT:-
0.8128285650772719
The accuracy is nearly 81 it is good to apply.
Now apply the ridge regression.
RIDGE REGRESSION
ridge=Ridge()
pipe=make_pipeline(column_trans,scaler,ridge)
pipe.fit(X_train,y_train)
OUTPUT:-
To check the accuracy score.
y_pred_ridge=pipe.predict(X_test) r2_score(y_test,y_pred_ridge)
OUTPUT:-
0.8234146633312649
Now lets compare the all three regression accuracy scores.
print("No Regularization: ",r2_score(y_test,y_pred_lr )) print("Lasso: ",r2_score(y_test,y_pred_lasso)) print("Ridge: ",r2_score(y_test,y_pred_ridge))
OUTPUT:-
Out of three ridge regression is best way to apply.
Leave a Reply