Data preprocessing in Python
Hey Everyone,
In this tutorial let’s learn about data preprocessing in Python which is a very important step in data mining. Data preprocessing helps us prepare the raw data which we have collected from sources and make this data suitable for our machine-learning model. Let’s just first take a look at what is a need for data preprocessing:
Need for Data preprocessing
Generally, raw data contains a lot of noise or missing values. Or data can not be in the required or unusable format directly used for machine learning models. Data preprocessing performs tasks like data cleaning and makes it suitable for our models. Let’s perform the data preprocessing on titanic data You can download the dataset using the link given below:
https://www.kaggle.com/competitions/titanic/data
As you can see in the screenshot below this dataset contains a lot of missing values:
Performing data preprocessing on the dataset
Let’s first start importing the required libraries which are Numpy and Pandas:
import numpy as np import pandas as pd
Now we will import the dataset using the pandas Python library:
dataset = pd.read_csv('titanic.csv')
Let’s first check the description of the dataset using the info()
function:
dataset.info()
Output:
We can also check missing values using isnull()
:
dataset.isnull()
Output:
Let’s fill the missing values using fillna()
function: Filling null values with the previous ones
dataset.fillna(method ='pad')
Output:
You can also drop rows that contains missing value using dropna()
function:
dataset = dataset.dropna() dataset.info()
Output:
But by dropping the rows we end up wasting the data so it is preferable that we will fill the data using dummy values as we have done before.
Instead of dropping the data we can create dummy variables and then drop then after conversion:
dummies = [] cols = ['Pclass', 'Sex', 'Embarked'] for col in cols: dummies.append(pd.get_dummies(dataset[col])) titanic = pd.concat(dummies, axis=1) dataset = dataset.drop(['Pclass', 'Sex', 'Embarked'], axis=1) dataset.info()
Output:
Now let’s replace NaN value in data using -99
dataset.replace(to_replace = np.nan, value = -99)
Output:
We can also use interpolate() function to fill the missing values using linear method:
datatset['Age'] = datatset['Age'].interpolate() dataset
Output:
You can use any of the above method for dealing with missing values in dataset. You don’t need to use all the methods. You can choose the various functions according to the data. After performing data preprocessing we can proceed with building a machine-learning model.
Leave a Reply