**This Blog Will Be Focusing On The Preprocessing The Data Mostly EDA, Finding Missing Values, Filling and Feature Scaling**

**TABLE OF CONTENTS**

- Importing Libraries
- Creating Dataset
- EDA on Dataset
- Find Missing Values
- Replace The Missing Values
- Visualizing Data
- Categorical variables
- Splitting the dataset to train and test dataset
- Feature Scaling

```
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import random
```

```
df = pd.DataFrame({"Country": ['France', 'Spain', 'Germany', 'Spain', 'Germany', 'France', 'Spain', 'Germany', 'Spain', 'France'], 'Age': [44, 27, 30, 38, 40 , 35, np.nan, 48 , 50 , 37], 'Salary': [72000, 48000, 54000, 61000, np.nan, 58000, 52000, 79000, 83000, 67000], 'Purchased': ['No','Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes' ]})
```

```
print(df.info())
```

```
print(df.head(4))
```

```
print(df.describe())
```

```
import missingno as mp
mp.matrix(df) #will plot the missing value chart
```

```
print(df.isna().sum()) #will print the total of null values
```

```
df= df.fillna(df.mean()) #replace the missing values with the mean of the values
```

```
mp.matrix(df) #will plot the missing value chart
```

##### We can also use imputer to do the same task

Here is the Code for the same:

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean',verbose=0)

imputer = imputer.fit(X.iloc[:, 1:3])

X.iloc[:, 1:3] = imputer.transform(X.iloc[:, 1:3])

imputer = imputer.fit(df.iloc[: , 1:3])

df.iloc[:,1:3] = imputer.transform(df.iloc[:, 1:3])

.

```
x = df.iloc[: , 0:3].values # independent variable
y = df.iloc[: , 3:4].values # dependent variable
```

```
print(x)
```

```
print(y)
```

..

## Categorical variables

#### Now we will take care of categorical variables

#### we will import three libraries for the same

#### label encoder , one hot encoder and ColumnTransformer

#### ColumnTransformer will encode your labels to numberical variables ie here france to 0 , germany to 1 and other to 2

#### One hot encoder is used to make a sparse matrix of the numerical features encoded by Cloumn Transformer. It encodes into 0 and 1 Sparse Matrix

.

```
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers = [('encoder', OneHotEncoder(), [0])], remainder = 'passthrough') #creating an instance
x = np.array(ct.fit_transform(x))
```

```
print(x)
```

```
print(y)
```

```
#it is preferred to use label encoder to change the binary data so we use LabelEncoder
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
y = encoder.fit_transform(y)
y = np.reshape(y , (10,1))
```

```
from sklearn.model_selection import train_test_split
x_train , x_test , y_train , y_test = train_test_split(x, y , test_size = 0.3 , random_state = 0)
```

```
print('x_train: ', x_train, '\n', 'y_train: ', y_train, '\n' ,'x_test: ', x_test, '\n', 'y_test: ', y_test)
```

```
random.seed(2) # so that we get the same results
splt = np.random.rand(len(df)) < 0.8 #80, 20 split or random
x_train , y_train = x[splt] , y[splt]
x_test , y_test = x[~splt], y[~splt]
```

```
print('x_train: ', x_train, '\n', 'y_train: ', y_train, '\n' ,'x_test: ', x_test, '\n', 'y_test: ', y_test)
```

## Feature Scaling

#### We have 2 important parts in feature scaling

- Standardization
- Normalization

The terms normalization and standardization are sometimes used interchangeably, but they usually refer to different things.

**Standardization** rescales data to have a mean (𝜇) of 0 and standard deviation (𝜎) of 1 (unit variance).
Formulae for Standardization is
$X_{changed} = \frac{X - \mu}{\sigma} $
For most applications standardization is recommended.

**Normalization** usually means to scale a variable to have a values between 0 and 1,
while standardization transforms data to have a mean of zero and a standard deviation of 1.

Normalization rescales the values into a range of [0,1]. This might be useful in some cases where all parameters need to have the same positive scale. However, the outliers from the data set are lost. Formulae for normalization is $ X_{changed} = \frac{X - X_{min}}{X_{max}-X_{min}} $

```
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
```

**
We must standardize the data only after the split and the scaler should be only be fitted to the x_train set because if we do that we get the mean and standard of the values in the x_test which should be hidden to us. So we will only fit the scaler to the test set and then we will transform the scaler to x_test
**</p>

**Do we have to apply/standardization to the dummy variables(Country Column here) to the matrix of features ? **

**Answer is no**

Simply as the goal of Standardization is to transform your data and get them in the range generally (-3 , +3) But here we have mostly the data in 0s and 1s after we have converted them using ColumnTransformer, OneHotEncoder and LabelEncoder.

And there is nothing extra to be done here.

Moreever here standardization will convert the values to -3 and +3 which will worsen out understanding of the data as we will not be able to understand the nonsense numerical values.

Feature Scaling on the dataset makes the model better but when we do the same on the dummy variabe it makes the data not redable and we can not relate the country to the salary etc.

So feature Scaling should be applyed to the model but not to the dummy variable as they are already encoded before using ColumnTransformer, OneHotEncoder and LabelEncoder.

```
x_train[: , 3:] = sc.fit_transform(x_train[:, 3:])
```

```
print(x_train)
```

```
x_test[: , 3:] = sc.transform(x_test[:, 3:])
```

```
print(x_test)
```