An important part of working on data science and machine learning problems is data preprocessing. If our data is noisy and a lot of information is not present in the correct and structural format, then it is not a good means of model building. Feature engineering, or selection of important features, becomes critical. As we know, when we get data for model building or predicting, it contains a lot of features. Columns are not always all important and interdependent, so we can remove some columns and features to reduce the dimensionality of data. Below are some steps that we will follow to perform feature engineering, or to remove unimportant features in our dataset.
In this step we will be removing features with constant features, which are not important for solving problem statements.
Variance threshold:
Here feature selection techniques remove all low-variance features. The feature selection algorithm looks only at feature(X), not the desired output(Y), and can thus be applied on unsupervised learning.
1
2
3
4
5
6
7
8
9
10
11
import pandas as pd
from sklearn.feature_selection
import VarianceThreshold
data = pd.DataFrame({
"A": [1, 2, 4, 1, 2, 4],
"B": [4, 5, 6, 7, 8, 9],
"C": [0, 0, 0, 0, 0, 0],
"D": [1, 1, 1, 1, 1, 1]
})
var_thres = VarianceThreshold(threshold = 0)
var_thres.fit(data)
Here we will use a correlation and covariance matrix to find relations between any columns and set a threshold if the value of correlation between two columns is greater than the particular.
Here we plot a correlation matrix between all the columns then apply conditions corresponding to all the columns.
1
2
3
4
5
6
import seaborn as sns
#Using Pearson Correlation
plt.figure(figsize = (12, 10))
cor = X_train.corr()
sns.heatmap(cor, annot = True, cmap = plt.cm.CMRmap_r)
plt.show
From the above we can see there are some values that are positive and negative with high magnitude, and some have less magnitude. If the values have high magnitude then it means they are less dependent, if they have less magnitude it means they are more dependent.
Feature Selection Using Mutual Information for Classification Problems
This type of approach is basically used in classification problems. Mutual Information (MI) of two values will be negative, positive, or zero. If the value is high it may be negative or positive, which means both the variables are more dependent on each other. If the value is zero then it means the values are independent from each other.
The function relies on nonparametric methods supported entropy estimation from k-nearest neighbors distances.
In short, a value obtained is called information gain, which means the amount of data you can get by using other dependent variables.
Information gain of variables X and Y is denoted as : I(X ; Y) = H(X) – H(X | Y),
where I(X ; Y) is that the mutual information for X and Y, H(X) is that the entropy for X and H(X | Y) is the conditional entropy for X given Y. The result has the units of bits.
The above formula is the same as random forest, which we use to find gini index and information gain or entropy.
1
2
3
4
5
6
7
8
9
df = pd.read_csv('https://gist.githubusercontent.com/tijptjik/9408623/raw/b237fa5848349a14a14e5d4107dc7897c21951f5/wine.csv')
print(df['Wine'].unique())
# # # Train test split to avoid overfitting
from sklearn.model_selection
import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.drop(labels = ['Wine'], axis = 1),
df['Wine'],
test_size = 0.3,
random_state = 0)
You’ll see the importing of the dataset and splitting it into train and test data. Now we will move to the next part of our analysis.
1
2
3
4
5
6
7
8
9
10
11
12
13
from sklearn.feature_selection
import mutual_info_classif
# determine the mutual information
mutual_info = mutual_info_classif(X_train, y_train)
mutual_info
mutual_info = pd.Series(mutual_info)
mutual_info.index = X_train.columns
mutual_info.sort_values(ascending = False)
#let 's plot the ordered mutual_info values per feature
import matplotlib.pyplot as plt
mutual_info.sort_values(ascending = False)
.plot.bar(figsize = (20, 8))
plt.savefig('mutual.png')
Above we see the values are decreasing accordingly, which means columns that are less mutually dependent have lesser value.
Feature Selection Using Mutual Information for Regression Problems
This technique is used for regression problems. Mutual information of two values will be negative, positive, or zero. If the value is high it may be negative or positive, which means both the variables are more dependent on each other. If the value is zero, then it means the values are independent from each other. The function relies on nonparametric methods supported entropy estimation from k-nearest neighbors distances.
In short, a value obtained is called information gain, which represents the amount of data you can get by using other dependent variables.
Information gain of variables X and Y is denoted as : I(X ; Y) = H(X) – H(X | Y), where I(X ; Y) is that the mutual information for X and Y, H(X) is that the entropy for X and H(X | Y) is the conditional entropy for X given Y. The result has the units of bits.
1
2
3
4
5
6
7
8
9
10
11
12
from sklearn.model_selection
import train_test_split
X_train, X_test, y_train, y_test = train_test_split(housing_df.drop(labels = ['SalePrice'], axis = 1),
housing_df['SalePrice'],
test_size = 0.3,
random_state = 0)
from sklearn.feature_selection
import mutual_info_regression
# determine the mutual information
mutual_info = mutual_info_regression(X_train.fillna(0), y_train)
mutual_info
mutual_info = pd.Series(mutual_info)
We have already seen a lot of hypothesis testing in past articles. Please visit Hypothesis Testing for more information. I will not go deeply into the chi-square hypothesis method .
So we will move directly to practical implementation. I will not show that our dataset was loaded and split into train and test data.
1
2
3
4
5
6
7
from sklearn.feature_selection
import chi2
f_p_values = chi2(X_train, y_train)
import pandas as pd
p_values = pd.Series(f_p_values[1])
p_values.index = X_train.columns
p_values
There are also many important techniques in feature engineering like treating and removing null values in the dataset.
When some values are missing in the dataset, the methods below are used to fill the null or handle Nan values.
Removing the missing features of the dataset. This can be done if your dataset is big enough that you can sacrifice some training data. This is not the best approach because we are losing some information.
Using machine learning algorithms that can predict the missing values by taking missing columns as output features and others as training examples.
Replacing the null values with mean, median, mode.
Filling the null values with dummy values.
Using data imputation techniques.
This technique consists of replacing missing values in features with an average value of these features in the dataset.
Another technique is to replace the missing values by some values outside the normal range, [-1,1] you will replace it with 2, -2. This is because our algorithm will learn it easily.
As we know, machine learning algorithms cannot understand categorical data because they are built using a mathematical approach and equations.There are two ways to handle categorical data:
Label Encoding
One Hot Encoding
This is a simple approach where we replace all the unique values in the categorical data with numerical data. For example,
Banana, mango, apple- 1,2,3, etc.
1 2 3 4
From sklearn.preprocessing import LabelEncoder le = LabelEncoder() data[‘numerical’] = le.fit_transform(data[‘text’])
Here we convert numerical data into multidimensional vectors. By doing so you increase the dimensionality of your dataset. If you simply transform banana to 1, mango to 2, and apple to 3, then our machine learning algorithm gets trapped by starting to find patterns in 1,2,3.
1 2 3 4
From sklearn.preprocessing import OneHotlEncoder le = OneHotEncoder() data[‘numerical’] = le.fit_transform(data[‘text’])
Binning is an opposite situation, where you have numerical features but want to convert them into categorical features. Binning is also called bucketing.
This is the process of converting continuous features into multiple binary features.
0 to 5=bucket1
5 to 10=bucket2