Data collection, cleaning, and preparation are the foremost steps in any machine learning project. Usually, we think that data analysts or scientists spend their time brainstorming about algorithms, but the actual thing they mainly spend their time on is data cleaning.
Data cleaning is a procedure in which one needs to figure out the incomplete, duplicate, inaccurate, or inconsistent data and then remove the invalid and unwanted information, thereby increasing the data quality.
When multiple businesses combine their datasets from various sources, such as receiving data from their clients, scraping data from the web, etc., this can cause issues and inconsistency of data.
Incomplete Data: Incomplete means data has various missing values, blank spaces, etc. Employee payrolls cannot be processed if the social security numbers are not there in the data.
Duplicate Data: There is more than one of the same record existing in the database. Due to this, a logistics person can misinterpret the count of inventory items that can cause unnecessary billings.
Invalid Data: Attributes of the data are not following the standardization.
Example: First names of people that contain numbers are invalid names.
Conflicting Data: When the records are actually the same with different versions and the current version is not marked. Example: The same company has different addresses where the current address of the company is not marked.
Software Bugs: Software applications can overwrite data due to bugs, or data can be written with some errors.
Synchronization Issues: Sometimes data is not adequately shared between the systems which can lead to problems.
Information Obfuscation by Users: People may give incorrect or incomplete data to protect their privacy.
Data is an asset to any organization, whether the organization is a bank, retail organization, etc. Real-world data quality is too poor to make any analysis on. Therefore, we need data cleaning. Let’s look at some scenarios to understand this better.
Health Sector: In healthcare, poor quality data can cause wrong treatments which can lead to deaths of people or some unfortunate events.
Sales: A salesperson may fail to keep in contact with old customers because data is not correct.
Logistics and Manufacturing: Valuations in inventory depend on valid data. If data is inconsistent, this can cause delivery troubles, unsatisfied customers, and even loss of customers.
Finance and Accounting: Inconsistent, incomplete, and inaccurate data can cause a delay in decisions as manual checks need to be made every time.
These are all problems that can be avoided if the data is clean.
Better Sales: Many main marketing decisions depend on customer data. If the marketing department is given high-quality data, that means better leads for the sales team.
Boost Results and Revenue: With high-quality data, an organization can group their target audience which will positively impact the results and lead to greater returns of investments.
Save Time and Increase Productivity: If a salesperson has accurate data, they won’t waste time contacting invalid customers or prospects. This will lead to more productivity of the sales team and the whole business.
Streamlined Business Practices: If there are no errors, duplicates, or inconsistencies in the data, all the daily activities will become more aligned and there won’t be any delayed decisions, etc.
There are several ways to find the missing or null values present in data. Lets see some of them below:
Using null() function: It is used to know the number of null values in a dataset. The below syntax returns true wherever the value is null in the dataset. As you can see below, values are false, so there are no null values.
Syntax:Data.isnull()
Using isna() function: It is used to know the number of Na values in the dataset.
Syntax:Data.isna()
Using isna().any().sum(): This function will give an integer value of the number of null values present in the dataset.
Syntax:Data.isna().any().sum()
De-duplicate: This function is used to delete duplicate values in the data as it will help us to get an accurate analysis.
Syntax:Data.duplicated()
Drop_duplicates: Drop_duplicates is used to keep the first value of every duplicate and delete the rest.
Syntax:Data.drop_duplicates(subset = None, keep = ‘first’)
In normalization, the data values are rescaled between 0 – 1, which is also called min-max scaling. The formula for normalization is below:
When the value of X is the lowest value of the column and Xmin means minimum, then X’ will be zero.
When the value of X is maximum, then the denominator and numerator value becomes equal and X’ will be one.
If the value of X lies between Xmax and Xmin, then the value of X’ is between 0 and 1.
Code:
From sklearn.preprocessing import MinMaxScaler
1 2 3 4
Normal = MinMaxScaler() .fit(X_train) X_train_normal = normal.transform(X_train) X_train_normal = normal.transform(X_test)
It is the second technique of feature scaling where the data values are scattered around the mean of the data with single unit standard deviation.
Code:
1 2 3 4 5 6 7 8 9 10 11 12
From sklearn.preprocessing import StandardScaler X_train_set = X_train.copy() X_test_set = X_test.copy() Numeric_cols = [‘Item_visibility’, ‘Outlet_Establishment_Year’, ‘Item_MRP’, ‘Item_Weight’] For i in numeric_cols: Scale_data = StandardScaler() .fit(X_train_stand[[i]]) X_train_set[i] = scale.transform(X_train_set[[i]]) X_test_set[i] = scale.transform(X_test_set[[i]])
Now the question is when to use which technique?
We should use normalization when the distribution of the dataset is not using a Gaussian distribution.
Standardization can be useful to us when the dataset follows a Gaussian distribution. Standardization doesn’t have a bounding range like normalization. The drawback is that standardization does not affect the outliers in data; in this case normalization is useful.
Thanks for reading about what data cleaning is and some methods to utilize.