A5, AEHS, Lahore, Pakistan
+92 306 77 57 681
Data cleaning is a critical step in the data preprocessing pipeline, ensuring that your data is accurate, complete, and ready for analysis. Clean data leads to more reliable insights and better decision-making. This guide will explore the significance of data cleaning, common challenges with raw data, and detailed techniques for cleaning and preparing your data.
Introduction to Data Cleaning
Data cleaning, also known as data cleansing or data scrubbing, involves identifying and correcting (or removing) errors and inconsistencies in data to improve its quality. This process is essential for ensuring that the data used in analysis and modeling is accurate and reliable.
Identify Missing Values: Use functions to detect missing values in the dataset.
import pandas as pd
df = pd.read_csv('data.csv')
missing_values = df.isnull().sum()Remove Missing Values: Drop rows or columns with missing values.
df.dropna(axis=0, inplace=True)
# Drop rows with any missing valuesImpute Missing Values: Fill missing values using various imputation techniques.
df['column'].fillna(df['column'].mean(), inplace=True) # Replace with mean
df['column'].fillna(df['column'].median(), inplace=True) # Replace with mediandf['column'].fillna(df['column'].mode()[0], inplace=True) # Replace with modedf['column'].interpolate(method='linear', inplace=True) # Linear interpolationIdentify Outliers: Use statistical methods or visualization tools to detect outliers.
from scipy import stats
z_scores = stats.zscore(df['column'])
abs_z_scores = np.abs(z_scores)
outliers = df[abs_z_scores > 3]Remove Outliers: Remove rows with outlier values.
df = df[(abs_z_scores < 3)]Cap Outliers: Replace outlier values with a specified percentile value.
q_low = df['column'].quantile(0.01)
q_high = df['column'].quantile(0.99)
df['column'] = np.where(df['column'] < q_low, q_low, df['column'])
df['column'] = np.where(df['column'] > q_high, q_high, df['column'])
Identify Duplicates: Find duplicate rows in the dataset.
duplicates = df.duplicated()