What is Data Cleaning in Data Science?
📌 Introduction
Data Cleaning (or Data Preprocessing) is the process of detecting, correcting, or removing inaccurate, incomplete, or irrelevant data from datasets. It is a crucial step in data science and machine learning, as poor-quality data can lead to incorrect insights and inaccurate predictions.
Data Science Course in Pune
💡 Fact: 80% of a data scientist’s time is spent cleaning and preparing data!
🛠️ Why is Data Cleaning Important?
🔹 Improves Data Accuracy – Reduces errors and inconsistencies
🔹 Enhances Model Performance – Clean data leads to better predictions
🔹 Prevents Bias – Eliminates duplicate or misleading records
🔹 Ensures Data Consistency – Standardizes formats and missing values
📌 Example:
Imagine a company analyzing customer transactions. If the dataset contains missing prices, incorrect dates, or duplicate entries, the sales analysis will be flawed.
🔍 Key Steps in Data Cleaning
1️⃣ Handling Missing Data
✅ Techniques to fill missing values:
Drop missing values (if the dataset is large)
Fill with mean/median/mode (for numerical data)
Use forward or backward fill (for time-series data)
📌 Example in Python:
python
Copy
Edit
import pandas as pd
df.fillna(df.mean(), inplace=True) # Fill missing values with mean
2️⃣ Removing Duplicates
✅ Duplicates can skew analysis and lead to incorrect conclusions.
📌 Example in Python:
Data Science classes in Punepython
Copy
Edit
df.drop_duplicates(inplace=True)
3️⃣ Standardizing Data Formats
✅ Ensure uniform formats for:
Date formats (YYYY-MM-DD vs. MM/DD/YYYY)
Text cases (uppercase/lowercase)
Units of measurement (e.g., km vs. miles)
📌 Example in Python:
python
Copy
Edit
df['date_column'] = pd.to_datetime(df['date_column']) # Standardize date format
df['name'] = df['name'].str.lower() # Convert text to lowercase
4️⃣ Handling Outliers
✅ Outliers can distort analysis and affect ML models.
🔹 Techniques:
Remove extreme values using IQR (Interquartile Range)
Use log transformations to normalize skewed data
📌 Example in Python:
python
Copy
Edit
Q1 = df['column'].quantile(0.25)
Q3 = df['column'].quantile(0.75)
IQR = Q3 - Q1
df = df[(df['column'] >= (Q1 - 1.5 * IQR)) & (df['column'] <= (Q3 + 1.5 * IQR))]
5️⃣ Correcting Data Entry Errors
✅ Common issues:
Typos (e.g., "USA" vs. "U.S.A")
Inconsistent naming ("Male" vs. "M")
Incorrect spellings
📌 Example in Python:
python
Copy
Edit
df['country'] = df['country'].replace({'U.S.A': 'USA', 'United States': 'USA'})
Data Science Training in Pune