Data Cleaning with Python: A Real-World Guide with Examples

Three months into my first data project, I made a mistake that cost me a week of work. I had skipped proper data cleaning, assuming our dataset was "clean enough," and our analysis ended up being completely wrong. That painful lesson taught me something crucial: in data science, cleaning isn't just busywork – it's the foundation of everything that follows.

If you've worked with real-world data, you know the frustration. Missing values, inconsistent formats, duplicate entries, and mysterious outliers – these aren't just textbook problems, they're daily challenges. While everyone loves to talk about sophisticated machine learning models and stunning visualizations, the truth is that without clean data, even the most advanced algorithms will fail.

In this guide, I'll share the exact data cleaning process I now use in my day-to-day work as a data scientist. We'll work with a messy e-commerce dataset that contains all the classic problems you'll encounter in the real world. No theoretical examples – just practical, hands-on code you can start using today.

By the end of this tutorial, you'll know how to:

  • Handle missing and duplicate data confidently

  • Standardize inconsistent formats

  • Deal with outliers systematically

  • Create reusable cleaning functions

  • Validate your cleaned dataset

Whether you're just starting your data science journey or you're looking to refine your cleaning process, this guide will give you the tools you need to build a solid foundation for your analysis.

Let's dive in and turn that messy data into something useful.

When I first started learning data science, everyone talked about machine learning algorithms and fancy visualizations. But nobody mentioned I'd spend 80% of my time cleaning data. Today, I'm sharing the data cleaning techniques I use daily, with real examples you can follow along with.

The Dataset We'll Use

We'll work with a messy e-commerce dataset that has common real-world problems:

  • Missing values

  • Inconsistent date formats

  • Duplicate entries

  • Incorrect data types

  • Outliers

  • Inconsistent text formatting

import pandas as pd
import numpy as np
from datetime import datetime

# Sample messy data
data = {
    'customer_id': ['A001', 'A002', 'A001', 'A003', np.nan, 'A004'],
    'purchase_date': ['2024-01-01', '01/15/2024', '2024-01-01', '2024-02-01', '2024-01-05', '2024-02-15'],
    'product': ['Laptop', 'laptop ', 'LAPTOP', 'Phone', 'Tablet', 'phone'],
    'price': ['1200', 1500, '1200', '800', '400', '750'],
    'quantity': [1, 2, 1, -1, 3, 2],
}

df = pd.DataFrame(data)

Step 1: Initial Assessment

First, let's understand what we're dealing with:

# Basic information about our dataset
print(df.info())
print("\nMissing values:\n", df.isnull().sum())
print("\nDuplicate rows:", df.duplicated().sum())

Step 2: Handling Missing Values

# Fill missing customer IDs with a placeholder
df['customer_id'] = df['customer_id'].fillna('UNKNOWN')

# Verify no missing values remain
print("Missing values after cleaning:", df.isnull().sum())

Step 3: Standardizing Date Formats

def standardize_date(date_str):
    try:
        # Try different date formats
        for fmt in ['%Y-%m-%d', '%m/%d/%Y']:
            try:
                return pd.to_datetime(date_str, format=fmt)
            except:
                continue
        return pd.NaT
    except:
        return pd.NaT

df['purchase_date'] = df['purchase_date'].apply(standardize_date)

Step 4: Cleaning Text Data

# Standardize product names
df['product'] = df['product'].str.strip().str.title()

# Verify unique products after cleaning
print("\nUnique products:", df['product'].unique())

Step 5: Converting Data Types

# Convert price to numeric, removing any currency symbols
df['price'] = pd.to_numeric(df['price'].astype(str).str.replace('$', ''), errors='coerce')

# Verify data types
print("\nUpdated data types:\n", df.dtypes)

Step 6: Handling Duplicates

# Identify duplicate transactions
duplicates = df[df.duplicated(subset=['customer_id', 'purchase_date', 'product', 'price'], keep=False)]

# Remove duplicates keeping the first occurrence
df = df.drop_duplicates(subset=['customer_id', 'purchase_date', 'product', 'price'])

Step 7: Handling Invalid Values

# Fix negative quantities
df['quantity'] = df['quantity'].apply(lambda x: abs(x) if x < 0 else x)

# Create price boundaries
price_mean = df['price'].mean()
price_std = df['price'].std()
price_bounds = (price_mean - 2*price_std, price_mean + 2*price_std)

# Flag potential price outliers
df['price_outlier'] = df['price'].apply(lambda x: 'Yes' if x < price_bounds[0] or x > price_bounds[1] else 'No')

Step 8: Creating a Cleaning Function

def clean_ecommerce_data(df):
    """
    Clean e-commerce data with standard preprocessing steps
    """
    # Create a copy to avoid modifying original data
    df_clean = df.copy()
    
    # Apply all our cleaning steps
    df_clean['customer_id'] = df_clean['customer_id'].fillna('UNKNOWN')
    df_clean['purchase_date'] = df_clean['purchase_date'].apply(standardize_date)
    df_clean['product'] = df_clean['product'].str.strip().str.title()
    df_clean['price'] = pd.to_numeric(df_clean['price'].astype(str).str.replace('$', ''), errors='coerce')
    df_clean['quantity'] = df_clean['quantity'].apply(lambda x: abs(x) if x < 0 else x)
    
    # Remove duplicates
    df_clean = df_clean.drop_duplicates()
    
    return df_clean

# Clean the entire dataset
clean_df = clean_ecommerce_data(df)

Common Pitfalls to Avoid

  1. Not Making a Copy: Always create a copy of your original data before cleaning

  2. Assuming Consistency: Check for multiple formats, especially in dates and categories

  3. Blind Removal: Don't remove outliers without understanding why they exist

  4. Missing Documentation: Document your cleaning steps for reproducibility

Validation After Cleaning

def validate_clean_data(df):
    """
    Validate cleaned dataset meets our requirements
    """
    checks = {
        'missing_values': df.isnull().sum().sum() == 0,
        'negative_quantities': (df['quantity'] >= 0).all(),
        'valid_dates': df['purchase_date'].notna().all(),
        'valid_prices': (df['price'] > 0).all()
    }
    
    return pd.Series(checks)

# Run validation
validation_results = validate_clean_data(clean_df)
print("\nValidation Results:\n", validation_results)

Next Steps

Once your data is clean, you can:

  • Create summary statistics

  • Build visualizations

  • Conduct analysis

  • Train machine learning models

Resources for Learning More

  • Pandas documentation

  • Real Python's data cleaning tutorials

  • Stack Overflow's pandas tag

Remember: Good data cleaning is the foundation of all data analysis. Take the time to do it right.

Next week, we'll look at automated data cleaning pipelines using Python. Subscribe to get notified!

Previous
Previous

Python Syntax Explained: A Beginner’s Guide to Writing Clean Code

Next
Next

The After Hours Data Science Blueprint: Your Roadmap from Excel to Data Scientist