Data Cleaning with Python: A Real-World Guide with Examples
Three months into my first data project, I made a mistake that cost me a week of work. I had skipped proper data cleaning, assuming our dataset was "clean enough," and our analysis ended up being completely wrong. That painful lesson taught me something crucial: in data science, cleaning isn't just busywork – it's the foundation of everything that follows.
If you've worked with real-world data, you know the frustration. Missing values, inconsistent formats, duplicate entries, and mysterious outliers – these aren't just textbook problems, they're daily challenges. While everyone loves to talk about sophisticated machine learning models and stunning visualizations, the truth is that without clean data, even the most advanced algorithms will fail.
In this guide, I'll share the exact data cleaning process I now use in my day-to-day work as a data scientist. We'll work with a messy e-commerce dataset that contains all the classic problems you'll encounter in the real world. No theoretical examples – just practical, hands-on code you can start using today.
By the end of this tutorial, you'll know how to:
Handle missing and duplicate data confidently
Standardize inconsistent formats
Deal with outliers systematically
Create reusable cleaning functions
Validate your cleaned dataset
Whether you're just starting your data science journey or you're looking to refine your cleaning process, this guide will give you the tools you need to build a solid foundation for your analysis.
Let's dive in and turn that messy data into something useful.
When I first started learning data science, everyone talked about machine learning algorithms and fancy visualizations. But nobody mentioned I'd spend 80% of my time cleaning data. Today, I'm sharing the data cleaning techniques I use daily, with real examples you can follow along with.
The Dataset We'll Use
We'll work with a messy e-commerce dataset that has common real-world problems:
Missing values
Inconsistent date formats
Duplicate entries
Incorrect data types
Outliers
Inconsistent text formatting
import pandas as pd import numpy as np from datetime import datetime # Sample messy data data = { 'customer_id': ['A001', 'A002', 'A001', 'A003', np.nan, 'A004'], 'purchase_date': ['2024-01-01', '01/15/2024', '2024-01-01', '2024-02-01', '2024-01-05', '2024-02-15'], 'product': ['Laptop', 'laptop ', 'LAPTOP', 'Phone', 'Tablet', 'phone'], 'price': ['1200', 1500, '1200', '800', '400', '750'], 'quantity': [1, 2, 1, -1, 3, 2], } df = pd.DataFrame(data)
Step 1: Initial Assessment
First, let's understand what we're dealing with:
# Basic information about our dataset print(df.info()) print("\nMissing values:\n", df.isnull().sum()) print("\nDuplicate rows:", df.duplicated().sum())
Step 2: Handling Missing Values
# Fill missing customer IDs with a placeholder df['customer_id'] = df['customer_id'].fillna('UNKNOWN') # Verify no missing values remain print("Missing values after cleaning:", df.isnull().sum())
Step 3: Standardizing Date Formats
def standardize_date(date_str): try: # Try different date formats for fmt in ['%Y-%m-%d', '%m/%d/%Y']: try: return pd.to_datetime(date_str, format=fmt) except: continue return pd.NaT except: return pd.NaT df['purchase_date'] = df['purchase_date'].apply(standardize_date)
Step 4: Cleaning Text Data
# Standardize product names df['product'] = df['product'].str.strip().str.title() # Verify unique products after cleaning print("\nUnique products:", df['product'].unique())
Step 5: Converting Data Types
# Convert price to numeric, removing any currency symbols df['price'] = pd.to_numeric(df['price'].astype(str).str.replace('$', ''), errors='coerce') # Verify data types print("\nUpdated data types:\n", df.dtypes)
Step 6: Handling Duplicates
# Identify duplicate transactions duplicates = df[df.duplicated(subset=['customer_id', 'purchase_date', 'product', 'price'], keep=False)] # Remove duplicates keeping the first occurrence df = df.drop_duplicates(subset=['customer_id', 'purchase_date', 'product', 'price'])
Step 7: Handling Invalid Values
# Fix negative quantities df['quantity'] = df['quantity'].apply(lambda x: abs(x) if x < 0 else x) # Create price boundaries price_mean = df['price'].mean() price_std = df['price'].std() price_bounds = (price_mean - 2*price_std, price_mean + 2*price_std) # Flag potential price outliers df['price_outlier'] = df['price'].apply(lambda x: 'Yes' if x < price_bounds[0] or x > price_bounds[1] else 'No')
Step 8: Creating a Cleaning Function
def clean_ecommerce_data(df): """ Clean e-commerce data with standard preprocessing steps """ # Create a copy to avoid modifying original data df_clean = df.copy() # Apply all our cleaning steps df_clean['customer_id'] = df_clean['customer_id'].fillna('UNKNOWN') df_clean['purchase_date'] = df_clean['purchase_date'].apply(standardize_date) df_clean['product'] = df_clean['product'].str.strip().str.title() df_clean['price'] = pd.to_numeric(df_clean['price'].astype(str).str.replace('$', ''), errors='coerce') df_clean['quantity'] = df_clean['quantity'].apply(lambda x: abs(x) if x < 0 else x) # Remove duplicates df_clean = df_clean.drop_duplicates() return df_clean # Clean the entire dataset clean_df = clean_ecommerce_data(df)
Common Pitfalls to Avoid
Not Making a Copy: Always create a copy of your original data before cleaning
Assuming Consistency: Check for multiple formats, especially in dates and categories
Blind Removal: Don't remove outliers without understanding why they exist
Missing Documentation: Document your cleaning steps for reproducibility
Validation After Cleaning
def validate_clean_data(df): """ Validate cleaned dataset meets our requirements """ checks = { 'missing_values': df.isnull().sum().sum() == 0, 'negative_quantities': (df['quantity'] >= 0).all(), 'valid_dates': df['purchase_date'].notna().all(), 'valid_prices': (df['price'] > 0).all() } return pd.Series(checks) # Run validation validation_results = validate_clean_data(clean_df) print("\nValidation Results:\n", validation_results)
Next Steps
Once your data is clean, you can:
Create summary statistics
Build visualizations
Conduct analysis
Train machine learning models
Resources for Learning More
Pandas documentation
Real Python's data cleaning tutorials
Stack Overflow's pandas tag
Remember: Good data cleaning is the foundation of all data analysis. Take the time to do it right.
Next week, we'll look at automated data cleaning pipelines using Python. Subscribe to get notified!