Most data scientists spend 60 to 80 percent of their time cleaning and preparing data, not training models. This is not inefficiency. Raw data collected from real systems is full of missing values, duplicates, inconsistent formats, schema drift, and distribution shifts. A model trained on dirty data produces confidently wrong predictions. Cleaning that data correctly requires understanding the domain, the collection mechanism, and what the corruption means.
Analysis Briefing
- Topic: Data preprocessing, feature engineering, and why clean data is rare
- Analyst: Mike D (@MrComputerScience)
- Context: A collaborative deep dive triggered by Claude Sonnet 4.6
- Source: Pithy Cyborg | AI News Made Simple
- Key Question: Why can’t you just feed raw data to a model and let it figure out the mess?
The Types of Data Corruption That Actually Appear in Production
Missing values are the most common. But missing values are not all equal. A missing age field in a customer record might mean the customer declined to answer (informative missingness). It might mean the data collection system had a bug during a specific date range (systematic missingness). It might mean a value below a detection threshold was not recorded (censored data). Each requires a different treatment. Filling all three with the column median produces a model that is wrong in three different ways.
Label leakage is the most dangerous corruption because it is invisible in validation metrics and catastrophic in production. A model predicting loan default trained on features that were collected after the default decision was made will achieve near-perfect validation accuracy and fail completely in production because the future features are not available at prediction time. Detection requires understanding the temporal relationship between every feature and the target variable.
Distribution shift occurs when the data collected during training does not match the data the model will see in production. A fraud detection model trained on 2022 transaction data may perform poorly on 2024 transactions because fraud patterns evolve. A churn model trained on pre-pandemic user behavior fails on post-pandemic cohorts. The model is not wrong on the data it was trained on. The world changed.
The Preprocessing Steps That Take the Most Time
Deduplication is harder than it sounds. Exact duplicate rows are trivial. Fuzzy duplicates, where the same entity appears multiple times with slightly different representations (“Mike Smith” vs “Michael Smith” vs “M. Smith” at the same address), require record linkage techniques and domain judgment.
Schema reconciliation is the time sink in large organizations. Data collected by System A in Q1 2022 has different column names, units, and encoding conventions than the same data collected by System B in Q4 2023. Mapping them to a common schema requires either documentation (rarely complete) or reverse engineering from the data itself.
import pandas as pd
import numpy as np
def preprocess_customer_data(df: pd.DataFrame) -> pd.DataFrame:
# Remove exact duplicates
df = df.drop_duplicates(subset=['customer_id', 'timestamp'])
# Handle missing age with separate indicator column
df['age_missing'] = df['age'].isna().astype(int)
df['age'] = df['age'].fillna(df['age'].median())
# Clip outliers at 99th percentile, not mean impute
p99 = df['transaction_amount'].quantile(0.99)
df['transaction_amount'] = df['transaction_amount'].clip(upper=p99)
# Parse dates consistently regardless of source format
df['created_at'] = pd.to_datetime(df['created_at'], utc=True)
# Drop features with > 40% missing (document why)
missing_rate = df.isnull().mean()
high_missing = missing_rate[missing_rate > 0.4].index
df = df.drop(columns=high_missing)
return df
Feature engineering is where domain knowledge converts raw data into signals. A raw timestamp becomes day-of-week, hour-of-day, days-since-last-event, and is-weekend. A transaction amount becomes amount-relative-to-customer-average and amount-relative-to-merchant-average. None of this is obvious from the data alone. It requires understanding what patterns the model needs to detect.
Why You Cannot Skip Preprocessing and Let the Model Handle It
Deep learning models are frequently cited as capable of learning from raw data without manual feature engineering. This is true for domains with abundant data and well-structured inputs (images, text). It is not true for tabular data in business domains.
Gradient boosted trees (XGBoost, LightGBM) dominate tabular data competitions and production deployments not because neural networks cannot learn from tables but because tabular data in business applications has small-to-medium sample sizes, high feature-to-sample ratios, mixed data types, and domain-specific missingness patterns that require human intervention to handle correctly.
A model that receives label-leaking features will learn to use them and achieve perfect accuracy until production reveals the problem. No architecture prevents this. The model does what the data tells it to do.
What This Means For You
- Create a missingness indicator column for every imputed feature, because the fact that a value is missing is often a stronger predictor than any imputed value, and dropping that information throws away a real signal.
- Audit every feature for temporal leakage before training by asking explicitly: could this value have been known at prediction time? because label leakage produces the most dangerous failure mode in supervised learning (great metrics, terrible production performance).
- Version your preprocessing code as carefully as your model code, because a preprocessing change that was not applied consistently between training and serving is a data distribution shift that will degrade model performance in production without any obvious error.
- Profile your dataset before building any preprocessing pipeline with
df.describe(),df.isnull().sum(), and histograms, because the specific corruption in your dataset is always different from what you expect and preprocessing built without this inspection addresses the wrong problems.
Enjoyed this deep dive? Join my inner circle:
- Pithy Cyborg | AI News Made Simple → AI news made simple without hype.
