Data engineering builds and maintains the pipelines that move and store data. Data science uses that data to produce insights and models. Machine learning engineering takes those models and puts them into production systems that serve predictions at scale. The three roles depend on each other and break down when any one is missing.
Analysis Briefing
- Topic: Data science vs machine learning vs data engineering roles and collaboration
- Analyst: Mike D (@MrComputerScience)
- Context: Sparked by a question from Claude Sonnet 4.6
- Source: Pithy Cyborg | AI News Made Simple
- Key Question: Why do companies hire all three separately when their work seems so similar on the surface?
What Data Engineers Actually Build
Data engineers are infrastructure engineers whose domain is data pipelines. They design and maintain the systems that ingest data from operational databases, event streams, and third-party APIs, transform it into usable formats, and load it into analytical storage (data warehouses, data lakes).
The core stack is ETL (Extract, Transform, Load) tooling: Apache Spark or dbt for transformation, Airflow or Prefect for orchestration, Snowflake, BigQuery, or Redshift as analytical storage, and Kafka or Kinesis for real-time ingestion. A data engineer writes Spark jobs that process terabytes of clickstream data daily, builds dbt models that clean and join tables in the warehouse, and maintains the Airflow DAGs that schedule everything.
Their output is reliable, queryable data. A data scientist who sits down to build a model and finds that the feature tables have null values, duplicates, schema drift, and 48-hour data delays is working without a functioning data engineering team. Everything downstream depends on the foundation.
What Data Scientists Actually Do That ML Engineers Don’t
Data scientists are applied statisticians and domain problem solvers. They explore datasets, formulate hypotheses, select features, train and evaluate models, and communicate findings to decision makers. Their primary output is insight: a recommendation, a model prototype, an analysis that answers a business question.
The gap between data science and machine learning engineering is the gap between a working Jupyter notebook and a production API serving 10,000 predictions per second. A data scientist trains an XGBoost model that achieves 91% accuracy on the validation set. That is real, valuable work. But the model lives in a notebook on their laptop. It cannot serve production traffic, does not monitor for data drift, has no fallback when it fails, and will silently degrade as the data distribution shifts over months.
# What a data scientist delivers
import xgboost as xgb
from sklearn.metrics import roc_auc_score
model = xgb.XGBClassifier(n_estimators=300, max_depth=6, learning_rate=0.05)
model.fit(X_train, y_train)
auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
print(f"Validation AUC: {auc:.4f}")
# Saves model to disk. Work complete.
What ML Engineers Add Between the Notebook and Production
ML engineers productionize what data scientists prototype. They wrap models in serving infrastructure (FastAPI, TorchServe, BentoML), build feature stores that serve real-time features at inference time, implement model registries that track versions and enable rollback, and set up monitoring for prediction drift, feature drift, and latency regressions.
The practical gap is significant. A model served through a feature store serving 50ms p99 latency with automatic A/B testing and a monitoring dashboard that pages on-call when AUC drops below threshold is a completely different engineering artifact than a pickle file on a shared drive.
In small companies, one person often covers two or three of these roles. A “data scientist” at a 30-person startup is probably doing data engineering on Mondays, feature engineering on Tuesdays, model training on Wednesdays, and deploying their own Flask app on Thursdays. The role specialization happens as data volume, team size, and system complexity grow. At companies like Uber, Netflix, and Airbnb, the three roles are entirely separate career tracks with distinct toolchains and hiring bars.
What This Means For You
- Hire data engineers before data scientists if you are building a data capability from scratch, because data scientists without reliable data infrastructure spend 80% of their time doing data cleaning instead of modeling, which is a waste of a senior hire.
- Distinguish between a model prototype and a production model in your project planning, because the work to go from a working notebook to a production-grade serving system typically takes 3 to 10 times longer than the modeling work itself.
- Build a feature store early if you have more than 2 models in production, because duplicated feature engineering logic across multiple models produces inconsistent predictions when implementations diverge and debugging that inconsistency is extremely expensive.
- Define model success metrics at the business level before the data scientist starts training, because optimizing for validation AUC without connecting it to a downstream business metric (conversion rate, fraud rate, churn) produces technically correct models that do not move the needle.
Enjoyed this deep dive? Join my inner circle:
- Pithy Cyborg | AI News Made Simple → AI news made simple without hype.
