
Data science interviews have the widest variance of any technical interview. A DS role at a consumer tech company needs strong A/B testing and product analytics depth. A DS role at an ML-first startup needs research intuition and system design. A DS role at a bank needs statistics and regulatory awareness. This guide covers all five question categories, calibrated to the current (2026) bar.
How much each category matters depends on the role:
| Role type | Stats | SQL | ML | Case Study | Behavioral | |---|---|---|---|---|---| | Analytics DS (product) | Heavy | Heavy | Light | Medium | Heavy | | Applied ML / MLE | Medium | Medium | Heavy | Heavy | Medium | | Research scientist | Heavy | Light | Heavy | Heavy | Light | | DS at bank / consulting | Heavy | Medium | Medium | Heavy | Heavy |
This is the category most self-taught data scientists underinvest in. Companies don't test it with trick questions — they test it with scenarios where the wrong statistical intuition leads to a wrong business decision.
1. What's the difference between Type I and Type II errors? When would you prefer one over the other?
Type I (false positive): you reject the null when it's true — you think an effect exists, but it doesn't. Type II (false negative): you fail to reject the null when it's false — a real effect exists, but you miss it.
Preference: in medical screening for a serious disease, prefer low Type II error (catch all cases, even at cost of false positives). In a spam filter, prefer low Type I error (don't incorrectly flag legitimate mail as spam).
2. Explain p-value in plain language.
The p-value is the probability of seeing a result at least as extreme as yours assuming the null hypothesis is true. It's not the probability that the null is true, and it's not the probability that your result happened by chance. The most common misconception: "p < 0.05 means there's a 95% chance the effect is real." Wrong. It means: if the null were true, we'd see results like this less than 5% of the time.
3. What's the difference between Bayesian and frequentist statistics?
Frequentists treat probability as the long-run frequency of events. Parameters are fixed (unknown) and data is random. Bayesians treat probability as a degree of belief. Parameters have distributions. Bayesian analysis lets you incorporate prior knowledge (useful when data is limited) and produce probability statements about parameters directly.
4. What is statistical power? How do you increase it?
Power = 1 − P(Type II error) = probability of correctly detecting an effect that exists. To increase power: increase sample size (most common), increase the minimum detectable effect you're looking for, increase significance level α (at cost of more false positives), or reduce noise (use covariates, better measurement).
5. What's the central limit theorem and why does it matter for A/B testing?
With sufficiently large samples, the distribution of sample means approaches normal regardless of the underlying distribution. This is why A/B tests work even when the metric (e.g., revenue per user) is skewed: the test statistic is approximately normally distributed, and standard z-tests and t-tests are valid.
6. You run an A/B test. The result is p = 0.04. Should you ship?
Weak answer: "Yes, p < 0.05."
Strong answer:
"I'd check several things before shipping:
- Was the sample size determined in advance? Or did I peek and stop when I hit p = 0.04? (Peeking inflates false positives dramatically.)
- Is the effect size practically meaningful — not just statistically significant?
- Are the guardrail metrics (load time, crash rate, support tickets) neutral or positive?
- Is the effect uniform across segments, or driven by one subgroup? (Simpson's Paradox risk.)
- How long did the test run? Did it clear novelty effects?
If all of these check out: yes, ship."
7. What is Simpson's Paradox? Give a real-world example.
A trend appears in several subgroups but reverses when they're combined. Classic example: UC Berkeley admissions data appeared to show bias against women overall, but when broken down by department, most departments admitted women at higher rates. The aggregate result was driven by women applying to more competitive departments. In product: a feature can appear to help overall while hurting each user segment, if the segments differ in size.
8. What's multicollinearity and why does it matter in regression?
Two or more predictor variables are highly correlated. This doesn't affect predictions much, but it inflates standard errors of individual coefficients, making them unreliable. You can't trust the coefficient interpretation ("a unit increase in X leads to Y change in Z") when multicollinearity is high.
9. Explain bias-variance tradeoff.
Bias: error from wrong assumptions (underfitting, too simple model). Variance: error from sensitivity to training data noise (overfitting, too complex model). Increasing model complexity typically decreases bias but increases variance. The goal is to find the sweet spot. In 2026: regularization (L1/L2), dropout, early stopping, cross-validation, ensemble methods.
10. What's bootstrapping? When is it useful?
Resampling the observed dataset with replacement to estimate the sampling distribution of a statistic. Useful when you can't assume a distribution for your estimator, when sample size is small, or when the statistic is something other than the mean (median, percentile, correlation). More computationally expensive than parametric methods.
11. How do you handle class imbalance?
Several strategies: resampling (oversample minority, undersample majority), synthetic sampling (SMOTE), adjust class weights in the loss function, use precision/recall/F1 instead of accuracy as the evaluation metric, use probability calibration. The right choice depends on the imbalance ratio and whether the cost of false positives and false negatives is symmetric.
12. What's the difference between correlation and causation? How do you establish causation?
Correlation: two variables move together. Causation: one variable causes changes in the other. Correlation can come from a common cause (confounder) or reverse causation. To establish causation: randomized controlled experiments (gold standard), or causal inference methods on observational data (instrumental variables, difference-in-differences, regression discontinuity, propensity score matching).
13. When would you use a t-test vs. z-test?
t-test: when population standard deviation is unknown and/or sample size is small (n < 30). z-test: when population standard deviation is known or sample size is large. In practice, the t-test is almost always the right choice — it converges to the z-test for large samples anyway.
14. What's regularization? When does it help?
Adding a penalty term to the loss function to shrink model coefficients. L1 (Lasso): drives some coefficients to exactly zero (feature selection). L2 (Ridge): shrinks all coefficients toward zero. Elastic Net: combination. Helps when you have many features relative to observations, or when features are correlated. Hurts when features are genuinely informative and large.
15. What's cross-validation and when should you use k-fold vs. LOOCV?
k-fold: split data into k folds, train on k-1, test on the remaining fold, rotate. LOOCV: leave one sample out at a time — expensive but low bias. Use k-fold (k=5 or 10) by default. Use LOOCV only for small datasets where you can't afford to sacrifice any data. In time-series, use time-series split (no data leakage from future to past).
Almost every DS role requires SQL fluency. The question format has evolved from "write a basic SELECT" to "debug this production query that returns wrong numbers."
1. Find the second highest salary in an employees table.
SELECT MAX(salary) AS second_highest
FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);
Or with window functions (preferred at most companies):
SELECT salary
FROM (
SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) AS rnk
FROM employees
) ranked
WHERE rnk = 2
LIMIT 1;
2. Calculate 7-day rolling average of daily active users.
SELECT
date,
AVG(dau) OVER (
ORDER BY date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS rolling_7d_avg
FROM daily_active_users
ORDER BY date;
3. Write a query to detect users who churned (active in month N-1 but not in month N).
WITH month_n1 AS (
SELECT DISTINCT user_id FROM events
WHERE DATE_TRUNC('month', event_date) = DATE_TRUNC('month', CURRENT_DATE) - INTERVAL '1 month'
),
month_n AS (
SELECT DISTINCT user_id FROM events
WHERE DATE_TRUNC('month', event_date) = DATE_TRUNC('month', CURRENT_DATE)
)
SELECT m1.user_id
FROM month_n1 m1
LEFT JOIN month_n m ON m1.user_id = m.user_id
WHERE m.user_id IS NULL;
4. Given a sessions table (user_id, start_time, end_time), find users with overlapping sessions.
SELECT DISTINCT a.user_id
FROM sessions a
JOIN sessions b
ON a.user_id = b.user_id
AND a.session_id <> b.session_id
AND a.start_time < b.end_time
AND a.end_time > b.start_time;
5. What's the difference between ROW_NUMBER(), RANK(), and DENSE_RANK()?
All assign a number to each row in a window partition.
ROW_NUMBER(): unique sequential integer — ties get different numbers.RANK(): tied rows get the same number, next rank skips (1, 1, 3).DENSE_RANK(): tied rows get the same number, next rank doesn't skip (1, 1, 2).Use DENSE_RANK when you want "top N values" without gaps. Use ROW_NUMBER when you need strictly unique row identifiers.
6. How do you calculate retention cohorts in SQL?
This is a common take-home problem. The key insight: you need the cohort date (first event date) and a self-join or window function to check if the user returned in subsequent periods.
WITH cohorts AS (
SELECT user_id, MIN(DATE_TRUNC('month', event_date)) AS cohort_month
FROM events GROUP BY user_id
),
user_activity AS (
SELECT DISTINCT user_id, DATE_TRUNC('month', event_date) AS active_month
FROM events
)
SELECT
c.cohort_month,
DATEDIFF('month', c.cohort_month, ua.active_month) AS months_since_cohort,
COUNT(DISTINCT c.user_id) AS retained_users
FROM cohorts c
JOIN user_activity ua ON c.user_id = ua.user_id
GROUP BY 1, 2
ORDER BY 1, 2;
1. Walk through how you'd approach a new ML problem from scratch.
- Define the problem: what are we predicting? What's the loss function?
- Understand the data: distribution, missingness, leakage risks.
- Establish a baseline: random, majority class, simple heuristic.
- Feature engineering and selection.
- Model selection and training (start simple — logistic regression before XGBoost).
- Evaluation: appropriate metrics for the problem (not just accuracy).
- Error analysis: where does the model fail, and is that a data problem or model problem?
- Deployment and monitoring.
2. How would you evaluate a classifier beyond accuracy?
Precision: of positives predicted, how many are correct? Recall: of actual positives, how many did we catch? F1: harmonic mean of precision and recall (useful when classes are imbalanced). AUC-ROC: separability between classes regardless of threshold. PR-AUC: better than ROC when positive class is rare. Log loss: penalizes confident wrong predictions. Choose based on the business cost of false positives vs false negatives.
3. What's gradient boosting and how does it differ from random forests?
Both are ensemble methods using decision trees. Random forests: parallel ensemble of trees trained on bootstrap samples; averaging reduces variance. Gradient boosting: sequential ensemble where each tree corrects the errors of the previous; reduces bias. XGBoost / LightGBM / CatBoost are implementations. Gradient boosting typically achieves better accuracy but is slower to train and more sensitive to hyperparameters.
4. How do you handle missing data?
First, understand why data is missing (MCAR, MAR, MNAR). Then:
- Imputation: mean/median (numeric), mode (categorical), model-based, or multiple imputation.
- Indicator variable: add a binary flag for "was this field missing."
- Deletion: only if missingness is random and low rate.
- Model that handles it natively: XGBoost handles NaN; tree-based models generally tolerate missingness better than linear models.
Never impute test data using training data statistics before the train/test split.
5. What is data leakage and how do you detect it?
Data leakage: information from the future (or from the target) contaminating the training data, making the model look better than it is. Signs: unrealistically high CV scores, model performance drops sharply in production. Common sources: imputing using whole-dataset statistics before splitting, using features computed after the event you're predicting, using proxies for the target variable. Detection: temporal validation (train on past, test on future), check feature importances for suspiciously predictive variables.
6. What's the difference between L1 and L2 regularization in practice?
L1 (Lasso): sum of absolute values of coefficients. Promotes sparsity — some coefficients go exactly to zero (feature selection built in). L2 (Ridge): sum of squared coefficients. Shrinks all coefficients but rarely zeroes them out. Use L1 when you suspect only a few features matter. Use L2 when you think all features contribute but are noisy.
7. How would you approach a recommendation system for a new product (cold start problem)?
Cold start has two sub-problems: new user (no interaction history) and new item (no ratings/clicks).
New user: use available context (demographics, device, referral source), content-based filtering from their stated preferences, or popular items as fallback.
New item: use content-based features (metadata, embeddings), introduce it via exploration to a fraction of users, hybrid approach.
Long-term: collaborative filtering kicks in once you have interaction data. Matrix factorization, two-tower neural networks, or transformers for session-based recommendations.
8. Walk through how A/B testing works at scale and common failure modes.
Design: define primary metric, guardrails, minimum detectable effect, and sample size before running. Run until reaching predetermined sample size (no peeking). Analyze: check treatment/control balance (SRM — sample ratio mismatch), then primary metric and guardrails.
Common failure modes: peeking and stopping early (inflates α), network effects (treatment contaminating control in social apps), novelty effects (short-term engagement boost that fades), SRM (assignment bug that creates biased groups), interaction effects with other running experiments.
9. How do you evaluate and monitor an ML model in production?
Metrics to monitor: (1) data drift — input feature distributions shifting; (2) model performance — label drift, accuracy decay; (3) operational metrics — latency, error rates, resource usage. Tools: Evidently, Arize, Fiddler, or custom monitoring dashboards. Retraining strategy: scheduled retraining (time-based) vs triggered retraining (when performance drops below threshold). Shadow mode: run new model alongside old model without serving predictions to compare.
10. What's the difference between online and batch inference?
Batch: predictions generated periodically for a fixed set of inputs, stored, and served from storage. Low latency for serving, but predictions can be stale. Use for: content recommendations pre-computed nightly, credit scores computed weekly.
Online (real-time): model called at inference time, fresh prediction every request. Higher serving latency, more infrastructure complexity. Use for: fraud detection, search ranking, real-time pricing.
Hybrid: batch pre-compute features and candidates, then real-time reranking. Common for recommendation systems.
11. Explain transformer architecture and why it replaced RNNs for most NLP tasks.
Transformers use self-attention mechanisms that compute relationships between all tokens simultaneously, not sequentially. This enables: (1) parallelization during training (vs RNN's sequential dependency); (2) better long-range dependencies (no vanishing gradient over many steps); (3) scalability (more data + more compute → consistently better). Downside: quadratic memory in attention (O(n²) for sequence length n), which is why sparse/linear attention variants exist.
12. What unique challenges do LLM-powered products introduce for data scientists?
This is the 2026 question. Key points:
- Evaluation is hard: no ground truth for open-ended generation. Need human evals, LLM-as-judge, task-specific automated metrics.
- Non-determinism: same input → different output. Statistical testing requires many samples.
- Latency/cost tradeoff: bigger models are better but slower and more expensive.
- Prompt sensitivity: small prompt changes → large output changes. Need prompt versioning.
- Hallucination: factual claims may be wrong. Need retrieval-augmented generation (RAG) + citation verification.
- Data contamination: training data may include your test set.
These are increasingly common at mid-to-senior levels. Expect 3–5 hours of work on a real dataset. The rubric is usually:
Same STAR structure as any behavioral round, but the content skews toward:
Top 5 DS behavioral questions:
See our behavioral questions guide for full STAR structure.
Week 1: Statistics deep review — probability, hypothesis testing, Bayesian basics, experimental design. Do 10 statistics questions without notes.
Week 2: SQL fluency — 30 SQL exercises (LeetCode SQL, StrataScratch, Mode Analytics). Window functions, CTEs, self-joins, recursive CTEs.
Week 3: ML review — 20 concept questions (use this guide). One take-home project from Kaggle or a real-world dataset. Practice explaining models to a non-technical audience.
Week 4: Mock interviews. At least 3 full mock sessions with a timer — 30 min statistics, 30 min SQL, 30 min ML concepts. Polish behavioral stories.
At most companies yes — Python for data manipulation (pandas), ML (scikit-learn, PyTorch/TensorFlow), and scripting. SQL is also required everywhere. R is accepted at some research roles. Scala for Spark is occasionally tested at data engineering-adjacent roles.
Depends on the role. Applied ML / research roles: yes, including transformer architecture, fine-tuning, evaluation. Product analytics DS: not deeply, but you should know the difference between ML and statistical models and when each applies.
Meta DS interviews are widely reported as among the hardest: heavy A/B testing design, product metrics, and SQL. Google DS interviews go deep on statistics and experiment design. Amazon DS roles vary widely by team.
Very. It's the closest thing to actual work. The rubric is: (1) did you frame the problem correctly, (2) is your code clean and reproducible, (3) are your conclusions sound, (4) can you defend every decision. Treat it like a work deliverable.
Practice the structured approach: (1) verify the drop is real, (2) segment by time, geography, platform, user cohort, (3) form 3–5 hypotheses, (4) rank by likelihood, (5) propose tests. Do this out loud, with a timer, until it's automatic.
HiredPathway runs DS-specific mock interviews — statistics, SQL logic, ML concepts, and behavioral — with structured feedback on each answer. Most DS candidates need 15–20 sessions to feel genuinely ready across all five categories.
Midjourney:
Editorial photograph of a data scientist at a standing desk, large monitor showing a Jupyter notebook with data visualizations (heatmap, retention curve), natural window light, whiteboard with feature importance chart sketched in marker, warm neutral tones, shallow depth of field --ar 16:9 --v 6 --style raw
Ideogram:
Clean editorial infographic showing four quadrants: "Type I vs Type II", "Bias vs Variance", "Precision vs Recall", "Correlation vs Causation", each with a minimal diagram, muted academic palette (navy, cream, rust), clean sans-serif typography, textbook style --ar 4:3
Ideogram:
Bold editorial poster, large serif headline "Data Scientist Interview Questions" on the left, right side shows a stylized scatter plot with a regression line, clean data visualization aesthetic, warm paper background --ar 1.91:1
Ready to practice?
HiredPathway gives you AI-powered mock interviews with real-time feedback. Free to start.
Start practicing free →