Question 1

What is the accuracy paradox and how does it expose the failure of accuracy as a metric?

Accepted Answer

The accuracy paradox occurs when a trivial model — one that always predicts the majority class — achieves high accuracy on an imbalanced dataset despite having zero predictive power for the minority class. A model that predicts 'not fraud' on every transaction achieves 99.9% accuracy if fraud is 0.1% of the data, but its recall for fraud is zero. Accuracy is only meaningful when classes are roughly balanced.

Question 2

What is the difference between classification and regression, and how do you choose between them?

Accepted Answer

Classification predicts a discrete class label; regression predicts a continuous numeric value. The choice is determined by the nature of the target variable, not by the algorithm family — many algorithms (e.g., decision trees, neural nets) handle both.

Question 3

What is a confusion matrix and what four quantities does it report?

Accepted Answer

A confusion matrix tallies predictions against ground truth in a 2x2 table: true positives, true negatives, false positives, and false negatives. From those four cells every classification metric — accuracy, precision, recall, F1, specificity — can be derived. It exposes *which kind* of error a model makes, not just how often it errs.

Question 4

What is feature engineering, and can you walk through how you'd engineer features to improve a model?

Accepted Answer

Feature engineering is creating, transforming, or selecting input variables so a model can capture patterns more easily. Common techniques include scaling, encoding categoricals, binning, interaction and ratio features, date/time decomposition, and domain-derived aggregates. It often matters more than the choice of algorithm because models can only learn from the signal present in their inputs.

Question 5

How do you extract useful features from datetime columns for a machine learning model?

Accepted Answer

Raw timestamps are meaningless to most models. Useful features extracted from a datetime column include calendar components (hour, day of week, month, quarter, year), cyclical encodings of periodic components (sin/cos of hour or day-of-week), lag and rolling-window aggregates, time-since-event features, and business-calendar flags like is_weekend or is_holiday.

Question 6

Which models require feature scaling and which don't, and why?

Accepted Answer

Distance-based and gradient-based models (KNN, K-means, SVM, PCA, linear/logistic regression with regularization, neural networks) need scaling because they're sensitive to feature magnitudes. Tree-based models (decision trees, random forests, gradient boosting) are scale-invariant because they split on thresholds per feature. Standardization and min-max scaling are the usual choices, fit on training data only.

Question 7

What's the difference between feature selection and dimensionality reduction like PCA?

Accepted Answer

Feature selection keeps a subset of the original features and discards the rest, so the surviving features stay interpretable. Dimensionality reduction like PCA creates new features that are combinations of the originals, compressing information but losing direct interpretability. Choose feature selection when you need to explain which inputs matter, and PCA when you mainly need a compact representation and don't need named features.

Question 8

What is the difference between Gini impurity and entropy as splitting criteria in decision trees?

Accepted Answer

Both measure node impurity but differ in computation and sensitivity. Gini is faster to compute and slightly favors larger partitions, while entropy (information gain) is more sensitive to class probability changes near 0.5. In practice the splits they produce are nearly identical.

Question 9

How do you handle skewed features in a machine learning dataset, and why does skew matter?

Accepted Answer

Right-skewed features (long tail on the right) concentrate most values near zero while a few extreme values pull the mean up, which distorts distance-based models and linear regression. Common fixes are log, square-root, or Box-Cox transformations that compress the tail and make the distribution closer to normal, improving model convergence and reducing the undue influence of large values.

Question 10

How does k-means clustering work?

Accepted Answer

K-means partitions n points into k clusters by alternating between two steps: assigning each point to its nearest centroid, then recomputing each centroid as the mean of its assigned points. It repeats until assignments stop changing, which guarantees convergence but not a globally optimal solution.

Question 11

What is information gain and how does it relate to entropy in a decision tree split?

Accepted Answer

Information gain measures how much a split reduces uncertainty (entropy) in the target variable. It is the difference between the parent node's entropy and the weighted average entropy of the child nodes. The split that maximises information gain is selected at each node.

Question 12

What's the difference between k-means and k-nearest neighbors? People confuse them.

Accepted Answer

K-means is an unsupervised clustering algorithm that partitions unlabeled data into k groups by iteratively updating centroids. KNN is a supervised algorithm that classifies or predicts a new point using the labels of its k closest training points. They share the letter k and the use of distances but solve completely different problems.

Question 13

How does k-nearest neighbours work, and why is it called a lazy learner?

Accepted Answer

KNN stores the entire training set and defers all computation to prediction time: for a new point it finds the k closest training examples by distance, then returns the majority class (classification) or mean value (regression). It is called lazy because there is no training phase — the model is the data itself.

Question 14

Why is KNN called a lazy learner, and what are the practical tradeoffs at inference time?

Accepted Answer

KNN is lazy because it does no real training; it just stores the training data and defers all computation to prediction time, when it searches for the nearest neighbors. The tradeoff is fast (zero) training but slow, memory-heavy inference that scales with dataset size. Approximate nearest-neighbor indexes and dimensionality reduction make it practical at scale.

Question 15

How does Naive Bayes work, and why is it called 'naive'?

Accepted Answer

Naive Bayes applies Bayes' theorem to classify by computing the posterior probability of each class given the features. It is naive because it assumes all features are conditionally independent given the class label — an assumption that is almost never true in practice, yet the classifier still works surprisingly well for text and other sparse data.

Question 16

What is the zero-probability problem in Naive Bayes and how do you fix it?

Accepted Answer

If a feature value never appears with a given class in training, its conditional probability is zero, and since Naive Bayes multiplies probabilities, the whole posterior for that class becomes zero regardless of other evidence. The fix is Laplace (additive) smoothing, which adds a small count to every feature-class combination so no probability is ever exactly zero. This is essential for text where many words are unseen per class.

Question 17

When should you use one-hot encoding versus label encoding for categorical features?

Accepted Answer

Label encoding assigns each category an integer and implies an ordinal relationship that most algorithms will treat as meaningful distance. One-hot encoding creates a binary column per category and is correct for nominal data fed to linear or distance-based models. Use label encoding only when the category genuinely has an order, or when feeding tree-based models that handle it cleanly.

Question 18

What is the out-of-bag error in a random forest and how reliable is it as a validation estimate?

Accepted Answer

The OOB error is computed by predicting each training sample only with the trees that did not include it in their bootstrap sample. It is nearly unbiased and tracks closely with cross-validation accuracy, making it a free, practical validation estimate that does not require a separate hold-out split.

Question 19

What are overfitting and underfitting, and how do you fix each?

Accepted Answer

Overfitting occurs when a model memorizes training noise and fails to generalize; underfitting occurs when the model is too simple to capture the true signal. Fixes differ: overfitting requires regularization, more data, or reduced complexity; underfitting requires a more expressive model or better features.

Question 20

Why does R-squared always increase when you add features, and when should you use adjusted R-squared instead?

Accepted Answer

R-squared measures the proportion of variance explained by the model and can only increase or stay the same as features are added, even if those features are pure noise. Adjusted R-squared penalizes for the number of predictors, making it the right metric for comparing models with different numbers of features.

Question 21

What is the difference between standardization and normalization, and which models require feature scaling?

Accepted Answer

Standardization rescales features to zero mean and unit variance; normalization squashes values into a fixed range, usually [0, 1]. Distance-based and gradient-based models are sensitive to scale and require one of these; tree-based models split on rank order and are scale-invariant.

Question 22

What is the difference between supervised, unsupervised, and reinforcement learning?

Accepted Answer

Supervised learning trains on labeled input-output pairs to predict a target. Unsupervised learning finds structure in unlabeled data. Reinforcement learning trains an agent to maximize cumulative reward through trial-and-error interaction with an environment.

Question 23

Why do we split data into train, validation, and test sets, and what are the typical proportions?

Accepted Answer

The train set fits the model, the validation set tunes hyperparameters and guides model selection, and the held-out test set provides an unbiased estimate of final generalization error. Using the test set during development causes optimistic bias because the evaluation signal leaks into decisions.

Question 24

What is AutoML, what does it automate, and where does it fall short?

Accepted Answer

AutoML automates parts of the ML pipeline such as data preprocessing, feature engineering, model selection, hyperparameter tuning, and sometimes neural architecture search, lowering the barrier to building models. It falls short on problem framing, data quality, domain feature engineering, careful evaluation against leakage, fairness, and deployment concerns, which still need human expertise. It's best as an accelerator and strong baseline generator, not a replacement for an ML engineer.

Question 25

Why is Naive Bayes called 'naive,' and why does it still work so well for text classification?

Accepted Answer

It's 'naive' because it assumes all features are conditionally independent given the class, which is almost never literally true. It still works for text because, even when the independence assumption is violated, the predicted class (the argmax) is often correct even if the probability estimates are miscalibrated. It's also fast, needs little data, and handles high-dimensional sparse word counts well.

Question 26

Why is linear regression unsuitable for binary classification, and what specific problems does logistic regression fix?

Accepted Answer

Linear regression predicts unbounded real values, so it can output probabilities below 0 or above 1, and its loss function penalizes confident correct predictions. Logistic regression fixes this by applying the sigmoid to map any real score to (0,1) and optimizing log-loss, which is a proper scoring rule aligned with probability calibration.

Question 27

How do you approach anomaly detection, and why is accuracy a bad metric for it?

Accepted Answer

Anomaly detection finds rare points that deviate from normal patterns, using statistical, distance, density, or model-based methods like isolation forest and one-class SVM, often trained mostly on normal data. Accuracy is misleading because anomalies are extremely rare, so a model that predicts 'normal' for everything scores high accuracy while catching nothing. Use precision, recall, F1, PR-AUC, or ROC-AUC instead, chosen by the cost of false positives vs false negatives.

Question 28

What is the difference between bagging and boosting, and what error component does each primarily reduce?

Accepted Answer

Bagging trains many independent models on bootstrap samples in parallel and averages their predictions, primarily reducing variance. Boosting trains models sequentially, each correcting the errors of its predecessor, primarily reducing bias.

Question 29

Bagging vs boosting — how do they differ, and when does each help?

Accepted Answer

Bagging trains many independent models in parallel on bootstrap samples and averages them, which mainly reduces variance; boosting trains models sequentially so each corrects its predecessor's errors, which mainly reduces bias. Use bagging (e.g. random forests) when your base learner is high-variance and overfits; use boosting (e.g. gradient boosting) when you need to squeeze out bias and maximize accuracy, accepting more tuning and overfitting risk.

Question 30

What is the bias–variance tradeoff?

Accepted Answer

A model's expected test error splits into bias (error from over-simplified assumptions, causing underfitting), variance (sensitivity to the particular training sample, causing overfitting), and irreducible noise. Adding complexity lowers bias but raises variance, so the best model minimises their sum on unseen data — not the training error.

Question 31

How do you choose the number of clusters k in k-means?

Accepted Answer

The elbow method plots inertia against k and looks for the bend where adding another cluster gives diminishing returns. The silhouette score measures how similar each point is to its own cluster versus its nearest rival, with values closer to 1 indicating tighter, better-separated clusters. Both should be used together, not in isolation.

Question 32

What is the curse of dimensionality, and how does it affect machine learning models?

Accepted Answer

As the number of features grows, the volume of the feature space increases exponentially, so training data becomes exponentially sparse. Distance-based algorithms degrade because points become approximately equidistant; density estimation requires data that grows exponentially; and overfitting risk rises for any fixed training set size.

Question 33

What is data leakage in machine learning, and what are the most common ways it occurs?

Accepted Answer

Data leakage happens when information that would not be available at prediction time influences model training, producing overly optimistic evaluation metrics that collapse in production. Common sources include fitting preprocessors on the full dataset, including target-derived features, and using future data in time-series pipelines.

Question 34

What is pruning in decision trees and when would you use pre-pruning versus post-pruning?

Accepted Answer

Pruning removes splits that do not improve generalisation. Pre-pruning stops growth early via hyperparameters like max_depth or min_samples_leaf. Post-pruning (cost-complexity pruning) grows the full tree then collapses nodes whose removal does not hurt held-out accuracy enough.

Question 35

Walk me through exactly how a decision tree chooses a split at each node.

Accepted Answer

At each node the algorithm iterates over every feature and every candidate threshold, scores each candidate split by the weighted impurity of the two child nodes, and selects the pair that gives the largest impurity reduction. It then recurses on each child until a stopping criterion is met.

Question 36

What is the difference between discriminative and generative models, and when would you prefer each?

Accepted Answer

Discriminative models learn the conditional distribution P(y|x) directly and focus entirely on the decision boundary; generative models learn the joint distribution P(x,y) and can generate new samples. Discriminative models typically achieve higher classification accuracy with sufficient labeled data; generative models excel when data is scarce, you need to synthesize data, or the problem requires modeling the input distribution.

Question 37

How does early stopping work in gradient boosting, and why is it necessary?

Accepted Answer

Early stopping monitors a held-out validation metric after each tree is added and stops training when the metric has not improved for a given number of rounds. It is necessary because gradient boosting is not regularised by the number of trees alone — the training loss always decreases, but test loss will eventually increase.

Question 38

What problem does ElasticNet solve that neither Lasso nor Ridge can handle alone?

Accepted Answer

When predictors are highly correlated, Lasso tends to arbitrarily pick one and discard the others, producing unstable feature selection. Ridge retains all correlated features but cannot zero any out. ElasticNet combines both penalties to achieve stable, sparse solutions — it groups correlated features and can shrink the whole group together.

Question 39

Explain the bias-variance tradeoff and how you'd diagnose which one you have.

Accepted Answer

Bias is error from oversimplifying assumptions (underfitting); variance is error from sensitivity to the training set (overfitting). Total error decomposes into bias squared, variance, and irreducible noise, and reducing one often increases the other. You diagnose by comparing training and validation error: high error on both means high bias, while a large gap (low train, high validation) means high variance.

Question 40

What is the kernel trick in SVM, and why does it work?

Accepted Answer

The kernel trick lets an SVM find a nonlinear decision boundary by implicitly mapping data into a higher-dimensional space where it becomes linearly separable, without ever computing that mapping explicitly. It works because the SVM's dual formulation depends only on dot products between points, and a kernel function computes that dot product directly in the high-dimensional space. Common kernels are linear, polynomial, and RBF.

Question 41

What is the F1 score, why use the harmonic mean, and when is it the wrong metric?

Accepted Answer

F1 is the harmonic mean of precision and recall: 2PR/(P+R). The harmonic mean penalises extreme imbalance between the two — a model with 1.0 precision and 0.01 recall gets F1 = 0.02, not 0.505. F1 is the wrong metric when the classes are heavily imbalanced or when the costs of false positives and false negatives differ sharply, in which case F-beta, PR-AUC, or a cost-weighted metric is more appropriate.

Question 42

Why does regularization require feature scaling, and what happens if you skip it?

Accepted Answer

Regularization penalizes large coefficient magnitudes uniformly. If features are on different scales, a feature measured in thousands will naturally have a small coefficient while one measured in fractions will have a large one, so the penalty disproportionately shrinks some features and nearly ignores others. Standardization ensures the penalty is applied equally across all features.

Question 43

What are filter, wrapper, and embedded feature selection methods, and when do you use each?

Accepted Answer

Filter methods score features independently of the model using statistics like mutual information or correlation; they are fast but ignore feature interactions. Wrapper methods search subsets by actually training the model, finding better subsets at high computational cost. Embedded methods perform selection during training — LASSO and tree-based feature importances are the most common — offering a balance of quality and speed.

Question 44

Compare filter, wrapper, and embedded feature-selection methods. When would you use each?

Accepted Answer

Filter methods score features by statistical relevance to the target independently of any model, so they're fast but ignore feature interactions. Wrapper methods (like recursive feature elimination) search subsets by training a model and evaluating performance, which is accurate but computationally expensive. Embedded methods select features as part of model training (like lasso or tree importances), giving a good balance of accuracy and efficiency.

Question 45

How does a Gaussian Mixture Model differ from k-means, and when would you prefer it?

Accepted Answer

A GMM models data as a mixture of Gaussian distributions and assigns soft probabilities of cluster membership, fitting clusters that can be elliptical and different sizes via the EM algorithm. K-means does hard assignment to the nearest centroid and implicitly assumes spherical, equal-size clusters. Prefer a GMM when clusters overlap, have different shapes or covariances, or when you need probabilistic (soft) assignments.

Question 46

Explain how gradient boosting fits residuals. What role does the learning rate play?

Accepted Answer

Gradient boosting builds an additive model by fitting each new tree to the negative gradient of the loss with respect to the current ensemble's predictions — effectively the residuals for squared error. The learning rate shrinks each tree's contribution, keeping the ensemble from over-correcting and acting as a regulariser.

Question 47

When should you use gradient descent over the normal equation to fit a linear regression?

Accepted Answer

The normal equation gives an exact closed-form solution in O(p³) time but becomes impractical when the number of features p is large (typically above ~10,000) because matrix inversion is cubic. Gradient descent scales as O(np) per iteration, making it the only viable option for large feature spaces or online learning.

Question 48

When should you use grid search vs random search vs Bayesian optimisation for hyperparameter tuning?

Accepted Answer

Grid search exhaustively tries every combination in a predefined grid, which is only practical for 1–2 hyperparameters. Random search samples combinations uniformly at random and finds good values faster per compute budget, especially when only a few hyperparameters actually matter. Bayesian optimisation fits a surrogate model of the objective and proposes the next trial intelligently, giving the best sample efficiency for expensive evaluations.

Question 49

How do decision trees and gradient boosting libraries handle categorical features natively, and when is label encoding safe?

Accepted Answer

sklearn trees require numeric input and treat label-encoded integers as ordinal, which imposes a false ordering. One-hot encoding is correct but expensive for high-cardinality features. XGBoost (v2+) and LightGBM support native categorical splits that find the optimal binary partition of categories without ordinal assumptions.

Question 50

How do you handle class imbalance in a machine-learning model?

Accepted Answer

Class imbalance is handled at the data level (oversampling with SMOTE, undersampling), the algorithm level (class weights, balanced bagging), and the decision level (threshold tuning). The right approach depends on how severe the imbalance is, how much data you have, and whether the minority class has sufficient local density to synthesise meaningfully. Always choose your evaluation metric first — accuracy is useless on imbalanced data.