datarekha — Frequently Asked Questions

Question 1

Do I need to know math to start learning Python?

Accepted Answer

No. Python's core syntax — variables, loops, functions, lists, and dictionaries — needs nothing beyond basic arithmetic. Math only matters later for specific domains like data science, and even then the Python itself stays simple. Start with the syntax and add math when a project demands it.

Question 2

What's the difference between a list and a tuple in Python?

Accepted Answer

A list is mutable (you can add, remove, or change items) and uses square brackets; a tuple is immutable (fixed once created) and uses parentheses. Use a list when the collection will change, and a tuple for fixed records or as dictionary keys, where immutability is required.

Question 3

Why is my Python code slow — is the GIL to blame?

Accepted Answer

For CPU-bound work, the Global Interpreter Lock (GIL) stops threads from running Python bytecode in true parallel, so threading won't help — use multiprocessing or vectorised libraries like NumPy. For I/O-bound work the GIL is released during waits, so threads or asyncio do help. Most 'slow Python' is actually unvectorised or algorithmic, not the GIL.

Question 4

When should I use a list comprehension instead of a for loop?

Accepted Answer

Use a comprehension when you're building a new list by transforming or filtering an iterable — it's more concise and usually faster. Stick with a regular for loop when the body has side effects, multiple statements, or complex logic, where a comprehension would hurt readability.

Question 5

What's the difference between == and is in Python?

Accepted Answer

`==` checks whether two values are equal; `is` checks whether two names point to the exact same object in memory. Use `==` for value comparison (the common case) and reserve `is` for identity checks like `x is None`.

Question 6

Why use NumPy instead of plain Python lists?

Accepted Answer

NumPy stores data in contiguous, typed arrays and runs operations in optimised C, so element-wise math on large arrays is often 10–100× faster than Python loops and uses far less memory. It's the foundation Pandas, scikit-learn, and PyTorch are built on.

Question 7

What is broadcasting in NumPy?

Accepted Answer

Broadcasting is how NumPy applies an operation between arrays of different shapes without copying data — it virtually stretches the smaller array to match the larger one. For example, subtracting a 1D row of column means from a 2D matrix works per-column without a loop, as long as the trailing dimensions are compatible.

Question 8

What does the axis argument mean in NumPy?

Accepted Answer

`axis` names the dimension the operation collapses along. For a 2D array, `axis=0` reduces down the rows (one result per column) and `axis=1` reduces across the columns (one result per row). The common mistake is reading 'axis=0' as 'rows' when it actually aggregates over them.

Question 9

What's the difference between a NumPy view and a copy?

Accepted Answer

Basic slicing returns a view — a window into the same memory — so changing the slice changes the original array. Fancy indexing (with a list or boolean mask) returns a copy. Call `.copy()` when you need an independent array to avoid surprising in-place mutations.

Question 10

Should I ever use a Python loop over a NumPy array?

Accepted Answer

Rarely. Prefer vectorised operations, broadcasting, and built-in functions, which run in C and are far faster. Reach for a loop only when the logic genuinely can't be vectorised, and even then consider tools like Numba.

Question 11

What's the difference between loc and iloc in Pandas?

Accepted Answer

`loc` selects by label (index and column names); `iloc` selects by integer position. Use `loc` when you know the row index or column name, and `iloc` when you want the Nth row or column regardless of its label.

Question 12

How do I avoid the SettingWithCopyWarning?

Accepted Answer

That warning means you may be assigning to a copy of a slice, so the change might not stick. Do the selection and assignment in a single `.loc` call — e.g. `df.loc[df.x > 0, 'y'] = 1` — or take an explicit `.copy()` first if you intend to work on a separate frame.

Question 13

When should I use apply versus a vectorised operation?

Accepted Answer

Prefer vectorised operations and built-in methods — they run in optimised C and are far faster than `apply`, which loops in Python. Use `apply` only for genuinely custom row/column logic that can't be expressed with vectorised functions.

Question 14

What's the difference between merge, join, and concat?

Accepted Answer

`merge` (and the similar `join`) combine frames by matching key columns, like a SQL join. `concat` stacks frames along an axis — rows on top of each other or columns side by side — without matching keys. Use merge for relational joins and concat for appending or aligning by index.

Question 15

Why did my groupby explode the number of rows?

Accepted Answer

That usually happens when you merge before aggregating — a one-to-many join multiplies rows, and a later sum double-counts. Aggregate to the right grain first, or confirm your join keys are unique, before combining tables.

Question 16

When should I use Matplotlib versus Seaborn?

Accepted Answer

Seaborn is built on Matplotlib and is faster for common statistical charts (distributions, categories, correlations) with attractive defaults. Drop to Matplotlib when you need fine-grained control over a custom figure. In practice you plot with Seaborn and tweak with Matplotlib.

Question 17

What's the difference between a figure and axes in Matplotlib?

Accepted Answer

The figure is the whole canvas; axes are the individual plot areas inside it, each with its own x/y coordinates. One figure can hold many axes (subplots). Understanding this split is the key to building multi-panel charts cleanly.

Question 18

Which chart should I use for my data?

Accepted Answer

Match the chart to the question: line for trends over time, bar for comparing categories, scatter for relationships between two numeric variables, histogram for one variable's distribution, and box or violin for comparing distributions across groups. Avoid pie charts beyond a few categories.

Question 19

How do I make charts that work in both light and dark mode?

Accepted Answer

Avoid hard-coded colors; use a palette tied to your theme and keep sufficient contrast against either background. Test the figure on both backgrounds, and choose colors that stay distinguishable for color-blind readers.

Question 20

Why does my plot look cramped or get cut off when saved?

Accepted Answer

Labels often overflow the default bounding box. Call `plt.tight_layout()` (or save with `bbox_inches='tight'`) to fit everything, and set an explicit figure size for the medium you're targeting.

Question 21

What's the difference between revenue, profit, and margin?

Accepted Answer

Revenue is total money from sales; profit is what's left after costs; margin is profit as a percentage of revenue. A business can have high revenue and still lose money if costs exceed it — which is why margin, not revenue, tells you whether the model actually works.

Question 22

What are CAC and LTV, and why do they matter together?

Accepted Answer

CAC (Customer Acquisition Cost) is what you spend to win a customer; LTV (Lifetime Value) is the total profit that customer brings. The business is healthy when LTV comfortably exceeds CAC — a common benchmark is roughly 3:1 — otherwise you lose money on every customer you acquire.

Question 23

What is a break-even analysis?

Accepted Answer

Break-even is the point where total revenue equals total cost, so profit is zero. It tells you how many units you must sell, or what price you must charge, to cover fixed and variable costs before you start making money — essential for pricing and go/no-go decisions.

Question 24

What is RFM segmentation?

Accepted Answer

RFM scores customers on Recency (how recently they bought), Frequency (how often), and Monetary value (how much) to group them into actionable segments like loyal, at-risk, or new. It's a simple, powerful way to target retention and marketing without a complex model.

Question 25

Why can the average customer be misleading?

Accepted Answer

Averages hide skew. If a few large customers dominate, the mean misrepresents the typical one, and decisions based on it misfire. Look at the median and the full distribution — segments and percentiles usually tell a truer story than a single average.

Question 26

What's the difference between WHERE and HAVING?

Accepted Answer

`WHERE` filters individual rows before grouping; `HAVING` filters groups after `GROUP BY` has aggregated them. Use `WHERE` for row conditions and `HAVING` for conditions on aggregates like `COUNT(*) > 5`.

Question 27

When should I use a window function instead of GROUP BY?

Accepted Answer

Use a window function when you need an aggregate alongside the original rows — a running total, a rank within each group, or each row's share of its group's total — without collapsing the rows. `GROUP BY` is for when you only want one summary row per group.

Question 28

What's the difference between an INNER JOIN and a LEFT JOIN?

Accepted Answer

An INNER JOIN keeps only rows that match in both tables; a LEFT JOIN keeps every row from the left table and fills NULLs where the right has no match. Use LEFT JOIN when you want to keep all records from your primary table even when related data is missing.

Question 29

Why is my SQL query slow?

Accepted Answer

Usually it's missing indexes on join or filter columns, functions wrapped around indexed columns (which disable the index), or returning far more rows than needed. Read the query plan with EXPLAIN, index the columns you filter and join on, and avoid `SELECT *` on wide tables.

Question 30

What is a CTE and when should I use one?

Accepted Answer

A CTE (Common Table Expression, the `WITH` clause) names a subquery so you can reference it like a temporary table. Use it to break a complex query into readable steps, or recursively to walk hierarchies like org charts. It mainly improves readability.

Question 31

When do I actually need Spark instead of Pandas?

Accepted Answer

Reach for Spark when your data is too large for one machine's memory or you need a cluster to process it in parallel — typically tens of gigabytes and up. For data that fits comfortably in RAM, Pandas (or Polars) is simpler and faster; Spark's distributed overhead only pays off at scale.

Question 32

What does lazy evaluation mean in Spark?

Accepted Answer

Spark doesn't run transformations as you write them — it builds a plan and only executes when an action (like `count`, `collect`, or `write`) is called. This lets its Catalyst optimiser reorder and combine steps, but it also means errors can surface only at the action.

Question 33

What's the difference between a transformation and an action in Spark?

Accepted Answer

Transformations (`select`, `filter`, `join`, `groupBy`) describe what to compute and return a new DataFrame lazily; actions (`count`, `show`, `collect`, `write`) trigger execution and return results. Nothing runs until an action is called.

Question 34

Why is my Spark job slow or running out of memory?

Accepted Answer

The usual cause is data skew or wide shuffles — joins and groupBy operations that move data across the cluster. Check for skewed keys, avoid `collect()` on large data, cache reused DataFrames, and let Adaptive Query Execution (AQE) tune partitions.

Question 35

What's the difference between the Spark driver and executors?

Accepted Answer

The driver runs your program and builds the execution plan; executors are the worker processes across the cluster that run tasks on partitions of the data. Pulling too much data back to the driver (e.g. `collect()`) is a common source of out-of-memory errors.

Question 36

How much math do I really need for machine learning?

Accepted Answer

Enough to reason, not to derive everything from scratch: linear algebra (vectors, matrices, dot products), the basics of calculus (gradients and the chain rule), and probability and statistics (distributions, expectation, Bayes). You can start applying models with less and deepen the math as you go.

Question 37

Why is linear algebra so important for ML?

Accepted Answer

Data is represented as vectors and matrices, and nearly every model — from linear regression to neural networks — is built on matrix multiplication. Linear algebra is the language that makes these operations fast and lets you reason about transformations, projections, and dimensionality.

Question 38

What is a gradient, intuitively?

Accepted Answer

A gradient is the vector of partial derivatives that points in the direction of steepest increase of a function. Training follows the negative gradient downhill to reduce the loss — that's gradient descent — and the gradient's size tells you how steep the slope is at the current point.

Question 39

What's the difference between probability and statistics?

Accepted Answer

Probability reasons forward — given a known model, how likely is an outcome. Statistics reasons backward — given observed data, what model or parameters likely produced it. ML uses both: probability to define models, statistics to fit and evaluate them.

Question 40

What is Bayes' theorem used for?

Accepted Answer

Bayes' theorem updates a prior belief into a posterior after seeing evidence, by weighing how likely the evidence is under each hypothesis. It underpins spam filters and medical-test interpretation, and it explains why a positive result on a rare-condition test can still mean low actual risk.

Question 41

Do neural networks always beat traditional ML?

Accepted Answer

No. On most tabular data, gradient-boosted trees (XGBoost, LightGBM, CatBoost) match or beat neural networks with less tuning and far less data. Deep learning dominates for images, text, and audio, but for structured business data, classical models are usually the stronger and simpler choice.

Question 42

What's the difference between training, validation, and test sets?

Accepted Answer

You fit the model on the training set, tune hyperparameters on the validation set, and report final performance once on the untouched test set. Keeping the test set truly held out is what makes your reported accuracy honest rather than optimistic.

Question 43

Why is my model 99% accurate but useless?

Accepted Answer

Accuracy is misleading on imbalanced data — a model that always predicts the majority class can score 99% while catching none of the rare cases that matter. Use metrics suited to the problem, like precision, recall, F1, or AUC, and read the confusion matrix.

Question 44

What is data leakage and why is it dangerous?

Accepted Answer

Data leakage is when information unavailable at prediction time sneaks into training — like scaling using the whole dataset before splitting, or a feature derived from the target. It produces great validation scores that collapse in production, making it one of the most costly silent bugs in ML.

Question 45

When should I use XGBoost versus a random forest?

Accepted Answer

Random forests are robust and need little tuning, so they make a strong baseline. Gradient boosting (XGBoost and friends) usually reaches higher accuracy but is more sensitive to hyperparameters and overfitting. Start with a random forest, then try boosting to squeeze out more performance.

Question 46

What does PyTorch's autograd actually do?

Accepted Answer

Autograd records the operations in your forward pass as a graph, then automatically computes gradients of the loss with respect to every parameter by applying the chain rule backward. That's what lets you train with `loss.backward()` without deriving gradients by hand.

Question 47

Why do neural networks need activation functions?

Accepted Answer

Without a non-linear activation, stacking layers just composes linear functions, which collapses to a single linear model no matter how deep. Activations like ReLU add non-linearity, letting the network approximate complex, curved decision boundaries.

Question 48

What's the difference between SGD, Adam, and AdamW?

Accepted Answer

SGD updates weights using the raw gradient (optionally with momentum). Adam adapts the step size per parameter using running estimates of the gradient and its variance, often training faster. AdamW fixes how Adam handles weight decay for better regularisation, and is the common default for transformers.

Question 49

What loss function should I use?

Accepted Answer

Use cross-entropy for classification — it heavily penalises confident wrong predictions — and mean squared error (or a robust variant like Huber) for regression. The loss must match the output layer, for example softmax outputs paired with cross-entropy.

Question 50

What is a transformer and why did it change everything?

Accepted Answer

A transformer is an architecture built on self-attention, which lets every token directly weigh every other token regardless of distance, and processes a whole sequence in parallel rather than step by step like an RNN. That parallelism and long-range modeling are what made modern large language models possible.

Question 51

What is MLOps, in one sentence?

Accepted Answer

MLOps is the practice of reliably deploying, monitoring, and maintaining machine-learning models in production — applying DevOps discipline (version control, CI/CD, automation, monitoring) to the messier reality of data and models that drift over time.

Question 52

Why do models that work in a notebook fail in production?

Accepted Answer

Common causes are training/serving skew (production data or preprocessing differs from training), data drift over time, environment and dependency differences, and no monitoring to catch the decline. Reproducible pipelines and Docker containers close most of these gaps.

Question 53

What is model drift and how do I detect it?

Accepted Answer

Drift is when the live data (data drift) or the input–target relationship (concept drift) changes from what the model trained on, quietly degrading accuracy. Detect it by monitoring input distributions and prediction quality over time, and retrain or alert when they shift beyond a threshold.

Question 54

Why use Docker for machine learning?

Accepted Answer

Docker packages your code, libraries, and system dependencies into one image that runs identically on your laptop, in CI, and in production. That eliminates 'it works on my machine' failures, which are especially common in ML because of heavy, version-sensitive dependencies.

Question 55

What does MLflow do?

Accepted Answer

MLflow tracks experiments — logging parameters, metrics, and artifacts so you can compare runs — and provides a model registry to version and stage models for deployment. It's the system of record that keeps ML work reproducible instead of scattered across notebooks.

Question 56

What actually is a large language model?

Accepted Answer

An LLM is a neural network (a transformer) trained to predict the next token in text. From that single objective, at scale, it learns grammar, facts, reasoning patterns, and style, then generates by predicting one token at a time. It doesn't look things up — it produces statistically likely continuations.

Question 57

What is RAG and why use it?

Accepted Answer

RAG (Retrieval-Augmented Generation) retrieves relevant documents and feeds them into the model's context so it answers from your data instead of only its training. It reduces hallucination, lets you use private or up-to-date information, and avoids the cost of retraining the model.

Question 58

What do temperature, top-k, and top-p control?

Accepted Answer

They control randomness in generation. Temperature scales how sharply the model favors high-probability tokens — low is focused and near-deterministic, high is creative and riskier. Top-k and top-p (nucleus) limit sampling to the most likely tokens. Use low temperature for factual tasks, higher for brainstorming.

Question 59

Why does an LLM hallucinate, and how do I reduce it?

Accepted Answer

Because it generates plausible continuations rather than retrieving facts, it can state confident falsehoods, especially outside its training data. Reduce it by grounding answers with RAG, asking for citations, lowering temperature, and constraining the task — but you can't eliminate it, so verify critical outputs.

Question 60

What's the difference between fine-tuning and RAG?

Accepted Answer

Fine-tuning adjusts the model's weights to change its style or specialise its behavior; RAG leaves the model unchanged and supplies knowledge at query time. Use RAG for facts that change or are private, and fine-tuning for consistent format, tone, or task behavior — they're often combined.

Question 61

What makes an AI agent different from a chatbot?

Accepted Answer

An agent doesn't just answer — it plans and acts in a loop: it calls tools, observes the results, and decides the next step until a goal is met. A chatbot produces one reply; an agent can search, run code, query a database, and chain those steps autonomously.

Question 62

What is the Model Context Protocol (MCP)?

Accepted Answer

MCP is an open standard for connecting AI models to tools and data sources through a uniform interface, so any MCP-compatible client can use any MCP server. It acts like a universal adapter that replaces bespoke, per-integration glue code for agent tooling.

Question 63

What's the difference between LangChain and LangGraph?

Accepted Answer

LangChain offers building blocks and chains for composing LLM apps; LangGraph models an agent as an explicit graph of nodes and edges with shared state, which makes loops, branching, retries, and human-in-the-loop control far easier to reason about. Use LangGraph when the control flow gets complex.

Question 64

Why do multi-agent systems often fail?

Accepted Answer

Common failure modes are agents losing shared context, compounding each other's errors over long chains, looping without progress, and ballooning cost and latency. Reliability usually comes from constraining each agent's scope, adding verification steps, and keeping the control flow explicit rather than fully open-ended.

Frequently asked questions

Python

NumPy

Pandas

Storytelling with Visualisation

Business Analytics

SQL

PySpark

Math for ML

Machine Learning

Deep Learning

MLOps

Generative AI

Agentic AI

Data Structures & Algorithms

GATE DA

Git

Command Line

Time Series

Recommender Systems