How would you reduce the cost of serving an ML or LLM model in production without hurting quality?

Work top-down: start at the model layer with quantization, distillation, or routing cheaper models for easy requests, since model choices drive every downstream cost. Then optimize the runtime with batching, caching, and techniques like prompt caching for LLMs, and finally match infrastructure to the load using autoscaling on queue depth and spot or batch capacity. Track cost per token or per prediction alongside latency percentiles and accuracy so optimizations never silently degrade quality.

What techniques reduce LLM cost and latency in production?

Cost scales with input plus output tokens; latency scales with output tokens and model size. The highest-leverage levers are: model routing (use a small model when the task is simple), prompt caching (reuse expensive prefix computation), output length control, and batching. Together these can cut spend 60–90% without quality regression.

How do you attribute and control ML spend across teams and models (FinOps for ML)?

Apply FinOps to ML by tagging every workload (training jobs, endpoints, GPU pools) by team, model, and environment so cost is attributable, then track unit-economics metrics like cost per prediction or per training run rather than just total spend. Set budgets and alerts, identify idle GPUs and overprovisioned endpoints, and enforce guardrails like autoscaling and instance-type policies. The goal is continuous visibility and accountability so teams optimize cost without killing experimentation.

How does LLMOps differ from classical MLOps, and what new operational challenges do LLMs introduce?

LLMOps extends classical MLOps to handle foundation model scale, prompt-based configuration, non-deterministic outputs, and evaluation without a scalar ground truth. Key new concerns include prompt versioning, output quality evaluation via LLM judges or human review, hallucination monitoring, cost management, and RAG pipeline observability.

Optimization and Linear Programming — Business Analytics

Forecasting handed us a prediction — next month’s demand will be about so-many units. But a prediction isn’t an action. Knowing demand doesn’t tell you what to do when your machine-hours, your materials, and your budget are all finite and fought over. This lesson is the last rung of the analytics ladder: turning goals and limits into a precise best decision.

You run a small workshop. Product A earns $40 per unit. Product B earns $30 per unit. Every instinct says: make as many A’s as possible. By the end of this lesson you will see exactly why that instinct is wrong — and how a method called linear programming finds the true best mix in seconds.

Two kinds of analytics

The analytics world has four types. Two of them tell you what happened or what will happen. The third, prescriptive analytics — also called optimization — goes further: it tells you the best decision to make, not just what the outcome might be.

Optimization needs two ingredients:

An objective — the number you want to maximize (profit, throughput) or minimize (cost, waste).
Constraints — the real-world limits you must stay within (hours, materials, budget).

Without constraints the answer is trivial (“make infinite units”). Without an objective the answer is vague (“make more”). Together they define a real decision problem.

The product-mix problem

Here are the exact numbers for our workshop.

Decision variables (the quantities you control):

A = units of Product A to make
B = units of Product B to make

Objective function (maximize total profit): 40A + 30B

Constraints (the limits):

Labour:   2A + B  <= 100   (A needs 2 hrs, B needs 1 hr; 100 hrs available)
Material: A  + B  <=  80   (each product uses 1 unit; 80 units available)
Non-neg:  A >= 0,  B >= 0

This is an example of linear programming (LP) — optimization where every constraint and the objective are linear, meaning they graph as straight lines (or flat planes). “Linear” is the key word: no squares, no products of variables, just additions and multiplications by constants.

The feasible region

Every point (A, B) that satisfies all four constraints at once belongs to the feasible region — the set of all mixes the workshop can legally produce. When you draw the two constraint lines on a graph, the feasible region is the area below both lines and above the axes.

For our problem the feasible region is a four-sided polygon (quadrilateral) with these corner points (also called vertices):

Corner	A	B
Origin	0	0
Labour limit only	50	0
Both limits cross	20	60
Material limit only	0	80

The crossing point deserves a close look. Where do 2A + B = 100 and A + B = 80 meet? Subtract the second from the first:

(2A + B) - (A + B) = 100 - 80
A = 20

Substitute back: 20 + B = 80, so B = 60. Confirmed: the two constraints cross at (20, 60).

The feasible region diagram

The shaded polygon is the feasible region. Every corner is labelled with its profit. The optimal corner (20, 60) at $2,600 is highlighted in green.

The corner-point theorem — why it works

Here is the most important theorem in LP, stated without jargon:

The optimal solution to any LP is always found at a corner (vertex) of the feasible region.

Why? Imagine walking in the direction of increasing profit across the feasible region. Profit increases in a fixed direction (toward more A and more B). You keep walking until the boundary stops you. The last point you can reach before leaving the feasible region is always a corner — never a point in the middle of an edge, never an interior point.

This matters enormously in practice. A real LP might have millions of feasible points. The theorem says you only need to check the corners — a finite, enumerable list.

Evaluating the corners

Corner `(A, B)`	Profit calculation	Profit
(0, 0)	`40(0) + 30(0)`	$0
(50, 0)	`40(50) + 30(0)`	$2,000
(0, 80)	`40(0) + 30(80)`	$2,400
(20, 60)	`40(20) + 30(60) = 800 + 1,800`	$2,600

The winner is (20, 60): make 20 units of A and 60 units of B for a profit of $2,600.

Notice that both constraints are binding at this corner — meaning both inequalities are satisfied with equality (2(20) + 60 = 100 exactly, and 20 + 60 = 80 exactly). A binding constraint is one that is tight; it is actually limiting you. A non-binding constraint has slack — you could do more but some other constraint stops you first.

What makes a constraint “binding”

A binding constraint is actively limiting your profit — remove or relax it and your optimal profit rises. In this problem, both constraints are binding at the optimum. If you could get just 10 more labour-hours (100 to 110), the feasible region would expand and you could earn more. Knowing which constraints bind tells you where to invest: buying extra capacity on a non-binding constraint wastes money; buying it on a binding one has direct payoff.

In one breath

Optimization (prescriptive analytics) finds the best decision, not just a prediction — it needs an objective to maximize or minimize and constraints you must stay within. When the objective and constraints are all linear, it’s a linear program: every legal mix forms the feasible region, and the corner-point theorem guarantees the optimum sits at one of its corners, so you only check a finite list of vertices. Here, making 20 of A and 60 of B earns $2,600 — beating the “just make the higher-margin A” instinct by $600, because A eats two labour-hours per unit ($20/hr) while B earns $30/hr, and labour is the binding constraint. The lesson that generalises: headline margin misleads; the scarce, binding resource decides the mix — and tells you exactly where extra capacity would pay off.

Practice

Quick check

0/3

Q1The workshop in this lesson chooses (20, 60) instead of (50, 0). What is the primary reason?

Q2The corner-point theorem says the optimal LP solution is always at a corner of the feasible region. Which statement best explains why?

Q3Transfer question. A new health-and-safety rule now limits total output: A + B must be no more than 60 units (a third constraint). All other constraints remain. Without solving the full new LP, which corner of the ORIGINAL feasible region is most likely cut off, and what would you expect to happen to the maximum profit?

A question to carry forward

Look at what made this whole calculation possible: we knew the numbers. A earns exactly $40, B exactly $30, labour is capped at exactly 100 hours. Optimization is only as trustworthy as those inputs — feed it a fantasy and it confidently produces a fantasy answer.

So the question to carry forward is: where do reliable numbers actually come from? When someone proposes a change — a new checkout button that “converts better,” a new flow that “feels faster” — how do you know the improvement is real and not a flattering accident? The next lesson is A/B testing: the discipline of proving a change works with randomized evidence and a p-value, instead of trusting the gut feeling that produced the $40-margin guess in the first place.

Optimization and Linear Programming

What you'll learn

Before you start

Two kinds of analytics

The product-mix problem

The feasible region

The feasible region diagram

The corner-point theorem — why it works

Evaluating the corners

What makes a constraint “binding”

In one breath

Practice

Quick check

A question to carry forward

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further