Text processing: sort, uniq, wc, cut

The last lesson ended on this exact promise: the pipe’s real power shows only when each stage transforms the stream, and a question like “top ten IPs” becomes a short assembly line of tiny tools. This lesson is those tools — wc, cut, sort, uniq, tr — and it ends by building that very one-liner, stage by stage.

Why a dedicated toolkit?

Python can do all of this — but starting a Python process, importing collections, writing a loop, and printing results takes minutes to write and seconds to start. The tools below are compiled C programs that stream data without loading it all into memory. For ad-hoc log analysis on multi-gigabyte files they are unbeatable, and every Unix-like system already has them.

The design philosophy is the same as pipes: each tool does one thing well, and you chain them together with |.

`wc` — count things

wc (word count) reads standard input or a file and reports counts.

wc access.log

  42318  338544 3209876 access.log

The three numbers are lines, words, and bytes. Use flags to ask for just one:

Flag	Reports
`-l`	line count
`-w`	word count
`-c`	byte count

wc -l access.log

42318 access.log

Pipe into wc -l to count the output of any command:

grep "POST" access.log | wc -l

`cut` — extract columns

cut slices fields (columns) from each line. A field is a chunk of text separated by a delimiter — a character that marks where one column ends and the next begins.

In a typical Apache/Nginx access log, fields are space-separated and the IP address lives in field 1:

203.0.113.42 - frank [10/Jun/2025:09:32:41] "GET /api/v2/users HTTP/1.1" 200 1234

cut -d' ' -f1 access.log

203.0.113.42
198.51.100.7
203.0.113.42
10.0.0.5
...

Key flags:

Flag	Meaning
`-d' '`	use space as the delimiter
`-f1`	extract field 1
`-f1,3`	extract fields 1 and 3
`-f2-4`	extract fields 2 through 4
`-c1-10`	extract characters 1–10 (no delimiter needed)

For a CSV file you would write -d',' instead.

`sort` — order lines

sort reads lines and emits them in sorted order. By default the sort is lexicographic (dictionary order), which means 10 comes before 9 because '1' comes before '9' as characters.

sort -n   # numeric sort (10 after 9)
sort -r   # reverse (descending)
sort -nr  # numeric, descending — very common
sort -k2  # sort by the second whitespace-delimited field
sort -u   # sort AND remove duplicates (equivalent to sort | uniq)

Example — sort a list of numbers correctly:

printf '10\n2\n30\n1\n' | sort -n

`uniq` — collapse adjacent duplicates

uniq removes adjacent duplicate lines. The word adjacent is the critical detail.

sort access_ips.txt | uniq

Add -c to count occurrences — this is the key flag for frequency analysis:

sort access_ips.txt | uniq -c

   5 10.0.0.5
 312 198.51.100.7
1047 203.0.113.42

The count appears as a left-padded number before each unique line. Now you have the raw material for a frequency table.

`tr` — translate or delete characters

tr is a lightweight character transformer. It reads stdin and replaces or deletes individual characters.

tr 'a-z' 'A-Z'          # lowercase to uppercase
tr -d '\r'               # delete carriage returns (Windows line endings)
tr -s ' '                # squeeze runs of spaces into one

It does not understand fields or delimiters — only single characters. Use cut or awk when you need column awareness.

`head` and `tail` — take top or bottom N lines

head -10 file.txt    # first 10 lines
tail -10 file.txt    # last 10 lines
tail -f app.log      # follow a growing log in real time

Piped at the end of a sorted frequency list, head -10 gives you the top 10.

Putting it all together

Here is the canonical one-liner that answers the opening question:

cut -d' ' -f1 access.log | sort | uniq -c | sort -nr | head -10

Stage-by-stage walkthrough

Stage 1 — extract IPs

cut -d' ' -f1 access.log

203.0.113.42
198.51.100.7
203.0.113.42
10.0.0.5
203.0.113.42

Stage 2 — sort so duplicates are adjacent

cut -d' ' -f1 access.log | sort

10.0.0.5
198.51.100.7
203.0.113.42
203.0.113.42
203.0.113.42

Stage 3 — count unique IPs

cut -d' ' -f1 access.log | sort | uniq -c

      1 10.0.0.5
      1 198.51.100.7
      3 203.0.113.42

Stage 4 — sort by count, descending

cut -d' ' -f1 access.log | sort | uniq -c | sort -nr

      3 203.0.113.42
      1 198.51.100.7
      1 10.0.0.5

Stage 5 — take the top 10

cut -d' ' -f1 access.log | sort | uniq -c | sort -nr | head -10

On a real log with thousands of unique IPs, the output looks like:

  18423 203.0.113.42
   9211 198.51.100.7
   4405 192.0.2.15
   3871 198.51.100.22
   2994 203.0.113.101
   1847 192.0.2.88
   1203 198.51.100.55
    987 203.0.113.9
    654 192.0.2.3
    401 198.51.100.200

Pipeline diagram

The five-stage pipeline that turns a raw access log into a ranked frequency table.

Quick reference

Tool	Core job	Key flags
`wc`	count lines / words / bytes	`-l` `-w` `-c`
`cut`	extract fields or characters	`-d` delimiter, `-f` fields, `-c` chars
`sort`	order lines	`-n` numeric, `-r` reverse, `-k` by field, `-u` unique
`uniq`	collapse adjacent duplicates	`-c` count, `-d` show only dupes
`tr`	translate / delete chars	`-d` delete, `-s` squeeze
`head`	first N lines	`-n` (default 10)
`tail`	last N lines	`-n`, `-f` follow

In one breath

These are the stream-transforming building blocks, each doing one job so pipes can chain them. wc counts (-l lines, -w words, -c bytes). cut slices columns — -d sets the delimiter, -f picks fields (-c picks character positions instead). sort orders lines, and -n (numeric) is the flag people forget — without it 10 sorts before 9 as text; pair with -r for descending, -k to sort by a field. uniq collapses adjacent duplicates, so it is almost always sort | uniq -c for a frequency count (unsorted, it misses scattered repeats). tr swaps or deletes single characters. Chain them and the canonical cut | sort | uniq -c | sort -nr | head turns a raw log into a ranked leaderboard.

Practice

Before the quiz, build a pipeline from scratch. Given a CSV orders.csv whose 3rd column is a country code, produce the five countries with the most orders, highest first. Write the full pipe. Then catch the classic bug: if you wrote uniq -c without a sort in front of it, what would the counts look like, and why?

Quick check

0/3

Q1You run `uniq -c` on a log file and every count is 1, even though you know many IPs appear repeatedly. What is most likely wrong?

Q2Which command extracts the third and fifth comma-separated fields from each line of `data.csv`?

Q3You have a TSV file where column 2 is a response-time float. You want the 5 slowest requests. Which pipeline is correct?

A question to carry forward

You are now genuinely fluent at wrangling text — finding it, filtering it, slicing and ranking it. But notice that the shell was never only about data. It is how you operate a whole machine, and the instant you step onto a real server a wall appears that has nothing to do with text: you try to read a system log and get Permission denied; you write a deploy script and it refuses to run. Every file on a Unix system carries an owner and a compact permission code deciding exactly who may read it, change it, or execute it — and a few actions demand powers your normal account simply does not have. Who is allowed to touch what, and how you safely borrow elevated power when you truly need it, is where the next chapter — Power Tools — begins.

What you'll learn

Before you start

Why a dedicated toolkit?

`wc` — count things

`cut` — extract columns

`sort` — order lines

`uniq` — collapse adjacent duplicates

`tr` — translate or delete characters

`head` and `tail` — take top or bottom N lines

Putting it all together

Stage-by-stage walkthrough

Pipeline diagram

Quick reference

In one breath

Practice

Quick check

A question to carry forward

Sign in to track your progress

Related lessons

Explore further

What you'll learn

Before you start

Why a dedicated toolkit?

wc — count things

cut — extract columns

sort — order lines

uniq — collapse adjacent duplicates

tr — translate or delete characters

head and tail — take top or bottom N lines

Putting it all together

Stage-by-stage walkthrough

Pipeline diagram

Quick reference

In one breath

Practice

Quick check

A question to carry forward

Sign in to track your progress

Related lessons

Explore further

`wc` — count things

`cut` — extract columns

`sort` — order lines

`uniq` — collapse adjacent duplicates

`tr` — translate or delete characters

`head` and `tail` — take top or bottom N lines