Text processing: sort, uniq, wc, cut
Master the Unix text-wrangling toolkit — sort, uniq, wc, cut, and tr — and learn to compose them into powerful data-analysis pipelines without leaving the terminal.
What you'll learn
- Count lines, words, and bytes with wc and slice columns with cut
- Sort streams numerically or lexicographically and collapse duplicates with uniq -c
- Build the canonical top-10-IPs one-liner by chaining every tool in a single pipeline
Before you start
Why a dedicated toolkit?
Python can do all of this — but starting a Python process, importing collections, writing a loop, and printing results takes minutes to write and seconds to start. The tools below are compiled C programs that stream data without loading it all into memory. For ad-hoc log analysis on multi-gigabyte files they are unbeatable, and every Unix-like system already has them.
The design philosophy is the same as pipes: each tool does one thing well, and you chain them together with |.
wc — count things
wc (word count) reads standard input or a file and reports counts.
wc access.log
42318 338544 3209876 access.log
The three numbers are lines, words, and bytes. Use flags to ask for just one:
| Flag | Reports |
|---|---|
-l | line count |
-w | word count |
-c | byte count |
wc -l access.log
42318 access.log
Pipe into wc -l to count the output of any command:
grep "POST" access.log | wc -l
8741
cut — extract columns
cut slices fields (columns) from each line. A field is a chunk of text separated by a delimiter — a character that marks where one column ends and the next begins.
In a typical Apache/Nginx access log, fields are space-separated and the IP address lives in field 1:
203.0.113.42 - frank [10/Jun/2025:09:32:41] "GET /api/v2/users HTTP/1.1" 200 1234
cut -d' ' -f1 access.log
203.0.113.42
198.51.100.7
203.0.113.42
10.0.0.5
...
Key flags:
| Flag | Meaning |
|---|---|
-d' ' | use space as the delimiter |
-f1 | extract field 1 |
-f1,3 | extract fields 1 and 3 |
-f2-4 | extract fields 2 through 4 |
-c1-10 | extract characters 1–10 (no delimiter needed) |
For a CSV file you would write -d',' instead.
sort — order lines
sort reads lines and emits them in sorted order. By default the sort is lexicographic (dictionary order), which means 10 comes before 9 because '1' comes before '9' as characters.
sort -n # numeric sort (10 after 9)
sort -r # reverse (descending)
sort -nr # numeric, descending — very common
sort -k2 # sort by the second whitespace-delimited field
sort -u # sort AND remove duplicates (equivalent to sort | uniq)
Example — sort a list of numbers correctly:
printf '10\n2\n30\n1\n' | sort -n
1
2
10
30
uniq — collapse adjacent duplicates
uniq removes adjacent duplicate lines. The word adjacent is the critical detail.
sort access_ips.txt | uniq
Add -c to count occurrences — this is the key flag for frequency analysis:
sort access_ips.txt | uniq -c
5 10.0.0.5
312 198.51.100.7
1047 203.0.113.42
The count appears as a left-padded number before each unique line. Now you have the raw material for a frequency table.
tr — translate or delete characters
tr is a lightweight character transformer. It reads stdin and replaces or deletes individual characters.
tr 'a-z' 'A-Z' # lowercase to uppercase
tr -d '\r' # delete carriage returns (Windows line endings)
tr -s ' ' # squeeze runs of spaces into one
It does not understand fields or delimiters — only single characters. Use cut or awk when you need column awareness.
head and tail — take top or bottom N lines
head -10 file.txt # first 10 lines
tail -10 file.txt # last 10 lines
tail -f app.log # follow a growing log in real time
Piped at the end of a sorted frequency list, head -10 gives you the top 10.
Putting it all together
Here is the canonical one-liner that answers the opening question:
cut -d' ' -f1 access.log | sort | uniq -c | sort -nr | head -10
Stage-by-stage walkthrough
Stage 1 — extract IPs
cut -d' ' -f1 access.log
203.0.113.42
198.51.100.7
203.0.113.42
10.0.0.5
203.0.113.42
Stage 2 — sort so duplicates are adjacent
cut -d' ' -f1 access.log | sort
10.0.0.5
198.51.100.7
203.0.113.42
203.0.113.42
203.0.113.42
Stage 3 — count unique IPs
cut -d' ' -f1 access.log | sort | uniq -c
1 10.0.0.5
1 198.51.100.7
3 203.0.113.42
Stage 4 — sort by count, descending
cut -d' ' -f1 access.log | sort | uniq -c | sort -nr
3 203.0.113.42
1 198.51.100.7
1 10.0.0.5
Stage 5 — take the top 10
cut -d' ' -f1 access.log | sort | uniq -c | sort -nr | head -10
On a real log with thousands of unique IPs, the output looks like:
18423 203.0.113.42
9211 198.51.100.7
4405 192.0.2.15
3871 198.51.100.22
2994 203.0.113.101
1847 192.0.2.88
1203 198.51.100.55
987 203.0.113.9
654 192.0.2.3
401 198.51.100.200
Pipeline diagram
Quick reference
| Tool | Core job | Key flags |
|---|---|---|
wc | count lines / words / bytes | -l -w -c |
cut | extract fields or characters | -d delimiter, -f fields, -c chars |
sort | order lines | -n numeric, -r reverse, -k by field, -u unique |
uniq | collapse adjacent duplicates | -c count, -d show only dupes |
tr | translate / delete chars | -d delete, -s squeeze |
head | first N lines | -n (default 10) |
tail | last N lines | -n, -f follow |