datarekha

Text processing: sort, uniq, wc, cut

Master the Unix text-wrangling toolkit — sort, uniq, wc, cut, and tr — and learn to compose them into powerful data-analysis pipelines without leaving the terminal.

9 min read Intermediate Command Line Lesson 8 of 14

What you'll learn

  • Count lines, words, and bytes with wc and slice columns with cut
  • Sort streams numerically or lexicographically and collapse duplicates with uniq -c
  • Build the canonical top-10-IPs one-liner by chaining every tool in a single pipeline

Before you start

Why a dedicated toolkit?

Python can do all of this — but starting a Python process, importing collections, writing a loop, and printing results takes minutes to write and seconds to start. The tools below are compiled C programs that stream data without loading it all into memory. For ad-hoc log analysis on multi-gigabyte files they are unbeatable, and every Unix-like system already has them.

The design philosophy is the same as pipes: each tool does one thing well, and you chain them together with |.


wc — count things

wc (word count) reads standard input or a file and reports counts.

wc access.log
  42318  338544 3209876 access.log

The three numbers are lines, words, and bytes. Use flags to ask for just one:

FlagReports
-lline count
-wword count
-cbyte count
wc -l access.log
42318 access.log

Pipe into wc -l to count the output of any command:

grep "POST" access.log | wc -l
8741

cut — extract columns

cut slices fields (columns) from each line. A field is a chunk of text separated by a delimiter — a character that marks where one column ends and the next begins.

In a typical Apache/Nginx access log, fields are space-separated and the IP address lives in field 1:

203.0.113.42 - frank [10/Jun/2025:09:32:41] "GET /api/v2/users HTTP/1.1" 200 1234
cut -d' ' -f1 access.log
203.0.113.42
198.51.100.7
203.0.113.42
10.0.0.5
...

Key flags:

FlagMeaning
-d' 'use space as the delimiter
-f1extract field 1
-f1,3extract fields 1 and 3
-f2-4extract fields 2 through 4
-c1-10extract characters 1–10 (no delimiter needed)

For a CSV file you would write -d',' instead.


sort — order lines

sort reads lines and emits them in sorted order. By default the sort is lexicographic (dictionary order), which means 10 comes before 9 because '1' comes before '9' as characters.

sort -n   # numeric sort (10 after 9)
sort -r   # reverse (descending)
sort -nr  # numeric, descending — very common
sort -k2  # sort by the second whitespace-delimited field
sort -u   # sort AND remove duplicates (equivalent to sort | uniq)

Example — sort a list of numbers correctly:

printf '10\n2\n30\n1\n' | sort -n
1
2
10
30

uniq — collapse adjacent duplicates

uniq removes adjacent duplicate lines. The word adjacent is the critical detail.

sort access_ips.txt | uniq

Add -c to count occurrences — this is the key flag for frequency analysis:

sort access_ips.txt | uniq -c
   5 10.0.0.5
 312 198.51.100.7
1047 203.0.113.42

The count appears as a left-padded number before each unique line. Now you have the raw material for a frequency table.


tr — translate or delete characters

tr is a lightweight character transformer. It reads stdin and replaces or deletes individual characters.

tr 'a-z' 'A-Z'          # lowercase to uppercase
tr -d '\r'               # delete carriage returns (Windows line endings)
tr -s ' '                # squeeze runs of spaces into one

It does not understand fields or delimiters — only single characters. Use cut or awk when you need column awareness.


head and tail — take top or bottom N lines

head -10 file.txt    # first 10 lines
tail -10 file.txt    # last 10 lines
tail -f app.log      # follow a growing log in real time

Piped at the end of a sorted frequency list, head -10 gives you the top 10.


Putting it all together

Here is the canonical one-liner that answers the opening question:

cut -d' ' -f1 access.log | sort | uniq -c | sort -nr | head -10

Stage-by-stage walkthrough

Stage 1 — extract IPs

cut -d' ' -f1 access.log
203.0.113.42
198.51.100.7
203.0.113.42
10.0.0.5
203.0.113.42

Stage 2 — sort so duplicates are adjacent

cut -d' ' -f1 access.log | sort
10.0.0.5
198.51.100.7
203.0.113.42
203.0.113.42
203.0.113.42

Stage 3 — count unique IPs

cut -d' ' -f1 access.log | sort | uniq -c
      1 10.0.0.5
      1 198.51.100.7
      3 203.0.113.42

Stage 4 — sort by count, descending

cut -d' ' -f1 access.log | sort | uniq -c | sort -nr
      3 203.0.113.42
      1 198.51.100.7
      1 10.0.0.5

Stage 5 — take the top 10

cut -d' ' -f1 access.log | sort | uniq -c | sort -nr | head -10

On a real log with thousands of unique IPs, the output looks like:

  18423 203.0.113.42
   9211 198.51.100.7
   4405 192.0.2.15
   3871 198.51.100.22
   2994 203.0.113.101
   1847 192.0.2.88
   1203 198.51.100.55
    987 203.0.113.9
    654 192.0.2.3
    401 198.51.100.200

Pipeline diagram

cut -d’ ’ -f1extract IPssortgroup dupesuniq -ccount eachsort -nrrank desc.head -10top tenranked IPsfinal output
The five-stage pipeline that turns a raw access log into a ranked frequency table.

Quick reference

ToolCore jobKey flags
wccount lines / words / bytes-l -w -c
cutextract fields or characters-d delimiter, -f fields, -c chars
sortorder lines-n numeric, -r reverse, -k by field, -u unique
uniqcollapse adjacent duplicates-c count, -d show only dupes
trtranslate / delete chars-d delete, -s squeeze
headfirst N lines-n (default 10)
taillast N lines-n, -f follow

Quick check

0/3
Q1You run `uniq -c` on a log file and every count is 1, even though you know many IPs appear repeatedly. What is most likely wrong?
Q2Which command extracts the third and fifth comma-separated fields from each line of `data.csv`?
Q3You have a TSV file where column 2 is a response-time float. You want the 5 slowest requests. Which pipeline is correct?

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content