Shell Script Hack #6: awk Magic for Data Extraction and Analysis → Explore with me!

You have a CSV file with thousands of rows and need to extract specific columns, calculate totals, or filter data. You could write a Python script, but awk can do it in one elegant line. Let me show you why awk is the secret weapon for data extraction.

The Problem

You have a server log file with columns: timestamp, IP address, status code, response time. You need to find the average response time for 200 status codes.

# Sample log format:
2025-10-24 10:23:45 192.168.1.100 200 0.234
2025-10-24 10:23:46 192.168.1.101 404 0.012
2025-10-24 10:23:47 192.168.1.102 200 0.445

Writing a full script seems excessive for this simple task.

The Hack: awk for Data Extraction

awk processes text line by line and treats each line as fields separated by whitespace (or any delimiter you specify):

awk '$4 == 200 {sum += $5; count++} END {print sum/count}' access.log

One line. Instant result. No external dependencies.

Understanding awk Basics

awk automatically splits each line into fields:

$0 – entire line
$1 – first field
$2 – second field
$NF – last field
NR – current line number
NF – number of fields in current line

Essential awk One-Liners

Print Specific Columns

# Print first and third columns
awk '{print $1, $3}' file.txt

# Print last column
awk '{print $NF}' file.txt

# Print all except first column
awk '{$1=""; print $0}' file.txt

# Swap first two columns
awk '{print $2, $1}' file.txt

Filtering Data

# Lines where column 3 > 100
awk '$3 > 100' data.txt

# Lines containing "ERROR"
awk '/ERROR/' log.txt

# Lines where status is 200 or 201
awk '$4 == 200 || $4 == 201' access.log

# Lines NOT matching pattern
awk '!/DEBUG/' log.txt

# Numeric comparison
awk '$5 >= 0.5' response_times.txt

Calculations and Aggregations

# Sum of column 2
awk '{sum += $2} END {print sum}' numbers.txt

# Average
awk '{sum += $1; count++} END {print sum/count}' values.txt

# Count lines
awk 'END {print NR}' file.txt

# Min and max
awk 'NR==1 {min=$1; max=$1} {if($1max) max=$1} END {print "Min:", min, "Max:", max}' data.txt

# Count occurrences
awk '{count[$1]++} END {for (i in count) print i, count[i]}' data.txt

Working with CSV Files

Comma-Separated Values

# Set field separator to comma
awk -F, '{print $1, $3}' data.csv

# Print second column of CSV
awk -F',' '{print $2}' users.csv

# Handle CSV with spaces
awk -F', *' '{print $1}' messy.csv

# Convert CSV to TSV
awk -F',' '{print $1"\t"$2"\t"$3}' data.csv

Skip Header Row

# Skip first line
awk 'NR > 1' data.csv

# Process only data rows
awk -F',' 'NR > 1 {sum += $3} END {print sum}' sales.csv

Advanced Patterns

Multiple Conditions

# AND condition
awk '$3 > 100 && $4 == "ACTIVE"' data.txt

# OR condition
awk '$1 == "ERROR" || $1 == "CRITICAL"' log.txt

# Range of lines
awk 'NR >= 10 && NR <= 20' file.txt

# Pattern range
awk '/START/,/END/' file.txt

String Operations

# String length
awk '{print length($1)}' words.txt

# Substring
awk '{print substr($1, 1, 5)}' data.txt

# Concatenation
awk '{print $1 "-" $2}' file.txt

# To uppercase
awk '{print toupper($0)}' file.txt

# To lowercase
awk '{print tolower($0)}' file.txt

Custom Output Formatting

# Format numbers
awk '{printf "%.2f\n", $1}' decimals.txt

# Aligned columns
awk '{printf "%-20s %10d\n", $1, $2}' data.txt

# Add prefix/suffix
awk '{print "User: " $1}' users.txt

# Custom field separator for output
awk 'BEGIN{OFS="|"} {print $1, $2, $3}' data.txt

Real-World Use Cases

Log Analysis

# Count errors by type
awk '/ERROR/ {count[$3]++} END {for (type in count) print type, count[type]}' app.log

# Find slowest requests
awk '$5 > 1.0 {print $1, $2, $5}' access.log | sort -k3 -nr | head -10

# Traffic by hour
awk '{print substr($1, 12, 2)}' access.log | sort | uniq -c

# Error rate percentage
awk '{total++; if ($4 >= 400) errors++} END {print errors/total*100 "%"}' access.log

System Monitoring

# Memory usage by process
ps aux | awk 'NR > 1 {print $11, $4}' | sort -k2 -nr | head -10

# Disk usage summary
df -h | awk 'NR > 1 {sum += $3} END {print sum}'

# Network connections by state
netstat -an | awk '/^tcp/ {state[$6]++} END {for (s in state) print s, state[s]}'

# CPU usage over threshold
top -bn1 | awk '$9 > 50 {print $12, $9}'

Database Query Results

# Calculate total sales
mysql -e "SELECT product, amount FROM sales" | awk 'NR > 1 {sum += $2} END {print sum}'

# Group by category
mysql -e "SELECT category, price FROM products" | awk 'NR > 1 {sum[$1] += $2} END {for (cat in sum) print cat, sum[cat]}'

Data Validation

# Find invalid email formats
awk -F',' '!/^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$/ {print NR, $0}' emails.csv

# Detect duplicate IDs
awk -F',' '{if (seen[$1]++) print "Duplicate ID:", $1}' users.csv

# Find missing fields
awk -F',' 'NF != 5 {print "Line", NR, "has", NF, "fields"}' data.csv

Complex Example: Web Server Analysis

#!/bin/bash

# Analyze Apache access log
analyze_log() {
    local logfile=$1
    
    echo "=== Web Server Analysis ==="
    echo
    
    # Total requests
    echo "Total Requests:"
    awk 'END {print NR}' "$logfile"
    echo
    
    # Requests by status code
    echo "Requests by Status:"
    awk '{print $9}' "$logfile" | sort | uniq -c | sort -rn
    echo
    
    # Top 10 IPs
    echo "Top 10 IP Addresses:"
    awk '{print $1}' "$logfile" | sort | uniq -c | sort -rn | head -10
    echo
    
    # Top 10 requested URLs
    echo "Top 10 URLs:"
    awk '{print $7}' "$logfile" | sort | uniq -c | sort -rn | head -10
    echo
    
    # Average response size
    echo "Average Response Size:"
    awk '{sum += $10; count++} END {print sum/count " bytes"}' "$logfile"
    echo
    
    # Requests by hour
    echo "Requests by Hour:"
    awk '{print substr($4, 13, 2)}' "$logfile" | sort | uniq -c
    echo
    
    # 404 errors
    echo "404 Errors:"
    awk '$9 == 404 {print $7}' "$logfile" | sort | uniq -c | sort -rn | head -10
}

analyze_log access.log

BEGIN and END Blocks

Special blocks that execute before and after processing:

# Set variables before processing
awk 'BEGIN {FS=","; OFS="|"} {print $1, $2}' data.csv

# Print header and footer
awk 'BEGIN {print "=== Report ==="} {print $0} END {print "=== End ==="}' data.txt

# Initialize counters
awk 'BEGIN {total=0; count=0} {total+=$1; count++} END {print total/count}' numbers.txt

Combining awk with Other Tools

With sort and uniq

# Most common values in column 3
awk '{print $3}' data.txt | sort | uniq -c | sort -rn | head -10

With grep

# Filter then extract
grep ERROR app.log | awk '{print $1, $4}'

With xargs

# Process list of files
ls *.log | awk '{print $1}' | xargs wc -l

Performance Tips

Filter early: Use grep before awk when possible
Avoid regex when simple comparison works: $1 == "text" is faster than /text/
Use built-in variables: $NF instead of $10 when last column varies
Pre-compile patterns: Store complex patterns in variables

Common Patterns Cheat Sheet

# Sum column
awk '{sum+=$1} END {print sum}'

# Average
awk '{sum+=$1; n++} END {print sum/n}'

# Count occurrences
awk '{count[$1]++} END {for(i in count) print i, count[i]}'

# Print unique values
awk '!seen[$1]++' 

# Print duplicates
awk 'seen[$1]++' 

# Remove empty lines
awk 'NF'

# Number lines
awk '{print NR, $0}'

# Print every Nth line
awk 'NR % 5 == 0'

Debugging awk Scripts

# Print what awk sees
awk '{print "Line:", NR, "Fields:", NF, "Content:", $0}' file.txt

# Debug field separator
awk '{print "Fields:", NF; for(i=1; i<=NF; i++) print i, $i}' file.txt

# Test pattern matching
awk '/pattern/ {print "Matched:", $0}' file.txt

When to Use awk vs Other Tools

Use awk when:

Working with columnar data
Need calculations or aggregations
Processing CSV/TSV files
Extracting specific fields
Need conditional logic

Use sed when:

Simple line-by-line substitution
Don't need field-based processing

Use cut when:

Just extracting columns, no processing
Fixed-width or simple delimited data

Pro Tips

Test on small samples first: Use head -100 file | awk '...'
Use meaningful variable names: {total+=$1} not {t+=$1}
Comment complex awk scripts: Save them in files with comments
Know your data: Check field count with awk '{print NF}' file | sort -u
Handle missing fields: Check NF before accessing fields

Conclusion

awk is the ultimate tool for extracting and manipulating columnar data. It's fast, built-in, and incredibly powerful once you learn the basics. Whether you're analyzing logs, processing CSV files, or calculating statistics, awk can do it in one elegant line.

Master these awk patterns, and you'll handle data extraction tasks that would take dozens of lines in other languages!

The Problem

The Hack: awk for Data Extraction

Understanding awk Basics

Essential awk One-Liners

Print Specific Columns

Filtering Data

Calculations and Aggregations

Working with CSV Files

Comma-Separated Values

Skip Header Row

Advanced Patterns

Multiple Conditions

String Operations

Custom Output Formatting

Real-World Use Cases

Log Analysis

System Monitoring

Database Query Results

Data Validation

Complex Example: Web Server Analysis

BEGIN and END Blocks

Combining awk with Other Tools

With sort and uniq

With grep

With xargs

Performance Tips

Common Patterns Cheat Sheet

Debugging awk Scripts

When to Use awk vs Other Tools

Pro Tips

Conclusion

References

Like this:

You may like

Written by:

Chandan 490 Posts

Related Posts

Shell Script Hack #12: Array Mastery – Managing Lists and Collections

Shell Script Hack #11: Command Chaining – Clean Error Handling with && and ||

Shell Script Hack #10: Parameter Expansion – String Magic Without External Tools

You May Have Missed

The Complete Picture: Balancing Professional and Personal Support Systems

For Parents, Partners, and Friends: A Guide to Supporting Your Loved One in Tech

The HR Conversation: When and How to Involve HR in Your Mental Health Journey

Finding Your Tech Tribe: The Power of Peer Support Groups