Shell Script Hack #6: awk Magic for Data Extraction and Analysis

Shell Script Hack #6: awk Magic for Data Extraction and Analysis

You have a CSV file with thousands of rows and need to extract specific columns, calculate totals, or filter data. You could write a Python script, but awk can do it in one elegant line. Let me show you why awk is the secret weapon for data extraction.

The Problem

You have a server log file with columns: timestamp, IP address, status code, response time. You need to find the average response time for 200 status codes.

# Sample log format:
2025-10-24 10:23:45 192.168.1.100 200 0.234
2025-10-24 10:23:46 192.168.1.101 404 0.012
2025-10-24 10:23:47 192.168.1.102 200 0.445

Writing a full script seems excessive for this simple task.

The Hack: awk for Data Extraction

awk processes text line by line and treats each line as fields separated by whitespace (or any delimiter you specify):

awk '$4 == 200 {sum += $5; count++} END {print sum/count}' access.log

One line. Instant result. No external dependencies.

Understanding awk Basics

awk automatically splits each line into fields:

  • $0 – entire line
  • $1 – first field
  • $2 – second field
  • $NF – last field
  • NR – current line number
  • NF – number of fields in current line

Essential awk One-Liners

Print Specific Columns

# Print first and third columns
awk '{print $1, $3}' file.txt

# Print last column
awk '{print $NF}' file.txt

# Print all except first column
awk '{$1=""; print $0}' file.txt

# Swap first two columns
awk '{print $2, $1}' file.txt

Filtering Data

# Lines where column 3 > 100
awk '$3 > 100' data.txt

# Lines containing "ERROR"
awk '/ERROR/' log.txt

# Lines where status is 200 or 201
awk '$4 == 200 || $4 == 201' access.log

# Lines NOT matching pattern
awk '!/DEBUG/' log.txt

# Numeric comparison
awk '$5 >= 0.5' response_times.txt

Calculations and Aggregations

# Sum of column 2
awk '{sum += $2} END {print sum}' numbers.txt

# Average
awk '{sum += $1; count++} END {print sum/count}' values.txt

# Count lines
awk 'END {print NR}' file.txt

# Min and max
awk 'NR==1 {min=$1; max=$1} {if($1max) max=$1} END {print "Min:", min, "Max:", max}' data.txt

# Count occurrences
awk '{count[$1]++} END {for (i in count) print i, count[i]}' data.txt

Working with CSV Files

Comma-Separated Values

# Set field separator to comma
awk -F, '{print $1, $3}' data.csv

# Print second column of CSV
awk -F',' '{print $2}' users.csv

# Handle CSV with spaces
awk -F', *' '{print $1}' messy.csv

# Convert CSV to TSV
awk -F',' '{print $1"\t"$2"\t"$3}' data.csv

Skip Header Row

# Skip first line
awk 'NR > 1' data.csv

# Process only data rows
awk -F',' 'NR > 1 {sum += $3} END {print sum}' sales.csv

Advanced Patterns

Multiple Conditions

# AND condition
awk '$3 > 100 && $4 == "ACTIVE"' data.txt

# OR condition
awk '$1 == "ERROR" || $1 == "CRITICAL"' log.txt

# Range of lines
awk 'NR >= 10 && NR <= 20' file.txt

# Pattern range
awk '/START/,/END/' file.txt

String Operations

# String length
awk '{print length($1)}' words.txt

# Substring
awk '{print substr($1, 1, 5)}' data.txt

# Concatenation
awk '{print $1 "-" $2}' file.txt

# To uppercase
awk '{print toupper($0)}' file.txt

# To lowercase
awk '{print tolower($0)}' file.txt

Custom Output Formatting

# Format numbers
awk '{printf "%.2f\n", $1}' decimals.txt

# Aligned columns
awk '{printf "%-20s %10d\n", $1, $2}' data.txt

# Add prefix/suffix
awk '{print "User: " $1}' users.txt

# Custom field separator for output
awk 'BEGIN{OFS="|"} {print $1, $2, $3}' data.txt

Real-World Use Cases

Log Analysis

# Count errors by type
awk '/ERROR/ {count[$3]++} END {for (type in count) print type, count[type]}' app.log

# Find slowest requests
awk '$5 > 1.0 {print $1, $2, $5}' access.log | sort -k3 -nr | head -10

# Traffic by hour
awk '{print substr($1, 12, 2)}' access.log | sort | uniq -c

# Error rate percentage
awk '{total++; if ($4 >= 400) errors++} END {print errors/total*100 "%"}' access.log

System Monitoring

# Memory usage by process
ps aux | awk 'NR > 1 {print $11, $4}' | sort -k2 -nr | head -10

# Disk usage summary
df -h | awk 'NR > 1 {sum += $3} END {print sum}'

# Network connections by state
netstat -an | awk '/^tcp/ {state[$6]++} END {for (s in state) print s, state[s]}'

# CPU usage over threshold
top -bn1 | awk '$9 > 50 {print $12, $9}'

Database Query Results

# Calculate total sales
mysql -e "SELECT product, amount FROM sales" | awk 'NR > 1 {sum += $2} END {print sum}'

# Group by category
mysql -e "SELECT category, price FROM products" | awk 'NR > 1 {sum[$1] += $2} END {for (cat in sum) print cat, sum[cat]}'

Data Validation

# Find invalid email formats
awk -F',' '!/^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$/ {print NR, $0}' emails.csv

# Detect duplicate IDs
awk -F',' '{if (seen[$1]++) print "Duplicate ID:", $1}' users.csv

# Find missing fields
awk -F',' 'NF != 5 {print "Line", NR, "has", NF, "fields"}' data.csv

Complex Example: Web Server Analysis

#!/bin/bash

# Analyze Apache access log
analyze_log() {
    local logfile=$1
    
    echo "=== Web Server Analysis ==="
    echo
    
    # Total requests
    echo "Total Requests:"
    awk 'END {print NR}' "$logfile"
    echo
    
    # Requests by status code
    echo "Requests by Status:"
    awk '{print $9}' "$logfile" | sort | uniq -c | sort -rn
    echo
    
    # Top 10 IPs
    echo "Top 10 IP Addresses:"
    awk '{print $1}' "$logfile" | sort | uniq -c | sort -rn | head -10
    echo
    
    # Top 10 requested URLs
    echo "Top 10 URLs:"
    awk '{print $7}' "$logfile" | sort | uniq -c | sort -rn | head -10
    echo
    
    # Average response size
    echo "Average Response Size:"
    awk '{sum += $10; count++} END {print sum/count " bytes"}' "$logfile"
    echo
    
    # Requests by hour
    echo "Requests by Hour:"
    awk '{print substr($4, 13, 2)}' "$logfile" | sort | uniq -c
    echo
    
    # 404 errors
    echo "404 Errors:"
    awk '$9 == 404 {print $7}' "$logfile" | sort | uniq -c | sort -rn | head -10
}

analyze_log access.log

BEGIN and END Blocks

Special blocks that execute before and after processing:

# Set variables before processing
awk 'BEGIN {FS=","; OFS="|"} {print $1, $2}' data.csv

# Print header and footer
awk 'BEGIN {print "=== Report ==="} {print $0} END {print "=== End ==="}' data.txt

# Initialize counters
awk 'BEGIN {total=0; count=0} {total+=$1; count++} END {print total/count}' numbers.txt

Combining awk with Other Tools

With sort and uniq

# Most common values in column 3
awk '{print $3}' data.txt | sort | uniq -c | sort -rn | head -10

With grep

# Filter then extract
grep ERROR app.log | awk '{print $1, $4}'

With xargs

# Process list of files
ls *.log | awk '{print $1}' | xargs wc -l

Performance Tips

  • Filter early: Use grep before awk when possible
  • Avoid regex when simple comparison works: $1 == "text" is faster than /text/
  • Use built-in variables: $NF instead of $10 when last column varies
  • Pre-compile patterns: Store complex patterns in variables

Common Patterns Cheat Sheet

# Sum column
awk '{sum+=$1} END {print sum}'

# Average
awk '{sum+=$1; n++} END {print sum/n}'

# Count occurrences
awk '{count[$1]++} END {for(i in count) print i, count[i]}'

# Print unique values
awk '!seen[$1]++' 

# Print duplicates
awk 'seen[$1]++' 

# Remove empty lines
awk 'NF'

# Number lines
awk '{print NR, $0}'

# Print every Nth line
awk 'NR % 5 == 0'

Debugging awk Scripts

# Print what awk sees
awk '{print "Line:", NR, "Fields:", NF, "Content:", $0}' file.txt

# Debug field separator
awk '{print "Fields:", NF; for(i=1; i<=NF; i++) print i, $i}' file.txt

# Test pattern matching
awk '/pattern/ {print "Matched:", $0}' file.txt

When to Use awk vs Other Tools

Use awk when:

  • Working with columnar data
  • Need calculations or aggregations
  • Processing CSV/TSV files
  • Extracting specific fields
  • Need conditional logic

Use sed when:

  • Simple line-by-line substitution
  • Don't need field-based processing

Use cut when:

  • Just extracting columns, no processing
  • Fixed-width or simple delimited data

Pro Tips

  • Test on small samples first: Use head -100 file | awk '...'
  • Use meaningful variable names: {total+=$1} not {t+=$1}
  • Comment complex awk scripts: Save them in files with comments
  • Know your data: Check field count with awk '{print NF}' file | sort -u
  • Handle missing fields: Check NF before accessing fields

Conclusion

awk is the ultimate tool for extracting and manipulating columnar data. It's fast, built-in, and incredibly powerful once you learn the basics. Whether you're analyzing logs, processing CSV files, or calculating statistics, awk can do it in one elegant line.

Master these awk patterns, and you'll handle data extraction tasks that would take dozens of lines in other languages!

References

Written by:

426 Posts

View All Posts
Follow Me :
How to whitelist website on AdBlocker?

How to whitelist website on AdBlocker?

  1. 1 Click on the AdBlock Plus icon on the top right corner of your browser
  2. 2 Click on "Enabled on this site" from the AdBlock Plus option
  3. 3 Refresh the page and start browsing the site