How to count word frequency in a text file using Perl?

Question

How to count word frequency in a text file using Perl?

Counting word frequency in Perl is a classic text processing task that showcases Perl's strength with hashes and regular expressions. The basic approach involves splitting text into words, then using a hash to track occurrences of each word.

Here's a complete example that processes sample text to demonstrate word frequency counting:

#!/usr/bin/perl
use strict;
use warnings;

# Sample text (in practice, you'd read from a file)
my $text = <<'END_TEXT';
Perl is a powerful programming language. Perl makes text
processing easy. Programming in Perl is fun and Perl is
widely used for text processing tasks.
END_TEXT

# Hash to store word frequencies
my %word_count;

# Process the text: lowercase, extract words, count them
while ($text =~ /\b([a-z]+)\b/gi) {
    my $word = lc($1);
    $word_count{$word}++;
}

# Display results sorted by frequency (descending)
print "Word Frequency Report:\n";
print "-" x 40, "\n";

foreach my $word (sort { $word_count{$b} <=> $word_count{$a} } keys %word_count) {
    printf "%-15s : %d\n", $word, $word_count{$word};
}

print "\nTotal unique words: ", scalar(keys %word_count), "\n";

Key Concepts

The code demonstrates several Perl idioms:

  • Hash autovivification: The expression $word_count{$word}++ automatically creates the hash entry if it doesn't exist, starting from undef which numerically evaluates to 0
  • Regular expression matching: The /\b([a-z]+)\b/gi pattern uses word boundaries (\b), captures letter sequences, and uses the g (global) and i (case-insensitive) modifiers
  • Hash sorting: The sort function with a custom comparison block ($word_count{$b} <=> $word_count{$a}) sorts by frequency in descending order

Reading from a File

To process an actual file, replace the sample text section with:

open my $fh, '<', 'filename.txt' or die "Cannot open file: $!";
while (my $line = <$fh>) {
    while ($line =~ /\b([a-z]+)\b/gi) {
        $word_count{lc($1)}++;
    }
}
close $fh;

Common Pitfalls

  • Case sensitivity: Always normalize with lc() unless you want "Perl" and "perl" counted separately
  • Punctuation: The word boundary \b handles most cases, but contractions like "don't" may need special handling
  • Memory with large files: For huge files, consider processing line-by-line rather than loading everything into memory

Verified Code

Executed in a sandbox to capture real output. • v5.34.1 • 9ms

Tip: edit code and use “Run (Browser)”. Server runs always execute the published, verified snippet.
STDOUT
Word Frequency Report:
----------------------------------------
perl            : 4
is              : 3
text            : 2
processing      : 2
programming     : 2
makes           : 1
language        : 1
in              : 1
used            : 1
and             : 1
easy            : 1
tasks           : 1
a               : 1
widely          : 1
fun             : 1
powerful        : 1
for             : 1

Total unique words: 17
STDERR
(empty)

Was this helpful?

Related Questions