text-processing beginner
How to count word frequency in a text file using Perl?
Question
How to count word frequency in a text file using Perl?
Counting word frequency in Perl is a classic text processing task that showcases Perl's strength with hashes and regular expressions. The basic approach involves splitting text into words, then using a hash to track occurrences of each word.
Here's a complete example that processes sample text to demonstrate word frequency counting:
#!/usr/bin/perl
use strict;
use warnings;
# Sample text (in practice, you'd read from a file)
my $text = <<'END_TEXT';
Perl is a powerful programming language. Perl makes text
processing easy. Programming in Perl is fun and Perl is
widely used for text processing tasks.
END_TEXT
# Hash to store word frequencies
my %word_count;
# Process the text: lowercase, extract words, count them
while ($text =~ /\b([a-z]+)\b/gi) {
my $word = lc($1);
$word_count{$word}++;
}
# Display results sorted by frequency (descending)
print "Word Frequency Report:\n";
print "-" x 40, "\n";
foreach my $word (sort { $word_count{$b} <=> $word_count{$a} } keys %word_count) {
printf "%-15s : %d\n", $word, $word_count{$word};
}
print "\nTotal unique words: ", scalar(keys %word_count), "\n";
Key Concepts
The code demonstrates several Perl idioms:
- Hash autovivification: The expression
$word_count{$word}++automatically creates the hash entry if it doesn't exist, starting fromundefwhich numerically evaluates to 0 - Regular expression matching: The
/\b([a-z]+)\b/gipattern uses word boundaries (\b), captures letter sequences, and uses theg(global) andi(case-insensitive) modifiers - Hash sorting: The
sortfunction with a custom comparison block ($word_count{$b} <=> $word_count{$a}) sorts by frequency in descending order
Reading from a File
To process an actual file, replace the sample text section with:
open my $fh, '<', 'filename.txt' or die "Cannot open file: $!";
while (my $line = <$fh>) {
while ($line =~ /\b([a-z]+)\b/gi) {
$word_count{lc($1)}++;
}
}
close $fh;
Common Pitfalls
- Case sensitivity: Always normalize with
lc()unless you want "Perl" and "perl" counted separately - Punctuation: The word boundary
\bhandles most cases, but contractions like "don't" may need special handling - Memory with large files: For huge files, consider processing line-by-line rather than loading everything into memory
Verified Code
Executed in a sandbox to capture real output. • v5.34.1 • 9ms
Tip: edit code and use “Run (Browser)”. Server runs always execute the published, verified snippet.
STDOUT
Word Frequency Report:
----------------------------------------
perl : 4
is : 3
text : 2
processing : 2
programming : 2
makes : 1
language : 1
in : 1
used : 1
and : 1
easy : 1
tasks : 1
a : 1
widely : 1
fun : 1
powerful : 1
for : 1
Total unique words: 17
STDERR
(empty)