How to remove duplicate lines from a file in Perl?
Question
How to remove duplicate lines from a file in Perl?
Removing Duplicate Lines from a File in Perl
Removing duplicate lines from a file is a common text-processing task. In Perl, it’s straightforward to accomplish this using a hash to keep track of lines you’ve already seen. Hashes provide efficient lookups, perfect for this kind of deduplication.
Here’s the general approach:
- Open the input file for reading.
- Read the file line by line.
- Use a hash to store lines that have already been encountered.
- If a line is new (not in the hash), print it (or save it).
Perl Concepts Used
%seen— a hash mapping lines to a true value to track duplicateschomp— removes trailing newline for clean comparisonwhile (<>) { ... }— Convenient file reading loop, works with ARGV or STDIN- Perl’s typical idiom of checking
existsin a hash to identify duplicates
Example: Remove Duplicate Lines from a File
#!/usr/bin/perl
use strict;
use warnings;
# Hash to store seen lines
my %seen;
# Read from standard input or files passed as arguments
while (my $line = <>) {
chomp($line); # remove newline for consistent comparison
unless ($seen{$line}++) { # if not seen before
print "$line\n"; # print the unique line with newline restored
}
}
This script can be used in several ways:
- Save it to a file, say
dedup.pl, then run:
perl dedup.pl yourfile.txt > unique.txt
- Or use it as a one-liner directly in the terminal:
perl -ne 'print unless $seen{$_}++' yourfile.txt > unique.txt
Key Points and Potential Gotchas
- Preserving order: Since hashes are inherently unordered in Perl, this method preserves the first occurrence of each line and skips duplicates that come later.
- Includes trailing newlines: We
chompto normalize lines before checking duplicates. Printing restores the newline explicitly. - Large files: This method stores all unique lines in memory. For extremely large files, memory usage may be an issue.
- Context: Using the
<>>operator will read files passed as command line arguments or STDIN. - Line endings: If the file uses Windows-style line endings (
\r\n), consider manually removing trailing\rtoo for proper deduplication.
Version Notes
This code works on all active Perl 5 versions and uses only core features. It requires no additional modules. Using the say function could simplify printing (available from 5.10+) but it is not necessary here.
With its clear, idiomatic use of hashes and line handling, this method showcases Perl’s “There’s More Than One Way To Do It” approach elegantly and efficiently.
Verified Code
Executed in a sandbox to capture real output. • v5.34.1 • 10ms
(empty)(empty)