regex intermediate

How to split a string by regex pattern in Perl?

Question

How to split a string by regex pattern in Perl?

Splitting a String by a Regex Pattern in Perl

In Perl, the split function is a versatile tool used to break a string into a list of substrings based on a specified pattern. When you want to split a string using a regular expression (regex) pattern rather than a fixed substring, split seamlessly accepts regexes, making it incredibly flexible.

Here's a quick refresher on the split syntax:

split /PATTERN/, EXPR, LIMIT
  • /PATTERN/: The regex pattern defining where to split.
  • EXPR: The string to split. Defaults to $_ if omitted.
  • LIMIT: Optional integer, limiting the number of fields returned.

Using a Regex Pattern

The pattern can be any valid Perl regex, allowing you to split on single characters, character classes, or complex expressions. For example:

  • Split on whitespace: split /\s+/, $string
  • Split on commas or semicolons: split /[;,]/, $string
  • Split on multiple delimiters or more complex patterns

Context & Sigils

Note that split returns a list in list context and a scalar in scalar context (scalar context returns the number of fields extracted). It's common to assign the result to an array to work with the fields individually:

@fields = split /PATTERN/, $string;

Remember that in regex, certain characters have special meaning. For example, to split on a dot ('.'), you need to escape it like /\./, because '.' matches any character.

Practical Example

The following fully runnable Perl script demonstrates splitting a string on one or more commas or whitespace characters. It prints each resulting field on a new line.

#!/usr/bin/perl
use strict;
use warnings;

my $string = "apple, banana,orange,  grape,melon";

# Split on commas (optionally with spaces) or whitespace
my @fruits = split /,\s*|\s+/, $string;

print "Split fields:\n";
foreach my $fruit (@fruits) {
    print "[$fruit]\n";
}

Explanation

  • The pattern /,\s*|\s+/ uses alternation | to split on either a comma followed by optional spaces or one or more whitespace characters.
  • This approach captures many real-world delimiters where the separator might be a comma with varying spaces or just whitespace.
  • Fields retain their characters without separators.

Common Pitfalls

  • Forgetting to escape regex metacharacters. Example: splitting on a dot requires /\./, not /./.
  • Using split without a pattern or with an empty string produces unexpected behavior.
  • Beware of trailing empty fields when the string ends with the pattern; use the LIMIT argument if needed.
  • Remember that split respects regex rules like greedy matching, so complex patterns might split unexpectedly.

Version Notes

The use of split with regex has been stable and consistent since early Perl versions. Some newer features related to regex (like unicode property escapes) are available in Perl 5.10 and above but are not specific to split.

In summary, splitting by regex in Perl is straightforward and powerful, leveraging Perl's rich regex engine. Adjust the pattern to capture the precise separators you need, test your pattern carefully, and you'll have fine-grained control over string tokenization.

Verified Code

Executed in a sandbox to capture real output. • v5.34.1 • 4ms

Tip: edit code and use “Run (Browser)”. Server runs always execute the published, verified snippet.
STDOUT
(empty)
STDERR
(empty)

Was this helpful?

Related Questions