Regular Expressions in Cygwin grep

Regular Expressions

Introduction

A regular expression ("regex") is a pattern that you use to match lines of text. You might, for example, to extract all the lines from a file that include the word "delinquent," or find all the file names whose names include an upper-case letter. Regular expressions provide a very powerful way to set up queries of this sort. They are intended for text, although they can be used on binary files (like Word documents, say) under some circumstances.

Every scripting language supports regular expressions, and there are libraries available for C and JAVA and, essentially, all other languages, because everyone knows how powerful regular expressions are.

Glob: The simplest pattern-matching approach is the glob, which is also known as the wild-card match. Regular expressions are much richer than globs (and, unsurprisingly, much more complicated). The glob characters are the asterisk, *, which matches all character; the question mark, ?, which matches exactly one; andf the square brackets [ and ], which let you make a choice.

Globbing is used at the bash command-line for filename expansion. So the command ls J* will list all files starting with upper-case J; ls J? will list all files with two-character names of which the first is J; ls [Jd]* lists file names that start with either J or d.

Types of Regular Expressions

There are two types of regular expressions: PERL and POSIX. PERL is the name of a scripting language, and POSIX is the name of a set of standards. This document is about POSIX regular expressions only!

POSIX regular expressions come in two forms: "basic" and "extended." The difference is primarily in which character are "special." See escaping below. Regular expressions are implemented in the grep command (and its siblings) in bash (found in Cygwin on Windows and Terminal on OSX) and in grep() and its siblings in R. By default, bash uses basic but R uses extended regular expressions. To use extended regular expressions in Cygwin, run egrep or pass the -E flag (note that it's upper-case).

The Basics of Regular Expressions

First, it's worth noting that this is a big subject. There are entire books written on regular expressions, and numerous web tutorials as well. This document represents my effort to introduce you to the idea, while clarifying a couple of points that have confused me in the past.

Running Example

Imagine that we have one text file, which I'll call f.txt. Suppose f.txt contains these seven lines:

Line one ends in a dollar sign; here is $$$
Line two has "$$$" in the middle and \ near the end
$12223.45, on line three, is a lot of money!
1223.45 (line four) has, at one point, exactly two 2's in a row
line five has no dollar signs and no punctuation but it does end in x
$$$ , on line six, isn't a thing with meaning , but here's a pipe: |
the seventh, with internal Capital Letter, is 2 lines past # 4

A regular expression is a pattern. When a regular character appears in a regular expression, it means that you are requiring a match to that character. So this command:
grep 2 f.txt says "extract all the lines in which "2" appears (anywhere), and the result is

$12223.45, on line three, is a lot of money!
1223.45 (line four) has, at one point, exactly two 2's in a row
seventh line, with internal Capital Letter, is 2 lines past line 4

Protecting Patterns From `bash`

In that example, the "2" was a perfectly good pattern, but in many more complicated expressions, bash will try to grab the pattern before it gets to grep. So it's a good idea to enclose the pattern in quotation marks. Here I want to find all the lines with a semi-colon.

egrep ; f.txt

produces an error, because bash thinks you wanted to use the semi-colon to separate two commands. To protect the semi-colon from bash and make sure it gets to grep, we enclose it in quotes:

egrep ";" f.txt
Line one ends in a dollar sign; $$$

Note on quotes in bash:Single quotes and double quotes are not interchangeable in bash. Inside single quotes, every character is what you think it is (except that the blackslash character, \, is reserved, just as in R, so to type a backslash, you need to type it twice.) As a result, you can't embed a single quote inside other single quotes, since bash will think that the embedded quote is paired with the opening one. Inside double quotes, every character is what you think it is, except for $ (dollar sign), \ (backslash), and ` (backtick). If you need to use double quotes, and you want to use one of those characters, you need to protect it with a backslash -- only, because backslash is itself special, you have to protect the backslash itself. For example, we will learn in a moment that the dollar sign is a special character meaning "end of line." In order to use the dollar sign to mean "match the dollar sign character," it needs to be "escaped," which (as I say below) means to tell the regular expression evaluator that this character has a different use than it does usually.

Escaping

There is another kind of "protecting" (called "escaping") that we have to do. "Protecting" is my word for how you deal with bash. "Escaping" is when you're dealing with the regular expression evaluator (also called the "engine"). Some characters have special uses inside a pattern. When you want to use that character as itself, you need to "escape" it by preceding it with a backslash (the \ character). If you escape a character that isn't special, the backslash is just ignored. Escaping Examples: Now consider these examples

grep '$' f.txt        # returns all lines with ends (i.e. all lines)
grep "$" f.txt        # same

grep '\$' f.txt       # returns lines with dollar signs
grep "\$" f.txt       # returns all lines, because the engine thinks the backslash is protecting the dollar sign
grep "\\$" f.txt      # returns lines with dollar signs, because you've made the backslash explicit

grep '\$$' f.txt      # return lines that end with dollar signs; the '\$' says "literal dollar sign" and the second '$' says "end of line"
grep "\\$\$' f.txt    # return lines that end with dollar signs; the "\\" says "backslash," the '$' says "literal $" and 
                     the "\$" says "end of line"

grep '\$\$' f.txt     # returns lines with two dollar signs
grep '\$\$$' f.txt    # returns lines that end with two dollar signs
grep "\$\$" f.txt     # returns all lines, because the engine thinks you want "end of line, followed by end of line"
grep "\\$\\$\$" f.txt # returns lines that end with two dollar signs

grep '\\' f.txt       # returns lines with backslashes; the first "escapes" the second
grep "\\\\" f.txt     # returns lines with backslashes; we need to pass two backslashes, but since we're in double-quotes, the 
                        first protects the second, producing one; the third protects the fourth, producing a second

Special Characters

Special characters are what give regular expressions their power. The special characters, and their uses, are these. Notice the asterisk column. Items with an asterisk in that column are special characters for extended regular expressions, but not for basic regular expressions. So to use, for example, the pipe's function, in a extended regular expression, just type the pipe; you need to escape the pipe to use it literally in an extended regular expressions. Conversely, to use a pipe's function in a basic regular expression, precede the pipe with a backslash. To use the pipe literally in a basic regular expression, just type the pipe. Notice that space is not a special character; it just matches the space character. You'll be tempted to put spaces into your patterns to make them easier to read, but that is a mistake.

Character Name * Purpose Example

. Period Matches any character t.e matches lines with tae, tbe, ..., t1e, t;e, etc. anywhere

[ and ] Brackets Matches any character between them t[13579]e matches t1e, t3e, ..., t9e; t[1-6]e matches t1e, t2e, ..., t6e, but watch out: [a-d] might mean [abcd] or it might mean [aAbBcCdD], depending on your computer. See character classes below.

^ Caret (i) Matches "start of line" ^Line matches any line starting Line

(ii) Acts as "not" as first character inside square brackets t[^h]e matches any line containing a t, then something that is not an h, then a e

(iii) Acts as itself inside square brackets if not first t[h^i]e matches the, t^e, or tie

$ Dollar sign Matches "end of line" the$ matches lines ending in the

| Pipe * "Or" operator th|sc matches either th or sc

( and ) Parentheses * Grouping Operators

\\ Backslash Escape character See below

These characters handle repetitions

Character Name * Purpose Example

{ and } Braces * Enclose repetition operators (a|b){3} matches any line with three (a or b) characters in a row: aba, baa, bbb, ...

, Comma Separate repetition operators b{2,4} matches bb, bbb or bbbb

+ Plus sign * Match one or more times ab+ matches ab, abb, abbb...
* Asterisk Match zero or more times ab* matches a, ab, abb, abbb...

? Question mark * Match zero or 1 times ab? matches a or ab

Character	Name	*	Purpose	Example
`.`	Period		Matches any character	`t.e` matches lines with `tae`, `tbe`, ..., `t1e`, `t;e`, etc. anywhere
`[` and `]`	Brackets		Matches any character between them	`t[13579]e` matches `t1e`, `t3e`, ..., `t9e`; `t[1-6]e` matches `t1e`, `t2e`, ..., `t6e`, but watch out: `[a-d]` might mean `[abcd]` or it might mean `[aAbBcCdD]`, depending on your computer. See character classes below.
`^`	Caret		(i) Matches "start of line"	`^Line` matches any line starting `Line`
			(ii) Acts as "not" as first character inside square brackets	`t[^h]e` matches any line containing a `t`, then something that is not an `h`, then a `e`
			(iii) Acts as itself inside square brackets if not first	`t[h^i]e` matches `the`, `t^e`, or `tie`
`$`	Dollar sign		Matches "end of line"	`the$` matches lines ending in `the`
`\|`	Pipe	*	"Or" operator	`th\|sc` matches either `th` or `sc`
`(` and `)`	Parentheses	*	Grouping Operators
`\\`	Backslash		Escape character	See below
These characters handle repetitions
Character	Name	`*`	Purpose	Example
`{` and `}`	Braces	*	Enclose repetition operators	`(a\|b){3}` matches any line with three (a or b) characters in a row: aba, baa, bbb, ...
`,`	Comma		Separate repetition operators	`b{2,4}` matches `bb`, `bbb` or `bbbb`
`+`	Plus sign	*	Match one or more times	`ab+` matches `ab`, `abb`, `abbb`...
`*`	Asterisk		Match zero or more times	`ab*` matches `a`, `ab`, `abb`, `abbb`...
`?`	Question mark	*	Match zero or 1 times	`ab?` matches `a` or `ab`

Repetition is Greedy

In an earlier example, I noted that the pattern b{2,4} matches bb, bbb or bbbb. It will also match bbbbb, since the regular expression "engine" sees the first two b's and declares a match. How might you restrict your interest to items that contain 2, 3, or 4 b's, but not 5 or 6? One way is with a pattern like b{2,4}[^b]. This says "give me lines in which there is bb followed by a non-b, or bbb followed by a non-b, or bbbb followed by a non-b.

Character Classes

There is a set of predefined "character classes" defining sets of characters. These consist of a square bracket, a colon, a name, then another colon, and then a closing square bracket. To use these, though, you have to include them inside another pair of square brackets. Useful character classes include:

[:alpha:]: Letters (also available: [:upper:] and [:lower:])
[:digit:]: Numbers (use [:xdigit:] for hex digits)
[:alnum:]: Letters and numbers
[:punct:]: Punctuation
[:space:]: Space, tab, new-line etc.

Example: the pattern ^[[:upper:]].* matches every line that starts with an upper-case letter. Notice that the character class [:upper:] is enclosed in square brackets. You might have used the equivalent ^[A-Z].*, but apparently the first form will work everywhere regardless of the character coding scheme your computer happens to use.

the pattern ^[^[:upper:]].* will match every line that doesn't start with an upper-case letter. Here the caret character, ^, is being used in two different ways. At the beginning of the pattern, it means "start of line"; the second caret, appearing as the first character inside square brackets, is the negation operation.

\w is a synonym for [[:alnum:]] (so in this case, the outer square brackets are already in place), and \W is its complement, [^[:alnum:]]. So the pattern ^\W matches lines that do not start with either a number or a letter.

Word boundaries

You can find a word (that is, a pattern with blank spaces around it) by using the \< and \> symbols. So the pattern \ matches "line" but not "declined."

Important command-line arguments

grep has a number of useful command-line arguments. I recommend that you look at the man page, so that you know what's available. Here are some of the arguments I find most useful:

-c: count (report the number of lines that match)
-F: treat the pattern as fixed (literal)
-i: ignore case (in both pattern and text)
-n: show matching line numbers, in addition to the lines themselves
-r: recurse (go into sub-directories)
-v: invert (find only non-matching lines)
-w: word match (match must be at the beginning or end of the line, and preceded or followed by a character that isn't a letter, digit, or underscore)

As I say, there are many more.

Examples

Here I give some examples based on the f.txt file above. Remember that # is the comment character, so from that character to the end of the line everything is ignored. I use it here to put my comments in line with the commands. The tab characters that push the responses to the right, relative to the commands, are just for readability.


grep 2 f.txt # extract all lines with a 2 anywhere
$12223.45, on line three, is a lot of money!
1223.45 (line four) has, at one point, exactly two 2's in a row
the seventh, with internal Capital Letter, is 2 lines past # 4

grep '\\' f.txt    # extract all lines with a \ anywhere
grep "\\\\" f.txt  # equivalent form using if using double-quotes
Line two has "$$$" in the middle and \ near the end
grep '|' f.txt     # extract all lines with a | anywhere (basic regex)
grep -E '\|' f.txt # equivalent form with extended regex
$$$ , on line six, isn't a thing with meaning , but here's a pipe: |
grep -E '(#|!)' f.txt   # Find lines with either # or !
$12223.45, on line three, is a lot of money!
the seventh, with internal Capital Letter, is 2 lines past # 
grep '^\W' f.txt    # Find lines that start with a non-letter, non-number
$12223.45, on line three, is a lot of money!
$$$ , on line six, isn't a thing with meaning , but here's a pipe: |
grep -E '[[:digit:]].*\.[[:digit:]]{2}' f.txt # get lines with a bunch of digits, then a dot, then two more digits
$12223.45, on line three, is a lot of money!
1223.45 (line four) has, at one point, exactly two 2's in a row

Backreferences

grep can remember matches it made earlier in a line. These are called backreferences. You create a backreference by enclosing part of a pattern in parentheses; you refer to it with \1 for the first, 2 for the second, and so on. Originally you were limited to nine backreferences in a single regex, but in some implementations you may now have more.

This first uses a backreference to locate every instance of a doubled lower-case character.

grep '\([a-z]\)\1' f.txt # equivalent: grep -E '([a-z])\1' f.txt
Line one ends in a dollar sign; here is $$$
Line two has "$$$" in the middle and \ near the end
$12223.45, on line three, is a lot of money!
line five has no dollar signs and no punctuation but it does end in x
the seventh, with internal Capital Letter, is 2 lines past # 4

Here we locate every line in which a word is duplicated:

grep -E '( .+ ).*\1' f.txt # equivalent: grep '\( .\+ \).*\1' f.txt
Line two has "$$$" in the middle and \ near the end
$$$ , on line six, isn't a thing with meaning , but here's a pipe: |
the seventh, with internal Capital Letter, is 2 lines past # 4

Operating on Multiple Files

When grep examines multiple files, by default it acts is if it had operated separately on each file. Often you will want to modify this behavior, and the command-line arguments, especially -c and -n, will help.