A regular expression ("regex") is a pattern that you use to match lines of text. You might, for example, to extract all the lines from a file that include the word "delinquent," or find all the file names whose names include an upper-case letter. Regular expressions provide a very powerful way to set up queries of this sort. They are intended for text, although they can be used on binary files (like Word documents, say) under some circumstances.
Every scripting language supports regular expressions, and there are libraries available for C and JAVA and, essentially, all other languages, because everyone knows how powerful regular expressions are.
Glob: The simplest pattern-matching approach is the glob, which is also known as the wild-card match. Regular expressions are much richer than globs (and, unsurprisingly, much more complicated). The glob characters are the asterisk, *, which matches all character; the question mark, ?, which matches exactly one; andf the square brackets [ and ], which let you make a choice.
Globbing is used at the bash command-line for filename expansion. So the command ls J* will list all files starting with upper-case J; ls J? will list all files with two-character names of which the first is J; ls [Jd]* lists file names that start with either J or d.
POSIX regular expressions come in two forms: "basic" and "extended." The difference is primarily in which character are "special." See escaping below. Regular expressions are implemented in the grep command (and its siblings) in bash (found in Cygwin on Windows and Terminal on OSX) and in grep() and its siblings in R. By default, bash uses basic but R uses extended regular expressions. To use extended regular expressions in Cygwin, run egrep or pass the -E flag (note that it's upper-case).
Line one ends in a dollar sign; here is $$$ Line two has "$$$" in the middle and \ near the end $12223.45, on line three, is a lot of money! 1223.45 (line four) has, at one point, exactly two 2's in a row line five has no dollar signs and no punctuation but it does end in x $$$ , on line six, isn't a thing with meaning , but here's a pipe: | the seventh, with internal Capital Letter, is 2 lines past # 4A regular expression is a pattern. When a regular character appears in a regular expression, it means that you are requiring a match to that character. So this command:
$12223.45, on line three, is a lot of money! 1223.45 (line four) has, at one point, exactly two 2's in a row seventh line, with internal Capital Letter, is 2 lines past line 4
egrep ; f.txtproduces an error, because bash thinks you wanted to use the semi-colon to separate two commands. To protect the semi-colon from bash and make sure it gets to grep, we enclose it in quotes:
egrep ";" f.txt Line one ends in a dollar sign; $$$Note on quotes in bash:Single quotes and double quotes are not interchangeable in bash. Inside single quotes, every character is what you think it is (except that the blackslash character, \, is reserved, just as in R, so to type a backslash, you need to type it twice.) As a result, you can't embed a single quote inside other single quotes, since bash will think that the embedded quote is paired with the opening one. Inside double quotes, every character is what you think it is, except for $ (dollar sign), \ (backslash), and ` (backtick). If you need to use double quotes, and you want to use one of those characters, you need to protect it with a backslash -- only, because backslash is itself special, you have to protect the backslash itself. For example, we will learn in a moment that the dollar sign is a special character meaning "end of line." In order to use the dollar sign to mean "match the dollar sign character," it needs to be "escaped," which (as I say below) means to tell the regular expression evaluator that this character has a different use than it does usually.
There is another kind of "protecting" (called "escaping") that we have to do. "Protecting" is my word for how you deal with bash. "Escaping" is when you're dealing with the regular expression evaluator (also called the "engine"). Some characters have special uses inside a pattern. When you want to use that character as itself, you need to "escape" it by preceding it with a backslash (the \ character). If you escape a character that isn't special, the backslash is just ignored. Escaping Examples: Now consider these examples
grep '$' f.txt # returns all lines with ends (i.e. all lines) grep "$" f.txt # same grep '\$' f.txt # returns lines with dollar signs grep "\$" f.txt # returns all lines, because the engine thinks the backslash is protecting the dollar sign grep "\\$" f.txt # returns lines with dollar signs, because you've made the backslash explicit grep '\$$' f.txt # return lines that end with dollar signs; the '\$' says "literal dollar sign" and the second '$' says "end of line" grep "\\$\$' f.txt # return lines that end with dollar signs; the "\\" says "backslash," the '$' says "literal $" and the "\$" says "end of line" grep '\$\$' f.txt # returns lines with two dollar signs grep '\$\$$' f.txt # returns lines that end with two dollar signs grep "\$\$" f.txt # returns all lines, because the engine thinks you want "end of line, followed by end of line" grep "\\$\\$\$" f.txt # returns lines that end with two dollar signs grep '\\' f.txt # returns lines with backslashes; the first "escapes" the second grep "\\\\" f.txt # returns lines with backslashes; we need to pass two backslashes, but since we're in double-quotes, the first protects the second, producing one; the third protects the fourth, producing a second
Character | Name | * | Purpose | Example | |
---|---|---|---|---|---|
. | Period | Matches any character | t.e matches lines with tae, tbe, ..., t1e, t;e, etc. anywhere | ||
[ and ] | Brackets | Matches any character between them | t[13579]e matches t1e, t3e, ..., t9e; t[1-6]e matches t1e, t2e, ..., t6e, but watch out: [a-d] might mean [abcd] or it might mean [aAbBcCdD], depending on your computer. See character classes below. | ||
^ | Caret | (i) Matches "start of line" | ^Line matches any line starting Line | ||
(ii) Acts as "not" as first character inside square brackets | t[^h]e matches any line containing a t, then something that is not an h, then a e | ||||
(iii) Acts as itself inside square brackets if not first | t[h^i]e matches the, t^e, or tie | ||||
$ | Dollar sign | Matches "end of line" | the$ matches lines ending in the | ||
| | Pipe | * | "Or" operator | th|sc matches either th or sc | |
( and ) | Parentheses | * | Grouping Operators | ||
\\ | Backslash | Escape character | See below | ||
These characters handle repetitions | |||||
Character | Name | * | Purpose | Example | |
{ and } | Braces | * | Enclose repetition operators | (a|b){3} matches any line with three (a or b) characters in a row: aba, baa, bbb, ... | |
, | Comma | Separate repetition operators | b{2,4} matches bb, bbb or bbbb | ||
+ | Plus sign | * | Match one or more times | ab+ matches ab, abb, abbb... | |
* | Asterisk | Match zero or more times | ab* matches a, ab, abb, abbb... | ||
? | Question mark | * | Match zero or 1 times | ab? matches a or ab |
the pattern ^[^[:upper:]].* will match every line that doesn't start with an upper-case letter. Here the caret character, ^, is being used in two different ways. At the beginning of the pattern, it means "start of line"; the second caret, appearing as the first character inside square brackets, is the negation operation.
\w is a synonym for [[:alnum:]] (so in this case, the outer square brackets are already in place), and \W is its complement, [^[:alnum:]]. So the pattern ^\W matches lines that do not start with either a number or a letter.
-c: count (report the number of lines that match)
-F: treat the pattern as fixed (literal)
-i: ignore case (in both pattern and text)
-n: show matching line numbers, in addition to the lines themselves
-r: recurse (go into sub-directories)
-v: invert (find only non-matching lines)
-w: word match (match must be at the beginning or end of the line, and preceded or followed by a character that isn't a letter, digit, or underscore)
As I say, there are many more.
This first uses a backreference to locate every instance of a doubled lower-case character.
grep '\([a-z]\)\1' f.txt # equivalent: grep -E '([a-z])\1' f.txt Line one ends in a dollar sign; here is $$$ Line two has "$$$" in the middle and \ near the end $12223.45, on line three, is a lot of money! line five has no dollar signs and no punctuation but it does end in x the seventh, with internal Capital Letter, is 2 lines past # 4Here we locate every line in which a word is duplicated:
grep -E '( .+ ).*\1' f.txt # equivalent: grep '\( .\+ \).*\1' f.txt Line two has "$$$" in the middle and \ near the end $$$ , on line six, isn't a thing with meaning , but here's a pipe: | the seventh, with internal Capital Letter, is 2 lines past # 4