Welcome to our today’s tutorial on how to search text files using regular expressions like grep, egrep, fgrep, sed, regex. String-searching algorithms are widely used by several data-processing tasks, so much that Unix-like operating systems have their own ubiquitous implementation:
Regular expressions, often acronym
REs. Regular expressions consist of character sequences that make up a generic pattern used to locate and sometimes modify a corresponding sequence in a larger string of characters. Regular expressions greatly expand the ability to:
- Write parsing rules to requests in HTTP servers, nginx in particular.
- Write scripts that convert text-based datasets to another format.
- Search for occurrences of interest in journal entries or documents.
- Filter markup documents, keeping semantic content.
Differences Between Basic and Extended Regular Expressions
Basic Regular Expressions
Basic regular expressions (BREs) include characters, such as a dot followed by an asterisk (
.*) to represent multiple characters and a single dot (
.) to represent one character. They also may use brackets to represent multiple characters, such as [a,e,i,o,u] (you do not have to include the commas) or a range of characters, such as [A-z]. When brackets are employed, it is called a bracket expression.
anchor characters: To find text file records that begin with particular characters, you can precede them with a caret (
^) symbol. For finding text file records where particular characters are at the record’s end, append them with a dollar sign (
$) symbol. Both the caret and the dollar sign symbols are called
anchor characters for BREs, because they fasten the pattern to the beginning or the end of a text line.
grep command with a BRE pattern;
$ grep root /etc/passwd root:x:0:0:root:/root:/bin/bash nm-openvpn:x:115:121:NetworkManager OpenVPN,,,:/var/lib/openvpn/chroot:/usr/sbin/nologin
The above command searches for instances of the word
root within the password file. You notice that it displays two lines from the file.
grep to display only lines matching the PATTERN;
$ grep ^root /etc/passwd root:x:0:0:root:/root:/bin/bash
The above command employs the BRE
^ character and places it before the word
root. This regular expression pattern causes
grep to display only lines in the password file that begin with
grep command to audit the password file;
$ grep -v nologin$ /etc/passwd root:x:0:0:root:/root:/bin/bash sync:x:4:65534:sync:/bin:/bin/sync tss:x:103:108:TPM software stack,,,:/var/lib/tpm:/bin/false speech-dispatcher:x:111:29:Speech Dispatcher,,,:/run/speech-dispatcher:/bin/false hplip:x:116:7:HPLIP system user,,,:/run/hplip:/bin/false whoopsie:x:117:122::/nonexistent:/bin/false gnome-initial-setup:x:121:65534::/run/gnome-initial-setup/:/bin/false gdm:x:122:127:Gnome Display Manager:/var/lib/gdm3:/bin/false frank:x:1000:1000:frank_bett,,,:/home/frank:/usr/bin/zsh mysql:x:995:1001::/home/mysql:/bin/sh pilot:x:1001:1002:,,,:/home/pilot:/bin/bash
-v option is useful when auditing your configuration files with the grep utility. It produces a list of text fi le records that do not contain the pattern. The above output shows an example of finding all the records in the password file that do not end in
nologin. Notice that the BRE pattern puts the
$ at the end of the word. If you were to place the
$ before the word, it would be treated as a variable name instead of a BRE pattern.
character classes: A special group of bracket expressions are
character classes. These bracket expressions have predefined names and could be considered bracket expression shortcuts. Their interpretation is based on the LC_CTYPE locale environment variable.
Commonly Used Character Classes;
[:alnum:]: Represents an alphanumeric character.
[:alpha:]: Represents an alphabetic character.
[:ascii:]: Represents a character that fits into the ASCII character set.
[:blank:]: Represents a blank character, that is, a space or a tab.
[:cntrl:]: Represents a control character.
[:digit:]: Represents a digit (0 through 9).
[:graph:]: Represents any printable character except space.
[:lower:]: Represents a lowercase character.
[:print:]:Represents any printable character including space.
[:punct:]: Represents any printable character which is not a space or an alphanumeric character.
[:space:]: Represents white-space characters: space, form-feed (
\f), newline (
\n), carriage return (
\r), horizontal tab (
\t), and vertical tab (
[:upper:]: Represents an uppercase letter.
[:xdigit:]: Represents hexadecimal digits (0 through F).
Having our file users.txt let’s check what it contains with
$ cat users.txt pilot 3434 frank 64646 sshd 77767
grep command and a character class;
$ grep "[[:digit:]]" users.txt
Quantifiers: An atom is just a character that may or may not have special meaning. The reach of an atom, either a single character atom or a bracket atom, can be adjusted using an atom quantifier. Atom quantifiers define atom sequences, that is, matches occur when a contiguous repetition for the atom is found in the string. The substring corresponding to the match is called a piece. Notwithstanding, quantifiers and other features of regular expressions are treated differently depending on which standard is being used.
* quantifier has the same function in both basic and extended REs (atom occurs zero or more times) and it’s a literal character if it appears at the beginning of the regular expression or if it’s preceded by a backslash
The plus sign quantifier
+ will select pieces containing one or more atom matches in sequence. The question mark quantifier
?, a match will occur if the corresponding atom appears once or if it doesn’t appear at all. If preceded by a backslash
\, their special meaning is not considered.
Basic regular expressions also support
? quantifiers, but they need to be preceded by a backslash. Unlike extended regular expressions,
? by themselves are literal characters in basic regular expressions.
.(dot): Atom matches with any character.
^(caret): Atom matches with the beginning of a line.
$(dollar sign): Atom matches with the end of a line
Extended Regular Expressions
Extended regular expressions (EREs) allow more complex patterns. For example, a vertical bar symbol (
|) allows you to specify two possible words or character sets to match. You can also employ parentheses to designate additional subexpressions. The best examples of ERE are
grep -E commands discussed below.
One of the most common uses of
grep is to facilitate the inspection of long files, using the regular expression as a filter applied to each line. It can be used to show only the lines starting with a certain term. The
grep command is powerful in its use of regular expressions, which will help with filtering text files.
grep [OPTION] PATTERN [FILE…]
Commonly Used Options with grep Command
|-c||Display a count of text file records that contain a PATTERN match.|
|-d action||When a file is a directory, if action is set to read, read the directory as if it were a regular text file; if action is set to skip, ignore the directory; and if action is set to recurse, act as if the – R, -r, or –recursive option was used.|
|-E||Designate the PATTERN as an extended regular expression.|
|-i||Ignore the case in the PATTERN as well as in any text file records.|
|-R, -r||Search a directory’s contents, and for any subdirectory within the original directory tree, consecutively search its contents as well (recursively).|
|-v||Display only text files records that do not contain a PATTERN match.|
Using a simple
grep command to search a file.
No options are used, and the
grep utility is used to search for the word
frank (PATTERN) within /etc/passwd (FILE).
$ grep frank /etc/passwd frank:x:1000:1000:frank_bett,,,:/home/frank:/usr/bin/zsh
We notice that in the above output the
grep command returns each file record (line) that contains an instance of the PATTERN, which in this case was the word
egrep command is the same as
grep -E. It interpret PATTERNS as extended regular expressions. EREs allow more complex patterns.
grep -E command with an ERE pattern;
$ grep -E "^root|^pilot" /etc/passwd root:x:0:0:root:/root:/bin/bash pilot:x:1001:1002:,,,:/home/pilot:/bin/bash
In the above output, the
grep command uses the
-E option to indicate the pattern is an extended regular expression. If you did not employ the
-E option, unpredictable results would occur. Quotation marks around the ERE pattern protect it from misinterpretation. The command searches for any password file records that start with either the word
frank or the word
pilot. Thus, a caret (
^) is placed prior to each word, and a vertical bar (
|) separates the words to indicate that the record can start with either word.
fgrep command with an ERE pattern;
$ egrep "(sshd|s).*sh" /etc/passwd frank:x:1000:1000:frank_bett,,,:/home/frank:/usr/bin/zsh sshd:x:123:65534::/run/sshd:/usr/sbin/nologin mysql:x:995:1001::/home/mysql:/bin/sh
In the above output, you notice that the
egrep command is employed. The
egrep command is equivalent to using the
grep -E command. The ERE pattern here also uses quotation marks to avoid misinterpretation and employs parentheses to issue a subexpression. The subexpression consists of a choice, indicated by the vertical bar (
|), between the word
sshd and the letter
s. Also in the ERE pattern, the
.* symbols are used to indicate there can be anything in between the subexpression choice and the word
sh in the text file record.
fgrep command is the same as
grep -F. It interpret PATTERNS as fixed strings, not regular expressions. grep command is used to search for patterns stored in a text file.
grep -F command to search for patterns stored in a text file. We have a file users.txt, let’s look at its contents with
$ cat users.txt pilot frank sshd
fgrep command to search for patterns stored in /etc/passwd file;
$ fgrep -f users.txt /etc/passwd frank:x:1000:1000:frank_bett,,,:/home/frank:/usr/bin/zsh sshd:x:123:65534::/run/sshd:/usr/sbin/nologin pilot:x:1001:1002:,,,:/home/pilot:/bin/bash
In the above output, the patterns are stored in the users.txt file, which is first displayed using the
cat command. Next, the
fgrep command is employed, along with the
-f option to indicate the file that holds the patterns. The /etc/passwd file is searched for all the patterns stored within the users.txt file, and the results are displayed.
grep -F command to search for patterns stored in /etc/passwd file;
$ grep -F -f users.txt /etc/passwd frank:x:1000:1000:frank_bett,,,:/home/frank:/usr/bin/zsh sshd:x:123:65534::/run/sshd:/usr/sbin/nologin pilot:x:1001:1002:,,,:/home/pilot:/bin/bash
You notice, the
grep -F command is equivalent to using the
fgrep command, which is why the two commands produce identical results.
Searching with Regular Expressions
The immediate benefit offered by regular expressions is to improve searches on filesystems and in text documents. The
-regex option of command
find allows to test every path in a directory hierarchy against a regular expression.
$ find $HOME -regex '.*/\..*' -size +100M /home/frank/.vagrant.d/boxes/generic-VAGRANTSLASH-fedora28/3.2.10/virtualbox/generic-fedora28-virtualbox-disk001.vmdk /home/frank/.vagrant.d/boxes/ubuntu-VAGRANTSLASH-focal64/20210302.0.0/virtualbox/ubuntu-focal-20.04-cloudimg.vmdk /home/frank/.vagrant.d/boxes/generic-VAGRANTSLASH-centos8/3.2.10/virtualbox/generic-centos8-virtualbox-disk001.vmdk /home/frank/.vagrant.d/boxes/gusztavvargadr-VAGRANTSLASH-docker-linux/2010.0.2012/virtualbox/gusztavvargadr-u1604s-dc-2012.0.0-1608130612-disk001.vmdk /home/frank/.vagrant.d/boxes/batrusi-VAGRANTSLASH-suse_minimal/0.0.1/virtualbox/box-disk001.vmdk
The above command searches for files greater than 100 megabytes (100 units of 1048576 bytes), but only in paths inside the user’s home directory that do contain a match with
.*/\..*, that is, a
/. surrounded by any other number of characters. In other words, only hidden files or files inside hidden directories will be listed, regardless of the position of
/. in the corresponding path.
For case insensitive regular expressions, the
-iregex option is used instead;
$ find /usr/share/fonts -regextype posix-extended -iregex '.*(dejavu|liberation).*sans.*(italic|oblique).*' /usr/share/fonts/truetype/liberation2/LiberationSans-Italic.ttf /usr/share/fonts/truetype/liberation2/LiberationSans-BoldItalic.ttf /usr/share/fonts/truetype/liberation/LiberationSans-Italic.ttf /usr/share/fonts/truetype/liberation/LiberationSans-BoldItalic.ttf /usr/share/fonts/truetype/liberation/LiberationSansNarrow-Italic.ttf /usr/share/fonts/truetype/liberation/LiberationSansNarrow-BoldItalic.ttf
In the above example, the regular expression contains branches (written in extended style) to list only specific font files under the
/usr/share/fonts directory hierarchy. Extended regular expressions are not supported by default, but
find allows for them to be enabled with
-regextype posix-extended or
-regextype egrep. The default RE standard for
find is findutils-default, which is virtually a basic regular expression clone.
Using Sed (stream editor)
There are times where you will want to edit text without having to pull out a full-fledged text editor. A
stream editor modifies text that is passed to it via a file or output from a pipeline. This editor uses special commands to make text changes as the text “streams” through the editor utility.
sed editor changes data based on commands either entered into the command line or stored in a text file. The process the editor goes through is as follows:
- Reads one text line at a time from the input stream
- Matches that text with the supplied editor commands
- Modifies the text as specified in the commands
- Displays the modified text
sed editor matches all the specified commands against a text line, it reads the next text line and repeats the editorial process. Once
sed reaches the end of the text lines, it stops.
sed [OPTIONS] [SCRIPT]… [FILENAME]
Using sed to modify/substitute file text
You can modify text stored in a file using
sed command. Having our file AboutLinux.txt let’s check it contents with cat command;
$ cat AboutLinux.txt Linus Torvalds developed Linux OS Linux OS made everything simple Linux OS has many Distros We love Linux OS We love Technology
sed to modify file text;
$ sed 's/Linux/Unix/' AboutLinux.txt Linus Torvalds developed Unix OS Unix OS made everything simple Unix OS has many Distros We love Unix OS We love Technology
The stream editor only displays the modifi ed text to STDOUT. You could save the modifi ed text to another fi le name via a STDOUT redirection operator, if desired.
Using sed to delete file text
You can also delete lines using the stream editor. Use the syntax of ‘
PATTERN/d‘ for the
sed command’s SCRIPT to accomplish it.
sed to delete file text;
$ sed '/Torvalds/d' AboutLinux.txt Linux OS made everything simple Linux OS has many Distros We love Linux OS We love Technology
You notice that the AboutLinux.txt file line that contains the word
Torvalds is not displayed to STDOUT. It was “deleted” in the output, but it still exists within the text file.
Using sed to change an entire file line
You can also change an entire line of text. To accomplish this, you use the syntax of ‘
ADDRESScNEWTEXT‘ for the
sed command’s SCRIPT. The
ADDRESS refers to the file’s line number, and the
NEWTEXT is the different text line you want displayed.
sed to change an entire file line;
$ sed '5cLinux OS exmples Ubuntu, ArchLinux, Elementary, Manjaro and many more' AboutLinux.txt Linus Torvalds developed Linux OS Linux OS made everything simple Linux OS has many Distros We love Linux OS Linux OS exmples Ubuntu, ArchLinux, Elementary, Manjaro and many more
That’s all on to Search Text Files Using Regular Expressions using grep, egrep, fgrep, sed, regex. Stay tuned for more LPIC – 101 guides.
More guides on LPIC – 101;