Welcome to our today’s tutorial on how to search text files using regular expressions like grep, egrep, fgrep, sed, regex. String-searching algorithms are widely used by several data-processing tasks, so much that Unix-like operating systems have their own ubiquitous implementation: Regular expressions, often acronym REs. Regular expressions consist of character sequences that make up a generic pattern used to locate and sometimes modify a corresponding sequence in a larger string of characters. Regular expressions greatly expand the ability to:

  • Write parsing rules to requests in HTTP servers, nginx in particular.
  • Write scripts that convert text-based datasets to another format.
  • Search for occurrences of interest in journal entries or documents.
  • Filter markup documents, keeping semantic content.

Differences Between Basic and Extended Regular Expressions

Basic Regular Expressions

Basic regular expressions (BREs) include characters, such as a dot followed by an asterisk (.*) to represent multiple characters and a single dot (.) to represent one character. They also may use brackets to represent multiple characters, such as [a,e,i,o,u] (you do not have to include the commas) or a range of characters, such as [A-z]. When brackets are employed, it is called a bracket expression.

anchor characters: To find text file records that begin with particular characters, you can precede them with a caret (^) symbol. For finding text file records where particular characters are at the record’s end, append them with a dollar sign ($) symbol. Both the caret and the dollar sign symbols are called anchor characters for BREs, because they fasten the pattern to the beginning or the end of a text line.

Using the grep command with a BRE pattern;

$ grep root /etc/passwd
root:x:0:0:root:/root:/bin/bash
nm-openvpn:x:115:121:NetworkManager OpenVPN,,,:/var/lib/openvpn/chroot:/usr/sbin/nologin

The above command searches for instances of the word root within the password file. You notice that it displays two lines from the file.

Using the grep to display only lines matching the PATTERN;

$ grep ^root /etc/passwd
root:x:0:0:root:/root:/bin/bash

The above command employs the BRE ^ character and places it before the word root. This regular expression pattern causes grep to display only lines in the password file that begin with root.

Using the grep command to audit the password file;

$ grep -v nologin$ /etc/passwd
root:x:0:0:root:/root:/bin/bash
sync:x:4:65534:sync:/bin:/bin/sync
tss:x:103:108:TPM software stack,,,:/var/lib/tpm:/bin/false
speech-dispatcher:x:111:29:Speech Dispatcher,,,:/run/speech-dispatcher:/bin/false
hplip:x:116:7:HPLIP system user,,,:/run/hplip:/bin/false
whoopsie:x:117:122::/nonexistent:/bin/false
gnome-initial-setup:x:121:65534::/run/gnome-initial-setup/:/bin/false
gdm:x:122:127:Gnome Display Manager:/var/lib/gdm3:/bin/false
frank:x:1000:1000:frank_bett,,,:/home/frank:/usr/bin/zsh
mysql:x:995:1001::/home/mysql:/bin/sh
pilot:x:1001:1002:,,,:/home/pilot:/bin/bash

The -v option is useful when auditing your configuration files with the grep utility. It produces a list of text fi le records that do not contain the pattern. The above output shows an example of finding all the records in the password file that do not end in nologin. Notice that the BRE pattern puts the $ at the end of the word. If you were to place the $ before the word, it would be treated as a variable name instead of a BRE pattern.

character classes: A special group of bracket expressions are character classes. These bracket expressions have predefined names and could be considered bracket expression shortcuts. Their interpretation is based on the LC_CTYPE locale environment variable.

Commonly Used Character Classes;

  • [:alnum:]: Represents an alphanumeric character.
  • [:alpha:]: Represents an alphabetic character.
  • [:ascii:]: Represents a character that fits into the ASCII character set.
  • [:blank:]: Represents a blank character, that is, a space or a tab.
  • [:cntrl:]: Represents a control character.
  • [:digit:]: Represents a digit (0 through 9).
  • [:graph:]: Represents any printable character except space.
  • [:lower:]: Represents a lowercase character.
  • [:print:]:Represents any printable character including space.
  • [:punct:]: Represents any printable character which is not a space or an alphanumeric character.
  • [:space:]: Represents white-space characters: space, form-feed (\f), newline (\n), carriage return (\r), horizontal tab (\t), and vertical tab (\v).
  • [:upper:]: Represents an uppercase letter.
  • [:xdigit:]: Represents hexadecimal digits (0 through F).

Having our file users.txt let’s check what it contains with cat command;

$ cat users.txt
pilot
3434
frank
64646
sshd
77767

Using the grep command and a character class;

$ grep "[[:digit:]]" users.txt

Quantifiers: An atom is just a character that may or may not have special meaning. The reach of an atom, either a single character atom or a bracket atom, can be adjusted using an atom quantifier. Atom quantifiers define atom sequences, that is, matches occur when a contiguous repetition for the atom is found in the string. The substring corresponding to the match is called a piece. Notwithstanding, quantifiers and other features of regular expressions are treated differently depending on which standard is being used.

The * quantifier has the same function in both basic and extended REs (atom occurs zero or more times) and it’s a literal character if it appears at the beginning of the regular expression or if it’s preceded by a backslash \.

The plus sign quantifier + will select pieces containing one or more atom matches in sequence. The question mark quantifier ?, a match will occur if the corresponding atom appears once or if it doesn’t appear at all. If preceded by a backslash \, their special meaning is not considered.

Basic regular expressions also support + and ? quantifiers, but they need to be preceded by a backslash. Unlike extended regular expressions, + and ? by themselves are literal characters in basic regular expressions.

special characters;

  • . (dot): Atom matches with any character.
  • ^ (caret): Atom matches with the beginning of a line.
  • $ (dollar sign): Atom matches with the end of a line

Extended Regular Expressions

Extended regular expressions (EREs) allow more complex patterns. For example, a vertical bar symbol (|) allows you to specify two possible words or character sets to match. You can also employ parentheses to designate additional subexpressions. The best examples of ERE are egrep and grep -E commands discussed below.

Using grep

One of the most common uses of grep is to facilitate the inspection of long files, using the regular expression as a filter applied to each line. It can be used to show only the lines starting with a certain term. The grep command is powerful in its use of regular expressions, which will help with filtering text files.

Syntax;

grep [OPTION] PATTERN [FILE…]

Commonly Used Options with grep Command

ShortLongDescription
-c--countDisplay a count of text file records that contain a PATTERN match.
-d action--directories=actionWhen a file is a directory, if action is set to read, read the directory as if it were a regular text file; if action is set to skip, ignore the directory; and if action is set to recurse, act as if the – R, -r, or –recursive option was used.
-E--extended regexpDesignate the PATTERN as an extended regular expression.
-i--ignore-caseIgnore the case in the PATTERN as well as in any text file records.
-R, -r--recursiveSearch a directory’s contents, and for any subdirectory within the original directory tree, consecutively search its contents as well (recursively).
-v--invert-matchDisplay only text files records that do not contain a PATTERN match.

Using a simple grep command to search a file.

No options are used, and the grep utility is used to search for the word frank (PATTERN) within /etc/passwd (FILE).

$ grep frank /etc/passwd
frank:x:1000:1000:frank_bett,,,:/home/frank:/usr/bin/zsh

We notice that in the above output the grep command returns each file record (line) that contains an instance of the PATTERN, which in this case was the word frank.

egrep (Extended-regexp)

egrep command is the same as grep -E. It interpret PATTERNS as extended regular expressions. EREs allow more complex patterns.

Using the grep -E command with an ERE pattern;

$ grep -E "^root|^pilot" /etc/passwd
root:x:0:0:root:/root:/bin/bash
pilot:x:1001:1002:,,,:/home/pilot:/bin/bash

In the above output, the grep command uses the -E option to indicate the pattern is an extended regular expression. If you did not employ the -E option, unpredictable results would occur. Quotation marks around the ERE pattern protect it from misinterpretation. The command searches for any password file records that start with either the word frank or the word pilot. Thus, a caret (^) is placed prior to each word, and a vertical bar (|) separates the words to indicate that the record can start with either word.

Using the fgrep command with an ERE pattern;

$ egrep "(sshd|s).*sh" /etc/passwd
frank:x:1000:1000:frank_bett,,,:/home/frank:/usr/bin/zsh
sshd:x:123:65534::/run/sshd:/usr/sbin/nologin
mysql:x:995:1001::/home/mysql:/bin/sh

In the above output, you notice that the egrep command is employed. The egrep command is equivalent to using the grep -E command. The ERE pattern here also uses quotation marks to avoid misinterpretation and employs parentheses to issue a subexpression. The subexpression consists of a choice, indicated by the vertical bar (|), between the word sshd and the letter s. Also in the ERE pattern, the .* symbols are used to indicate there can be anything in between the subexpression choice and the word sh in the text file record.

fgrep (fixed-strings)

fgrep command is the same as grep -F. It interpret PATTERNS as fixed strings, not regular expressions. grep command is used to search for patterns stored in a text file.

Using the fgrep and grep -F command to search for patterns stored in a text file. We have a file users.txt, let’s look at its contents with cat command.

$ cat users.txt
pilot
frank
sshd

Let’s use fgrep command to search for patterns stored in /etc/passwd file;

$ fgrep -f users.txt /etc/passwd
frank:x:1000:1000:frank_bett,,,:/home/frank:/usr/bin/zsh
sshd:x:123:65534::/run/sshd:/usr/sbin/nologin
pilot:x:1001:1002:,,,:/home/pilot:/bin/bash

In the above output, the patterns are stored in the users.txt file, which is first displayed using the cat command. Next, the fgrep command is employed, along with the -f option to indicate the file that holds the patterns. The /etc/passwd file is searched for all the patterns stored within the users.txt file, and the results are displayed.

Let’s use grep -F command to search for patterns stored in /etc/passwd file;

$ grep -F -f users.txt /etc/passwd
frank:x:1000:1000:frank_bett,,,:/home/frank:/usr/bin/zsh
sshd:x:123:65534::/run/sshd:/usr/sbin/nologin
pilot:x:1001:1002:,,,:/home/pilot:/bin/bash

You notice, the grep -F command is equivalent to using the fgrep command, which is why the two commands produce identical results.

Searching with Regular Expressions

The immediate benefit offered by regular expressions is to improve searches on filesystems and in text documents. The -regex option of command find allows to test every path in a directory hierarchy against a regular expression.

Using -regex with find command;

$ find $HOME -regex '.*/\..*' -size +100M

/home/frank/.vagrant.d/boxes/generic-VAGRANTSLASH-fedora28/3.2.10/virtualbox/generic-fedora28-virtualbox-disk001.vmdk
/home/frank/.vagrant.d/boxes/ubuntu-VAGRANTSLASH-focal64/20210302.0.0/virtualbox/ubuntu-focal-20.04-cloudimg.vmdk
/home/frank/.vagrant.d/boxes/generic-VAGRANTSLASH-centos8/3.2.10/virtualbox/generic-centos8-virtualbox-disk001.vmdk
/home/frank/.vagrant.d/boxes/gusztavvargadr-VAGRANTSLASH-docker-linux/2010.0.2012/virtualbox/gusztavvargadr-u1604s-dc-2012.0.0-1608130612-disk001.vmdk
/home/frank/.vagrant.d/boxes/batrusi-VAGRANTSLASH-suse_minimal/0.0.1/virtualbox/box-disk001.vmdk

The above command searches for files greater than 100 megabytes (100 units of 1048576 bytes), but only in paths inside the user’s home directory that do contain a match with .*/\..*, that is, a /. surrounded by any other number of characters. In other words, only hidden files or files inside hidden directories will be listed, regardless of the position of /. in the corresponding path.

 For case insensitive regular expressions, the -iregex option is used instead;

$ find /usr/share/fonts -regextype posix-extended -iregex '.*(dejavu|liberation).*sans.*(italic|oblique).*'
/usr/share/fonts/truetype/liberation2/LiberationSans-Italic.ttf
/usr/share/fonts/truetype/liberation2/LiberationSans-BoldItalic.ttf
/usr/share/fonts/truetype/liberation/LiberationSans-Italic.ttf
/usr/share/fonts/truetype/liberation/LiberationSans-BoldItalic.ttf
/usr/share/fonts/truetype/liberation/LiberationSansNarrow-Italic.ttf
/usr/share/fonts/truetype/liberation/LiberationSansNarrow-BoldItalic.ttf

In the above example, the regular expression contains branches (written in extended style) to list only specific font files under the /usr/share/fonts directory hierarchy. Extended regular expressions are not supported by default, but find allows for them to be enabled with -regextype posix-extended or -regextype egrep. The default RE standard for find is findutils-default, which is virtually a basic regular expression clone.

Using Sed (stream editor)

There are times where you will want to edit text without having to pull out a full-fledged text editor. A stream editor modifies text that is passed to it via a file or output from a pipeline. This editor uses special commands to make text changes as the text “streams” through the editor utility.

The sed editor changes data based on commands either entered into the command line or stored in a text file. The process the editor goes through is as follows:

  • Reads one text line at a time from the input stream
  • Matches that text with the supplied editor commands
  • Modifies the text as specified in the commands
  • Displays the modified text

After the sed editor matches all the specified commands against a text line, it reads the next text line and repeats the editorial process. Once sed reaches the end of the text lines, it stops.

Syntax;

sed [OPTIONS] [SCRIPT]… [FILENAME]

Using sed to modify/substitute file text

You can modify text stored in a file using sed command. Having our file AboutLinux.txt let’s check it contents with cat command;

$ cat AboutLinux.txt
Linus Torvalds developed Linux OS
Linux OS made everything simple
Linux OS has many Distros
We love Linux OS
We love Technology

Using sed to modify file text;

$ sed 's/Linux/Unix/' AboutLinux.txt
Linus Torvalds developed Unix OS
Unix OS made everything simple
Unix OS has many Distros
We love Unix OS
We love Technology

The stream editor only displays the modifi ed text to STDOUT. You could save the modifi ed text to another fi le name via a STDOUT redirection operator, if desired.

Using sed to delete file text

You can also delete lines using the stream editor. Use the syntax of ‘PATTERN/d‘ for the sed command’s SCRIPT to accomplish it.

Using sed to delete file text;

$ sed '/Torvalds/d' AboutLinux.txt
Linux OS made everything simple
Linux OS has many Distros
We love Linux OS
We love Technology

You notice that the AboutLinux.txt file line that contains the word Torvalds is not displayed to STDOUT. It was “deleted” in the output, but it still exists within the text file.

Using sed to change an entire file line

You can also change an entire line of text. To accomplish this, you use the syntax of ‘ADDRESScNEWTEXT‘ for the sed command’s SCRIPT. The ADDRESS refers to the file’s line number, and the NEWTEXT is the different text line you want displayed.

Using sed to change an entire file line;

$ sed '5cLinux OS exmples Ubuntu, ArchLinux, Elementary, Manjaro and many more' AboutLinux.txt
Linus Torvalds developed Linux OS
Linux OS made everything simple
Linux OS has many Distros
We love Linux OS
Linux OS exmples Ubuntu, ArchLinux, Elementary, Manjaro and many more

Conclusion

That’s all on to Search Text Files Using Regular Expressions using grep, egrep, fgrep, sed, regex. Stay tuned for more LPIC – 101 guides.

More guides on LPIC – 101;

LPIC 101 – Managing File Permissions on Linux

LPIC 101 – Managing Files and Directories on Linux Terminal

LPIC 101 – Basic File Editing with vim|vi

LPIC 101 – Processing Text Streams Using Filters on Linux