Scripting Clinic: Your Pre-Fab Text Processing Toolkit

Wednesday Aug 11th 2004 by Carla Schroder

Scripting he-men are fond of 'writing a few lines of Perl' whenever a file needs munging. Too bad they're ignoring the overflowing toolbox of Unix and Linux text-processing utilities.

Because Linux/Unix relies on text files for practically everything, a veritable thicket of text-processing utilities have grown up to satisfy every text-processing whim. Many of them are specialized and do only a few things. This is a real lifesaver when you just need to do one little thing, and don't want to wade through reams of Perl or Bash documentation to figure out how. This is especially handy when you're scripting; it's often easier to keep a list of useful text-processing utitilies, than to explore all the meeelyuns of possibilities of more complex tools.

Convert tabs to spaces, or spaces to tabs. Some programs are finicky about whitespace; either you must use only tabs, or only spaces, or simply pick either one, and don't mix them. To convert tabs to spaces:

$ expand filename

And that's it. All the tabs in the file will be converted to 8 spaces. You don't have to settle for 8 spaces, you may select any number you like:

$ expand -t 4 filename

That spits the output to stdout, which may not be what you want. This sends the output to a new file:

$ expand -t 4 filename >> filename

You don't have to convert all the tabs in the file, you can convert only the leading tabs on each line:

$ expand -i filename

unexpand does the reverse, it converts leading spaces to tabs:

$ unexpand filename

Or, convert all strings of two or more spaces to tabs:

$ unexpand -a filename

If your text editor does not have a way of displaying tabs, cat can do it:

$ cat -v -t testfile ^IThis section shows how to put various ^Ipieces of information into the Bash prompt. ^I There are an infin^Iite^I number of things that ^I could be put in your prompt.

The carets indicate tabs, so you can see for yourself how expand and unexpand work.

tr, "translate", is a deceptive little utility. It does a whole lot of things. A handy use for it is converting text to lowercase, like Windows filenames. You can test this at the command prompt:

$ tr "[:upper:]" "[:lower:]"

Hit ctrl + c to stop. This can easily be applied to files. This example uses character classes to convert the contents of trtest to lowercase, and output it to trtestout:

$  tr "[:upper:]" "[:lower:]" > trtest < trtestout

< means read from this file, > means send the output to a new file.

Sometimes you want to change a specific character. Suppose you've written an article, and after spending hours on it you realize you inadvertently slipped into l33t-sp34k. Well this will never do. Use tr to change the numbers to letters:

$ tr "340" "eao"
I 4m a l33t hax0r. Ph34r m3!
I am a leet haxor. Phear me!

Pretty simple -- first list the characters you want changed, all grouped together, then the second group is what you want the new characters to be. Just make sure they are in order.

tr can delete strings, using the -d flag:

$ tr -d "leet" I am a leet haxor e e lt et el I am a haxor

You can see the limitations of tr in this example -- it looks for characters, not words.

Continued on page 2: awk, sort, nl

Continued From Page 1

awk is way fun. awk lets you pluck things out of lines of text according to their position. This example sorts out the human users from Linux system users, assuming you have stuck to a sensible user numbering scheme:

$  awk -F: '$3 > 999 { print $0}' /etc/passwd
carla:x:1000:1000:carla schroder,,,:/home/carla:/bin/bash
dawns:x:1002:1002:Dawn Marie Schroder,,,,foo:/home/dawns:
nikitah:x:1003:1003:Nikita Horse,,123-4567,equine:/home/nikitah:
rubst:x:1004:1004:Rubs The Cat,101,,234-5678,,test test:/home/rubst:

How does awk know what the delimiter is? You tell it with the -F flag, which selects the colon as the delimiter in this example. $3 means "the third field."

On a Debian system, human users start at UID 1000. On Red Hat and SuSE, they start at 500. You can select subsets easily enough:

$ awk -F: '($3 >= 1000) &&($3 <=1050)  { print $0}' /etc/passwd

OK, but suppose all you really want are the logins. This is an awk specialty; change print $0, which means "print the whole line," to print $1:

$ awk -F: '$3 > 999 { print $1}' /etc/passwd nobody carla dawns nikitah rubst

Well that's all very nice and everything, but they're in UID order. What if you want them in alphabetical order? Easy. Throw sort into the fray:

$ awk -F: '($3 >= 1000) &&($3 <=1050) { print $1}' /etc/passwd | sort

And they will be listed alphabetically. You can do all sorts of things with this, like copy and paste mass users into groups, check for duplicate logins, and generate an index of users.

sort has its own flag for sending the output to a file, don't use a pipe or a redirect, because they won't work. Use sort's -o option to name the output file:

$ awk -F: '($3 >= 1000) &&($3 <=1050)  { print $1}' /etc/passwd | sort -o list1.txt

Now suppose you have the ever-so-fun chore of merging lists of logins from two different systems. sort can do this too. Take your two files containing the logins, which are already sorted, and do this:

$ sort -um list1 list2 -o merged_list

-u checks for duplicates; if it finds any, it only prints one of them. -m means merge.

Suppose you want to add line numbers? Why, all you need is nl.

This shows line numbers on the screen, but does not change the file:

$ nl merged_list
     1  carla
     2  dawns
     3  foober
     4  goober
     5  helen
     6  nanana
     7  nikitah
     8  nobody
     9  rubst

Or you can create a file containing the line numbers:

$ nl merged_list > merged_list_numbered

nl can do some other interesting things, like number lines that contain only a specific string. For example, you want to number only the lines in an article containing your name:

$ nl -bpCarla article.text

You don't have to settle for a boring old tab stop delimiting the line numbers from the lines. Add custom text with the -s flag:

$ nl -bpCarla -s "***hey, here I am*** " article.text

You may need leading zeroes, or numbers of a specific length. Suppose you want to use 4-digit IDs on your login list, and you want to start numbering from 0500, with three spaces between the numbers and the logins:

$ nl -nrz -w4 -v0500  -s" " list1
0500   carla
0501   dawns
0502   nikitah
0503   nobody

That's a mere scratch on the surface of the Wide Wide World Of Super-Specialized Text Utilities. Next month on Scripting Clinic we'll look at putting some of these together for bringing sanity to reading log files.

Check out the man pages for awk, tr, sort, expand/unexpand, and nl. awk is rather complicated beast, being a full-grown programming language; an excellent reference book is sed & awk, 2nd Edition, by Dale Dougherty and Arnold Robbins. Of course it is an O'Reilly book.

Mobile Site | Full Site
Copyright 2017 © QuinStreet Inc. All Rights Reserved