June 2007

Shell Corner: Littera Delenda Est (part two)

Hosted by Ed Schaefer

In part 2 of his healing process, Royce Williams continues his unrelenting assault on files containing "funny characters".

Littera Delenda Est: on the Removal of Files With Unusual Characters in Their Filenames, Part Two

by Royce Williams

In last month's column, we learned that unusual characters in filenames fall into some discrete categories, and started to tackle the hypothetical example of a directory full of them. The techniques already covered include escaping, workarounds for special shell characters, and dramatic overuse of battle metaphors. We continue in similar fashion, attacking the remaining character types in rough order of increasing difficulty.

Falling Back to Regroup

After our first round of skirmishes, let's see what work remains to delete these unusual files.

We also take this opportunity to remove those files that have already served their purpose as obstacles (so that we can better see what it is that we're having trouble seeing). Drawing on what we've learned so far, here is how to perform that removal, followed by a list of the survivors:

admin@unixlike$ rm -- -keeper \!keeper 0 A Z a z \~keep-me-too
admin@unixlike$ /bin/ls -lA
-rw-r-----  1 admin  admin  22 May  5  2006 ?
-rw-r-----  1 admin  admin  27 May  5  2006 ?
-rw-r-----  1 admin  admin  20 May  5  2006 ?
-rw-r-----  1 admin  admin  26 May  5  2006 ?
-rw-r-----  1 admin  admin  32 May  5  2006 ?
-rw-r-----  1 admin  admin  23 May  5  2006
-rw-r-----  1 admin  admin  29 May  5  2006  ? 
-rw-r-----  1 admin  admin  29 May  5  2006  
-rw-r-----  1 admin  admin  21 May  5  2006 ?

Up to this point, we've been going after what we might call the "low-hanging fruit": those characters that have been easy to identify. The survivors have been better at protecting themselves — they're not just waiting out in the open for us to pick them off, so to speak. We'll have to drop back and perform some additional reconnaissance to determine our next strategy.

Our remaining tasks are also complicated by the fact that differences in the various platforms are going to become more significant. It's almost as if the lesser-used features in a given operating system are more likely to be diverse. As we explore those approaches to revealing unprintable characters that are available on most Unix-likes, we'll try to take these differences into account. We'll also defer doing any more deleting until we've learned a few new identification methods.

cat and Mouse

The -v option to cat, available on many systems, can reveal more about which characters are in play. It replaces many characters that are usually non-printing with either C escape codes or other human-readable representations of their control functions, depending on platform.

To make some of our impending under-the-hood work easier, let's switch to single-column mode for ls with the -1 option (that's the number one). Piping the output of through cat -v yields the following:

admin@unixlike$ /bin/ls -lA | cat -v




From the output above, you can see that our stragglers aren't all the same after all, that some of them are still not visible, and that there's a "^M" with some whitespace in front of it.

To gather more intel (pun intended), you can also use cat -e to mark the end of each line with a dollar sign. On most platforms, using -e implies -v.. Most of the platforms tested produce similar output:

admin@unixlike$ /bin/ls -lA | cat -e
 ^M $

With the trailing dollar signs, we can now see the boundaries of any whitespace. Since the third entry in our reduced listing was represented by ls as a single character, but its width now appears to be quite a few columns wider than that, we might guess at this point that we're dealing with a tab. There's also some additional whitespace trailing our floating ^M, and we're not sure yet what M-% means.

That's a much better mugshot lineup than we've had before, but before we start deleting again, let's explore what other discovery methods are available in other circumstances (and on other platforms).

When the cat's away

If your cat doesn't support -v, or if you find that its output isn't giving you enough information, then your ls may be able to help. Most ls implementations support some kind of -b or -B (binary output) option, which renders unprintable characters in some visible way. Some variants of this option output C escape codes, while others display the octal value of the underlying ASCII character. Some flavors support both. Even the output of the same rendering type varies across platforms.

Solaris and HP-UX only support octal dumps with -b:

admin@flare$ /bin/ls -1A -b



NetBSD's -b substitutes escape codes:

admin@ofcourse$ /bin/ls -1A -b




NetBSD's -B output gives the same octal output as Solaris and HP-UX -b:

admin@ofcourse$ /bin/ls -1A -B




FreeBSD can display one more of our test files than NetBSD can:

admin@beastie$ /bin/ls -1A -b


admin@beastie$ /bin/ls -1A -B


Mac OS X appears to be missing an entire file (whatever \245 is), not merely failing to display it:

admin@sonofnext$ /bin/ls -1A -B


There's a significant difference in the last output example. It turns out that the Mac filesystem (HFS+) doesn't appear to allow single-byte characters higher than ASCII 127, which is why our \245 is conspicuously absent. The HFS+ documentation that I could find said that all Unicode characters were supported in filenames, but my testing script showed that all simple 8-bit ASCII with the high bit set was refused:

[ Output for 1 through 125 omitted ...]

 Trying to create ASCII 126 (hex 7e, tilde)
 Trying to create ASCII 127 (hex 7f, delete)
 Trying to create ASCII 128 (hex 80)
  Could not create ASCII 128 (hex 80): Invalid argument
 Trying to create ASCII 129 (hex 81)
  Could not create ASCII 129 (hex 81): Invalid argument
 Trying to create ASCII 130 (hex 82)
  Could not create ASCII 130 (hex 82): Invalid argument

[ ... output for 131 through 255 omitted ]

On some Linux flavors, -b has the most useful information so far. Unlike all preceding examples, it produces visible indicators for every line of output. Our previously unseen characters look like whitespace of some kind, and there are three whitespace "somethings" in the next-to-the-last filename:

admin@emperor$ /bin/ls -1A -b
\ \r\ 
\ \ \ 

With cat -e/v and ls -b/-B, we now know a lot more than we did before. Some pockets of resistance remain — but the arsenal isn't empty yet.

Munitions dump

A powerful weapon at our disposal is the venerable od (which stands for "octal dump," though it dumps other formats as well). od is available on many Unix-likes. Piping the output of our ls command to od, and using the -c option (character output), you can see that the output of the ls -1A command here contains some familiar escapes:

admin@unixlike$ /bin/ls -1A | od -c
0000000  \a  \n  \b  \n  \t  \n  \f  \n  \r  \n      \n      \r      \n
0000020              \n 245  \n

If you expect a lot of characters that have no C escapes, using the -x option (hex dump) will render the output as columns of hex, with starting offsets listed in the first column:

admin@unixlike$ /bin/ls -1A | od -x
0000000 0a07 0a08 0a09 0a0c 0a0d 0a20 0d20 0a20
0000020 2020 0a20 0aa5

Reading and using this output takes a little bit of parsing, since it is showing you the underlying codes instead of using them to alter the appearance of your terminal (newlines, whitespace, etc.). Here is the od -x output arranged so that you can compare it to the Linux ls -b output:

0a			\a
07 0a			\b
08 0a			\t
09 0a			\f
0c 0a			\r
0d 0a			\
20 0d 20 0a		\ \r\
20 20 20 0a		\ \ \ 
20 0a			\ 
a5			\245

Note that on some systems, od appears to have been deprecated in favor of hexdump, which usually accepts all of the flags demonstrated.

We have seen the enemy

We've produced some numbers and escape codes for our characters, but what do they mean, and how can we use them?

With your favorite search engine, we can figure out how the listings and dumps above correspond to the values and characters underneath. Combining the results from searching for things like "character codes", "ASCII table" and "C escape sequences", and using those references to look up the characters in the filenames that remain, yields the information represented in the following table:

ASCII name Decimal C escape mnemonicoctalhex Control key
BEL7 \aalarm \0070x07^G
BS 8 \bbackspace \0100x08^H
HT 9 \ttab \0110x09^I
LF 10\nnewline \0120x0a^J
FF 12\fformfeed \0140x0c^L
CR 13\rcarriage return\0150x0d^M
SP 32n/aspace \0400x20spacebar
n/a165n/ayen \2450xa5n/a

That last one isn't strictly part of the original 7-bit ASCII character set. Depending on the character set and your selected locale, this character could be represented any number of ways. In ISO 8859-1, it's the yen symbol.

Looking back at our directory dump, what at first appeared to be a floating question mark is now clearly a carriage return (^M, 0x0d) with a space (0x20) on either side:

20 0d 20 0a

The file following it is three spaces in a row:

20 20 20 0a

Using commonly available tables and references, the other characters can also be easily looked up. Our bad characters are in serious trouble.

You have no chance to survive — make your time

Now that we know exactly what these characters are, how can we delete them?

Knowing their corresponding Control keys will help. It turns out that most of the characters represented by ^ (Control) followed by a key on the keyboard can actually be typed just as they appear ... if you know the secret handshake.

Let's select one of our single-character-long filenames: ^M. If we try to remove it by hitting Control-M, the system will respond as if we had pressed the Enter key:

### Key sequence here is r,m,space,press-and-hold Control,M,release Control
admin@unixlike$ rm 
rm: not enough arguments
usage: rm [-blah]

How can we get our shell to interpret these keystrokes as characters, instead of carrying out their usual functions? It turns out that we can do so using the relatively unknown Control-V shell feature, which tells many shells to interpret the next input character as a literal character (instead of as a control character):

### The key sequence here is l,s,[space],^V,^M
admin@unixlike$ /bin/ls ^M

That question mark is our listing of the file named "^M". To verify that the whitespace really corresponds to what we think that we're typing, we can throw in an od -x or one of the other tools that we've covered. Here, we first test our keystroke by echoing it, and then verify that it matches our filename:

### Each control sequence here is immediately preceded by typing Control-V
admin@unixlike$ echo ^M | od -x
0000000 0d0a
admin@unixlike$ /bin/ls ^M | od -x
0000000 0d0a

Now that we know how to reveal, type and verify some of our characters, we can make short work of them:

### Each control sequence here is immediately preceded by typing Control-V
admin@unixlike$ rm ^M
admin@unixlike$ rm ^G		# That beeping should go away now.
admin@unixlike$ rm ^L		# Form feed clears the screen; fixed.
admin@unixlike$ rm ^H

For our files that are just one or more spaces, we must escape them:

### The key sequence here is l,s,[space],backslash,space
admin@unixlike$ /bin/ls \ 

admin@unixlike$ rm \ 
### Three escaped spaces here.
admin@unixlike$ rm \ \ \ 

For our ^M surrounded by spaces, we use Control-V, and then simply escape the spaces:

### The key sequence here is l,s,[space],backslash,space,^V,^M,backslash,space
admin@unixlike$ /bin/ls \ ^M\ 
admin@unixlike$ rm \ ^M\ 

Our file that is just a tab (0x09, ^I) is a little different. Because some shells interpret a tab as a Control key of sorts (for command and filename completion), we must both escape the tab and use the Control-V trick to type it. You can type the actual character by either using the Control key for tab (^I) or simply by pressing the Tab key itself:

### The key sequence here is l,s,[space],backslash,^V,^I
admin@unixlike$ /bin/ls \  
### The key sequence here is r,m,[space],backslash,^V,[tab]
admin@unixlike$ rm \ 

The same holds true for the backspace - it must be escaped, and either typing Control-H or the backspace key itself will do.

Destruction is at hand

For the most part, we've taken care of all of the characters that we can type on the command line, directly or indirectly. Only one file remains:

admin@unixlike$ /bin/ls -1A | cat -ev

But for any characters that have no known keystroke, or if our shell does not support the handy Control-V feature, are there other options?

It would be nice if we knew of some way to enter a character with the keyboard using its decimal, octal or hex value. Surprisingly, some systems support this by using the Alt key and the numeric keypad.

If you have access to a Microsoft machine, give it a try: Open a Command Prompt window and type the following, remembering to use the numeric keys, and making sure that you prepend zeroes to make the field four digits long:

### Key sequence here is Alt-down,numeric 0,1,6,5,Alt-up:

Padding with zeroes appears to invoke ISO-8859-1 on some systems, while not padding invokes another (sometimes the so-called ASCII-II that contains a number of line-drawing characters).

In my testing, the Knoppix system actually supported 8-bit ASCII, including a proper rendering of the yen symbol, as long as I was connected via SSH or on the physical, non-X console. Most of the other Unix-likes properly accepted these keystrokes as long as they were 7-bit ASCII (under 127), and attempts to enter values above 127 were turned into their 7-bit equivalents by stripping their high bit (subtracting 128, and turning 129 into 1, 130 into 2, etc.) This sheds some light on why some ls implementations render this as M-% ("Meta-%").

Here's a small demonstration in which I type "HI" followed by Enter (decimal 72, 73, and 13) using nothing but the numeric keypad. I then attempt to type the yen (165), but product a percent sign (decimal 37, or 165 minus 128) instead:

### The key sequence here is:
###   Alt-down,7,2,Alt-up,Alt-down,7,3,Alt-up,Alt-down,1,3,Alt-up
admin@unixlike$ HI
-bash: HI: command not found

### The sequence here is:
###   Alt-down,0,1,6,5,Alt-up
admin@unixlike$ %

Desperate times call for desperate measures

If we cannot use any of these arcane keyboard tricks to get at the character we want, we'll have to try other angles. Unfortunately, figuring out some way to generate the right characters on the command line and only delete the desired files isn't as easy as it might be. printf(1), for example, is very good at making things human-readable, but is not as keen on rendering human-readable numbers into their character equivalents, so it's not useful for our purposes.

But there are a number of utilities available on many systems that allow you to perform character substitution. If your version of tr supports replacing characters with their control equivalents, then you're in luck. We can accomplish our mission in a roundabout way with tr by transforming an arbitrary single character (here, a 'T') into the one that we need (in this case, the yen):

admin@unixlike$ /bin/ls -1A `echo T | tr 'T' '\245'`
admin@unixlike$ rm `echo T | tr 'T' '\245'`
admin@unixlike$ /bin/ls -1A `echo T | tr 'T' '\245'`
: No such file or directory

A Perl one-liner also allows for a similar trick:

admin@unixlike$ /bin/ls -lA `perl -e 'print "\245";'`
-rw-r-----  1 admin  admin  0 May  5 2007 ?

If you're desperate — for example, if you're in a barren wasteland in which neither tr nor Perl are available — then there are a couple of techniques of last resort that can be used even if you cannot determine what your strange characters are.

If you can't refer to the files by name, you can get a handhold using the file's inode. Many Unix-like ls variants support the i option, revealing the inodes of any files listed:

admin@unixlike$ ls -li
17878240 -rw-r--r--  1 admin  admin  0 May  5 2007 ?

... which can then be passed to find, as long as your find also has an -i option:

admin@unixlike$ find . -inum 17878240 -exec ls -la '{}' \; 
-rw-r-----  1 admin  admin  0 May  5 2007 ?
admin@unixlike$ find . -inum 17878240 -exec rm '{}' \; 

But the true approach of last resort that still allows you some precision (and a shred of dignity) involves dumping the contents of the directory itself to a temporary file, editing it as needed, and then deleting every file listed in the temporary file. This is about as elegant as a drunk rhinoceros, but it gets the job done. (In the following example, note that you must leave out the name of your tempfile itself, because it is created very early on, with zero length, as part of the command pipelining process):

admin@unixlike$ /bin/ls -1 |grep -v temp.list >temp.list
admin@unixlike$ cat -v temp.list

To verify that you have the right filenames captured, you can script a quick one-liner to list each file:

admin@unixlike$  while read file; do ls -lA $file; done < temp.list
-rw-r-----  1 admin  admin  21 May  5 2007 ?

.... and then modify that one-liner slightly to delete them:

admin@unixlike$ while read file; do rm $file; done < temp.list
admin@unixlike$ /bin/ls -1

At long last, our character assassination is complete. You should now have the power to delete many files that would have stumped you before.

You've probably also figured out that your new deletion powers could just as easily be applied to creating unusual filenames. To quote sudo, "With great power comes great responsibility." Don't be tempted to torture the new junior sys admin. Save it for your peers.

Once more unto the breach

I hope that I've given you the ability to handle files with unusual filenames in a more "sysadminly" fashion. If you can figure out exactly what the unusual characters are, then you can delete them with precision. When you have a big directory full of bad filenames mixed in with ones that you need to keep, these methods can help you avoid having to resort to moving files out of the way in wildcarded groups or the slash-and-burn technique of deleting entire directories. Good luck!

Appendix: Testing Platforms

I tried most of these techniques on the following platforms (listed here by the output of uname -srm).

While I also mentioned Cygwin in this column, it is so different from the other flavors that it will have to wait for another day.

Most sessions were conducted using PuTTY and bash v3, and most of the systems were using the ISO-8859-1 character set. While the specifics vary, these techniques should at least give users of different locales a starting point.


All accessed April 2007 unless otherwise indicated.

Royce Williams is a Unix-like systems administrator for an Alaskan telecommunications company. He was included in the package when they acquired the first Alaskan ISP. When not flushing bad characters to ground, Royce likes watching indie movies and trying to put FreeBSD on ancient hardware. He also has an Alaskan license plate problem. You can reach him at royce@tycho.org.

Copyright © 2007 Royce D. Williams. All rights reserved.