Making ‘find’ go faster

The mysterious Unix find command has some interesting things about it. For one, the order of the command-line parameters is important—earlier commands affect later commands. The ramifications of this finally became clear to me after using find for 12 years.

So here is this and a few things I’ve learned over the years to make find go faster:

Idea Number 1

find has command-line arguments which are called “expressions” or “primaries”. These expressions are tests or rules that are applied to each file as find crawls the file system, and the results of each test are used to determine if find should proceed.

“Files” (and we really mean any file system entry: directories, devices, links, etc.) not matching the criteria are skipped.

Example 1a

find . -type f -name "joe*.jpg" -print

This example will find all regular files (not directories) with the name “joe(something).jpg”:

joe.jpg
joeschmoe.jpg
joe is awesome.jpg

Example 1b

find . -type d -path "*/lib/*" -print

This will find all directories that have /lib/ as part of their pathname (but not lib itself).


Idea Number 2

find has logical operators (“and” and “or”) and group operators (parentheses) which allow you to group expressions.

Corollary 2

Group expressions with parentheses, and use logical operators to include or exclude files.

Example 2a

find . ( -name "*.php" -or -name "*.phtml" ) -type f -print

This will find all regular files that end in “.php” or “.phtml”. The parentheses group the two -name expressions, so that if either one matches, the entire expression (between the parentheses) is true (and find then continues on to evaluate the next expression).

Note 2

Depending on the shell you are using, you may need to escape the parentheses so that the shell doesn’t see them first:

bash $ find . \( -name "*.php" -or -name "*.phtml" \) -type f -print

Example 2b

find . -name "*.[Pp][Hh][Pp]" -type f -print

You can use character classes to match filenames that may have varying case:

foo.PHP
foo.php
foo.Php
foo.pHp

Idea Number 3

find evaluates its expressions in the order they appear. This has important ramifications. When find runs, it works like this:

find . -type f -name "*.jpg" -print

This repeats for all files rooted in the current directory (.). You’ll notice that find operates on its expressions with an implicit ‘and’ operator between them, and uses shortcut logic. In other words, for ‘and’ conditions, the second expression will not be evaluated if the first one is false. For ‘or’ conditions, subsequent expressions will not be evalutated if the first one is found true.

Corollary 3

Skip as much as possible as early as possible by putting the ‘and’ expressions most likely to fail (or ‘or’ expressions most likely to succeed) near the beginning of the expression list.

Example 3a

Avoid entire directory hierarchies with -prune:

find share ( -path "share/doc" -prune ) -o -type f -print

This will significantly reduce the number of comparisons we have to make while find is traversing the file tree.


Benchmarks

With filesystem read caching, it becomes very difficult to benchmark disk I/O easily. The best way (that I’ve heard of) is to unmount and remount the partition, which clears the read cache (but setting up that test environment takes more time than I have).

Instead we’ll count system calls, which should give us a rough idea of speed, especially when comparing two different find calls to achieve the same thing.

What to look for:

Look at the lstat64 syscall, as well as the calls to chdir, open, close, and fstat64.

First we’ll look at this call to find. Notice that the -type f test occurs before the -path test:

find share -type f ! -path "share/doc/*" -print

An strace helps us see what’s happening:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 75.38    0.115423           2     53671           lstat64
  9.56    0.014634           3      4396           getdents64
  4.83    0.007399           2      4115           chdir
  4.22    0.006467           3      2064         1 open
  1.83    0.002797           9       324           write
  1.71    0.002614           1      2062           close
  1.21    0.001850           1      2063           fstat64
  1.12    0.001716           1      2058           fcntl64
------ ----------- ----------- --------- --------- ----------------
100.00    0.153119                 70776         1 total

(I’ve omitted all of the calls that were called only once and took less than 1% of time time). We’ve got 53671 lstat calls and 2k to 4k other common syscalls.

Next we’ll look at putting the -type f test after we’ve checked the path:

find share ! -path "share/doc/*" -type f -print

Here’s the strace output:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 72.91    0.100304           2     43210           lstat64
 10.88    0.014970           3      4396           getdents64
  5.39    0.007410           2      4115           chdir
  3.41    0.004694           2      2064         1 open
  2.24    0.003076           9       324           write
  2.05    0.002817           1      2062           close
  1.59    0.002184           1      2063           fstat64
  1.40    0.001922           1      2058           fcntl64
------ ----------- ----------- --------- --------- ----------------
100.00    0.137577                 60315         1 total

Notice that we’re making 10k fewer calls to lstat64 than before, and everything else is roughly equal. You begin to see the effect of ordering your expressions!

Finally, we’ll use find’s -prune expression to simply avoid an entire hierarchy:

find share ( -path "share/doc" -prune ) -o -type f -print

And the strace output:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 78.35    0.092499           2     39040           lstat64
  8.04    0.009487           4      2602           getdents64
  3.50    0.004130           2      2361           chdir
  3.01    0.003558           3      1181           fcntl64
  2.55    0.003011           9       324           write
  2.24    0.002642           2      1187         1 open
  1.23    0.001454           1      1185           close
  0.91    0.001074           1      1186           fstat64
------ ----------- ----------- --------- --------- ----------------
100.00    0.118052                 49089         1 total

Now we’re a few thousand calls to lstat64 less than before, and half as many calls to chdir, open, fcntl, close, and fstat64, for a total of over 20k fewer syscalls.

You can see the effect -prune as well as the ordering of expressions has on execution speed.


Summary

find examines each entry in a directory hierarchy; if the entry (be it a regular file, symlink, directory, device, or whatever) matches the given criteria, it will be printed. If no criteria are given, it’s an automatic match (and will be printed).

find evaluates its expressions in the order you specify them; by carefully choosing the order of the expressions, you can shave tons of time off the cost of the search and get your wanted results quicker.

As is true with most programming endeavours, a little investment up front in crafting the expressions will yield better results later. If you’re only running a find operation once, maybe you don’t want to take too much time fussing over saving a few syscalls, but if you will be running find frequently (e.g., as part of a cron, or some other regular occurance), do your disk a favor and let find skip as much as possible.

categories: /tech/sysadmin
posted on Wed, 24 Oct 2007 at 14:51 | permanent link |