Better Living Through Thinking |
|
Making 'find' go fasterWed, 24 Oct 2007
The mysterious Unix So here is this and a few things I've learned over the years to make
Idea Number 1
"Files" (and we really mean any file system entry: directories, devices, links, etc.) not matching the criteria are skipped. Example 1a find . -type f -name "joe*.jpg" -print This example will find all regular files (not directories) with the name "joe(something).jpg": joe.jpg joeschmoe.jpg joe is awesome.jpg Example 1b find . -type d -path "*/lib/*" -print This will find all directories that have "/lib/" as part of their pathname (but not "lib" itself). Idea Number 2
Corollary 2 Group expressions with parentheses, and use logical operators to include or exclude files. Example 2a find . ( -name "*.php" -or -name "*.phtml" ) -type f -print This will find all regular files that end in ".php" or ".phtml". The
parentheses group the two '-name' expressions, so that if either one
matches, the entire expression (between the parentheses) is true (and
Note 2 Depending on the shell you are using, you may need to escape the parentheses so that the shell doesn't see them first: bash $ find . \( -name "*.php" -or -name "*.phtml" \) -type f -print Example 2b find . -name "*.[Pp][Hh][Pp]" -type f -print You can use character classes to match filenames that may have varying case: foo.PHP foo.php foo.Php foo.pHp Idea Number 3
find . -type f -name "*.jpg" -print
This repeats for all files rooted in the current directory
('.'). You'll notice that Corollary 3 Skip as much as possible as early as possible by putting the 'and' expressions most likely to *fail* (or 'or' expressions most likely to succeed) near the beginning of the expression list. Example 3a Avoid entire directory hierarchies with '-prune': find share ( -path "share/doc" -prune ) -o -type f -print This will significantly reduce the number of comparisons we have to
make while Benchmarks With filesystem read caching, it becomes very difficult to benchmark disk I/O easily. The best way (that I've heard of) is to unmount and remount the partition, which clears the read cache (but setting up that test environment takes more time than I have). Instead we'll count system calls, which should give us a rough idea of
speed, especially when comparing two different What to look for: Look at the First we'll look at this call to find share -type f ! -path "share/doc/*" -print An 'strace' helps us see what's happening: % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 75.38 0.115423 2 53671 lstat64 9.56 0.014634 3 4396 getdents64 4.83 0.007399 2 4115 chdir 4.22 0.006467 3 2064 1 open 1.83 0.002797 9 324 write 1.71 0.002614 1 2062 close 1.21 0.001850 1 2063 fstat64 1.12 0.001716 1 2058 fcntl64 ------ ----------- ----------- --------- --------- ---------------- 100.00 0.153119 70776 1 total (I've omitted all of the calls that were called only once and took
less than 1% of time time). We've got 53671 Next we'll look at putting the '-type f' test after we've checked the path: find share ! -path "share/doc/*" -type f -print Here's the 'strace' output: % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 72.91 0.100304 2 43210 lstat64 10.88 0.014970 3 4396 getdents64 5.39 0.007410 2 4115 chdir 3.41 0.004694 2 2064 1 open 2.24 0.003076 9 324 write 2.05 0.002817 1 2062 close 1.59 0.002184 1 2063 fstat64 1.40 0.001922 1 2058 fcntl64 ------ ----------- ----------- --------- --------- ---------------- 100.00 0.137577 60315 1 total Notice that we're making calls to Finally, we'll use find share ( -path "share/doc" -prune ) -o -type f -print And the 'strace' output: % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 78.35 0.092499 2 39040 lstat64 8.04 0.009487 4 2602 getdents64 3.50 0.004130 2 2361 chdir 3.01 0.003558 3 1181 fcntl64 2.55 0.003011 9 324 write 2.24 0.002642 2 1187 1 open 1.23 0.001454 1 1185 close 0.91 0.001074 1 1186 fstat64 ------ ----------- ----------- --------- --------- ---------------- 100.00 0.118052 49089 1 total Now we're a few thousand calls to You can see the effect '-prune' as well as the ordering of expressions has on execution speed. Summary
As is true with most programming endeavours, a little investment up
front in crafting the expressions will yield better results later. If
you're only running a |
Audio Broadcast(standby)Moon StatusPhase: 64.17%Illuminated: 81.46% Age (days): 18.95
Fri Jul 30 11:17:02 MDT 2010 |