bash - Counting lines or enumerating line numbers so I can loop over them - why is this an anti-pattern?

Question

Welcome To Ask or Share your Answers For Others

bash - Counting lines or enumerating line numbers so I can loop over them - why is this an anti-pattern?

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

bash - Counting lines or enumerating line numbers so I can loop over them - why is this an anti-pattern?

I posted the following code and got scolded. Why is this not acceptable?

numberOfLines=$(wc -l <"$1")
for ((i=1; $i<=$numberOfLines; ++$i)); do
  lineN=$(sed -n "$i!d;p;q" "$1")
  # ... do things with "$lineN"
done

We collect the number of lines in the input file into numberOfLines, then loop from 1 to that number, pulling out the next line from the file with sed in each iteration.

The feedback I received complained that reading the same file repeatedly with sed inside the loop to get the next line is inefficient. I guess I could use head -n "$i" "$1" | tail -n 1 but that's hardly more efficient, is it?

Is there a better way to do this? Why would I want to avoid this particular approach?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T01:33:42+0000

The shell (and basically every programming language which is above assembly language) already knows how to loop over the lines in a file; it does not need to know how many lines there will be to fetch the next one — strikingly, in your example, sed already does this, so if the shell couldn't do it, you could loop over the output from sed instead.

The proper way to loop over the lines in a file in the shell is with while read. There are a couple of complications — commonly, you reset IFS to avoid having the shell needlessly split the input into tokens, and you use read -r to avoid some pesky legacy behavior with backslashes in the original Bourne shell's implementation of read, which have been retained for backward compatibility.

while IFS='' read -r lineN; do
    # do things with "$lineN"
done <"$1"

Besides being much simpler than your sed script, this avoids the problem that you read the entire file once to obtain the line count, then read the same file again and again in each loop iteration. With a typical modern disk driver, some repeated reading can be avoided by caching, but the basic fact is still that reading information from disk is on the order of 1000x slower than not doing it when you can avoid it. Especially with a large file, the cache will fill up eventually, and so you end up reading in and discarding the same bytes over and over, adding a significant amount of CPU overhead and an even more significant amount of the CPU simply doing something else while waiting for the disk to deliver the bytes you read.

In a shell script, you also want to avoid the overhead of an external process if you can. Invoking sed (or the functionally equivalent but even more expensive two-process head -n "$i"| tail -n 1) thousands of times in a tight loop will add significant overhead for any non-trivial input file. (On the other hand, if the body of your loop could be done in e.g. sed or Awk instead, that's going to be a lot more efficient than a native shell while read loop, because of the way read is implemented. This is why while read is also frequently regarded as an antipattern.)

The q in the sed script is a very partial remedy; frequently, you see variations where the sed script will read the entire input file through to the end each time, even if it only wants to fetch one of the very first lines out of the file.

With a small input file, the effects are negligible, but perpetuating this bad practice just because it's not immediately harmful when the input file is small is just irresponsible. Just don't teach this technique to beginners. At all.

If you really need to display the number of lines in the input file, at least make sure you don't spend a lot of time seeking through to the end just to obtain that number. Maybe stat the file and keep track of how many bytes there are on each line, so you can project the number of lines you have left (and instead of line 1/10345234 display something like line 1/approximately 10000000?) ... or use an external tool like pv.

Tangentially, there is a vaguely related antipattern you want to avoid, too; you want to avoid reading an entire file into memory when you are only going to process one line at a time. Doing that in a for loop also has some additional gotchas, so don't do that, either; see https://mywiki.wooledge.org/DontReadLinesWithFor

Categories

bash - Counting lines or enumerating line numbers so I can loop over them - why is this an anti-pattern?

bash - Counting lines or enumerating line numbers so I can loop over them - why is this an anti-pattern?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags