c - Why does this implementation of strlen() work?

Question

Welcome To Ask or Share your Answers For Others

c - Why does this implementation of strlen() work?

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

c - Why does this implementation of strlen() work?

_{(Disclaimer: I've seen this question, and I am not re-asking it -- I am interested in why the code works, and not in how it works.)}

So here's this implementation of Apple's (well, FreeBSD's) strlen(). It uses a well-known optimization trick, namely it checks 4 or 8 bytes at once, instead of doing a byte-by-byte comparison to 0:

size_t strlen(const char *str)
{
    const char *p;
    const unsigned long *lp;

    /* Skip the first few bytes until we have an aligned p */
    for (p = str; (uintptr_t)p & LONGPTR_MASK; p++)
        if (*p == '')
            return (p - str);

    /* Scan the rest of the string using word sized operation */
    for (lp = (const unsigned long *)p; ; lp++)
        if ((*lp - mask01) & mask80) {
        p = (const char *)(lp);
        testbyte(0);
        testbyte(1);
        testbyte(2);
        testbyte(3);
#if (LONG_BIT >= 64)
        testbyte(4);
        testbyte(5);
        testbyte(6);
        testbyte(7);
#endif
    }

    /* NOTREACHED */
    return (0);
}

Now my question is: maybe I'm missing the obvious, but can't this read past the end of a string? What if we have a string of which the length is not divisible by the word size? Imagine the following scenario:

|<---------------- all your memories are belong to us --------------->|<-- not our memory -->
+-------------+-------------+-------------+-------------+-------------+ - -
|     'A'     |     'B'     |     'C'     |     'D'     |      0      |
+-------------+-------------+-------------+-------------+-------------+ - -
^                                                      ^^
|                                                      ||
+------------------------------------------------------++-------------- - -
                       long word #1                      long word #2

When the second long word is read, the program accesses bytes that it shouldn't in fact be accessing... isn't this wrong? I'm pretty confident that Apple and the BSD folks know what they are doing, so could someone please explain why this is correct?

One thing I've noticed is that beerboy asserted this to be undefined behavior, and I also believe it indeed is, but he was told that it isn't, because "we align to word size with the initial for loop" (not shown here). However, I don't see at all why alignment would be any relevant if the array is not long enough and we are reading past its end.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-16T23:59:35+0000

Although this is technically undefined behavior, in practice no native architecture checks for out-of-bounds memory access at a finer granularity than the size of a word. So while garbage past the terminator may end up being read, the result will not be a crash.

Categories

c - Why does this implementation of strlen() work?

c - Why does this implementation of strlen() work?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags