Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
324 views
in Technique[技术] by (71.8m points)

c - Is strtok broken? Or just tricky?

Is strtok hopelessly broken?

On many StackOverflow questions about text-parsing in C, someone will suggest using strtok, and one common reply is that strtok should never be used, that it is hopelessly broken.

Some posters have claimed that strtok's problems are limited to multi-threading issues, and it is safe in a single-threaded environment.

What is the right answer?
Does it work?
Is it hopelessly broken?
Can you back up your answer with examples?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Yes, strtok is hopelessly broken, even in a simple single-threaded program, and I will demonstrate this failure with some sample code:

Let us begin with a simple text-analyzer function to gather statistics about sentences of text, using strtok. This code will lead to undefined behavior.

In this example, a sentence is a set of words separated by spaces, commas, semi-colons, and periods.

// Example:
//     int words, longest;
//     GetSentenceStats("There were a king with a large jaw and a queen with a plain face, on the throne of England.", &words, &longest);
// will report there are 20 words, and the longest word has 7 characters ("England").
void GetSentenceStats(const char* sentence, int* pWordCount, int* pMaxWordLen)
{
    char* delims = " ,;.";           // In a sentence, words are separated by spaces, commas, semi-colons or period.
    char* input = strdup(sentence);  // Make an local copy of the sentence, to be modified without affecting the caller.

    *pWordCount = 0;                 // Initialize the output to Zero
    *pMaxWordLen = 0;

    char* word = strtok(input, delims);
    while(word)
    {
        (*pWordCount)++;
        *pMaxWordLen = MAX(*pMaxWordLen, (int)strlen(word));
        word = strtok(NULL, delims);
    }
    free(input);
}

This simple function works. There are no bugs so far.


Now let us augment our library to add a function that gathers stats on Paragraphs of text.
A paragraph is a set of sentences separated by Exclamation Marks, Question Marks and Periods.

It will return the number of sentences in the paragraph, and the number of words in the longest sentence.
And perhaps most importantly, it will use the earlier function GetSentenceStats to help

void GetParagraphStats(const char* paragraph, int* pSentenceCount, int* pMaxWords)
{
    char* delims = ".!?";             // Sentences in a paragraph are separated by Period, Question-Mark, and Exclamation.
    char* input = strdup(paragraph);  // Make an local copy of the paragraph, to be modified without affecting the caller.

    *pSentenceCount = 0;
    *pMaxWords = 0;
    char* sentence = strtok(input, delims);
    while(sentence)
    {
        (*pSentenceCount)++;

        int wordCount;
        int longestWord;
        GetSentenceStats(sentence, &wordCount, &longestWord);
        *pMaxWords = MAX(*pMaxWords, wordCount);
        sentence = strtok(NULL, delims);    // This line returns garbage data, 
    }
    free(input);
}

This function also looks very simple and straightforward.
But it does not work, as demonstrated by this sample program.

int main(void)
{
    int cnt;
    int len;

    // First demonstrate that the SentenceStats function works properly:
    char *sentence = "There were a king with a large jaw and a queen with a plain face, on the throne of England."; 
    GetSentenceStats(sentence, &cnt, &len);
    printf("Word Count: %d
Longest Word: %d
", cnt, len);
    // Correct Answer:
    // Word Count: 20
    // Longest Word: 7   ("England")


    printf("

At this point, expected output is 20/7.
Everything is working fine

");

    char paragraph[] =  "It was the best of times!"   // Literary purists will note I have changed Dicken's original text to make a better example
                        "It was the worst of times?"
                        "It was the age of wisdom."
                        "It was the age of foolishness."
                        "We were all going direct to Heaven!";
    int sentenceCount;
    int maxWords;
    GetParagraphStats(paragraph, &sentenceCount, &maxWords);
    printf("Sentence Count: %d
Longest Sentence: %d
", sentenceCount, maxWords);
    // Correct Answer:
    // Sentence Count: 5
    // Longest Sentence: 7  ("We were all going direct to Heaven")

    printf("

At the end, expected output is 5/7.
But Actual Output is Undefined Behavior! Strtok is hopelessly broken
");
    _getch();
    return 0;
}

All calls to strtok are entirely correct, and are on separate data.
But the result is Undefined Behavior!

Why does this happen?
When GetParagraphStats is called, it begins a strtok-loop to get sentences. On the first sentence it will call GetSentenceStats. GetSentenceStats will also being a strtok-loop, losing all state established by GetParagraphStats. When GetSentenceStats returns, the caller (GetParagraphStats) will call strtok(NULL) again to get the next sentence.

But strtok will think this is a call to continue the previous operation, and will continue tokenizing memory that has now been freed! The result is the dreaded Undefined Behavior.

When is it safe to use strtok?
Even in a single-threaded environment, strtok can only be used safely when the programmer/architect is sure of two conditions:

  • The function using strtok must never call any function that may also use strtok.
    If it calls a subroutine that also uses strtok, its own use of strtok may be interrupted.

  • The function using strtok must never be called by any function that may also use strtok.
    If this function ever called by another routine using strtok, then this function will interrupt the callers use of strtok.

In a multi-threaded environment, use of strtok is even more impossible, because the programmer needs to be sure that there is only one use of strtok on the current thread, and also, no other threads are using strtok either.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

56.9k users

...