The linked article is full of basic errors, and should not be relied upon.
The process of parsing C or C++ is defined as a series of transformations:1
- Backslash-newline is replaced with nothing whatsoever -- not even a space.
- Comments are removed and replaced with a single space each.
- The surviving text is converted into a series of preprocessing tokens. These are less specific than the tokens used by the language proper: for instance, the keyword
if
is an IF token to the language proper, but just an IDENT token to the preprocessor.
- Preprocessing directives are executed and macros are expanded.
- Each preprocessing token is converted into a token.
- the stream of tokens is parsed into an abstract syntax tree, and the rest of the compiler takes it from there.
Your example program
#include<stdio.h>
#define max 100
int main()
{
printf("max is %d", max);
return 0;
}
will, after transformation 3, be this series of 23 preprocessing tokens:
PUNCT:# IDENT:include INCLUDE-ARG:<stdio.h>
PUNCT:# IDENT:define IDENT:max PP-NUMBER:100
IDENT:int IDENT:main PUNCT:( PUNCT:)
PUNCT:{
IDENT:printf PUNCT:( STRING:"max is %d" PUNCT:, IDENT:max PUNCT:) PUNCT:;
IDENT:return PP-NUMBER:0 PUNCT:;
PUNCT:}
The directives are still present at this stage. Please notice that #include
and #define
are each two tokens: the #
and the directive name are separate. Some people like to write complex #if
nests with the hashmarks all in column 1 but the directive names indented.
After transformation 5, though, the directives are gone and we have this series of 16+n tokens:
[ ... some large volume of tokens produced from the contents of stdio.h ... ]
INT IDENT:main LPAREN RPAREN
LBRACE
IDENT:printf LPAREN STRING:"max is %d" COMMA DECIMAL-INTEGER:100 RPAREN SEMICOLON
RETURN DECIMAL-INTEGER:0 SEMICOLON
RBRACE
where 'n' is however many tokens came from stdio.h.
Preprocessing directives (#include
, #define
, #if
, etc.) are always removed from the token stream and perhaps replaced with something else, so you will never have tokens after transformation 6 that directly result from the text of a directive line. But you will usually have tokens that result from the effects of each directive, such as the contents of stdio.h
, and DECIMAL-INTEGER:100
replacing IDENT:max
.
Finally, C and C++ do this series of operations almost, but not quite, the same, and the specifications are formally independent. You can usually rely on preprocessing operations to behave the same in both languages, as long as you're only doing simple things with the preprocessor, which is best practice nowadays anyway.
1 You will sometimes see people talking about translation phases, which are the way the C and C++ standards officially describe this series of operations. My list is not the list of translation phases; it includes separate bullet points for some things that are grouped as a single phase by the standards, and leaves out several steps that aren't relevant to this discussion.