Dummies are any variables that are either one or zero for each observation. pd.get_dummies
when applied to a column of categories where we have one category per observation will produce a new column (variable) for each unique categorical value. It will place a one in the column corresponding to the categorical value present for that observation. This is equivalent to one hot encoding.
One-hot encoding is characterized by having only one one per set of categorical values per observation.
Consider the series s
s = pd.Series(list('AABBCCABCDDEE'))
s
0 A
1 A
2 B
3 B
4 C
5 C
6 A
7 B
8 C
9 D
10 D
11 E
12 E
dtype: object
pd.get_dummies
will produce one-hot encoding. And yes! it is absolutely appropriate to not fit the intercept.
pd.get_dummies(s)
A B C D E
0 1 0 0 0 0
1 1 0 0 0 0
2 0 1 0 0 0
3 0 1 0 0 0
4 0 0 1 0 0
5 0 0 1 0 0
6 1 0 0 0 0
7 0 1 0 0 0
8 0 0 1 0 0
9 0 0 0 1 0
10 0 0 0 1 0
11 0 0 0 0 1
12 0 0 0 0 1
However, if you had s
include different data and used pd.Series.str.get_dummies
s = pd.Series('A|B,A,B,B,C|D,D|B,A,B,C,A|D'.split(','))
s
0 A|B
1 A
2 B
3 B
4 C|D
5 D|B
6 A
7 B
8 C
9 A|D
dtype: object
Then get_dummies
produces dummy variables that are not one-hot encoded and you could theoretically leave the intercept.
s.str.get_dummies()
A B C D
0 1 1 0 0
1 1 0 0 0
2 0 1 0 0
3 0 1 0 0
4 0 0 1 1
5 0 1 0 1
6 1 0 0 0
7 0 1 0 0
8 0 0 1 0
9 1 0 0 1
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…