Yes, this is expected behaviour, it leads from the initial dtype
storage type of each series (column). The first input results in a series with floating point numbers, the second contains references to Python objects:
>>> pd.Series([0,1,None]).dtype
dtype('float64')
>>> pd.Series([0,1,None,'test']).dtype
dtype('O')
The float version of None
is NaN, or Not a Number, which converts to True when interpreted as a boolean (as it is not equal to 0):
>>> pd.Series([0,1,None])[2]
nan
>>> bool(pd.Series([0,1,None])[2])
True
In the other case, the original None
object was preserved, which converts to False
:
>>> pd.Series([0,1,None,'test'])[2] is None
True
>>> bool(None)
False
So this comes down to automatic type inference, what type Pandas thinks is best suited for each column; see the DataFrame.infer_objects()
method. The goal is to minimise storage requirements and operation performance; storing numbers as native 64-bit floating point values leads to faster numeric operations and a smaller memory footprint, while at the same time still being able to represent 'missing' values as NaN.
However, when you pass in a mix of numbers and strings, Panda's can't use a dedicated specialised array type and so falls back to the "Python object" type, which are references to the original Python objects.
Instead of letting Pandas guess as to what type you need, you could explicitly specify the type to be used. You could use one of the nullable integer types (which use Pandas.NA
instead of NaN); converting these to booleans results in missing values converting to False
:
>>> pd.Series([0,1,None], dtype=pd.Int64Dtype).astype(bool)
0 False
1 True
2 False
dtype: bool
Another option is to convert to a nullable boolean type, and so preserve the None
/ NaN indicators of missing data:
>>> pd.Series([0,1,None]).astype("boolean")
0 False
1 True
2 <NA>
dtype: boolean
Also see Working with missing data section in the user manual, as well as the nullable integer and nullable boolean data type manual pages.
Note that the Pandas notion of the NA
value, representing missing data, is still considered experimental, which is why it is not yet the default. But if you want to 'opt in' for dataframes you just created, you can call the DataFrame.convert_dtypes()
method right after creating the frame:
>>> df = pd.DataFrame({'prime_member':[0,1,None]}).convert_dtypes()
>>> df.prime_member
0 0
1 1
2 <NA>
Name: prime_member, dtype: Int64