Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
235 views
in Technique[技术] by (71.8m points)

python - How to convert type to bool in pandas with `None` values in the series?

Why does the value None convert to both True and False in this series?

Env:

  • Jupyter Notebok 6.0.3 in Jupyter Labs
  • Python 3.7.6

Imports:

from IPython.display import display
import pandas as pd

Converts None to True:

df_test1 = pd.DataFrame({'test_column':[0,1,None]})
df_test1['test_column'] = df_test1.test_column.astype(bool)
display(df_test1)

enter image description here

Converts None to False:

df_test2 = pd.DataFrame({'test_column':[0,1,None,'test']})
df_test2['test_column'] = df_test2.test_column.astype(bool)
display(df_test2)

enter image description here

Is this expected behavior?

question from:https://stackoverflow.com/questions/66067314/how-to-convert-type-to-bool-in-pandas-with-none-values-in-the-series

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Yes, this is expected behaviour, it leads from the initial dtype storage type of each series (column). The first input results in a series with floating point numbers, the second contains references to Python objects:

>>> pd.Series([0,1,None]).dtype
dtype('float64')
>>> pd.Series([0,1,None,'test']).dtype
dtype('O')

The float version of None is NaN, or Not a Number, which converts to True when interpreted as a boolean (as it is not equal to 0):

>>> pd.Series([0,1,None])[2]
nan
>>> bool(pd.Series([0,1,None])[2])
True

In the other case, the original None object was preserved, which converts to False:

>>> pd.Series([0,1,None,'test'])[2] is None
True
>>> bool(None)
False

So this comes down to automatic type inference, what type Pandas thinks is best suited for each column; see the DataFrame.infer_objects() method. The goal is to minimise storage requirements and operation performance; storing numbers as native 64-bit floating point values leads to faster numeric operations and a smaller memory footprint, while at the same time still being able to represent 'missing' values as NaN.

However, when you pass in a mix of numbers and strings, Panda's can't use a dedicated specialised array type and so falls back to the "Python object" type, which are references to the original Python objects.

Instead of letting Pandas guess as to what type you need, you could explicitly specify the type to be used. You could use one of the nullable integer types (which use Pandas.NA instead of NaN); converting these to booleans results in missing values converting to False:

>>> pd.Series([0,1,None], dtype=pd.Int64Dtype).astype(bool)
0    False
1     True
2    False
dtype: bool

Another option is to convert to a nullable boolean type, and so preserve the None / NaN indicators of missing data:

>>> pd.Series([0,1,None]).astype("boolean")
0    False
1     True
2     <NA>
dtype: boolean

Also see Working with missing data section in the user manual, as well as the nullable integer and nullable boolean data type manual pages.

Note that the Pandas notion of the NA value, representing missing data, is still considered experimental, which is why it is not yet the default. But if you want to 'opt in' for dataframes you just created, you can call the DataFrame.convert_dtypes() method right after creating the frame:

>>> df = pd.DataFrame({'prime_member':[0,1,None]}).convert_dtypes()
>>> df.prime_member
0       0
1       1
2    <NA>
Name: prime_member, dtype: Int64

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...