You have three main options for converting types in pandas:
(您可以使用三种主要选项来转换熊猫的类型:)
to_numeric()
- provides functionality to safely convert non-numeric types (eg strings) to a suitable numeric type.
(to_numeric()
-提供了将非数字类型(例如字符串)安全地转换为合适的数字类型的功能。)
(See also to_datetime()
and to_timedelta()
.) ((另请参见to_datetime()
和to_timedelta()
。))
astype()
- convert (almost) any type to (almost) any other type (even if it's not necessarily sensible to do so).
(astype()
-将(几乎)任何类型转换为(几乎)任何其他类型(即使这样做不一定明智)。)
Also allows you to convert to categorial types (very useful). (还允许您转换为分类类型(非常有用)。)
infer_objects()
- a utility method to convert object columns holding Python objects to a pandas type if possible.
(infer_objects()
-一种实用的方法,可以将保存Python对象的对象列转换为熊猫类型。)
Read on for more detailed explanations and usage of each of these methods.
(继续阅读以获取每种方法的更详细的解释和用法。)
1. to_numeric()
(1. to_numeric()
)
The best way to convert one or more columns of a DataFrame to numeric values is to use pandas.to_numeric()
.
(将DataFrame的一列或多列转换为数值的最佳方法是使用pandas.to_numeric()
。)
This function will try to change non-numeric objects (such as strings) into integers or floating point numbers as appropriate.
(此函数将尝试将非数字对象(例如字符串)适当地更改为整数或浮点数。)
Basic usage (基本用法)
The input to to_numeric()
is a Series or a single column of a DataFrame.
(to_numeric()
的输入是Series或DataFrame的单个列。)
>>> s = pd.Series(["8", 6, "7.5", 3, "0.9"]) # mixed string and numeric values
>>> s
0 8
1 6
2 7.5
3 3
4 0.9
dtype: object
>>> pd.to_numeric(s) # convert everything to float values
0 8.0
1 6.0
2 7.5
3 3.0
4 0.9
dtype: float64
As you can see, a new Series is returned.
(如您所见,将返回一个新的Series。)
Remember to assign this output to a variable or column name to continue using it: (请记住,将此输出分配给变量或列名以继续使用它:)
# convert Series
my_series = pd.to_numeric(my_series)
# convert column "a" of a DataFrame
df["a"] = pd.to_numeric(df["a"])
You can also use it to convert multiple columns of a DataFrame via the apply()
method:
(您还可以通过apply()
方法使用它来转换DataFrame的多个列:)
# convert all columns of DataFrame
df = df.apply(pd.to_numeric) # convert all columns of DataFrame
# convert just columns "a" and "b"
df[["a", "b"]] = df[["a", "b"]].apply(pd.to_numeric)
As long as your values can all be converted, that's probably all you need.
(只要您的值都可以转换,那可能就是您所需要的。)
Error handling (错误处理)
But what if some values can't be converted to a numeric type?
(但是,如果某些值不能转换为数字类型怎么办?)
to_numeric()
also takes an errors
keyword argument that allows you to force non-numeric values to be NaN
, or simply ignore columns containing these values.
(to_numeric()
还采用了errors
关键字参数,该参数允许您将非数字值强制为NaN
,或仅忽略包含这些值的列。)
Here's an example using a Series of strings s
which has the object dtype:
(这是使用具有对象dtype的一系列字符串s
的示例:)
>>> s = pd.Series(['1', '2', '4.7', 'pandas', '10'])
>>> s
0 1
1 2
2 4.7
3 pandas
4 10
dtype: object
The default behaviour is to raise if it can't convert a value.
(如果无法转换值,则默认行为是引发。)
In this case, it can't cope with the string 'pandas': (在这种情况下,它不能处理字符串“ pandas”:)
>>> pd.to_numeric(s) # or pd.to_numeric(s, errors='raise')
ValueError: Unable to parse string
Rather than fail, we might want 'pandas' to be considered a missing/bad numeric value.
(我们可能希望将“ pandas”视为丢失/错误的数值,而不是失败。)
We can coerce invalid values to NaN
as follows using the errors
keyword argument: (我们可以使用errors
关键字参数将无效值强制为NaN
,如下所示:)
>>> pd.to_numeric(s, errors='coerce')
0 1.0
1 2.0
2 4.7
3 NaN
4 10.0
dtype: float64
The third option for errors
is just to ignore the operation if an invalid value is encountered:
(errors
的第三个选项是,如果遇到无效值,则忽略该操作:)
>>> pd.to_numeric(s, errors='ignore')
# the original Series is returned untouched
This last option is particularly useful when you want to convert your entire DataFrame, but don't not know which of our columns can be converted reliably to a numeric type.
(当您要转换整个DataFrame,但又不知道我们哪些列可以可靠地转换为数字类型时,最后一个选项特别有用。)
In that case just write: (在这种情况下,只需写:)
df.apply(pd.to_numeric, errors='ignore')
The function will be applied to each column of the DataFrame.
(该函数将应用于DataFrame的每一列。)
Columns that can be converted to a numeric type will be converted, while columns that cannot (eg they contain non-digit strings or dates) will be left alone. (可以转换为数字类型的列将被转换,而不能转换(例如,它们包含非数字字符串或日期)的列将被保留。)
Downcasting (下垂)
By default, conversion with to_numeric()
will give you either a int64
or float64
dtype (or whatever integer width is native to your platform).
(默认情况下,使用to_numeric()
转换将为您提供int64
或float64
dtype(或平台固有的任何整数宽度)。)
That's usually what you want, but what if you wanted to save some memory and use a more compact dtype, like float32
, or int8
?
(通常这就是您想要的,但是如果您想节省一些内存并使用更紧凑的dtype(例如float32
或int8
呢?)
to_numeric()
gives you the option to downcast to either 'integer', 'signed', 'unsigned', 'float'.
(to_numeric()
使您可以选择向下转换为'integer','signed','unsigned','float'。)
Here's an example for a simple series s
of integer type: (这是一个整数类型的简单序列s
示例:)
>>> s = pd.Series([1, 2, -7])
>>> s
0 1
1 2
2 -7
dtype: int64
Downcasting to 'integer' uses the smallest possible integer that can hold the values:
(向下转换为“整数”将使用可以保存值的最小整数:)
>>> pd.to_numeric(s, downcast='integer')
0 1
1 2
2 -7
dtype: int8
Downcasting to 'float' similarly picks a smaller than normal floating type:
(向下转换为“ float”类似地选择了一个比普通浮点型小的类型:)
>>> pd.to_numeric(s, downcast='float')
0 1.0
1 2.0
2 -7.0
dtype: float32
2. astype()
(2. astype()
)
The astype()
method enables you to be explicit about the dtype you want your DataFrame or Series to have.
(astype()
方法使您可以明确表示希望DataFrame或Series具有的dtype。)
It's very versatile in that you can try and go from one type to the any other. (它非常通用,可以尝试从一种类型转换为另一种类型。)
Basic usage (基本用法)
Just pick a type: you can use a NumPy dtype (eg np.int16
), some Python types (eg bool), or pandas-specific types (like the categorical dtype).
(只需选择一个类型即可:您可以使用NumPy np.int16
(例如np.int16
),某些Python类型(例如bool)或特定于熊猫的类型(例如类别dtype)。)
Call the method on the object you want to convert and astype()
will try and convert it for you:
(在要转换的对象上调用方法, astype()
将尝试为您转换它:)
# convert all DataFrame columns to the int64 dtype
df = df.astype(int)
# convert column "a" to int64 dtype and "b" to complex type
df = df.astype({"a": int, "b": complex})
# convert Series to float16 type
s = s.astype(np.float16)
# convert Series to Python strings
s = s.astype(str)
# convert Series to categorical type - see docs for more details
s = s.astype('category')
Notice I said "try" - if