Let's begin with a bit of general advise: If you're interested in finding the bottlenecks of Python code you can use a profiler to find the functions/parts that eat up most of the time. In this case I use a line-profiler because you can actually see the implementation and the time spent on each line.
However, these tools don't work with C or Cython by default. Given that CPython (that's the Python interpreter I'm using), NumPy and pandas make heavy use of C and Cython there will be a limit how far I'll get with profiling.
Actually: one probably could extend profiling to the Cython code and probably also the C code by recompiling it with debug symbols and tracing, however it's not an easy task to compile these libraries so I won't do that (but if someone likes to do that the Cython documentation includes a page about profiling Cython code).
But let's see how far I can get:
Line-Profiling Python code
I'm going to use line-profiler and a Jupyter Notebook here:
%load_ext line_profiler
import numpy as np
import pandas as pd
x = pd.Series(np.random.randint(0, 100, 100000))
Profiling x.astype
%lprun -f x.astype x.astype(str)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
87 @wraps(func)
88 def wrapper(*args, **kwargs):
89 1 12 12.0 0.0 old_arg_value = kwargs.pop(old_arg_name, None)
90 1 5 5.0 0.0 if old_arg_value is not None:
91 if mapping is not None:
...
118 1 663354 663354.0 100.0 return func(*args, **kwargs)
So that's simply a decorator and 100% of the time is spent in the decorated function. So let's profile the decorated function:
%lprun -f x.astype.__wrapped__ x.astype(str)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
3896 @deprecate_kwarg(old_arg_name='raise_on_error', new_arg_name='errors',
3897 mapping={True: 'raise', False: 'ignore'})
3898 def astype(self, dtype, copy=True, errors='raise', **kwargs):
3899 """
...
3975 """
3976 1 28 28.0 0.0 if is_dict_like(dtype):
3977 if self.ndim == 1: # i.e. Series
...
4001
4002 # else, only a single dtype is given
4003 1 14 14.0 0.0 new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors,
4004 1 685863 685863.0 99.9 **kwargs)
4005 1 340 340.0 0.0 return self._constructor(new_data).__finalize__(self)
Source
Again one line is the bottleneck so let's check the _data.astype
method:
%lprun -f x._data.astype x.astype(str)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
3461 def astype(self, dtype, **kwargs):
3462 1 695866 695866.0 100.0 return self.apply('astype', dtype=dtype, **kwargs)
Okay, another delegate, let's see what _data.apply
does:
%lprun -f x._data.apply x.astype(str)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
3251 def apply(self, f, axes=None, filter=None, do_integrity_check=False,
3252 consolidate=True, **kwargs):
3253 """
...
3271 """
3272
3273 1 12 12.0 0.0 result_blocks = []
...
3309
3310 1 10 10.0 0.0 aligned_args = dict((k, kwargs[k])
3311 1 29 29.0 0.0 for k in align_keys
3312 if hasattr(kwargs[k], 'reindex_axis'))
3313
3314 2 28 14.0 0.0 for b in self.blocks:
...
3329 1 674974 674974.0 100.0 applied = getattr(b, f)(**kwargs)
3330 1 30 30.0 0.0 result_blocks = _extend_blocks(applied, result_blocks)
3331
3332 1 10 10.0 0.0 if len(result_blocks) == 0:
3333 return self.make_empty(axes or self.axes)
3334 1 10 10.0 0.0 bm = self.__class__(result_blocks, axes or self.axes,
3335 1 76 76.0 0.0 do_integrity_check=do_integrity_check)
3336 1 13 13.0 0.0 bm._consolidate_inplace()
3337 1 7 7.0 0.0 return bm
Source
And again ... one function call is taking all the time, this time it's x._data.blocks[0].astype
:
%lprun -f x._data.blocks[0].astype x.astype(str)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
542 def astype(self, dtype, copy=False, errors='raise', values=None, **kwargs):
543 1 18 18.0 0.0 return self._astype(dtype, copy=copy, errors=errors, values=values,
544 1 671092 671092.0 100.0 **kwargs)
.. which is another delegate...
%lprun -f x._data.blocks[0]._astype x.astype(str)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
546 def _astype(self, dtype, copy=False, errors='raise', values=None,
547 klass=None, mgr=None, **kwargs):
548 """
...
557 """
558 1 11 11.0 0.0 errors_legal_values = ('raise', 'ignore')
559
560 1 8 8.0 0.0 if errors not in errors_legal_values:
561 invalid_arg = ("Expected value of kwarg 'errors' to be one of {}. "
562 "Supplied value is '{}'".format(
563 list(errors_legal_values), errors))
564 raise ValueError(invalid_arg)
565
566 1 23 23.0 0.0 if inspect.isclass(dtype) and issubclass(dtype, ExtensionDtype):
567 msg = ("Expected an instance of {}, but got the class instead. "
568 "Try instantiating 'dtype'.".format(dtype.__name__))
569 raise TypeError(msg)
570
571 # may need to convert to categorical
572 # this is only called for non-categoricals
573 1 72 72.0 0.0 if self.is_categorical_astype(dtype):
...
595
596 # astype processing
597 1 16 16.0 0.0 dtype = np.dtype(dtype)
598 1 19 19.0 0.0 if self.dtype == dtype:
...
603 1 8 8.0 0.0 if klass is None:
604 1 13 13.0 0.0 if dtype == np.object_:
605 klass = ObjectBlock
606 1 6 6.0 0.0 try:
607 # force the copy here
608 1 7 7.0 0.0 if values is None:
609
610 1 8 8.0 0.0 if issubclass(dtype.type,
611 1 14 14.0 0.0 (compat.text_type, compat.string_types)):
612
613 # use native type formatting for datetime/tz/timedelta
614 1 15 15.0 0.0 if self.is_datelike:
615 values = self.to_native_types()
616
617 # astype formatting
618 else:
619 1 8 8.0 0.0 values = self.values
620
621 else:
622 values = self.get_values(dtype=dtype)
623
624 # _astype_nansafe works fine with 1-d only
625 1 665777 665777.0 99.9 values = astype_nansafe(values.ravel(), dtype, copy=True)
626 1 32 32.0 0.0 valu