There is an example in map_partitions
docs to achieve exactly what are trying to do:
ddf.map_partitions(lambda df: df.assign(z=df.x * df.y))
When you call map_partitions
(just like when you call .apply()
on pandas.DataFrame
), the function that you try to map
(or apply
) will be given dataframe as a first argument.
In case of dask.dataframe.map_partitions
this first argument will be a partition and in case of pandas.DataFrame.apply
- a whole dataframe.
Which means that your function has to accept dataframe(partition) as a first argument and and in your case could look like this:
def test_f(df, col_1, col_2):
return df.assign(result=df[col_1] * df[col_2])
Note that assignment of a new column in this case happens (i.e. gets scheduled to happen) BEFORE you call .compute()
.
In your example you assign column AFTER you call .compute()
, which kind of defeats the purpose of using dask. I.e. after you call .compute()
the results of that operation are loaded into memory if there is enough space for those results (if not you just get MemoryError
).
So for you example to work you could:
1) Use function (with column names as arguments):
def test_f(df, col_1, col_2):
return df.assign(result=df[col_1] * df[col_2])
ddf_out = ddf.map_partitions(test_f, 'col_1', 'col_2')
# Here is good place to do something with BIG ddf_out dataframe before calling .compute()
result = ddf_out.compute(get=get) # Will load the whole dataframe into memory
2) Use lambda
(with column names hardcoded in the function):
ddf_out = ddf.map_partitions(lambda df: df.assign(result=df.col_1 * df.col_2))
# Here is good place to do something with BIG ddf_out dataframe before calling .compute()
result = ddf_out.compute(get=get) # Will load the whole dataframe into memory
Update:
To apply function on a row-by-row basis, here is a quote from the post you linked:
map
/ apply
You can map a function row-wise across a series with map
df.mycolumn.map(func)
You can map a function row-wise across a dataframe with apply
df.apply(func, axis=1)
I.e. for the example function in your question, it might look like this:
def test_f(dds, col_1, col_2):
return dds[col_1] * dds[col_2]
Since you will be applying it on a row-by-row basis the function's first argument will be a series (i.e. each row of a dataframe is a series).
To apply this function then you might call it like this:
dds_out = ddf.apply(
test_f,
args=('col_1', 'col_2'),
axis=1,
meta=('result', int)
).compute(get=get)
This will return a series named 'result'
.
I guess you could also call .apply
on each partition with a function but it does not look to be any more efficient then calling .apply
on dataframe directly. But may be your tests will prove otherwise.