It does sound like you are complicating the requirement. For column multiplication, the regular pandas syntax will work (df['c'] = df['a'] * df['b']
). In your case, it's possible to use pd.eval
to get the actual numeric value for views:
import pandas as pd
import numpy as np
import dask.dataframe as dd
import random
df = pd.DataFrame(15*np.random.rand(15), columns=['views'])
df['views'] = df['views'].round(2).astype('str') + [random.choice(['K views', 'M views']) for _ in range(len(df))]
df['group'] = [random.choice([1,2,3]) for _ in range(len(df))]
ddf = dd.from_pandas(df, npartitions=2)
ddf['views_digits'] = ddf['views'].replace({'K views': '*1e3', 'M views': '*1e6'}, regex=True).map(pd.eval, meta=ddf['group'])
aggregate_df = ddf.groupby(['group']).agg({'views_digits': 'sum'}).compute()
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…