Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
374 views
in Technique[技术] by (71.8m points)

python - How can I compare columns of a dask dataframe?

I have a large array of data which I have read into a dask dataframe. This data frame has two columns that I believe to be redundant (i.e., have identical values). These columns are string-valued -- they give the names of growth media used for incubating colonies of cells.

I would like to check my hypothesis that the two columns are identical before dropping one of them.

The simplest solution I could come up with was the following:

(df['growth_media_1'] == df['growth_media_2']).all().compute()

But this gives me the following error:

ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.

+--------+---------+----------+
| Column | Found   | Expected |
+--------+---------+----------+
| input  | float64 | int64    |
| output | float64 | int64    |
+--------+---------+----------+

Usually this is due to dask's dtype inference failing, and
*may* be fixed by specifying dtypes manually by adding:

dtype={'input': 'float64',
       'output': 'float64'}

I thought this might be because there were some NaN's in the columns, so I tried doing .dropna() before the comparison. But that did not fix the problem.

After extensive flailing, I ended up with this arcane mess:

(df['growth_media_1'].dropna() == df['growth_media_2'].dropna()).astype('bool').all().compute()

but even that did not solve my problem.

The error message really isn't helpful, since neither pd.read_csv nor pd.read_table are involved, as far as I can tell. However, pandas.read_text is in the backtrace, so perhaps dask is writing files for different shards of the data.

(I'm using dask version 1.2.2, if that helps. I'm using this on a high performance cluster, which lags the bleeding edge of software.)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

This is probably because you have inconsistent types in your Dask dataframe. Without looking at your data, it's hard to figure out what the problem is. But you could probably do the following to coerce the types.

df[['growth_media_1', 'growth_media_2']] = df[['growth_media_1', 'growth_media_2']].astype('float64')
(df['growth_media_1'] == df['growth_media_2']).all().compute()


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...