python - How can I compare columns of a dask dataframe?

Question

Welcome To Ask or Share your Answers For Others

python - How can I compare columns of a dask dataframe?

posted Feb 19, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - How can I compare columns of a dask dataframe?

I have a large array of data which I have read into a dask dataframe. This data frame has two columns that I believe to be redundant (i.e., have identical values). These columns are string-valued -- they give the names of growth media used for incubating colonies of cells.

I would like to check my hypothesis that the two columns are identical before dropping one of them.

The simplest solution I could come up with was the following:

(df['growth_media_1'] == df['growth_media_2']).all().compute()

But this gives me the following error:

ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.

+--------+---------+----------+
| Column | Found   | Expected |
+--------+---------+----------+
| input  | float64 | int64    |
| output | float64 | int64    |
+--------+---------+----------+

Usually this is due to dask's dtype inference failing, and
*may* be fixed by specifying dtypes manually by adding:

dtype={'input': 'float64',
       'output': 'float64'}

I thought this might be because there were some NaN's in the columns, so I tried doing .dropna() before the comparison. But that did not fix the problem.

After extensive flailing, I ended up with this arcane mess:

(df['growth_media_1'].dropna() == df['growth_media_2'].dropna()).astype('bool').all().compute()

but even that did not solve my problem.

The error message really isn't helpful, since neither pd.read_csv nor pd.read_table are involved, as far as I can tell. However, pandas.read_text is in the backtrace, so perhaps dask is writing files for different shards of the data.

(I'm using dask version 1.2.2, if that helps. I'm using this on a high performance cluster, which lags the bleeding edge of software.)

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-02-19T04:07:43+0000

This is probably because you have inconsistent types in your Dask dataframe. Without looking at your data, it's hard to figure out what the problem is. But you could probably do the following to coerce the types.

df[['growth_media_1', 'growth_media_2']] = df[['growth_media_1', 'growth_media_2']].astype('float64')
(df['growth_media_1'] == df['growth_media_2']).all().compute()

Categories

python - How can I compare columns of a dask dataframe?

python - How can I compare columns of a dask dataframe?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags