Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
355 views
in Technique[技术] by (71.8m points)

apache spark - get datatype of column using pyspark

We are reading data from MongoDB Collection. Collection column has two different values (e.g.: (bson.Int64,int) (int,float) ).

I am trying to get a datatype using pyspark.

My problem is some columns have different datatype.

Assume quantity and weight are the columns

quantity           weight
---------          --------
12300              656
123566000000       789.6767
1238               56.22
345                23
345566677777789    21

Actually we didn't defined data type for any column of mongo collection.

When I query to the count from pyspark dataframe

dataframe.count()

I got exception like this

"Cannot cast STRING into a DoubleType (value: BsonString{value='200.0'})"
question from:https://stackoverflow.com/questions/45033315/get-datatype-of-column-using-pyspark

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)
import pandas as pd
pd.set_option('max_colwidth', -1) # to prevent truncating of columns in jupyter

def count_column_types(spark_df):
    """Count number of columns per type"""
    return pd.DataFrame(spark_df.dtypes).groupby(1, as_index=False)[0].agg({'count':'count', 'names': lambda x: " | ".join(set(x))}).rename(columns={1:"type"})

Example output in jupyter notebook for a spark dataframe with 4 columns:

count_column_types(my_spark_df)

enter image description here


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...