apache spark - get datatype of column using pyspark

Question

Welcome To Ask or Share your Answers For Others

apache spark - get datatype of column using pyspark

posted Oct 6, 2021 in Technique[技术] by 深蓝 (71.8m points)

apache spark - get datatype of column using pyspark

We are reading data from MongoDB Collection. Collection column has two different values (e.g.: (bson.Int64,int) (int,float) ).

I am trying to get a datatype using pyspark.

My problem is some columns have different datatype.

Assume quantity and weight are the columns

quantity           weight
---------          --------
12300              656
123566000000       789.6767
1238               56.22
345                23
345566677777789    21

Actually we didn't defined data type for any column of mongo collection.

When I query to the count from pyspark dataframe

dataframe.count()

I got exception like this

"Cannot cast STRING into a DoubleType (value: BsonString{value=&apos;200.0&apos;})"

question from:https://stackoverflow.com/questions/45033315/get-datatype-of-column-using-pyspark

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-06T05:16:22+0000

import pandas as pd
pd.set_option('max_colwidth', -1) # to prevent truncating of columns in jupyter

def count_column_types(spark_df):
    """Count number of columns per type"""
    return pd.DataFrame(spark_df.dtypes).groupby(1, as_index=False)[0].agg({'count':'count', 'names': lambda x: " | ".join(set(x))}).rename(columns={1:"type"})

Example output in jupyter notebook for a spark dataframe with 4 columns:

count_column_types(my_spark_df)

Categories

apache spark - get datatype of column using pyspark

apache spark - get datatype of column using pyspark

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags