Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
721 views
in Technique[技术] by (71.8m points)

machine learning - Spark data type guesser UDAF

Wanted to take something like this https://github.com/fitzscott/AirQuality/blob/master/HiveDataTypeGuesser.java and create a Hive UDAF to create an aggregate function that returns a data type guess.

Does Spark have something like this already built-in? Would be very useful for new wide datasets to explore data. Would be helpful for ML too, e.g. to decide categorical vs numerical variables.

How do you normally determine data types in Spark?

P.S. Frameworks like h2o automatically determine data type scanning a sample of data, or whole dataset. So then one can decide e.g. if a variable should be a categorical variable or numerical.

P.P.S. Another use case is if you get an arbitrary data set (we get them quite often), and want to save as a Parquet table. Providing correct data types make parquet more space effiecient (and probably more query-time performant, e.g. better parquet bloom filters than just storing everything as string/varchar).

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Does Spark have something like this already built-in?

Partially. There are some tools in Spark ecosystem which perform schema inference like spark-csv or pyspark-csv and category inference (categorical vs. numerical) like VectorIndexer.

So far so good. Problem is that schema inference has limited applicability, is not an easy task in general, can introduce hard to diagnose problems and can be quite expensive:

  1. There are not so many formats which can be used with Spark and may require schema inference. In practice it is limited to different variants of CSV and Fixed Width Formatted data.
  2. Depending on a data representation it can be impossible to determine correct data type or inferred type can lead to information loss:

    • interpreting numeric data as float or double can lead to unacceptable loss of precision, especially if working with financial data
    • date or number formats can differ based on a locale
    • some common identifiers can look like numerics while having some internal structure which can lost in conversion
  3. Automatic schema inference can mask different problems with input data and if it is not supported by additional tools which can highlight possible issues it can be dangerous. Moreover any mistakes during data loading and cleaning can be propagated through complete data processing pipeline.

    Arguably we should develop good understanding of input data before we even start to think about possible representation and encoding.

  4. Schema inference and / or category inference may require full data scan and / or large lookup tables. Both can be expensive or even not feasible on large datasets.

Edit:

It looks like schema inference capabilities on CSV files have been added directly to Spark SQL. See CSVInferSchema.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...