I would prefer to avoid the hassle of encoding and decoding,
You cannot really avoid this completely. Required metadata for categorical variable is actually a mapping between value and index. Still, there is no need to do it manually or to create a custom transformer. Lets assume you have data frame like this:
import numpy as np
import pandas as pd
df = sqlContext.createDataFrame(pd.DataFrame({
"x1": np.random.random(1000),
"x2": np.random.choice(3, 1000),
"x4": np.random.choice(5, 1000)
}))
All you need is an assembler and indexer:
from pyspark.ml.feature import VectorAssembler, VectorIndexer
from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[
VectorAssembler(inputCols=df.columns, outputCol="features_raw"),
VectorIndexer(
inputCol="features_raw", outputCol="features", maxCategories=10)])
transformed = pipeline.fit(df).transform(df)
transformed.schema.fields[-1].metadata
## {'ml_attr': {'attrs': {'nominal': [{'idx': 1,
## 'name': 'x2',
## 'ord': False,
## 'vals': ['0.0', '1.0', '2.0']},
## {'idx': 2,
## 'name': 'x4',
## 'ord': False,
## 'vals': ['0.0', '1.0', '2.0', '3.0', '4.0']}],
## 'numeric': [{'idx': 0, 'name': 'x1'}]},
## 'num_attrs': 3}}
This example also shows what type information you provide to mark given element of the vector as categorical variable
{
'idx': 2, # Index (position in vector)
'name': 'x4', # name
'ord': False, # is ordinal?
# Mapping between value and label
'vals': ['0.0', '1.0', '2.0', '3.0', '4.0']
}
So if you want to build this from scratch all you have to do is correct schema:
from pyspark.sql.types import *
from pyspark.mllib.linalg import VectorUDT
# Lets assume we have only a vector
raw = transformed.select("features_raw")
# Dictionary equivalent to transformed.schema.fields[-1].metadata shown abov
meta = ...
schema = StructType([StructField("features", VectorUDT(), metadata=meta)])
sqlContext.createDataFrame(raw.rdd, schema)
But it is quite inefficient due to required serialization, deserialization.
Since Spark 2.2 you can also use metadata argument:
df.withColumn("features", col("features").alias("features", metadata=meta))
See also Attach metadata to vector column in Spark