Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.2k views
in Technique[技术] by (71.8m points)

apache spark - Pyspark random forest feature importance mapping after column transformations

I am trying to plot the feature importances of certain tree based models with column names. I am using Pyspark.

Since I had textual categorical variables and numeric ones too, I had to use a pipeline method which is something like this -

  1. use string indexer to index string columns
  2. use one hot encoder for all columns
  3. use a vectorassembler to create the feature column containing the feature vector

    Some sample code from the docs for steps 1,2,3 -

    from pyspark.ml import Pipeline
    from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer, 
    VectorAssembler
    categoricalColumns = ["workclass", "education", "marital_status", 
    "occupation", "relationship", "race", "sex", "native_country"]
     stages = [] # stages in our Pipeline
     for categoricalCol in categoricalColumns:
        # Category Indexing with StringIndexer
        stringIndexer = StringIndexer(inputCol=categoricalCol, 
        outputCol=categoricalCol + "Index")
        # Use OneHotEncoder to convert categorical variables into binary 
        SparseVectors
        # encoder = OneHotEncoderEstimator(inputCol=categoricalCol + "Index", 
        outputCol=categoricalCol + "classVec")
        encoder = OneHotEncoderEstimator(inputCols= 
        [stringIndexer.getOutputCol()], outputCols=[categoricalCol + "classVec"])
        # Add stages.  These are not run here, but will run all at once later on.
        stages += [stringIndexer, encoder]
    
    numericCols = ["age", "fnlwgt", "education_num", "capital_gain", 
    "capital_loss", "hours_per_week"]
    assemblerInputs = [c + "classVec" for c in categoricalColumns] + numericCols
    assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
    stages += [assembler]
    
    # Create a Pipeline.
    pipeline = Pipeline(stages=stages)
    # Run the feature transformations.
    #  - fit() computes feature statistics as needed.
    #  - transform() actually transforms the features.
    pipelineModel = pipeline.fit(dataset)
    dataset = pipelineModel.transform(dataset)
    
  4. finally train the model

    after training and eval, I can use the "model.featureImportances" to get the feature rankings, however I dont get the feature/column names, rather just the feature number, something like this -

    print dtModel_1.featureImportances
    
    (38895,[38708,38714,38719,38720,38737,38870,38894],[0.0742343395738,0.169404823667,0.100485791055,0.0105823115814,0.0134236162982,0.194124862158,0.437744255667])
    

How do I map it back to the initial column names and the values? So that I can plot ?**

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Extract metadata as shown here by user6910411

attrs = sorted(
    (attr["idx"], attr["name"]) for attr in (chain(*dataset
        .schema["features"]
        .metadata["ml_attr"]["attrs"].values())))

and combine with feature importance:

[(name, dtModel_1.featureImportances[idx])
 for idx, name in attrs
 if dtModel_1.featureImportances[idx]]

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...