apache spark - PySpark & MLLib: Random Forest Feature Importances

Question

Welcome To Ask or Share your Answers For Others

apache spark - PySpark & MLLib: Random Forest Feature Importances

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

apache spark - PySpark & MLLib: Random Forest Feature Importances

I'm trying to extract the feature importances of a random forest object I have trained using PySpark. However, I do not see an example of doing this anywhere in the documentation, nor is it a method of RandomForestModel.

How can I extract feature importances from a RandomForestModel regressor or classifier in PySpark?

Here's the sample code provided in the documentation to get us started; however, there is no mention of feature importances in it.

from pyspark.mllib.tree import RandomForest
from pyspark.mllib.util import MLUtils

# Load and parse the data file into an RDD of LabeledPoint.
data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a RandomForest model.
#  Empty categoricalFeaturesInfo indicates all features are continuous.
#  Note: Use larger numTrees in practice.
#  Setting featureSubsetStrategy="auto" lets the algorithm choose.
model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
                                     numTrees=3, featureSubsetStrategy="auto",
                                     impurity='gini', maxDepth=4, maxBins=32)

I don't see a model.__featureImportances_ attribute available -- where can I find this?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T17:48:27+0000

UPDATE for version > 2.0.0

From the version 2.0.0, as you can see here, FeatureImportances is available for Random Forest.

In fact, you can find here that:

The DataFrame API supports two major tree ensemble algorithms: Random Forests and Gradient-Boosted Trees (GBTs). Both use spark.ml decision trees as their base models.

Users can find more information about ensemble algorithms in the MLlib Ensemble guide. In this section, we demonstrate the DataFrame API for ensembles.

The main differences between this API and the original MLlib ensembles API are:

support for DataFrames and ML Pipelines

separation of classification vs. regression

use of DataFrame metadata to distinguish continuous and categorical features

more functionality for random forests: estimates of feature importance, as well as the predicted probability of each class (a.k.a. class conditional probabilities) for classification.

If you want to have Feature Importance values, you have to work with ml package, not mllib, and use dataframes.

Below there is an example that you can find here:

# IMPORT
>>> import numpy
>>> from numpy import allclose
>>> from pyspark.ml.linalg import Vectors
>>> from pyspark.ml.feature import StringIndexer
>>> from pyspark.ml.classification import RandomForestClassifier

# PREPARE DATA
>>> df = spark.createDataFrame([
...     (1.0, Vectors.dense(1.0)),
...     (0.0, Vectors.sparse(1, [], []))], ["label", "features"])
>>> stringIndexer = StringIndexer(inputCol="label", outputCol="indexed")
>>> si_model = stringIndexer.fit(df)
>>> td = si_model.transform(df)

# BUILD THE MODEL
>>> rf = RandomForestClassifier(numTrees=3, maxDepth=2, labelCol="indexed", seed=42)
>>> model = rf.fit(td)

# FEATURE IMPORTANCES
>>> model.featureImportances
SparseVector(1, {0: 1.0})

Categories

apache spark - PySpark & MLLib: Random Forest Feature Importances

apache spark - PySpark & MLLib: Random Forest Feature Importances

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags