-** In a tree-based learning model with Apache Spark , the importance ( Feature Importance **) of the variable that indicates which variable (explanatory variable) is important in the learning result is linked with the variable name of the training data set. I will explain how to attach it. --Scala people can write neatly with zipWithIndex, so it's not so troublesome, but Java is a little troublesome, so I made it into a library https://github.com/riversun/spark-ml-feature-importance-helper
** Feature Importances ** can be calculated with ** Apache Spark ** using the following tree learning model
GBTRegressionModel < / td> | Gradient boosting tree regression model td> tr> |
GBTClassificationModel < / td> | Gradient boosting tree classification model td> tr> |
RandomForestRegressionModel < / td> | Random forest regression model td> tr> |
RandomForestClassificationModel < / td> | Random forest classification model td> tr> |
DecisionTreeRegressionModel < / td> | Decision tree regression model td> tr> |
DecisionTreeClassificationModel < / td> | Decision tree classification model td> tr> |
The convenient part of the tree algorithm is that it is possible to [calculate] which variable was important in building the model as a result of training as "Variable Importance **" (https:: //web.stanford.edu/~hastie/ElemStatLearn/printings/ESLII_print12.pdf) (P.593).
However, since random forests and gradient boosting trees are methods called ** ensemble learning ** that combine multiple models, the importance of variables (** Feature Importance **) is like the regression coefficient. It should be noted that it is not an index for seeking a direct interpretation.
The training model obtained by training ** RandomForest ** in ** spark.ml **, for example, ** RandomForestRegressionModel ** class has ** RandomForestRegressionModel # featureImportances ** method when using Random Forest as the algorithm. is there.
By calling this method, the importance of the variable (** Feature Importance **) can be calculated.
By the way, the return value of the ** RandomForestRegressionModel # featureImportances ** method is of ** SparseVector ** type, and since the vector contains only the numerical component indicating the importance, the numerical value and which variable It is difficult to understand whether it is supported. It's hard to tell if it's there.
Therefore, the purpose of this paper is to associate ** Feature Importances ** with variable names.
In order to associate, prepare the variable to be passed to the argument in 3 steps.
It is assumed that the model object called randomForestRegressionModel has been acquired on the premise that the learning has been completed.
** STEP1 ** First, get ** Feature Importances ** as follows.
Vector importances=randomForestRegressionModel.featureImportances();
** STEP2 ** Next, get the variable name used as the feature. (Since the variable name was originally decided by myself, it is OK to hard code it)
String featuresCol = this.randomForestRegressionModel.getFeaturesCol()
** STEP3 ** Finally get the schema
StructType schema = predictions.schema();
Pass the three you just prepared to the following methods. Link (+ sort as well) with this method.
List<Importance> zipImportances(Vector featureImportances, String featuresCol, StructType schema) {
final int indexOfFeaturesCol = (Integer) schema.getFieldIndex(featuresCol).get();
final StructField featuresField = schema.fields()[indexOfFeaturesCol];
final Metadata metadata = featuresField
.metadata();
final Metadata featuresFieldAttrs = metadata
.getMetadata("ml_attr")
.getMetadata("attrs");
final Map<Integer, String> idNameMap = new HashMap<>();
final String[] fieldKeys = { "nominal", "numeric", "binary" };
final Collector<Metadata, ?, HashMap<Integer, String>> metaDataMapperFunc = Collectors
.toMap(
metaData -> (int) metaData.getLong("idx"), // key of map
metaData -> metaData.getString("name"), // value of map
(oldVal, newVal) -> newVal,
HashMap::new);
for (String fieldKey : fieldKeys) {
if (featuresFieldAttrs.contains(fieldKey)) {
idNameMap.putAll(Arrays
.stream(featuresFieldAttrs.getMetadataArray(fieldKey))
.collect(metaDataMapperFunc));
}
}
final double[] importanceScores = featureImportances.toArray();
final List<Importance> rawImportanceList = IntStream
.range(0, importanceScores.length)
.mapToObj(idx -> new Importance(idx, idNameMap.get(idx), importanceScores[idx], 0))
.collect(Collectors.toList());
final List<Importance> descSortedImportanceList = rawImportanceList
.stream()
.sorted(Comparator.comparingDouble((Importance ifeature) -> ifeature.score).reversed())
.collect(Collectors.toList());
for (int i = 0; i < descSortedImportanceList.size(); i++) {
descSortedImportanceList.get(i).rank = i;
}
final List<Importance> finalImportanceList;
switch (this.sort) {
case ASCENDING:
final List<Importance> ascSortedImportantFeatureList = descSortedImportanceList
.stream()
.sorted(Comparator.comparingDouble((Importance ifeature) -> ifeature.score))
.collect(Collectors.toList());
finalImportanceList = ascSortedImportantFeatureList;
break;
case DESCENDING:
finalImportanceList = descSortedImportanceList;
break;
case UNSORTED:
default:
finalImportanceList = rawImportanceList;
break;
}
return finalImportanceList;
}
The key to this method is as follows.
final StructField featuresField = schema.fields()[indexOfFeaturesCol];
final Metadata metadata = featuresField
.metadata();
final Metadata featuresFieldAttrs = metadata
.getMetadata("ml_attr")
.getMetadata("attrs");
** schema ** stores the structure of the data set, but there is a variable as a feature quantity for learning in it, and the metadata (design information) of that variable is extracted and the position (index) of the data is extracted. And get the association information of the corresponding variable.
What happens to the actual metadata can be extracted as the following JSON format. (Since the format changes depending on the Transformer sandwiched in the pipeline processing, the following is an example)
MetaData example(An example of predicting the price of jewelry accessories)
{
"ml_attr":{
"attrs":{
"numeric":[
{
"idx":0,
"name":"weight"
}
],
"nominal":[
{
"vals":[
"platinum",
"gold",
"Silver",
"Dia"
],
"idx":1,
"name":"materialIndex"
},
{
"vals":[
"brooch",
"necklace",
"Earrings",
"ring",
"bracelet"
],
"idx":2,
"name":"shapeIndex"
},
{
"vals":[
"Domestic famous brand",
"Overseas super famous brand",
"No brand",
"Famous overseas brands"
],
"idx":3,
"name":"brandIndex"
},
{
"vals":[
"Department store",
"Directly managed store",
"Cheap shop"
],
"idx":4,
"name":"shopIndex"
}
]
},
"num_attrs":5
}
}
As you can see in this data, you can get the pair of ** idx ** and ** name , so if you map it with ** Vector ** that you can get by ** FeatureImportances ** method, it will be as follows. As a result of a certain learning, we can see how important ( score ) the variable name ( name **) was.
Explanatory variable importance ranking
FeatureInfo [rank=0, score=0.3015643580333446, name=weight]
FeatureInfo [rank=1, score=0.2707593044437997, name=materialIndex]
FeatureInfo [rank=2, score=0.20696065038166056, name=brandIndex]
FeatureInfo [rank=3, score=0.11316392134864546, name=shapeIndex]
FeatureInfo [rank=4, score=0.10755176579254973, name=shopIndex]
--In Apache Spark spark.ml, I wrote a helper in Java that links the importance of explanatory variable names in a tree model. ――It may be "This kind of guy is haste with zipWithIndex if it's scala!", But it was a story that I wanted a library that would save time and effort with Java.
The source code is below https://github.com/riversun/spark-ml-feature-importance-helper
Maven Also available from Maven repositories
<dependency>
<groupId>org.riversun</groupId>
<artifactId>spark-ml-feature-importance-helper</artifactId>
<version>1.0.0</version>
</dependency>
Recommended Posts