Japanese language processing by Python3 (5) Ensemble learning of different models by Voting Classifier

What is ensemble learning?

When there are multiple classifiers with high accuracy for classifying a given sentence into several classes, such as "sentiment analysis" in natural language processing, it is possible to create a stronger model by combining these in ensemble learning. There is also.

Ensemble learning A method of constructing a high-precision learner by combining a predictor that outputs a solution at random, that is, a weak learner that can predict with higher accuracy than a predictor with the worst prediction accuracy. Techniques such as bagging and boosting are well known. ([Crested Ibis Forest Wiki Ensemble Learning](http://ibisforest.org/index.php?%E3%82%A2%E3%83%B3%E3%82%B5%E3%83%B3%E3%83] % 96% E3% 83% AB% E5% AD% A6% E7% BF% 92))

In reality, it seems that it may be better for multiple experts to discuss policy proposals than to ask one expert for policy advice. Roughly speaking, the expert here is a learner (random forest or support vector machine), and combining the results (predicted values) obtained from multiple learners is ensemble learning. By the way, Random Forest itself is called an ensemble learner because Random Forest obtains the predicted value by majority from the results of multiple decision trees. This time, I will investigate Voting Classifier that can quickly combine multiple models with high accuracy that are conceptually different.

What is Voting Classifier?

A class in sklearn.ensemble implemented in scikit-learn v0.17.

The idea behind the voting classifier implementation is to combine conceptually different machine learning classifiers and use a majority vote or the average predicted probabilities (soft vote) to predict the class labels. Such a classifier can be useful for a set of equally well performing model in order to balance out their individual weaknesses.

Voting Classifier determines the results of different types of learners (such as Random Forest, Logistic Regression and Gausian NB (naive-based classifier)) that already have a certain degree of accuracy by majority vote or probability average. The concept itself is very simple, but easy to use and surprisingly powerful.

Hard Vote A method of adopting the label that was decided by majority among the labels predicted when making predictions with multiple models. For a given input X, if the three learners make different decisions, "this is 1" and "this is 2", respectively, the majority "1" is taken here and classified as X-> 1. Will be done. Learner 1-> class 1 Learner 2-> class 1 Learner 3-> class 2

Weak Vote This weights the probabilities that each learner predicts to be in a class, and adds the sum to get the label with the highest average probability. See the official example for details. 1.11.5.2. Weighted Average Probabilities (Soft Voting)

Be careful with Voting Classifier

One thing to keep in mind is that the ** equally well performing model ** is a pitfall, and if there are models that don't work well here, voting may not improve the results. Why is my VotingClassifier accuracy less than my individual classifier?

Actually use

As you can see in the official docs, I'm doing a hard vote on the iris dataset using the Voting Classifier for the time being.

`voting_classifier.py`


from sklearn import datasets
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier

def executre_voting_classifier():
    #Load the iris dataset
    iris = datasets.load_iris()
    X = iris.data[:, [0,2]]
    y = iris.target

    #Set the classifier. Here we use logistic regression, a random forest classifier, and a Gaussian naive base.
    clf1 = LogisticRegression(random_state=1)
    clf2 = RandomForestClassifier(random_state=1)
    clf3 = GaussianNB()

    #Create an ensemble learner. voting='hard'I will set it to and decide the value by a simple majority vote.
    eclf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)], voting='hard')

    for clf, label in zip([clf1, clf2, clf3, eclf], ['Logistic Regression', 'Random Forest', 'naive Bayes', 'Ensemble']):
        scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')
        print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))

Output result:
    Accuracy: 0.92 (+/- 0.03) [Logistic Regression]
    Accuracy: 0.91 (+/- 0.05) [Random Forest]
    Accuracy: 0.91 (+/- 0.06) [naive Bayes]
    Accuracy: 0.93 (+/- 0.06) [Ensemble]

Certainly (subtly) improved. You can easily combine models with multiple parameters and perform a Grid Search on these parameters.

`ensemble.py`


clf1 = SVC(kernel='rbf', random_state=0, gamma=0.3, C=5 ,class_weight='balanced')
clf2 = LogisticRegression(C=5, random_state=0, class_weight='balanced')
clf3 = RandomForestClassifier(criterion='entropy', n_estimators=250, random_state = 1, max_depth = 20, n_jobs=2, class_weight='balanced')

eclf = VotingClassifier(estimators=[('svm', clf1), ('lr', clf2), ('rfc', clf3)], voting='hard')
    eclf.fit(X_train, y_train)

Like this, it seems to be effective when "There are some models that seem to be good to combine, but I wonder if there is an easy way to combine them".