Feature Selection Datasets
Les ensembles de données de sélection de fonctionnalités sont un ensemble de données qui semble avoir été collecté pour étudier les méthodes d'apprentissage automatique et d'analyse comparative.
http://featureselection.asu.edu/datasets.php
Comme il y a tellement de données, je voulais lister le contenu et trouver les bonnes données, donc je l'ai analysé à la légère.
En plus de récupérer les données et d'examiner la structure des données, j'ai également utilisé RandomForestClassifier de scikit-learn pour examiner la difficulté du problème de classification.
import os
import timeit
from scipy import io
import pandas as pd
import urllib.request
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
Cliquez ici pour une liste des données acquises. J'ai corrigé quelques URL incorrectes.
dataset_url = [
"http://featureselection.asu.edu/files/datasets/BASEHOCK.mat",
"http://featureselection.asu.edu/files/datasets/PCMAC.mat",
"http://featureselection.asu.edu/files/datasets/RELATHE.mat",
"http://featureselection.asu.edu/files/datasets/COIL20.mat",
"http://featureselection.asu.edu/files/datasets/ORL.mat",
"http://featureselection.asu.edu/files/datasets/orlraws10P.mat",
"http://featureselection.asu.edu/files/datasets/pixraw10P.mat",
"http://featureselection.asu.edu/files/datasets/warpAR10P.mat",
"http://featureselection.asu.edu/files/datasets/warpPIE10P.mat",
"http://featureselection.asu.edu/files/datasets/Yale.mat",
"http://featureselection.asu.edu/files/datasets/USPS.mat",
"http://featureselection.asu.edu/files/datasets/ALLAML.mat",
"http://featureselection.asu.edu/files/datasets/Carcinom.mat",
"http://featureselection.asu.edu/files/datasets/CLL_SUB_111.mat",
"http://featureselection.asu.edu/files/datasets/colon.mat",
"http://featureselection.asu.edu/files/datasets/GLI_85.mat",
"http://featureselection.asu.edu/files/datasets/GLIOMA.mat",
"http://featureselection.asu.edu/files/datasets/leukemia.mat",
"http://featureselection.asu.edu/files/datasets/lung.mat",
"http://featureselection.asu.edu/files/datasets/lung_discrete.mat",
"http://featureselection.asu.edu/files/datasets/lymphoma.mat",
"http://featureselection.asu.edu/files/datasets/nci9.mat",
"http://featureselection.asu.edu/files/datasets/Prostate_GE.mat",
"http://featureselection.asu.edu/files/datasets/SMK_CAN_187.mat",
"http://featureselection.asu.edu/files/datasets/TOX_171.mat",
"http://featureselection.asu.edu/files/datasets/arcene.mat",
"http://featureselection.asu.edu/files/datasets/gisette.mat",
"http://featureselection.asu.edu/files/datasets/Isolet.mat",
"http://featureselection.asu.edu/files/datasets/madelon.mat"
]
result = {
'dataset':[],
'byte':[],
'X.shape':[],
'X_type':[],
'y.shape':[],
'n_class':[],
'RF_max':[],
'RF_mean':[],
'RF_min':[],
'sec':[],
}
for url in dataset_url:
result['dataset'].append(url.split("/")[-1])
filename = 'dataset.mat'
urllib.request.urlretrieve(url, filename)
result['byte'].append(os.path.getsize(filename))
matdata = io.loadmat(filename, squeeze_me=True)
X = matdata['X']
y = matdata['Y'].flatten()
result['X.shape'].append(X.shape)
result['X_type'].append(pd.DataFrame(X).nunique()[0])
result['y.shape'].append(y.shape)
result['n_class'].append(pd.DataFrame(y).nunique()[0])
scores = []
times = []
for _ in range(10):
X_train, X_test, y_train, y_test = train_test_split(X, y)
model = RandomForestClassifier()
times.append(timeit.timeit(lambda: model.fit(X_train, y_train), number=1))
scores.append(model.score(X_test,y_test))
result['RF_max'].append(max(scores))
result['RF_mean'].append(sum(scores) / len(scores))
result['RF_min'].append(min(scores))
result['sec'].append(sum(times) / len(times))
pd.DataFrame(result).sort_values("RF_max")
dataset | byte | X.shape | X_type | y.shape | n_class | RF_max | RF_mean | RF_min | sec | |
---|---|---|---|---|---|---|---|---|---|---|
21 | nci9.mat | 169288 | (60, 9712) | 3 | (60,) | 9 | 0.666667 | 0.433333 | 0.266667 | 0.183649 |
23 | SMK_CAN_187.mat | 11861244 | (187, 19993) | 171 | (187,) | 2 | 0.723404 | 0.655319 | 0.574468 | 0.670948 |
28 | madelon.mat | 1496573 | (2600, 500) | 40 | (2600,) | 2 | 0.733846 | 0.707385 | 0.680000 | 2.456003 |
13 | CLL_SUB_111.mat | 5875157 | (111, 11340) | 111 | (111,) | 3 | 0.750000 | 0.657143 | 0.464286 | 0.326307 |
24 | TOX_171.mat | 3470586 | (171, 5748) | 169 | (171,) | 4 | 0.813953 | 0.772093 | 0.697674 | 0.405085 |
16 | GLIOMA.mat | 1462087 | (50, 4434) | 50 | (50,) | 4 | 0.846154 | 0.669231 | 0.538462 | 0.154852 |
9 | Yale.mat | 161021 | (165, 1024) | 77 | (165,) | 15 | 0.857143 | 0.769048 | 0.595238 | 0.306511 |
25 | arcene.mat | 1900005 | (200, 10000) | 82 | (200,) | 2 | 0.900000 | 0.788000 | 0.680000 | 0.417719 |
20 | lymphoma.mat | 110185 | (96, 4026) | 3 | (96,) | 9 | 0.916667 | 0.829167 | 0.708333 | 0.169875 |
2 | RELATHE.mat | 226918 | (1427, 4322) | 2 | (1427,) | 2 | 0.921569 | 0.898880 | 0.876751 | 1.218853 |
14 | colon.mat | 36319 | (62, 2000) | 3 | (62,) | 2 | 0.937500 | 0.768750 | 0.687500 | 0.135427 |
7 | warpAR10P.mat | 279711 | (130, 2400) | 63 | (130,) | 10 | 0.939394 | 0.851515 | 0.757576 | 0.274956 |
1 | PCMAC.mat | 191131 | (1943, 3289) | 4 | (1943,) | 2 | 0.944444 | 0.922634 | 0.899177 | 1.491283 |
4 | ORL.mat | 376584 | (400, 1024) | 151 | (400,) | 40 | 0.950000 | 0.921000 | 0.830000 | 1.216780 |
15 | GLI_85.mat | 8743262 | (85, 22283) | 85 | (85,) | 2 | 0.954545 | 0.863636 | 0.772727 | 0.269521 |
27 | Isolet.mat | 3652673 | (1560, 617) | 1340 | (1560,) | 26 | 0.956410 | 0.938205 | 0.905128 | 2.222803 |
18 | lung.mat | 4762671 | (203, 3312) | 203 | (203,) | 5 | 0.960784 | 0.929412 | 0.882353 | 0.380843 |
22 | Prostate_GE.mat | 1524983 | (102, 5966) | 29 | (102,) | 2 | 0.961538 | 0.900000 | 0.807692 | 0.207986 |
10 | USPS.mat | 15138167 | (9298, 256) | 1617 | (9298,) | 10 | 0.965161 | 0.960258 | 0.955699 | 9.295629 |
26 | gisette.mat | 10619742 | (7000, 5000) | 345 | (7000,) | 2 | 0.974286 | 0.968971 | 0.961714 | 9.597926 |
12 | Carcinom.mat | 6917199 | (174, 9182) | 156 | (174,) | 11 | 0.977273 | 0.868182 | 0.772727 | 0.557979 |
0 | BASEHOCK.mat | 279059 | (1993, 4862) | 2 | (1993,) | 2 | 0.985972 | 0.974349 | 0.965932 | 1.789281 |
3 | COIL20.mat | 3024549 | (1440, 1024) | 10 | (1440,) | 20 | 1.000000 | 0.998889 | 0.994444 | 1.873450 |
11 | ALLAML.mat | 3639219 | (72, 7129) | 66 | (72,) | 2 | 1.000000 | 0.938889 | 0.833333 | 0.183536 |
6 | pixraw10P.mat | 520463 | (100, 10000) | 11 | (100,) | 10 | 1.000000 | 0.972000 | 0.920000 | 0.338596 |
17 | leukemia.mat | 154743 | (72, 7070) | 3 | (72,) | 2 | 1.000000 | 0.950000 | 0.777778 | 0.155346 |
8 | warpPIE10P.mat | 458267 | (210, 2420) | 36 | (210,) | 10 | 1.000000 | 0.962264 | 0.924528 | 0.410544 |
5 | orlraws10P.mat | 951783 | (100, 10304) | 46 | (100,) | 10 | 1.000000 | 0.988000 | 0.960000 | 0.415471 |
19 | lung_discrete.mat | 7516 | (73, 325) | 3 | (73,) | 7 | 1.000000 | 0.800000 | 0.526316 | 0.131734 |
Je pensais que ce serait ennuyeux de résoudre un problème trop simple, alors je les ai classés par ordre décroissant de RF_max.
J'espère que cela vous sera utile pour choisir un ensemble de données.
Recommended Posts