Feature Selection Datasets

Feature Selection Datasets is a dataset that seems to have been collected for machine learning studies and method benchmarking.

http://featureselection.asu.edu/datasets.php

Since there is so much data, I wanted to list the contents and find the right data, so I analyzed it lightly.

In addition to retrieving the data and looking at the data structure, I also used scikit-learn's RandomForestClassifier to look at the difficulty of the classification problem.

code

import os
import timeit
from scipy import io
import pandas as pd
import urllib.request
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

Click here for a list of acquired data. I fixed two wrong URLs.

dataset_url = [
        "http://featureselection.asu.edu/files/datasets/BASEHOCK.mat",
        "http://featureselection.asu.edu/files/datasets/PCMAC.mat",
        "http://featureselection.asu.edu/files/datasets/RELATHE.mat",
        "http://featureselection.asu.edu/files/datasets/COIL20.mat",
        "http://featureselection.asu.edu/files/datasets/ORL.mat",
        "http://featureselection.asu.edu/files/datasets/orlraws10P.mat",
        "http://featureselection.asu.edu/files/datasets/pixraw10P.mat",
        "http://featureselection.asu.edu/files/datasets/warpAR10P.mat",
        "http://featureselection.asu.edu/files/datasets/warpPIE10P.mat",
        "http://featureselection.asu.edu/files/datasets/Yale.mat",
        "http://featureselection.asu.edu/files/datasets/USPS.mat",
        "http://featureselection.asu.edu/files/datasets/ALLAML.mat",
        "http://featureselection.asu.edu/files/datasets/Carcinom.mat",
        "http://featureselection.asu.edu/files/datasets/CLL_SUB_111.mat",
        "http://featureselection.asu.edu/files/datasets/colon.mat",
        "http://featureselection.asu.edu/files/datasets/GLI_85.mat",
        "http://featureselection.asu.edu/files/datasets/GLIOMA.mat",
        "http://featureselection.asu.edu/files/datasets/leukemia.mat",
        "http://featureselection.asu.edu/files/datasets/lung.mat",
        "http://featureselection.asu.edu/files/datasets/lung_discrete.mat",
        "http://featureselection.asu.edu/files/datasets/lymphoma.mat",
        "http://featureselection.asu.edu/files/datasets/nci9.mat",
        "http://featureselection.asu.edu/files/datasets/Prostate_GE.mat",
        "http://featureselection.asu.edu/files/datasets/SMK_CAN_187.mat",
        "http://featureselection.asu.edu/files/datasets/TOX_171.mat",
        "http://featureselection.asu.edu/files/datasets/arcene.mat",
        "http://featureselection.asu.edu/files/datasets/gisette.mat",
        "http://featureselection.asu.edu/files/datasets/Isolet.mat",
        "http://featureselection.asu.edu/files/datasets/madelon.mat"
]

result = {
    'dataset':[],
    'byte':[],
    'X.shape':[],
    'X_type':[],
    'y.shape':[],
    'n_class':[],
    'RF_max':[],
    'RF_mean':[],
    'RF_min':[],
    'sec':[],
    }

for url in dataset_url:
    result['dataset'].append(url.split("/")[-1])

    filename = 'dataset.mat'
    urllib.request.urlretrieve(url, filename)
    result['byte'].append(os.path.getsize(filename))

    matdata = io.loadmat(filename, squeeze_me=True)
    X = matdata['X']
    y = matdata['Y'].flatten()
    result['X.shape'].append(X.shape)
    result['X_type'].append(pd.DataFrame(X).nunique()[0])

    result['y.shape'].append(y.shape)
    result['n_class'].append(pd.DataFrame(y).nunique()[0])

    scores = []
    times = []
    for _ in range(10):
        X_train, X_test, y_train, y_test = train_test_split(X, y)
        model = RandomForestClassifier()
        times.append(timeit.timeit(lambda: model.fit(X_train, y_train), number=1))
        scores.append(model.score(X_test,y_test))

    result['RF_max'].append(max(scores))
    result['RF_mean'].append(sum(scores) / len(scores))
    result['RF_min'].append(min(scores))

    result['sec'].append(sum(times) / len(times))

result

dataset: the name of the dataset
byte: Data set size (byte)
X.shape: The shape of the explanatory variable
X_type: A variation of the number contained in the explanatory variable. If it is 2, it can be considered as a discrete value containing only two types of values. If it is sufficiently large, it can be considered as a substantially continuous value.
y.shape: The shape of the objective variable
n_class: Numerical variation of the objective variable, that is, the number of classes.
RF_max, RF_mean, RF_min: Maximum, average, and minimum accuracy rates when solving classification problems in Random Forest
sec: The average value of the time (sec) required to solve the classification problem

pd.DataFrame(result).sort_values("RF_max")

	dataset	byte	X.shape	X_type	y.shape	n_class	RF_max	RF_mean	RF_min	sec
21	nci9.mat	169288	(60, 9712)	3	(60,)	9	0.666667	0.433333	0.266667	0.183649
23	SMK_CAN_187.mat	11861244	(187, 19993)	171	(187,)	2	0.723404	0.655319	0.574468	0.670948
28	madelon.mat	1496573	(2600, 500)	40	(2600,)	2	0.733846	0.707385	0.680000	2.456003
13	CLL_SUB_111.mat	5875157	(111, 11340)	111	(111,)	3	0.750000	0.657143	0.464286	0.326307
24	TOX_171.mat	3470586	(171, 5748)	169	(171,)	4	0.813953	0.772093	0.697674	0.405085
16	GLIOMA.mat	1462087	(50, 4434)	50	(50,)	4	0.846154	0.669231	0.538462	0.154852
9	Yale.mat	161021	(165, 1024)	77	(165,)	15	0.857143	0.769048	0.595238	0.306511
25	arcene.mat	1900005	(200, 10000)	82	(200,)	2	0.900000	0.788000	0.680000	0.417719
20	lymphoma.mat	110185	(96, 4026)	3	(96,)	9	0.916667	0.829167	0.708333	0.169875
2	RELATHE.mat	226918	(1427, 4322)	2	(1427,)	2	0.921569	0.898880	0.876751	1.218853
14	colon.mat	36319	(62, 2000)	3	(62,)	2	0.937500	0.768750	0.687500	0.135427
7	warpAR10P.mat	279711	(130, 2400)	63	(130,)	10	0.939394	0.851515	0.757576	0.274956
1	PCMAC.mat	191131	(1943, 3289)	4	(1943,)	2	0.944444	0.922634	0.899177	1.491283
4	ORL.mat	376584	(400, 1024)	151	(400,)	40	0.950000	0.921000	0.830000	1.216780
15	GLI_85.mat	8743262	(85, 22283)	85	(85,)	2	0.954545	0.863636	0.772727	0.269521
27	Isolet.mat	3652673	(1560, 617)	1340	(1560,)	26	0.956410	0.938205	0.905128	2.222803
18	lung.mat	4762671	(203, 3312)	203	(203,)	5	0.960784	0.929412	0.882353	0.380843
22	Prostate_GE.mat	1524983	(102, 5966)	29	(102,)	2	0.961538	0.900000	0.807692	0.207986
10	USPS.mat	15138167	(9298, 256)	1617	(9298,)	10	0.965161	0.960258	0.955699	9.295629
26	gisette.mat	10619742	(7000, 5000)	345	(7000,)	2	0.974286	0.968971	0.961714	9.597926
12	Carcinom.mat	6917199	(174, 9182)	156	(174,)	11	0.977273	0.868182	0.772727	0.557979
0	BASEHOCK.mat	279059	(1993, 4862)	2	(1993,)	2	0.985972	0.974349	0.965932	1.789281
3	COIL20.mat	3024549	(1440, 1024)	10	(1440,)	20	1.000000	0.998889	0.994444	1.873450
11	ALLAML.mat	3639219	(72, 7129)	66	(72,)	2	1.000000	0.938889	0.833333	0.183536
6	pixraw10P.mat	520463	(100, 10000)	11	(100,)	10	1.000000	0.972000	0.920000	0.338596
17	leukemia.mat	154743	(72, 7070)	3	(72,)	2	1.000000	0.950000	0.777778	0.155346
8	warpPIE10P.mat	458267	(210, 2420)	36	(210,)	10	1.000000	0.962264	0.924528	0.410544
5	orlraws10P.mat	951783	(100, 10304)	46	(100,)	10	1.000000	0.988000	0.960000	0.415471
19	lung_discrete.mat	7516	(73, 325)	3	(73,)	7	1.000000	0.800000	0.526316	0.131734

I thought it would be boring to solve a problem that was too easy, so I arranged them in descending order of RF_max.

I hope it will be helpful when choosing a dataset.