Parameter tuning with luigi

Recently, I'm using a data flow control framework called luigi, but I feel that it's relatively easy to use, so I'll try to write a little missionary text. ... There aren't many Japanese materials ...

About luigi itself http://qiita.com/colspan/items/453aeec7f4f420b91241 http://qiita.com/keisuke-nakata/items/0717c0c358658964f81e Please refer to the detailed description in.

To briefly explain the goodness, luigi terminates each child class (= task) that inherits luigi.Task and obtains the whole calculation result. By limiting the data transfer to files, even if there is a bug in the middle or the calculation time limit is exceeded, the already calculated part can be left and resume is possible. (Maybe on-memory cannot be delivered ...? → Addendum: luigi.mock

Machine learning with luigi

It is sad to redo all the calculations if the calculation is done on-memory at the time of parameter tuning and it falls in the middle. So, I thought that I could use the goodness of luigi, so I wrote the code for the time being. https://github.com/keisuke-yanagisawa/study/blob/20151204/luigi/param_tuning.py What you need

numpy
scikit-learn
luigi

Three of.

This program

python param_tuning.py task_param_tuning --local-scheduler

If you do it, it will run. You'll specify the root task. Also, luigi is usually used by launching the scheduler, but since it is troublesome, I try to run it independently with --local-scheduler.

Let's take a look at the tasks.

class task_param_eval(luigi.Task):
    data = luigi.Parameter()
    C = luigi.FloatParameter()
    gamma = luigi.FloatParameter()

    def requires(self):
        return []
    def output(self):
        return luigi.LocalTarget("temp/%s.txt" % hash( frozenset([self.C, self.gamma]) ))
    def run(self):
        model = svm.SVR(C=self.C, gamma=self.gamma)

        # cross_val_score function returns the "score", not "error". 
        # So, the result is inverse of error value.
        results = -cross_validation.cross_val_score(model, data.data, data.target, scoring="mean_absolute_error")
        with self.output().open("w") as out_file:
            out_file.write( str(np.mean(results)) );

The code itself is pretty simple, isn't it? The evaluation value is output by cross-validation using SVR, and the average value is output to a file.

Keep in mind that luigi tasks basically overwrite a 3-piece set of ** [requires, output, run] **.

requires ... Tasks that must be done in the first place to perform this task
output ... Output file path for this task (multiple can be specified)
run ... task contents

is. The output file path uses a magic spell called luigi.LocalTarget ().

Also, use luigi.Parameter () etc. as the argument. Inside luigi, I feel like I'm looking at these parameters and deciding that the same task name will be executed if the parameters are different, otherwise the same thing will not be executed twice. (Therefore, Parameter is required to be hashable)

Next, let's look at a task that calls the above task multiple times.

class task_param_tuning(luigi.Task):

    cost_list = luigi.Parameter(default="1,2,5,10")
    gamma_list = luigi.Parameter(default="1,2,5,10")
    
    data = datasets.load_diabetes()

    def requires(self):
        return flatten_array(
            map(lambda C:
                    map(lambda gamma:
                            task_param_eval(data=frozenset(self.data), # values should be hashable 
                                       C=float(C), gamma=float(gamma)),
                        self.cost_list.split(",")),
                self.gamma_list.split(",")))
    def output(self):
        return luigi.LocalTarget("results.csv")
    def run(self):

        results = {}

        for task in self.requires():
            with task.output().open() as taskfile:
                results[(task.C, task.gamma)] = float(taskfile.read())
        
        best_key = min(results,  key=results.get)
        with self.output().open("w") as out_file:
            out_file.write("%s,%s,%.4f\n" %(best_key[0], best_key[1], results[best_key]))

I'm not studying myself, and I don't know how to pass multiple parameters (I'm likely to get angry), so I'm separating them with commas for the time being, but leave that alone. In this code, I wanted to output parameters such as C and gamma of task_param_eval, so I set for task in self.requires () in run, but if it is OK if I can read the require file purely Is self.input () and has the same effect as self.requires (). Output ().