I started to touch xgboost, but honestly it is troublesome to build an environment for machine learning.

Even if I wrote a procedure manual or a script, where did I do that file later? With EC2, you can save money on a spot instance and experiment with multiple units in parallel while changing parameters, but if the environment construction is based on a procedure manual, you can see what is the terminal running which parameter processing. It's gone.

In that area, if you create "things that build an environment with one button or one command with AWS Lambda", I feel that it is relatively easy to ad hoc and make various progress.

Below, I would like to write what I tried about how to set up a spot instance of EC2 and create an environment where xgboost works in one shot via Lambda.

Lamda function

Code

After pasting the following in Lambda's hello-world template by overwriting, various settings will be made.

Since it is long, I also raised it to gist. https://gist.github.com/pyr-revs/d768984ed68500bdbeb9

console.log('Loading function');

var ec2Region = 'ap-northeast-1';
var s3Region = ec2Region;
var snsRegion = ec2Region;

var s3Bucket = 'mybucket';
var shellScriptS3Key = 'sh/launch_xgboost.sh';
var shellScriptS3Path = 's3://' + s3Bucket + '/' + shellScriptS3Key;

var iamInstanceProfile = 'my_ec2_role';

var availabilityZone = ec2Region + 'a';
var spotPrice = '0.1';
var imageId = 'ami-9a2fb89a';
var instanceType = 'c3.2xlarge';
var securityGroup = 'launch-wizard-1';
var keyName = 'my_ssh_keypair';

var userData = (function () {/*#!/bin/bash
tmp=/root/sudoers_tmp
cat /etc/sudoers > $tmp
cat >> $tmp <<EOF
Defaults:ec2-user !requiretty
EOF
cat $tmp > /etc/sudoers
yum -y update
yum groupinstall -y "Development tools"
yum -y install gcc-c++ python27-devel atlas-sse3-devel lapack-devel
pip install numpy
pip install scipy
pip install pandas
aws s3 cp %s /home/ec2-user/launch_xgboost.sh
chown ec2-user /home/ec2-user/launch_xgboost.sh
chmod +x /home/ec2-user/launch_xgboost.sh
su - ec2-user /home/ec2-user/launch_xgboost.sh
*/}).toString().match(/[^]*\/\*([^]*)\*\/\}$/)[1];

var shellScriptContents = (function () {/*#!/bin/bash
git clone --recursive https://github.com/dmlc/xgboost.git
cd xgboost
./build.sh > build.log 2>&1
cd python-package
sudo -s python setup.py install > setup.log 2>&1
export AWS_DEFAULT_REGION=%s
aws sns publish --topic-arn arn:aws:sns:ap-northeast-1:xxxxxxxxxxxx:My-Sns-Topic --subject "Launch xgboost Done" --message "Launch xgboost Done!!"
*/}).toString().match(/[^]*\/\*([^]*)\*\/\}$/)[1];

exports.handler = function(event, context) {
    var util = require('util');
    var AWS = require('aws-sdk');
    
    // Write sh file for xgboost launch to S3
    AWS.config.region = s3Region;
    var shellScriptContentsFormatted = util.format(shellScriptContents, snsRegion);
    var s3 = new AWS.S3();
    var s3Params = {Bucket: s3Bucket, Key: shellScriptS3Key, Body: shellScriptContentsFormatted};
    var s3Options = {partSize: 10 * 1024 * 1024, queueSize: 1};
    //console.log(shellScriptContentsFormatted);
    
    s3.upload(s3Params, s3Options, function(err, data) {
        if (err) {
            console.log(err, err.stack);
            context.fail('[Fail]');
        }
        else {
            console.log(data);
            
            // Lauch EC2 Spot Instance with UserData
            var userDataFormatted = util.format(userData, shellScriptS3Path);
            var userDataBase64 = new Buffer(userDataFormatted).toString('base64');
    
            var ec2LaunchParams = {
                SpotPrice: spotPrice, 
                LaunchSpecification : {
                    IamInstanceProfile: {
                      Name: iamInstanceProfile
                    },
                    ImageId: imageId,
                    InstanceType: instanceType,
                    KeyName: keyName,
                    Placement: {
                      AvailabilityZone: availabilityZone
                    },
                    SecurityGroups: [
                        securityGroup
                    ],
                    UserData: userDataBase64
                }
            };
            //console.log(params);
            
            AWS.config.region = ec2Region;
            var ec2 = new AWS.EC2();
            ec2.requestSpotInstances(ec2LaunchParams, function(err, data) {
                if (err) {
                    console.log(err, err.stack);
                    context.fail('[Fail]');
                }
                else {
                    console.log(data);
                    context.succeed('[Succeed]');
                }
            });
        }
    });
};

What are you doing

Create an sh file to install xgboost and save it in S3
Request a spot instance of EC2 and write the initial processing in UserData
When the instance is started, execute yum etc. with UserData. Furthermore, download the sh file created in Step 1 from s3 and hand it over to ec2-user.
xgboost is installed by the sh file that runs with ec2-user privileges.
Throw Notification to SNS when all is done

Basically, it is an extension of the following article.

Create a spot instance with AWS Lambda and let UserData automatically execute environment construction and long processing http://qiita.com/pyr_revs/items/c7f33d1cdee3118590e3

Where preparation / setting is required

Lambda execution timeout

Lambda timeout is as short as 3 seconds by default, so increase it to 60 seconds. The default memory should be 128MB.

IAM Role for Lambda

As an IAM Role ** that runs Lambda, you need to create a role that has "Create EC2 instance" and "Access to S3" authority. It means that you should attach the management policies "Amazon EC2 Full Access" and "Amazon S3 Full Access", but for some reason you can't write to S3, so I'm shaking "Administrator Access". .. ..

Create an IAM Role for your users-Getting started with AWS Lambda (1). Handle events from user applications http://dev.classmethod.jp/cloud/aws/getting-started-with-aws-lambda-1st-user-application/#prepare-iam-role

IAM Role for EC2 Instance Profile

As a role to attach to the EC2 instance profile **, you need to create a role with "Access to S3" and "Access to SNS" authority. You can attach the management policies "Amazon EC2 Full Access" and "Amazon SNS Full Access". After making it, rewrite the following part.

var iamInstanceProfile = 'my_ec2_role';

Grant permissions to applications running on Amazon EC2 instances using IAM roles http://docs.aws.amazon.com/ja_jp/IAM/latest/UserGuide/id_roles_use_switch-role-ec2.html

SNS Topic

Please make a suitable SNS Topic. You will be notified when xgboost is installed. Rewrite the following arn at the end of the shell script section.

aws sns publish --topic-arn arn:aws:sns:ap-northeast-1:xxxxxxxxxxxx:My-Sns-Topic --subject "Launch xgboost Done" --message "Launch xgboost Done!!"

Well, it's not essential, but if you subscribe to emails on SNS, you can track the status on your smartphone even if you are away from the front of your PC, so it is recommended.

Get started with Amazon Simple Notification Service https://docs.aws.amazon.com/ja_jp/sns/latest/dg/GettingStarted.html

Other settings

The following is as you can see, so I will omit the details, but please change it if necessary. However, the s3 bucket name and key pair must be changed. The imageId is from "Amazon Linux AMI 2015.09 / HVM (SSD) EBS-Backed 64-bit / Asia Pacific Tokyo", so change the region. It is necessary to change when an update comes.

var s3Bucket = 'mybucket';

var availabilityZone = ec2Region + 'a';
var spotPrice = '0.1';
var imageId = 'ami-9a2fb89a';
var instanceType = 'c3.2xlarge';
var securityGroup = 'launch-wizard-1';
var keyName = 'my_ssh_keypair';

Operation check

It's easiest to run it from the Lambda Test button. Of course you can also hit from aws cli.

Please wait for a while after confirming that the Spot Instance request has been made. Is it about 15-20 minutes? It takes quite a while to install numpy, scipy, pandas.

Once you have an EC2 instance and xgboost installed, log in with ssh. Let's run the binary_classification demo to check the operation.

cd /home/ec2-user/xgboost/demo/binary_classification
./runexp.sh

Immediately, it returned as follows.

6513x126 matrix with 143286 entries is loaded from agaricus.txt.train
6513x126 matrix with 143286 entries is saved to agaricus.txt.train.buffer
1611x126 matrix with 35442 entries is loaded from agaricus.txt.test
1611x126 matrix with 35442 entries is saved to agaricus.txt.test.buffer
boosting round 0, 0 sec elapsed
tree prunning end, 1 roots, 12 extra nodes, 0 pruned nodes ,max_depth=3
[0]     test-error:0.016139     train-error:0.014433
boosting round 1, 0 sec elapsed
tree prunning end, 1 roots, 10 extra nodes, 0 pruned nodes ,max_depth=3
[1]     test-error:0.000000     train-error:0.001228

updating end, 0 sec in all
1611x126 matrix with 35442 entries is loaded from agaricus.txt.test.buffer
start prediction...
writing prediction to pred.txt
...[Abbreviation]...
booster[1]:
0:[odor=none] yes=2,no=1
        1:[bruises?=bruises] yes=4,no=3
                3:leaf=1.1457
                4:[gill-spacing=close] yes=8,no=7
                        7:leaf=-6.87558
                        8:leaf=-0.127376
        2:[spore-print-color=green] yes=6,no=5
                5:[gill-size=broad] yes=10,no=9
                        9:leaf=-0.0386054
                        10:leaf=-1.15275
                6:leaf=0.994744

I'd like to see it a little longer, so I'll try the kaggle-higgs demo. The data must be downloaded separately from Kaggle and sent to EC2.

cd /home/ec2-user/xgboost/demo/kaggle-higgs
mkdir data
# Copy training.csv and test.csv
./run.sh

result. This also takes less than a minute.

[0]     train-auc:0.910911      [email protected]:3.699574
[1]     train-auc:0.915308      [email protected]:3.971228
[2]     train-auc:0.917743      [email protected]:4.067463
...[Abbreviation]...
[118]   train-auc:0.945648      [email protected]:5.937291
[119]   train-auc:0.945800      [email protected]:5.935622
finish training
finish loading from csv

Well, like this, it is a recognition that it is working without problems.

A memo of the place I was addicted to

UserData Section

Dependencies to put in yum

yum groupinstall -y "Development tools"
yum -y install gcc-c++ python27-devel atlas-sse3-devel lapack-devel

Development tools such as gcc and Blas / Lapack etc. are required, so I have included them. I referred to the following.

Installing scikit-learn on Amazon EC2 http://dacamo76.com/blog/2012/12/07/installing-scikit-learn-on-amazon-ec2/

Dependencies to put in with pip

pip install numpy
pip install scipy
pip install pandas

numpy and scipy are required. I like pandas because it takes a long time to install, but it seems that various improvements have been made, so I think it's okay to include it.

Improved Python XGBoost + pandas integration http://sinhrks.hatenablog.com/entry/2015/10/03/080719

/etc/sudoers

tmp=/root/sudoers_tmp
cat /etc/sudoers > $tmp
cat >> $tmp <<EOF
Defaults:ec2-user !requiretty
EOF
cat $tmp > /etc/sudoers

In EC2 (or AMI), sudo other than root is required tty by default (sudo cannot be done in shell script). That will be a problem later, so add Defaults: ec2-user! requiretty to the end of the file so that ec2-user can also sudo in the shell script.

Refer to the following, and set it roughly on the assumption that the instance will be deleted quickly when it is finished.

Omit password entry required many times http://qiita.com/yuku_t/items/5f995bb0dfc894c9f2df

ShellScript Section

Install xgboost

git clone https://github.com/dmlc/xgboost.git
cd xgboost
./build.sh > build.log 2>&1
cd python-package
sudo -s python setup.py install > setup.log 2>&1

Install from source. First, execute xgboost / build.sh to create a binary, and then install it in Python with xgboost / python-package / setup.py.

python setup.py install requires sudo. If you don't set / etc / sudoers, you'll get addicted to sudo: sorry, you must have a tty to run sudo.

in conclusion

If you want to do automatic termination when finished, or batch at night, the following may be helpful.

Cron Lambda from AWS Data Pipeline (without launching EC2) http://qiita.com/pyr_revs/items/d2ec88a8fafeace7da4a