I started to touch xgboost, but honestly it is troublesome to build an environment for machine learning.
Even if I wrote a procedure manual or a script, where did I do that file later? With EC2, you can save money on a spot instance and experiment with multiple units in parallel while changing parameters, but if the environment construction is based on a procedure manual, you can see what is the terminal running which parameter processing. It's gone.
In that area, if you create "things that build an environment with one button or one command with AWS Lambda", I feel that it is relatively easy to ad hoc and make various progress.
Below, I would like to write what I tried about how to set up a spot instance of EC2 and create an environment where xgboost works in one shot via Lambda.
Lamda function
Code
After pasting the following in Lambda's hello-world template by overwriting, various settings will be made.
Since it is long, I also raised it to gist. https://gist.github.com/pyr-revs/d768984ed68500bdbeb9
console.log('Loading function');
var ec2Region = 'ap-northeast-1';
var s3Region = ec2Region;
var snsRegion = ec2Region;
var s3Bucket = 'mybucket';
var shellScriptS3Key = 'sh/launch_xgboost.sh';
var shellScriptS3Path = 's3://' + s3Bucket + '/' + shellScriptS3Key;
var iamInstanceProfile = 'my_ec2_role';
var availabilityZone = ec2Region + 'a';
var spotPrice = '0.1';
var imageId = 'ami-9a2fb89a';
var instanceType = 'c3.2xlarge';
var securityGroup = 'launch-wizard-1';
var keyName = 'my_ssh_keypair';
var userData = (function () {/*#!/bin/bash
tmp=/root/sudoers_tmp
cat /etc/sudoers > $tmp
cat >> $tmp <<EOF
Defaults:ec2-user !requiretty
EOF
cat $tmp > /etc/sudoers
yum -y update
yum groupinstall -y "Development tools"
yum -y install gcc-c++ python27-devel atlas-sse3-devel lapack-devel
pip install numpy
pip install scipy
pip install pandas
aws s3 cp %s /home/ec2-user/launch_xgboost.sh
chown ec2-user /home/ec2-user/launch_xgboost.sh
chmod +x /home/ec2-user/launch_xgboost.sh
su - ec2-user /home/ec2-user/launch_xgboost.sh
*/}).toString().match(/[^]*\/\*([^]*)\*\/\}$/)[1];
var shellScriptContents = (function () {/*#!/bin/bash
git clone --recursive https://github.com/dmlc/xgboost.git
cd xgboost
./build.sh > build.log 2>&1
cd python-package
sudo -s python setup.py install > setup.log 2>&1
export AWS_DEFAULT_REGION=%s
aws sns publish --topic-arn arn:aws:sns:ap-northeast-1:xxxxxxxxxxxx:My-Sns-Topic --subject "Launch xgboost Done" --message "Launch xgboost Done!!"
*/}).toString().match(/[^]*\/\*([^]*)\*\/\}$/)[1];
exports.handler = function(event, context) {
var util = require('util');
var AWS = require('aws-sdk');
// Write sh file for xgboost launch to S3
AWS.config.region = s3Region;
var shellScriptContentsFormatted = util.format(shellScriptContents, snsRegion);
var s3 = new AWS.S3();
var s3Params = {Bucket: s3Bucket, Key: shellScriptS3Key, Body: shellScriptContentsFormatted};
var s3Options = {partSize: 10 * 1024 * 1024, queueSize: 1};
//console.log(shellScriptContentsFormatted);
s3.upload(s3Params, s3Options, function(err, data) {
if (err) {
console.log(err, err.stack);
context.fail('[Fail]');
}
else {
console.log(data);
// Lauch EC2 Spot Instance with UserData
var userDataFormatted = util.format(userData, shellScriptS3Path);
var userDataBase64 = new Buffer(userDataFormatted).toString('base64');
var ec2LaunchParams = {
SpotPrice: spotPrice,
LaunchSpecification : {
IamInstanceProfile: {
Name: iamInstanceProfile
},
ImageId: imageId,
InstanceType: instanceType,
KeyName: keyName,
Placement: {
AvailabilityZone: availabilityZone
},
SecurityGroups: [
securityGroup
],
UserData: userDataBase64
}
};
//console.log(params);
AWS.config.region = ec2Region;
var ec2 = new AWS.EC2();
ec2.requestSpotInstances(ec2LaunchParams, function(err, data) {
if (err) {
console.log(err, err.stack);
context.fail('[Fail]');
}
else {
console.log(data);
context.succeed('[Succeed]');
}
});
}
});
};
Basically, it is an extension of the following article.
Create a spot instance with AWS Lambda and let UserData automatically execute environment construction and long processing http://qiita.com/pyr_revs/items/c7f33d1cdee3118590e3
Lambda timeout is as short as 3 seconds by default, so increase it to 60 seconds. The default memory should be 128MB.
IAM Role for Lambda
As an IAM Role ** that runs Lambda, you need to create a role that has "Create EC2 instance" and "Access to S3" authority. It means that you should attach the management policies "Amazon EC2 Full Access" and "Amazon S3 Full Access", but for some reason you can't write to S3, so I'm shaking "Administrator Access". .. ..
Create an IAM Role for your users-Getting started with AWS Lambda (1). Handle events from user applications http://dev.classmethod.jp/cloud/aws/getting-started-with-aws-lambda-1st-user-application/#prepare-iam-role
As a role to attach to the EC2 instance profile **, you need to create a role with "Access to S3" and "Access to SNS" authority. You can attach the management policies "Amazon EC2 Full Access" and "Amazon SNS Full Access". After making it, rewrite the following part.
var iamInstanceProfile = 'my_ec2_role';
Grant permissions to applications running on Amazon EC2 instances using IAM roles http://docs.aws.amazon.com/ja_jp/IAM/latest/UserGuide/id_roles_use_switch-role-ec2.html
SNS Topic
Please make a suitable SNS Topic. You will be notified when xgboost is installed. Rewrite the following arn at the end of the shell script section.
aws sns publish --topic-arn arn:aws:sns:ap-northeast-1:xxxxxxxxxxxx:My-Sns-Topic --subject "Launch xgboost Done" --message "Launch xgboost Done!!"
Well, it's not essential, but if you subscribe to emails on SNS, you can track the status on your smartphone even if you are away from the front of your PC, so it is recommended.
Get started with Amazon Simple Notification Service https://docs.aws.amazon.com/ja_jp/sns/latest/dg/GettingStarted.html
The following is as you can see, so I will omit the details, but please change it if necessary. However, the s3 bucket name and key pair must be changed. The imageId is from "Amazon Linux AMI 2015.09 / HVM (SSD) EBS-Backed 64-bit / Asia Pacific Tokyo", so change the region. It is necessary to change when an update comes.
var s3Bucket = 'mybucket';
var availabilityZone = ec2Region + 'a';
var spotPrice = '0.1';
var imageId = 'ami-9a2fb89a';
var instanceType = 'c3.2xlarge';
var securityGroup = 'launch-wizard-1';
var keyName = 'my_ssh_keypair';
It's easiest to run it from the Lambda Test button. Of course you can also hit from aws cli.
Please wait for a while after confirming that the Spot Instance request has been made. Is it about 15-20 minutes? It takes quite a while to install numpy, scipy, pandas.
Once you have an EC2 instance and xgboost installed, log in with ssh. Let's run the binary_classification demo to check the operation.
cd /home/ec2-user/xgboost/demo/binary_classification
./runexp.sh
Immediately, it returned as follows.
6513x126 matrix with 143286 entries is loaded from agaricus.txt.train
6513x126 matrix with 143286 entries is saved to agaricus.txt.train.buffer
1611x126 matrix with 35442 entries is loaded from agaricus.txt.test
1611x126 matrix with 35442 entries is saved to agaricus.txt.test.buffer
boosting round 0, 0 sec elapsed
tree prunning end, 1 roots, 12 extra nodes, 0 pruned nodes ,max_depth=3
[0] test-error:0.016139 train-error:0.014433
boosting round 1, 0 sec elapsed
tree prunning end, 1 roots, 10 extra nodes, 0 pruned nodes ,max_depth=3
[1] test-error:0.000000 train-error:0.001228
updating end, 0 sec in all
1611x126 matrix with 35442 entries is loaded from agaricus.txt.test.buffer
start prediction...
writing prediction to pred.txt
...[Abbreviation]...
booster[1]:
0:[odor=none] yes=2,no=1
1:[bruises?=bruises] yes=4,no=3
3:leaf=1.1457
4:[gill-spacing=close] yes=8,no=7
7:leaf=-6.87558
8:leaf=-0.127376
2:[spore-print-color=green] yes=6,no=5
5:[gill-size=broad] yes=10,no=9
9:leaf=-0.0386054
10:leaf=-1.15275
6:leaf=0.994744
I'd like to see it a little longer, so I'll try the kaggle-higgs demo. The data must be downloaded separately from Kaggle and sent to EC2.
cd /home/ec2-user/xgboost/demo/kaggle-higgs
mkdir data
# Copy training.csv and test.csv
./run.sh
result. This also takes less than a minute.
[0] train-auc:0.910911 [email protected]:3.699574
[1] train-auc:0.915308 [email protected]:3.971228
[2] train-auc:0.917743 [email protected]:4.067463
...[Abbreviation]...
[118] train-auc:0.945648 [email protected]:5.937291
[119] train-auc:0.945800 [email protected]:5.935622
finish training
finish loading from csv
Well, like this, it is a recognition that it is working without problems.
UserData Section
yum groupinstall -y "Development tools"
yum -y install gcc-c++ python27-devel atlas-sse3-devel lapack-devel
Development tools such as gcc and Blas / Lapack etc. are required, so I have included them. I referred to the following.
Installing scikit-learn on Amazon EC2 http://dacamo76.com/blog/2012/12/07/installing-scikit-learn-on-amazon-ec2/
pip install numpy
pip install scipy
pip install pandas
numpy and scipy are required. I like pandas because it takes a long time to install, but it seems that various improvements have been made, so I think it's okay to include it.
Improved Python XGBoost + pandas integration http://sinhrks.hatenablog.com/entry/2015/10/03/080719
/etc/sudoers
tmp=/root/sudoers_tmp
cat /etc/sudoers > $tmp
cat >> $tmp <<EOF
Defaults:ec2-user !requiretty
EOF
cat $tmp > /etc/sudoers
In EC2 (or AMI), sudo other than root is required tty by default (sudo cannot be done in shell script). That will be a problem later, so add Defaults: ec2-user! requiretty
to the end of the file so that ec2-user can also sudo in the shell script.
Refer to the following, and set it roughly on the assumption that the instance will be deleted quickly when it is finished.
Omit password entry required many times http://qiita.com/yuku_t/items/5f995bb0dfc894c9f2df
ShellScript Section
git clone https://github.com/dmlc/xgboost.git
cd xgboost
./build.sh > build.log 2>&1
cd python-package
sudo -s python setup.py install > setup.log 2>&1
Install from source. First, execute xgboost / build.sh to create a binary, and then install it in Python with xgboost / python-package / setup.py.
python setup.py install
requires sudo. If you don't set / etc / sudoers, you'll get addicted to sudo: sorry, you must have a tty to run sudo
.
If you want to do automatic termination when finished, or batch at night, the following may be helpful.
Cron Lambda from AWS Data Pipeline (without launching EC2) http://qiita.com/pyr_revs/items/d2ec88a8fafeace7da4a
Create a spot instance with AWS Lambda and let UserData automatically execute environment construction and long processing http://qiita.com/pyr_revs/items/c7f33d1cdee3118590e3
Recommended Posts