It automatically preprocesses, selects algorithms, and optimizes hyperparameters provided by AWS. It's AutoML that runs on SageMaker. This time, I have a sample of Autopilot, so I would like to actually move it. → Autopilot sample
First, create the necessary libraries and Sessions.
jupyter
import sagemaker
import boto3
from sagemaker import get_execution_role
region = boto3.Session().region_name
session = sagemaker.Session()
bucket = session.default_bucket()
prefix = 'sagemaker/autopilot-dm'
role = get_execution_role()
sm = boto3.Session().client(service_name='sagemaker',region_name=region)
Next, download the dataset. The data we are using this time is Bank Marketing Data Set. It's the data of the bank's direct marketing, and it seems to be the data of whether to execute the time deposit.
jupyter
!wget -N https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip
!unzip -o bank-additional.zip
local_data_path = './bank-additional/bank-additional-full.csv'
Next, divide the downloaded data into test data and train data, and delete the "y" column, which is the objective variable.
jupyter
import pandas as pd
data = pd.read_csv(local_data_path, sep=';')
train_data = data.sample(frac=0.8,random_state=200)
test_data = data.drop(train_data.index)
test_data_no_target = test_data.drop(columns=['y'])
After that, upload each divided data to S3.
jupyter
train_file = 'train_data.csv';
train_data.to_csv(train_file, index=False, header=True)
train_data_s3_path = session.upload_data(path=train_file, key_prefix=prefix + "/train")
print('Train data uploaded to: ' + train_data_s3_path)
test_file = 'test_data.csv';
test_data_no_target.to_csv(test_file, index=False, header=False)
test_data_s3_path = session.upload_data(path=test_file, key_prefix=prefix + "/test")
print('Test data uploaded to: ' + test_data_s3_path)
Next, we will set up Autopilot. In this sample, the settings are as follows, but it seems that various other settings can be made. The settings are described in this document, so please check it. please try.
jupyter
input_data_config = [{
'DataSource': {
'S3DataSource': {
'S3DataType': 'S3Prefix',
'S3Uri': 's3://{}/{}/train'.format(bucket,prefix)
}
},
'TargetAttributeName': 'y'
}
]
output_data_config = {
'S3OutputPath': 's3://{}/{}/output'.format(bucket,prefix)
}
Now that the settings are complete, let's actually move it.
jupyter
from time import gmtime, strftime, sleep
timestamp_suffix = strftime('%d-%H-%M-%S', gmtime())
auto_ml_job_name = 'automl-banking-' + timestamp_suffix
print('AutoMLJobName: ' + auto_ml_job_name)
sm.create_auto_ml_job(AutoMLJobName=auto_ml_job_name,
InputDataConfig=input_data_config,
OutputDataConfig=output_data_config,
RoleArn=role)
By writing the following, the content that is being executed every 30 seconds will be output.
jupyter
print ('JobStatus - Secondary Status')
print('------------------------------')
describe_response = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
print (describe_response['AutoMLJobStatus'] + " - " + describe_response['AutoMLJobSecondaryStatus'])
job_run_status = describe_response['AutoMLJobStatus']
while job_run_status not in ('Failed', 'Completed', 'Stopped'):
describe_response = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
job_run_status = describe_response['AutoMLJobStatus']
print (describe_response['AutoMLJobStatus'] + " - " + describe_response['AutoMLJobSecondaryStatus'])
sleep(30)
Model creation is complete when the output is "Completed". I think it took a little over two hours.
This time, I tried to automatically create a model using SageMaker Autopilot. I realized once again that AutoML is amazing because you can create a model just by preparing the data. I hope this will reduce the difficulty of creating a model and make ML widely used.
Recommended Posts