This article is the 14th day article of Python Advent Calendar 2016.
NewsDigest classifies the categories of news articles to be delivered by machine learning algorithms. Specifically, about 1000 articles a day are classified into 10 categories such as "entertainment," "politics," and "sports."
NewsDigest provides a ** general-purpose API ** for classification in-house, instead of tightly coupling such categorization to the server module.
In order to realize this general-purpose API, we considered a serverless (AWS Lambda) machine learning API in order to make it more scalable, so it will be an introduction or a tutorial for creating a serverless API.
The API that actually works is https://3lxb3g0cx5.execute-api.us-east-1.amazonaws.com/prod/classify And the repository is https://github.com/yamitzky/serverless-machine-learning.
The API to be implemented this time is created based on the following assumptions.
--Classification API by supervised learning. That is, ** there is a learning stage and a classification (prediction) stage ** -** Not for big data **. Therefore, implement it only with scikit-learn without using Spark etc. --As mentioned earlier, create a serverless API
** I will omit explanations about machine learning, morphological analysis methods, how to use scikit learn, etc. **
In this tutorial, you will follow the steps below:
--First, write code for a minimal machine learning algorithm regardless of API --Use bottle to make a non-serverless API --Deploy to AWS Lambda to make it serverless
First, let's create a minimum implementation without thinking about API conversion. As a premise, I will prepare the following corpus (I made a corpus from Reuters corpus and generated it).
category\t Morphologically parsed text
money-fx\tu.k. money market given 120 mln stg late help london, march 17 - the bank of england said it provided the money market with late assistance of around 120 mln stg. this brings the bank's total help today to some 136 mln stg and compares with its forecast of a 400 mln stg shortage in the system.
grain\tu.s. export inspections, in thous bushels soybeans 20,349 wheat 14,070 corn 21,989 blah blah blah. 
earn\tsanford corp <sanf> 1st qtr feb 28 net bellwood, ill., march 23 - shr 28 cts vs 13 cts net 1,898,000 vs 892,000 sales 16.8 mln vs 15.3 mln
...
What does Naive Bayes' minimum implementation of categorization look like?
from gensim.corpora.dictionary import Dictionary
from gensim.matutils import corpus2csc
from sklearn.naive_bayes import MultinomialNB
def load_corpus(path):
"""Get corpus from file"""
categories = []
docs = []
with open(path) as f:
for line in f:
category, line = line.split('\t')
doc = line.strip().split(' ')
categories.append(category)
docs.append(doc)
return categories, docs
def train_model(documents, categories):
"""Learn the model"""
dictionary = Dictionary(documents)
X = corpus2csc([dictionary.doc2bow(doc) for doc in documents]).T
return MultinomialNB().fit(X, categories), dictionary
def predict(classifier, dictionary, document):
"""Estimate unknown sentence categories from the trained model"""
X = corpus2csc([dictionary.doc2bow(document)], num_terms=len(dictionary)).T
return classifier.predict(X)[0]
#Learn the model
categories, documents = load_corpus('corpus.txt')
classifier, dictionary = train_model(documents, categories)
#Categorize with the learned model
predict_sentence = 'a dollar of 115 yen or more at the market price of the trump market 4% growth after the latter half of next year'.split() # NOQA
predict(classifier, dictionary, predict_sentence) # money-fx
This minimum implementation
--Read data from the corpus and train the model --Category unknown sentences from trained models
In that respect, it has the minimum functionality of supervised learning. Let's make this an API.
Before making it serverless, let's simply make categorization into an API using bottle, a simple web framework.
from bottle import route, run, request
def load_corpus(path):
"""Get corpus from file"""
def train_model(documents, categories):
"""API for learning"""
def predict(classifier, dictionary, document):
"""API for classification"""
@route('/classify')
def classify():
categories, documents = load_corpus('corpus.txt')
classifier, dictionary = train_model(documents, categories)
sentence = request.params.sentence.split()
return predict(classifier, dictionary, sentence)
run(host='localhost', port=8080)
If you hit the curl command in this state, you will get the API result.
curl "http://localhost:8080/classify?sentence=a%20dollar%20of%20115%20yen%20or%20more%20at%20the%20market%20price%20of%20the%20trump%20market%204%%20growth%20after%20the%20latter%20half%20of%20next%20year"
# money-fx
But of course, ** there is a big problem with this implementation **. It's slow because it learns and classifies at the same time when you hit the classification endpoint (/ classify
).
In general, in machine learning, learning takes time and classification is completed in a short time. So, let's cut out the learning endpoint and make the trained model persistent.
This time, I prepared two APIs, / train
and / classify
. I tried to save the model persistence with joblib as described in scikit learn 3.4. Model persistence. The trick is to use joblib, and if you use joblib to compress the model, something about 200MB will fit in 2MB (because the file size is a constraint when converting to Lambda).
from sklearn.externals import joblib
import os.path
from bottle import route, run, request
def load_corpus(path):
"""Get corpus from file"""
def train_model(documents, categories):
"""API for learning"""
def predict(classifier, dictionary, document):
"""API for classification"""
@route('/train')
def train():
categories, documents = load_corpus('corpus.txt')
classifier, dictionary = train_model(documents, categories)
joblib.dump((classifier, dictionary), 'model.pkl', compress=9)
return "trained"
@route('/classify')
def classify():
if os.path.exists('model.pkl'):
classifier, dictionary = joblib.load('model.pkl')
sentence = request.params.sentence.split()
return predict(classifier, dictionary, sentence)
else:
#Without the file, it is not learned
return "model not trained. call `/train` endpoint"
run(host='localhost', port=8080)
With this API, when a model is trained, the trained model is persisted as model.pkl
. At the first stage, the model is not trained, so "model not trained" is displayed.
curl "http://localhost:8080/?sentence=a%20dollar%20of%20115%20yen%20or%20more%20at%20the%20market%20price%20of%20the%20trump%20market%204%%20growth%20after%20the%20latter%20half%20of%20next%20year"
# model not trained
If you learn and classify again, you can see that the API classifies normally.
curl http://localhost:8080/train
# trained
curl "http://localhost:8080/classify?sentence=a%20dollar%20of%20115%20yen%20or%20more%20at%20the%20market%20price%20of%20the%20trump%20market%204%%20growth%20after%20the%20latter%20half%20of%20next%20year"
# money-fx
Here is the main issue. We will deploy the API created with bottle to AWS Lambda.
To make the machine learning API serverless, the learning phase and classification phase are defined as follows.
--Learning phase --Use Docker to download datasets, create datasets, train, and save trained model files --Zip the code + trained model and deploy it to AWS Lambda --Classification phase --With the combination of API Gateway + AWS Lambda, when a classification request comes, load the trained model and classify it.
It's a bit amakudari, but prepare the following Dockerfile.
#Easy to build machine learning related, anaconda(miniconda)For the base image
FROM continuumio/miniconda
RUN mkdir -p /usr/src/app
WORKDIR /usr/src/app
#Dataset download
COPY download_corpus.sh /usr/src/app/
RUN sh download_corpus.sh
#Installation of machine learning related libraries
COPY conda-requirements.txt /usr/src/app/
RUN conda create -y -n deploy --file conda-requirements.txt
#Related libraries are/opt/conda/envs/deploy/lib/python2.7/site-Exhaled to packages
COPY . /usr/src/app/
#Learn and spit out the model
RUN python gen_corpus.py \
&& /bin/bash -c "source activate deploy && python train.py"
#Prepare deliverables for deployment
#Pack code, trained model, so files needed for execution, etc.
RUN mkdir -p build/lib \
&& cp main.py model.pkl build/ \
&& cp -r /opt/conda/envs/deploy/lib/python2.7/site-packages/* build/ \
&& cp /opt/conda/envs/deploy/lib/libopenblas* /opt/conda/envs/deploy/lib/libgfortran* build/lib/
Building this Dockerfile will result in a Docker image packed with code, a trained model, and the so files needed to run it. In other words, "build the model".
To extract the artifact from the Docker image and create the artifact for uploading to Lambda, run a command similar to the following:
docker build -t serverless-ml .
#Retrieving information from a Docker image
id=$(docker create serverless-ml)
docker cp $id:/usr/src/app/build ./build
docker rm -v $id
#Size reduction rm build/**/*.pyc
rm -rf build/**/test
rm -rf build/**/tests
#Zip the artifacts
cd build/ && zip -q -r -9 ../build.zip ./
You now have a zip file packed with code and models.
I'll omit this because it's how to use AWS Lambda.
[Step 2.3: Create a Lambda function and test it manually](https://docs.aws.amazon.com/ja_jp/lambda/latest/dg/with-s3-example-upload-deployment-pkg.html# walkthrough-s3-events-adminuser-create-test-function-upload-zip-test-upload) may be helpful.
Use Amazon API Gateway to create a serverless "API".
This is also omitted because it is how to use API Gateway.
You may find it helpful to create an API to expose your Lambda function (http://docs.aws.amazon.com/ja_jp/apigateway/latest/developerguide/getting-started.html).
As a Working example, I prepared the following API.
https://3lxb3g0cx5.execute-api.us-east-1.amazonaws.com/prod/classify
Let's actually hit the API with curl.
curl -X POST https://3lxb3g0cx5.execute-api.us-east-1.amazonaws.com/prod/classify -H "Content-type: application/json" -d '{"sentence": "a dollar of 115 yen or more at the market price of the trump market 4% growth after the latter half of next year"}'
You can safely get the classification result money-fx
.
In conclusion, it is ** none **.
The reason is that the API is returning results too late. In the previous example, it takes about 5 seconds. The API that returns 5 seconds is, well ** none ** (bitter smile)
The reason for the slow response is clear: it takes a long time to load the pkl file stored on disk into memory. Therefore, if the model file is huge, it can be said that the serverless machine learning API is too slow to respond **.
On the contrary, for example, if there is no model file, the API is simply using numpy
, or if the model file is very lightweight, I think that it can be used relatively.
Recommended Posts