Advance preparation http://qiita.com/GushiSnow/items/9ab8761082e29002f735
Github with hands-on code https://github.com/SnowMasaya/Chainer-with-Neural-Networks-Language-model-Hands-on.git
Enable virtual environment
Mac/Linux
source my_env/bin/activate
pyenv install 3.4.1
pyenv rehash
pyenv local 3.4.1
Operation check Directly below, do the following:
ipython notebook
The Chainer notebook will now open.
Open the iPython notebook. Here are the steps for creating a recurrent neural language model in order. Since the code in the sentence can actually be executed with iPython notebook, let's explain and execute it in order from the top (Click here for detailed usage) ( http://qiita.com/icoxfog417/items/175f69d06f4e590face9 )Please refer to).
Look at the ipython notebook from here to the coding part.
The model is defined in another class.
You can freely change the model in this part.
The purpose of this part is to help you understand the unique characteristics of the recurrent neural language model.
-F.EmbedID
is performing the process of converting dictionary data into data for the number of input units (conversion to latent vector space).
-The reason why the output is quadrupled is that the input layer, input restriction layer, output restriction layer, and forgetting layer are used for input in LSTM.
・ H1_in = self.l1_x (F.dropout (h0, ratio = dropout_ratio, train = train)) + self.l1_h (state ['h1'])
is a unit with how many dropouts while retaining past information Indicates whether to scrape.
See below for Drop out.
http://olanleed.hatenablog.com/entry/2013/12/03/010945
・ C1, h1 = F.lstm (state ['c1'], h1_in)
is a device for the recurrent neural network to learn with a good feeling without causing memory failure and gradient disappearance by a magical device called lstm. .. If you want to know more, please see below.
http://www.slideshare.net/nishio/long-shortterm-memory
-Return state, F.softmax_cross_entropy (y, t)
is where the loss function is updated by comparing the predicted character with the actual character. The reason for using the softmax function is that the output can be determined by considering all the inputs of the layer immediately before the output layer, so the softmax function is generally used to calculate the output layer.
#-------------Explain2 in the Qiita-------------
class CharRNN(FunctionSet):
"""
This is the part that defines the neural network.
The dictionary vector space entered in order from the top is converted to the number of hidden layer units, and then the hidden layer is entered.
Setting power and hidden layer.
The same processing is performed on the two layers, and the output layer corrects the number of vocabularies and outputs it.
The first parameter to set is-0.08 to 0.It is set randomly between 08.
"""
def __init__(self, n_vocab, n_units):
super(CharRNN, self).__init__(
embed = F.EmbedID(n_vocab, n_units),
l1_x = F.Linear(n_units, 4*n_units),
l1_h = F.Linear(n_units, 4*n_units),
l2_x = F.Linear(n_units, 4*n_units),
l2_h = F.Linear(n_units, 4*n_units),
l3 = F.Linear(n_units, n_vocab),
)
for param in self.parameters:
param[:] = np.random.uniform(-0.08, 0.08, param.shape)
"""
A description of forward propagation.
The forward propagation input is defined in Variable and the input and answer are passed.
Use the embed that defined the input layer earlier.
For the input of the hidden layer, l1 defined earlier_Using x, pass dropout and hidden layer state as arguments
is.
Hidden layer in lstm First layer state and h1_Pass in.
Write the second layer in the same way, and define the output layer without passing the state.
Each state is retained for use in subsequent inputs.
It compares the output label with the label of the answer and returns the loss and the status.
"""
def forward_one_step(self, x_data, y_data, state, train=True, dropout_ratio=0.5):
x = Variable(x_data, volatile=not train)
t = Variable(y_data, volatile=not train)
h0 = self.embed(x)
h1_in = self.l1_x(F.dropout(h0, ratio=dropout_ratio, train=train)) + self.l1_h(state['h1'])
c1, h1 = F.lstm(state['c1'], h1_in)
h2_in = self.l2_x(F.dropout(h1, ratio=dropout_ratio, train=train)) + self.l2_h(state['h2'])
c2, h2 = F.lstm(state['c2'], h2_in)
y = self.l3(F.dropout(h2, ratio=dropout_ratio, train=train))
state = {'c1': c1, 'h1': h1, 'c2': c2, 'h2': h2}
return state, F.softmax_cross_entropy(y, t)
"""
The description of dropout is removed and it is described as a method for prediction.
There is an argument called train in dropout, and it will not work if the argument of train is set to false
So, at the time of prediction, you can change the learning and prediction by changing the arguments passed, but this time it is explicitly known
I wrote it separately as follows.
"""
def predict(self, x_data, state):
x = Variable(x_data, volatile=True)
h0 = self.embed(x)
h1_in = self.l1_x(h0) + self.l1_h(state['h1'])
c1, h1 = F.lstm(state['c1'], h1_in)
h2_in = self.l2_x(h1) + self.l2_h(state['h2'])
c2, h2 = F.lstm(state['c2'], h2_in)
y = self.l3(h2)
state = {'c1': c1, 'h1': h1, 'c2': c2, 'h2': h2}
return state, F.softmax(y)
"""
It is the initialization of the state.
"""
def make_initial_state(n_units, batchsize=100, train=True):
return {name: Variable(np.zeros((batchsize, n_units), dtype=np.float32),
volatile=not train)
for name in ('c1', 'h1', 'c2', 'h2')}
#-------------Explain2 in the Qiita-------------
Handson # 2 Explanation
Regarding the acquisition of character string data for prediction, normally the training data and the test data are separated, but this time I wanted the effect to be realized hands-on, so I made the training data and the test data the same. Forecasting makes model changes and string predictions created.
-Change the model. -Predict the character string.
To change the predicted model, change the code below in the iPython notebook. Since the created model is in the cv folder There aren't many, but please check it.
# load model
#-------------Explain5 in the Qiita-------------
model = pickle.load(open("cv/charrnn_epoch_x.chainermodel", 'rb'))
#-------------Explain5 in the Qiita-------------
-The probability and state predicted by state, prob = model.predict (np.array ([index], dtype = np.int32), state)
are acquired. The state is also acquired for use in the next prediction.
・ ʻIndex = np.argmax (cuda.to_cpu (prob.data))has the highest probability among them because the weight probability of each word can be obtained in the
cuda.to_cpu (prob.data)part. Is the predicted character, so I try to return the index for that character. ・ ʻIndex = np.random.choice (prob.data.argsort () [0, -sampling_range:] [:: -1], 1) [0]
is the probability that characters similar to recurrent will be output. Is high, so the process of randomly outputting from the top 5 candidates is also included. Since this is the part where you want to see that a wide variety of outputs are output, you should originally select the maximum value.
#-------------Explain7 in the Qiita-------------
state, prob = model.predict(np.array([index], dtype=np.int32), state)
#index = np.argmax(prob.data)
index = np.random.choice(prob.data.argsort()[0,-sampling_range:][::-1], 1)[0]
#-------------Explain7 in the Qiita-------------
In this Hands On, we are learning only for a limited time, so we can only make a model with terrible accuracy. So let's adjust the parameters and recreate it using the model. Parameters to adjust
#-------------Explain7 in the Qiita-------------
n_epochs = 30
n_units = 625
batchsize = 100
bprop_len = 10
grad_clip = 0.5
#-------------Explain7 in the Qiita-------------
Role of each parameter n_epochs represents the number of learnings. If the model is complicated, it will not converge unless the number of trainings is increased, so if the model is complicated, it is necessary to set a large number.
n_units is the number of hidden layers. The higher this number, the more complex the model. If this number is increased, the learning will not converge unless the number of learnings is increased. Especially in the case of a language model, it is better to change it according to the number of vocabularies. If the number of units is larger than the number of vocabularies, the mapping to the latent space will not be completed, resulting in meaningless processing.
batchsize is the number of data to learn at one time. It depends on the size of the data. This point is often adjusted empirically, but basically, if it is increased, the learning accuracy will be improved but the learning speed will be reduced, and if it is decreased, the learning accuracy will be decreased but the learning speed will be increased.
bprop_len is a parameter peculiar to the recurrent neural network and indicates how many past characters are retained. This is a parameter that changes depending on the problem to be solved, so if you want to predict a long sentence, set a large number, and if it is a relatively short sentence, set a short number.
optimizer.clip_grads (grad_clip) puts an upper limit on the magnitude of the gradient (weight update width) to prevent the weights from exploding. A large value allows learning, and a small value suppresses learning.
If you want to know more about hyperparameter optimization, please see below.
http://colinraffel.com/wiki/neural_network_hyperparameters
Handson Advance
Language processing takes a lot of time, so I recommend GPU settings. However, it does not mean that it should be used unconditionally, and it works effectively in the following settings. If you would like to know the details of the mechanism, please see below. http://www.kumikomi.net/archives/2008/06/12gpu1.php?page=1
Good at Matrix calculation Sequential access to memory and strong in calculations without conditional branching (processing with high calculation density).
I'm not good at Binary search Random access to memory and many conditional branches.
http://sla.hatenablog.com/entry/chainer_on_ec2
github (for this GPU) Use GPU instance (based on Amazon Linux) with CUDA environment published by NVIDIA
https://github.com/SnowMasaya/Chainer-with-Neural-Networks-Language-model-Hands-on-Advance.git
GPU setting on AWS was done by referring to the following site.
http://tleyden.github.io/blog/2014/10/25/cuda-6-dot-5-on-aws-gpu-instance-running-ubuntu-14-dot-04/
apt-get update && apt-get install build-essential
Get the Cuda installer
wget http://developer.download.nvidia.com/compute/cuda/6_5/rel/installers/cuda_6.5.14_linux_64.run
Get only Cuda installer
chmod +x cuda_6.5.14_linux_64.run
mkdir nvidia_installers
./cuda_6.5.14_linux_64.run -extract=`pwd`/nvidia_installers
Get image-extract
sudo apt-get install linux-image-extra-virtual
Reboot
reboot
Create file
vi /etc/modprobe.d/blacklist-nouveau.conf
set nouveau and lbm-nouveau not to start
blacklist nouveau
blacklist lbm-nouveau
options nouveau modeset=0
alias nouveau off
alias lbm-nouveau off
Set not to start Kernel Nouveau
echo options nouveau modeset=0 | sudo tee -a /etc/modprobe.d/nouveau-kms.conf
Reboot after setting to configure the file system by expanding to memory in advance when the kernel starts
update-initramfs -u
reboot
Get kernel source
apt-get install linux-source
apt-get install linux-headers-3.13.0-37-generic
Install NVIDIA driver
cd nvidia_installers
./NVIDIA-Linux-x86_64-340.29.run
Check if the driver is installed with the following command.
nvidia-smi
Wed Aug 5 07:48:36 2015
+------------------------------------------------------+
| NVIDIA-SMI 340.29 Driver Version: 340.29 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GRID K520 Off | 0000:00:03.0 Off | N/A |
| N/A 54C P0 80W / 125W | 391MiB / 4095MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| 0 8013 python 378MiB |
+-----------------------------------------------------------------------------+
The number assigned to the above GPU is the GPU ID. This will be used later when running Chainer.
ʻError while loading shared libraries: libcurand.so.5.5: cannot open shared object file: No such file or directory `
http://linuxtoolkit.blogspot.jp/2013/09/error-while-loading-shared-libraries.html
Also pass the PATH
export PATH=$PATH:/usr/local/cuda-6.5/bin/
I have set up Python3.
Execute the following command to install the necessary items in advance
apt-get update
apt-get install gcc gcc++ kmod perl python-dev
sudo reboot
pip installation procedure https://pip.pypa.io/en/stable/installing.html
Pyenv installation procedure https://github.com/yyuu/pyenv
pip install virtualenv
pyenv install 3.4
virtualenv my_env -p = ~/.pyenv/versions/3.4.0/bin/python3.4
Requirement.txt has been set.
numpy
scikit-learn
Mako
six
chainer
scikit-cuda
Install the required libraries
pip install -r requirement.txt
Download "install-headers" from below.
https://android.googlesource.com/toolchain/python/+/47a24ea6662f20c8e165d541ab6facdf009bfee4/Python-2.7.5/Lib/distutils/command/install_headers.py
Install PyCuda
wget https://pypi.python.org/packages/source/p/pycuda/pycuda-2015.1.2.tar.gz
tar zxvf pycuda-2015.1.2.tar.gz
cd pycuda-2015.1.2
./configure.py
make
make install
Handson Advance2
Run ipython notebook on the server and check the operation (on AWS)
https://thomassileo.name/blog/2012/11/19/setup-a-remote-ipython-notebook-server-with-numpyscipymaltplotlibpandas-in-a-virtualenv-on-ubuntu-server/
Create configuration file
ipython profile create myserver
Modify configuration file
vim /home/ec2-user/.ipython/profile_myserver/ipython_config.py
Add line
c = get_config()
c.IPKernelApp.pylab = 'inline'
c.NotebookApp.ip = '*'
c.NotebookApp.open_browser = False
c.NotebookApp.password = u'sha1:yourhashedpassword'
c.NotebookApp.port = 9999
Keep it in cuda's PATH
export PATH=$PATH:/usr/local/cuda-6.5/bin/
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-6.5/lib64/
Open the port you want to open from the AWS security group
1: Select a security group 2: Add rules by editing 3: Type: Custom TCP rule 4: Protocol: TCP 5: Port: 9999 6: Source can be set anywhere
Run Since it is not reflected in the normal procedure, reflect the profiling file directly.
sudo ipython notebook --config=/home/ec2-user/.ipython/profile_myserver/ipython_config.py --no-browser
"Statistical Language Models based on Neural Networks" on the following site is very organized and easy to understand. Although it is in English.
http://rnnlm.org/
Description of language model coverage, perplexity
http://marujirou.hatenablog.com/entry/2014/08/22/235215
Run the deep learning framework Chainer on a GPU instance on EC2 g2.2xlarge instance
http://ukonlly.hatenablog.jp/entry/2015/07/04/210149
Drop Out
http://olanleed.hatenablog.com/entry/2013/12/03/010945
Learning to forget continual prediction with lstm
http://www.slideshare.net/FujimotoKeisuke/learning-to-forget-continual-prediction-with-lstm
Zaremba, Wojciech, Ilya Sutskever, and Oriol Vinyals. "Recurrent neural network regularization." arXiv preprint arXiv:1409.2329 (2014).
Google Mikolov
http://www.rnnlm.org/
Language Model (LM) using Neural Network (NN), that is, a kind of Neural Network Language Model (NNLM), Recurrent Neural Network Language Model (RNNLM) using Recurrent Neural Network (RNN)
http://kiyukuta.github.io/2013/12/09/mlac2013_day9_recurrent_neural_network_language_model.html
Long Short-term Memory
http://www.slideshare.net/nishio/long-shortterm-memory
While explaining Chainer's ptb sample, let's learn your own sentences deeply and automatically generate my sentence-like sentences
http://d.hatena.ne.jp/shi3z/20150714/1436832305
RNNLM
http://www.slideshare.net/uchumik/rnnln
Sparse estimation overview: model, theory, application
http://www.is.titech.ac.jp/~s-taiji/tmp/sparse_tutorial_2014.pdf
Optimization method in regularization learning method
http://imi.kyushu-u.ac.jp/~waki/ws2013/slide/suzuki.pdf
Recurrent neural language model creation reference https://github.com/yusuketomoto/chainer-char-rnn
Neural network natural language processing http://www.orsj.or.jp/archive2/or60-4/or60_4_205.pdf
Language model creation http://www.slideshare.net/uchumik/rnnln
Natural Language Processing Programming Study Group n-gram Language Model http://www.phontron.com/slides/nlp-programming-ja-02-bigramlm.pdf
Introduction to Statistical Semantic ~ From distribution hypothesis to word2vec ~ http://www.slideshare.net/unnonouno/20140206-statistical-semantics
linux source code https://github.com/torvalds/linux
Actual GPU computing using CUDA technology (Part 1) -Application of parallel processing technology refined in the graphics field to general-purpose numerical calculation http://www.kumikomi.net/archives/2008/06/12gpu1.php?page=1
GPGPU https://ja.wikipedia.org/wiki/GPGPU#.E7.89.B9.E5.BE.B4.E3.81.A8.E8.AA.B2.E9.A1.8C
Natural language processing theory I http://www.jaist.ac.jp/~kshirai/lec/i223/02.pdf
STATISTICAL LANGUAGE MODELS BASED ON NEURAL NETWORKS http://www.rnnlm.org/
Neural Network Hyperparameters http://colinraffel.com/wiki/neural_network_hyperparameters
Random Search for Hyper-Parameter Optimization http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf
Recommended Posts