Explanation and implementation of ESIM algorithm

The algorithm explanation and implementation of the following paper.

Title: Enhanced LSTM for Natural Language Inference Author: Chen, et al. Year: 2017 URL: https://arxiv.org/abs/1609.06038

What is the story? There is Decomsable Attention in the algorithm that finds the relationship between the two documents (eg, assumptions and hypotheses, whether they have the same meaning, etc.), and the performance of ESIM, which is an improved version of it. Seems to be good, so I tried to implement it with Keras.

The article below is about Decomposable Attention, the predecessor of this algorithm.

Explanation and implementation of Decomposable Attention algorithm

What is ESIM?

--ESIM is an abbreviation for Enhanced Sequential Inference Model, which is an improved version of Decomposable Attention [^ 2]. --Algorithm for classification that takes two documents as input --Also used by Kaggle's Quora [^ 5] winner. (The single model had the best performance)

image.png

Whole algorithm (quoted from the original paper [^ 1])

Difference from Decomposable Attention

--Use Tree-LSTM [^ 12]. (BiLSTM [^ 13] is also acceptable) --Use differences and element products when comparing. --Use Average Pooling as well as Max Pooling during Aggregation.

algorithm

❏ Flow

The flow itself is the same as Decomposable Attention.

--Attend --Arranged words with similar meanings in similar places for easy comparison. --Compare --Compare the arranged vectors and convert them into two feature vectors --Aggregate --Calculate the likelihood for each class by combining two feature vectors

❏ Input variables

The input document $ a $ and document $ b $ are composed of words of length $ l_a $ and $ l_b $, respectively, and each element is a distributed representation (eg GloVe [^ 7] or Word2Vec [^ 8]]. ) Is converted to a $ d $ dimensional vector.

a = (a_1, ..., a_{l_a})^\mathrm{T} \\
b = (b_1, ..., b_{l_b})^\mathrm{T} \\

Convert words to distributed representation (experiment uses 300-dimensional GloVe [^ 7])

\bar{a}_i = \mathrm{BiLSTM}(a, i),\quad \forall i \in \left\{1,...,l_a\right\} \\
\bar{b}_j = \mathrm{BiLSTM}(b, j),\quad \forall j \in \left\{1,...,l_b\right\} \\

❏Attend

This is the same as Decomposable Attention.

\begin{align}
e_{ij} &= \bar{a}_i^\mathrm{T}\bar{b}_j \\
\tilde{a}_i &= \sum_{j=1}^{l_b} \frac{\exp(e_{ij})}{\sum_{k=1}^{l_a}\exp(e_{kj})} \bar{b}_j, \quad \forall i \in \left\{1,...,l_a\right\} \\
\tilde{b}_j &= \sum_{i=1}^{l_a} \frac{\exp(e_{ij})}{\sum_{k=1}^{l_b}\exp(e_{ik})} \bar{a}_i, \quad \forall j \in \left\{1,...,l_b\right\}  \\
\end{align}

What are you doing?

--A). I baked an apple and ate it --B). I ate oranges yesterday

When there were two sentences

\bar{a} I Is Apple To Bake ate
\tilde{a} I Is Mandarin orange To - ate

It is easier to compare if the elements similar to are sorted so that they are paired. The documents are arranged by weighting the words that are likely to be related to the two sentences. (Actually calculation between vectors)

❏Compare

\begin{align}
m_a &= (\bar{a}, \tilde{a}, \bar{a}-\tilde{a}, \bar{a} \odot \tilde{a})^\mathrm{T} \\
m_b &= (\bar{b}, \tilde{b}, \bar{b}-\tilde{b}, \bar{b} \odot \tilde{b})^\mathrm{T} \\
\end{align}

In Decomposable Attention, there were only two, $ \ bar {a}, \ tilde {a}, \ bar {b}, \ tilde {b} $, but in ESIM, performance is improved by adding subtraction and element product. It seems that it is.

❏Aggregate

v_{a,t} = \mathrm{BiLSTM}(F(m_a)) \\
v_{b,t} = \mathrm{BiLSTM}(F(m_b))

Here, the original paper proposes both TreeLSTM [^ 12] and BiLSTM [^ 13] methods, but in this article, BiLSTM is used for convenience of implementation. The function $ F $ is a feedforward neural network.

v_{a,\mathrm{ave}} = \sum_{i=1}^{l_a}\frac{v_{a,i}}{l_a}, \quad 
v_{a,\max} = \max_{i=\left\{1,..l_a\right\}}\frac{v_{a,i}}{l_a}, \\

v_{b,\mathrm{ave}} = \sum_{j=1}^{l_b}\frac{v_{b,i}}{l_b}, \quad 
v_{b,\max} = \max_{j=\left\{1,..l_b\right\}}\frac{v_{b,j}}{l_b}, \\

v = (v_{a,\mathrm{ave}}, v_{a,\max}, v_{b,\mathrm{ave}}, v_{b,\max})^\mathrm{T}

With Decomposable Attention, it was only $ \ max $, but it is difficult to adjust, so $ \ mathrm {avg} $ was also added.

Implementation

It is based on the code [^ 3].

https://gist.github.com/namakemono/b74547e82ef9307da9c29057c650cdf1

References

[^ 1]: Chen, Enhanced LSTM for Natural Language Inference, 2017. (Original paper of ESIM algorithm) [^ 2]: Parikh, A Decomposable Attention for Natural Language Inference, 2016. (Original paper of Decomposable Attention) [^ 3]: Dang, Quora Question Pairs --DL models, 2017. (with ESIM and Decomposable Attention code) [^ 4]: Kaggle, Quora Question Pairs, 2017. (Competition to determine if the documents are the same) [^ 5]: Maximilien @ DAMI, Quora Question Pairs-1st place solution, 2017. (Kaggle-Quora Winner Article) [^ 6]: explosion, spaCy, 2017. (Natural Language Processing Library) [^ 7]: Pennington et al., Glove: Global vectors for word representation., 2014. (Distributed representation algorithm GloVe Original paper) [^ 8]: Milkolov et al., Efficient Optimization of Word Representations in Vector Space, 2013. (Original paper of distributed representation algorithm Word2Vec) [^ 9]: Chollet, Keras, 2016. (Keras: Library for Deep Learning) [^ 10]: lystdo, LSTM with word2vec embeddings, 2017. (LSTM + Word2Vec Keras version code available) [^ 11]: namakemono, Implementation of ESIM, 2017. (ESIM implementation) [^ 12]: Tai et al., Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks, 2015. (LSTM in tree structure) [^ 13]: Bahdanau, Neural machine translation by jointly learning to align and translate, 2014. (Machine translation paper, mentioning BiLSTM)

Recommended Posts

Explanation and implementation of ESIM algorithm
Explanation and implementation of Decomposable Attention algorithm
Explanation and implementation of SocialFoceModel
Explanation and implementation of PRML Chapter 4
Explanation and implementation of simple perceptron
Explanation of edit distance and implementation in Python
Introduction and Implementation of JoCoR-Loss (CVPR2020)
Introduction and implementation of activation function
Sorting algorithm and implementation in Python
[Reinforcement learning] Explanation and implementation of Ape-X in Keras (failure)
Implementation of Dijkstra's algorithm with python
Explanation of CSV and implementation example in each programming language
Mathematical explanation of binary search and ternary search and implementation method without bugs
Implementation and experiment of convex clustering method
Machine learning algorithm (implementation of multi-class classification)
Implementation and explanation using XGBoost for beginners
Machine learning algorithm classification and implementation summary
Non-recursive implementation of extended Euclidean algorithm (Python)
Explanation and implementation of the XMPP protocol used in Slack, HipChat, and IRC
Comparison of k-means implementation examples of scikit-learn and pyclustering
Implementation of TRIE tree with Python and LOUDS
Comparison of R and Python writing (Euclidean algorithm)
Euclidean algorithm and extended Euclidean algorithm
Python --Explanation and usage summary of the top 24 packages
Sequential update of covariance to derivation and implementation of expressions
Implementation of Fibonacci sequence
I touched Wagtail (3). Investigation and implementation of pop-up messages.
Perceptron basics and implementation
Implementation of DB administrator screen by Flask-Admin and Flask-Login
Overview of generalized linear models and implementation in Python
Rabbit and turtle algorithm
Explanation of package tools and commands for Linux OS
Python implementation of CSS3 blend mode and talk of color space
[Deep Learning from scratch] Implementation of Momentum method and AdaGrad method
Derivation and implementation of update equations for non-negative tensor factorization
[With simple explanation] Scratch implementation of deep Boltzmann machine with Python ②
[With simple explanation] Scratch implementation of deep Boltzmann machine with Python ①
Theory and implementation of multiple regression models-why regularization is needed-
Verification and implementation of video reconstruction method using GRU and Autoencoder
Derivation of EM algorithm and calculation example for coin toss
Implementation of ML-EM method, cross-section reconstruction algorithm for CT scan
Quantum computer implementation of quantum walk 2
Problems of liars and honesty
Mechanism of pyenv and virtualenv
Implementation of TF-IDF using gensim
Implementation of MathJax on Sphinx
Pre-processing and post-processing of pytest
Combination of recursion and generator
Combination of anyenv and direnv
Normalizing Flow theory and implementation
Implementation of game theory-Prisoner's dilemma-
Differentiation of sort and generalization of sort
Implementation of independent component analysis
[Neta] Sorting algorithm of O (1)
Coexistence of pyenv and autojump
Quantum computer implementation of quantum walk 3
Python implementation of particle filters
Use and integration of "Shodan"
Problems of liars and honesty
Maxout description and implementation (Python)
Occurrence and resolution of tensorflow.python.framework.errors_impl.FailedPreconditionError