Start studying: Saturday, December 7th
Teaching materials, etc .: ・ Miyuki Oshige "Details! Python3 Introductory Note ”(Sotec, 2017): 12/7 (Sat) -12/19 (Thu) read ・ Progate Python course (5 courses in total): 12/19 (Thursday) -12/21 (Saturday) end ・ Andreas C. Müller, Sarah Guido "(Japanese title) Machine learning starting with Python" (O'Reilly Japan, 2017): 12/21 (Sat) -December 23 (Sat) ・ Kaggle: Real or Not? NLP with Disaster Tweets: Posted on Saturday, December 28th to Friday, January 3rd Adjustment ・ Wes Mckinney "(Japanese title) Introduction to data analysis by Python" (O'Reilly Japan, 2018): 1/4 (Wednesday) to 1/13 (Monday) read ・ Yasuki Saito "Deep Learning from Zero" (O'Reilly Japan, 2016): 1/15 (Wed) -1/20 (Mon) ・ ** François Chollet “Deep Learning with Python and Keras” (Queep, 2018): 1/21 (Tue) ~ **
p.244 Finish reading Chapter 6 Deep Learning for Texts and Sequences.
-Trained network (word embedding): A trained and saved network on a large dataset. ** If the dataset used is large and versatile, the spatial hierarchy of learned features is effectively a general-purpose model in the same world. ** **
Similar to CNN (Pattern Movement Invariance, Spatial Hierarchy Learning) in image classification, natural language if the required features are fairly general and have general visual or semantic features. The learned word embedding is also advantageous in processing.
The trained model is applied to the embedding layer. The embedded layer can be easily thought of as a "dictionary that maps an integer index representing a particular word to a dense vector". (Word index → ** Embedded layer ** → Corresponding word vector)
Kaggle (Real or Not? NLP with Disaster Tweets), which I tried before, was a natural language processing problem, and now these Trial and error trying to apply a trained model (gensim: glove-twitter) to the dataset.
Build index to map(Embedding)
gensim = '/Users/***/gensim-data/glove-twitter-100' #Extract the ZIP file in advance.
embedding_index = {}
f = open(os.path.join(gensim, 'glove-twitter-100'))
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype = 'float32')
embedding_index[word] = coefs
f.close()
print('Found %s word vectors.' % len (embedding_index))
#Found 1193515 word vectors.
-Tokenization of train.csv ['text'] Last time I was able to do batch conversion with tfidf_vectorizer, but this time it is necessary to tokenize in advance because it passes through the Embedding layer ... but for some reason it does not work. In the book, it is processed by keras built-in Tokenizer, so I tried the same procedure, but the following error.
Full use of google
Recommended Posts