I want to heal my eyes with the blue ocean rather than burning my eyes with the blue light.
2015 is nearing the end, but how are you all doing? Since I want to return to nature around the end of the year, this time I will introduce how to create an application that makes travel proposals interactively using a topic model.
This article is a sister article of the previously published Creating an application using the topic model. At that time, application creation was not catching up, and although we said "application creation", we did not reach application creation, so it will be a complement to that. I will not touch on the topic model itself this time, so if you are interested, please refer to the above article.
I will leave the detailed explanation to here, but the topic model is to classify documents by topic as the name suggests. It is a method of. Specifically, the "topic" here has the following image.
This is a word cloud created from a travel blog. A "topic" is thus composed of words, and some of the words are frequent and some are not. Estimating the probability distribution that defines the "appearance word" and "appearance probability" is the main focus of the topic model. If this probability distribution is clarified, it will be possible to classify documents with similar distributions, and it will also be possible to estimate the degree of relevance between documents from the distance between distributions.
Topics can be represented by probability distributions as described above, so that the distance between distributions can be calculated (this time I used KL-divergence). Using this distance, try to propose a spot for topic A, and if the answer is No, propose a distant topic (topic B, which is the farthest in the figure), and implement it with a simple policy. I will try.
The application implemented this time is here.
I suggest about 3 candidates from the same topic, which can be switched with the arrow below. If there is something you like / image is different, you can evaluate it with the Good / Bad button below. Receive ratings and make suggestions for similar / distant topics.
Since it has a Heroku Button, you can deploy it to your Heroku environment. Try it with the topic model I made! That's quite possible. As data, the API of AB-ROAD is used, and this usage registration is required.
The application configuration is as follows.
data
: Stores the trained model file. Since it was troublesome to use the database, the spot data is also included this time, but originally it will be repelled by ignore.pola
: Contains the topic model and the implementation of dialogue using it. pola
is the name of the engine that conducts this" dialogue using the topic model "(I chose a foreigner-like name because it is an overseas trip).scripts
: Various scripts for extracting, formatting, and training datatests
: test codeIn the composition, I paid attention to the following points.
and
pola`)
When developing with multiple teams, it is highly likely that these responsibilities will be separated, and I think it is better to separate them in terms of increasing the portability of the machine learning part.pola
and scripts
)
The implementation of the machine learning part is mainly divided into the code related to "data extraction and shaping" and the "machine learning model" part. And the former is often really dependent on the data to be handled, and if this is incorporated into the "implementation of a machine learning model", the model itself becomes a data source and a lot of people, so this is separated. I think it is desirable to abstract the machine learning model part to some extent, such as "applicable if data is entered in this format".pola
and data
)
Since the modification of the machine learning model itself (algorithm modification etc.) and the update of the trained model that is the learning result should be different, this time we explicitly separate the two. Of course, I think there is also the idea of managing with a trained model included.After that, like the application, write the test code exactly for the machine learning model, and attach the document with iPython notebook for the machine learning model.
The construction assumptions and verification of the topic model constructed this time can be referred from the following iPython notebook.
enigma_abroad/pola/machine/topic_model_evaluation.ipynb
Of course, when making a proposal, it is essential to build a topic model, which is the brain of the application.
This time, like the sister article Creating an application using a pick model, [gensim
]( I built it using https://radimrehurek.com/gensim/) (I tried using pymc
as well, but it was sealed because the memory was lost due to learning). And sadly, the accuracy wasn't as good as it was ... but I'm going to continue here.
In addition, when it comes to actually using machine learning in an application, it is unlikely that "accuracy 99% or!" Like the tutorial, and even if there is, it is either a hallucination due to overfitting or a bug created by oneself. It is often the case.
To overcome this, steady data collection and steady data preprocessing are required. Ah ... when I talked about what happened, I was trying to do something cool with machine learning, but before I knew it, I was meticulously setting words to exclude from the corpus ... ・. Content-based recommendation such as the topic model has the advantage that it can be recommended even when the user's evaluation data is irresistible compared to collaborative filtering that is often used for recommendation, but after all the amount of content and its shaping must be done properly. It doesn't work well (there are a number of documents, but the volume of the document itself is also quite good).
Although it was made into an application, the essential topic model has not been built well. Last time I dealt with data different from hair salon and this time travel plan, but all of them ended up with sad results that topics could not be classified well.
I think the cause of this is the data problem.
In short, I think it is desirable to apply it in a situation where there are various variations of documents and each is reasonably long. If you want to make more detailed classifications within the same category, I think you need to build some prior knowledge.
I think there are many other ideas, so please try to create your own model and build an application that will take you to Blue Ocean.