There may be mistakes, so if you find one, please point it out.
This time, I would like to introduce the idea of Bayes, which is an inevitable idea in machine learning. The explanation is described based on the following article.
reference: https://github.com/markdregan/Bayesian-Modelling-in-Python
1: Bayesian way of thinking 2: Experience Bayesian theory using ipython notebook 3: Evaluation method of the model optimized using Bayesian theory
The purpose of machine learning is to learn patterns from data and deal with unknown data, but perfect data is
"Complete, consistent, correct and descriptive"
Such cases are extremely rare, so it is close to the idea of giving knowledge in advance so that it is not easily affected by strange data.
I think that human beings also make predictions when they do something and act to improve their feedback.
However, if the prediction is wrong in advance, bad results are waiting, so be careful there.
For example, I want to find a meat restaurant for women (based on dogmatism and prejudice)
- Women in their early twenties: Take them to inexperienced fashionable stores or stores that offer things that you wouldn't normally experience.
- Women in their late 20s: Fashionable stores are exhausted to some extent, so it is difficult to simply look for fashionable stores. It would be nice to have a hideaway shop or a favorite shop
In this way, even if you receive a woman, the prior distribution differs depending on the age, so if you make a mistake in this prior distribution, it will be ridiculous.
This time, instead of following the mathematical formulas, the tutorial for understanding Bayesian ideas using actual data was in English, so I will translate it into Japanese as part of my study and leave a memorandum of practice.
Since Chapter 3 and later are advanced contents, if you suppress to Chapter 2, you can do from model creation to evaluation.
If you are interested in the main story, please see below
https://github.com/markdregan/Bayesian-Modelling-in-Python
This tutorial is getting data for your own Google Hangouts. It takes time to acquire the data, so it is recommended to acquire the data while doing other work.
If you don't have a python environment, you can download the necessary libraries by using requirement.txt
below.
numpy==1.9.2
ipython==4.0.0
notebook==4.0.4
jinja2==2.8
pyzmq==14.7.0
tornado==4.1
matplotlib
simplejson
pandas
seaborn
datetime
scipy
patsy
statsmodels
git+https://github.com/pymc-devs/pymc3
I tried it in an OSX environment, but I got the following error.
RuntimeError**: Python is not installed as a framework. The Mac OS X backend will not be able to function correctly if Python is not installed as a framework. See the Python documentation for more information on installing Python as a framework on Mac OS X. Please either reinstall Python as a framework, or try one of the other backends.
The corresponding method is as follows.
http://stackoverflow.com/questions/21784641/installation-issue-with-matplotlib-python
The nesting of the obtained json file is amazing, so here is some brief information about the data fields.
Field | Description | Example |
---|---|---|
conversation_id | Conversation id representing the chat thread | Ugw5Xrm3ZO5mzAfKB7V4AaABAQ |
participants | List of participants in the chat thread | [Mark, Peter, John] |
event_id | Id representing an event such as chat message or video hangout | 7-H0Z7-FkyB7-H0au2avdw |
timestamp | Timestamp | 2014-08-15 01:54:12 |
message | Content of the message sent | Went to the local wedding photographer today |
sender | Sender of the message | Mark Regan |
The tutorial parses the json data and converts it to the Pandas data framework for each message.
Please note that the code below narrows the data to the author and Alison Darcy, so you will not get the data unless you comment it out.
messages = messages[(messages['sender'] == 'Mark Regan') & (messages['participants_str'] != 'Alison Darcy, Mark Regan')]
1: Response time affects who you are talking to 2: What are the factors that affect the response time? 3: What is the worst day?
In my case, I don't use Google Hangout so much, so I got the following result.
This time, I can't use it for the problem I want to solve, so I decided to use the tutorial data as a reference.
Here is the main part of the Bayesian tutorial. It's interesting that the results are easy to understand using diagrams.
First, I describe the Bayesian way of thinking. As an example:
The boy counts the number of cars passing in front of his house each day and writes them in his notebook. His notebook describes the number of cars passed.
In Bayesian view, the observed data occur randomly, but I think it occurs with some probability distribution.
For discrete cases like the example, consider using the Poisson distribution.
The example shows cases with an average of 5,20,40.
Green has a probability distribution with an average of 5, orange has a probability distribution with an average of 20, and pink has a probability distribution with an average of 40.
By applying the time of the previous response to the framework of Poisson distribution and predicting the parameters with Bayes, we are trying to solve the question that arose this time.
The mean value of the Poisson distribution is estimated by maximum likelihood estimation (logarithm). The likelihood of this estimation and the mean value of the Poisson distribution to be estimated can be confirmed below.
It can be confirmed that the average value is close to 20 and the likelihood is the highest. The Poisson distribution of response time is as follows, and it is most often returned in 18 seconds.
The information that is known in advance to apply Bayes in this example is that the data falls within the range of 10 to 60. The subject is to define the Poisson distribution for it and obtain it by maximum likelihood estimation.
MCMC
This is a technique that changes the average value of this time into data and repeats it until the value maximizes the likelihood. The good point of this method is that even if there is no data, the parameters to be estimated from the prior distribution are decided and the values are estimated at random, and it is possible to stop when the likelihood is maximized.
However, there are disadvantages that it is difficult to converge when there are many parameters to be estimated, and it is difficult to exert the effect when the prior distribution is not appropriate.
If you run it with ipython notebook, you can see the process of estimating parameters while generating data, so please try it and see how it works.
This is the result actually estimated by MCMC. The data occurs between 17 and 19, and the mean is just over 18, so it's as accurate as a simple Bayesian inference.
The trajectory of the likelihood maximized by MCMC can be confirmed below. Since the estimated mean does not always converge as expected, you can check this trace to see what kind of transition it is.
It is also necessary to understand the correlation coefficient between the value output in the sample and the value output so far.
There are two points to check
1: Does the model represent data? 2: Comparison of models
Let's check the data and the predicted distribution. The mode of the distribution and the frequent response time do not match. It turns out that this model is not suitable in this case.
Therefore, if we use a negative binomial distribution, which is relatively similar to the Poisson distribution, we can handle not only the mean but also the variance, so let's replace it.
The distribution is similar as shown below.
The negative binomial distribution estimates the α and μ parameters as follows.
The predicted values are as follows, α is in the value of 1.4 to 2.4, and the expressive power of the distribution is improved by the role of variance.
Below is a diagram of the distribution and response time created using the α and μ parameters estimated earlier. The distribution is similar to the response time distribution and is more characteristic.
A combination of Poisson and negative binomial distributions has also been proposed.
It states that it calculates the Bayes factor and decides which model to use according to the criteria below.
This time, we have done the basics up to Chapter 2, but since Chapter 3 will be an advanced version, please challenge.
https://github.com/markdregan/Bayesian-Modelling-in-Python/blob/master/Section%203.%20Hierarchical%20modelling.ipynb
Pakutaso https://www.pakutaso.com/
Learn statistical modeling with Stan (2): What was MCMC in the first place? http://tjo.hatenablog.com/entry/2014/02/08/173324
Knowledge to pretend to know Bayesian inference http://www.anlyznews.com/2012/01/blog-post_31.html
Bayesian-Modelling-in-Python https://github.com/markdregan/Bayesian-Modelling-in-Python
tksarah has confirmed the operation of DRBD Cinder Volume Driver with OpenStack LIBERTY !!
Recommended Posts