This time, following the previous ROC curve, an animation about the meaning of ** QQ plot ** published in the official textbook of Statistics Test Level 2 I will write an article that explains using graphs. This is also a slightly quirky graph, and I think it requires some tips to understand, so I would like to try to explain it. I can write Q-Q plots with qqnorm even in R, but I don't understand how it works in a black box, so I wrote it by myself in Python.
Therefore, the data used is the rent data of the condominium in the textbook. This is the data.
Mansion2.data
Walk_min | distance | Price | Type | Area | Direction | Year | |
---|---|---|---|---|---|---|---|
0 | 8 | B | 7900 | 1K | 30.03 | South td> | 3 |
1 | 9 | B | 8500 | 1K | 21.9 | South td> | 5 |
2 | 10 | B | 10800 | 1K | 27.05 | South td> | 4 |
3 | 10 | B | 10800 | 1K | 29.67 | South td> | 4 |
... | ... | ... | ... | ... | ... | ... | ... | Northeast td> | 0 |
186 | 8 | B | 7100 | 1K | 22 | West td> | 17 |
187 | 9 | B | 18400 | 1LDK | 54.68 | West td> | 10 |
Download this data here in the middle of the official textbook of Statistics Test Level 2 You can do it from the "Data for download" link in. Unzip the downloaded zip file and ** Mansion2.data ** in the [Chapter 2]-[Body] folder is the data used this time.
And once I get the data, I would like to first draw a graph and give an image of the data: blush:
Fig.1
The price range is closer to the left side, and it is a histogram with a long hem on the right side. Also, it can be seen that there seems to be a correlation between price and size.
Since this Q-Q plot focuses on price, we will go one step further on price and try to interpret the graph. It is "whether or not this distribution follows a normal distribution".
Well, actually, when I apply the density function of the normal distribution based on the mean and standard deviation obtained from this data as shown below, it clearly does not match, but I will proceed without worrying about it w
Fig.2
Fig.3
The Python code for drawing the above set of graphs is here.
Well, anyway, let's take a look at the Q-Q plot graph itself.
Fig.4
Yes, this is the data "Price" that we are targeting this time, that is, the Q-Q plot is drawn for the rent data. At first glance, I'm not sure what the graph shows. The textbook explanation is "Q-Q plot is a graph for comparing the obtained data with the theoretical distribution and examining the similarity." ** If they are similar, the plotted points will line up in a straight line **, that's right.
So how do you interpret the above graph? The above Fig.4 is considered to be a modification of the shape of Fig.2. In other words, it is a graph that allows you to visually understand how similar the obtained rent data and the normal distribution density function, which is the theoretical distribution, are by whether or not they are straight lines.
By the way, this graph is to measure the similarity with the theoretical distribution by the degree of straight line, but I think that I have to understand how to draw this graph, so I will explain why it can be said so.
Let's explain using rent data again. This is the shape of the distribution. Fig.5
Two intermediate product graphs are used to create a Q-Q plot from here.
The first thing to use is to arrange this rent data one by one in ascending order and draw dots to draw a graph. There are a total of 188 data, which are evenly arranged between 0 and 1. Fig.6
As the second graph, we will assume a normal distribution as the theoretical distribution this time, so we will write a graph of the normal cumulative distribution function and use it. This also represents the cumulative density function with 188 points, the same number as the rent data. Fig.7
By combining these two graphs, you can draw a graph of Q-Q plot. Let's see it in an animation graph.
Fig.8
The intermediate product graph Fig. 6 is the upper right graph, and Fig. 7 is the lower left graph. The upper left is the target Q-Q plot. First, the horizontal axis of the rent data graph in the upper right represents the quantiles, and the vertical axis of the normal cumulative distribution function in the lower left also represents the quantiles. Slide this quantile from 0 to 1 at the same time in the upper right and lower left respectively. The black line represents it. The points that intersect the black line are displayed as red dots. The Q-Q plot is a plot of these red dots at the same time. The dotted line shows that. The "Q" in this Q-Q plot stands for Quantile, and I think it has this name because it moves the quantiles in the upper right and lower left graphs at the same time.
(Python code is here)
By the way, if the data and the theoretical distribution are the same, the Q-Q plot will be a straight line, so I would like to try this as well. That means using random numbers that follow a normal distribution. Here is a histogram of 188 random numbers that follow a normal distribution.
If you draw a Q-Q plot ... It's definitely a straight line: relaxed:
Next is the exponential distribution. It's a distribution with a long hem to the right.
For such a shape, the regular Q-Q plot will be convex to the lower right.
It is a type of F distribution with a slightly long hem to the right. This also has a slightly convex Q-Q plot in the lower right corner.
Next, let's write a Q-Q plot using the long-tailed type distribution on the left, the beta distribution of $ \ alpha = 6, \ beta = 2 $. This time, on the contrary, a convex Q-Q plot is drawn on the upper left.
A little different is the beta distribution of $ \ alpha = 0.5, \ beta = 0.5 $, with vertices on both sides. In this case, you can draw a Q-Q plot that is convex to the lower right halfway and convex to the upper left in the second half.
The full text of the Python code for drawing the graphs on this page is here
Recommended Posts