When using Python, think about what the sum, mean, median, mode, variance, and standard deviation represent, and what kind of processing is being performed.
mac version10.10.5 OSX Yosemite Python 3.6.1 |Anaconda 4.4.0 (x86_64)|
Introduction to mathematics starting from python Introduction to Complete Self-study Statistics ↑ I recommend it because the statistics are very easy to understand.
If you want to know the answer rather than the calculation method, you can use this for a moment ↓ {0: .2f} is used to display up to two digits after the decimal point.
from statistics import mean, median,variance,stdev
data = [100,200,300,400,500,500,600,700,800,800]
m = mean(data)
median = median(data)
variance = variance(data)
stdev = stdev(data)
print('average: {0:.2f}'.format(m))
print('Median: {0:.2f}'.format(median))
print('Distributed: {0:.2f}'.format(variance))
print('standard deviation: {0:.2f}'.format(stdev))
There are various other methods such as using numpy. Introduction to Python Numerical Library NumPy
About the average
data = [800,200,700,300,100,400,500,500,600,800]
s = sum(data)
N = len(data)
mean = s / N
print('average:{0:.2f}'.format(mean))
It can be calculated by the above formula. First, add all the numbers in the data array using the sum function, and count the number of numbers in the data array using the len function. Divide the total of the data array by the number to get the average.
The median is the value in the middle of a collection of numbers. In other words, the median rank is the same regardless of whether you count from the top or the bottom. If the number for which you want to find the median is odd, then the median is one. However, if the number is even, the median is two, so the average of the two is the median.
The test scores of three people (Mr. A, Mr. B, and Mr. C) are displayed. If Mr. A has 80 points, Mr. B has 60 points, and Mr. C has 100 points, the median is Mr. A (80 points) who has the same ranking regardless of whether counting from the top or the bottom. If Mr. D is added here (Mr. D 70 points), the median will be 80 points for Mr. A and Mr. D because the same ranking will be given to Mr. A and Mr. D regardless of whether they are counted from the top or the bottom. The median is 75 points, which is the average of 70 points (80 + 70/2).
In this data array, there are an even number, but if statement is used, conditional branching is performed so that it can be obtained even if it is an odd number. Also, when calculating the median, it is necessary to sort the data in the data array in ascending order, so use the sort () method to sort the numbers in the array in ascending order.
data = [100,200,300,400,500,500,600,700,800,800]
N = len(data)
data.sort()
#If even
if N % 2 == 0:
median1 = N/2
median2 = N/2 + 1
#Because python counts elements from 0-1
#Also, the division operator returns a decimal point even if the result is an integer.(6 / 3 = 3.0)Make it an integer with the int function
median1 = int(median1) - 1
median2 = int(median2) - 1
median = (data[median1] + data[median2]) / 2
print('The median of the data is:',median)
#If odd
else:
median = (N + 1) / 2
#Because python counts elements from 0-1
median = int(median) - 1
median = data[median]
print('The median of the data is:',median)
The mode is the value that appears most often. In the following [1,1,1,1,2,2,3,4] array, 1 appears four times, so 1 is the mode. First, it is convenient to use the most_common () method of the Counter class to find the most elements.
>>> from collections import Counter
>>> list = [1,1,2,2,3,4,5,5,5]
>>> c = Counter(list)
>>> c.most_common()
[(1,2),(2,2),(3,1),(4,1),(5,3)]
When you want the largest number >>> c.most_common(1) If you enter, [(5, 3)] will be displayed. If you want to calculate only the number of appearances or the number of appearances most, >>> mode = c.most_common(1) >>> mode[0] [(5,3)] >>> mode[0][0] 5 >>> mode[0][1] 3 Is displayed.
This time we see multiple modes in the data array, so consider the case where there are multiple modes.
from collections import Counter
def calculate_mode(data):
c = Counter(data)
#Extracts all elements and their number of occurrences.
freq_scores = c.most_common()
#c.most_Most elements in common[0]Maximum number of appearances[1]To[0][1]Specified by
max_count = freq_scores[0][1]
modes = []
#Check if the number of appearances and the maximum number of appearances are equal.
for num in freq_scores:
if num[1] == max_count:
modes.append(num[0])
return(modes)
if __name__ == '__main__':
data = [100,200,300,400,500,500,600,700,800,800]
modes = calculate_mode(data)
print('The most frequent number is:')
for mode in modes:
print(mode)
Understanding the variance and standard deviation requires the idea of mean and deviation, so I will explain them together.
name | Mathematics (score) |
---|---|
Mr. A | 60 |
Mr. B | 80 |
Mr. C | 90 |
Mr. D | 40 |
Mr. E | 70 |
Based on the above five math scores, consider the mean, deviation, variance, and standard deviation.
Value to be sought | a formula |
---|---|
Average score | Total of 5 math scores ÷ Number of people |
deviation | Each individual's score-Average score |
Distributed | Total of squares of deviation ÷ number of people |
standard deviation | Square root of variance (root value) |
First of all, the average score is displayed based on the test results of the above 5 people.
(60 + 80 + 90 + 40 + 70) ÷ 5 = ** 68 is the average score **. Divide the total score of the 5 people by the number of people who took the test.
The average score is subtracted from the score of each individual who took the test.
name | a formula(Score-Average score) | deviation |
---|---|---|
Mr. A | 60-68 | -8 |
Mr. B | 80-68 | 12 |
Mr. C | 90-68 | 22 |
Mr. D | 40-68 | -28 |
Mr. E | 70-68 | 2 |
The deviation can be calculated by the above formula. Also, the deviation value represents the difference from the average value, so adding all the deviations gives ** 0 **.
Variance is a measure of how data is scattered. If you use the deviation obtained by subtracting the score from the average, it seems that you can see how the data is scattered (variance), but if you add all the deviation values, the total will always be 0, so the average of the deviation values squared. Let the value be the variance value.
name | a formula | -- |
---|---|---|
Mr. A | -8² | 64 |
Mr. B | 12² | 144 |
Mr. C | 22² | 484 |
Mr. D | -28² | 784 |
Mr. E | 2² | 4 |
total | --- | 1480 |
Distributed | 1480÷5 | 296 |
The sum of the squares of the deviations of the above 5 people (1480) ÷ number of people (5 people) = ** 296 ** is the variance value.
Since the variance value is squared, the value becomes very large. For this reason, using the variance value makes it difficult to see how the data is scattered, so finding the square root of the variance value makes it easier to see. This easy-to-read value is the standard deviation. The square root of 296 is 17.20 ,,, so the standard deviation is 17.20 ,,,.
The formula for finding the variance and standard deviation values in python is
def calculate_mean(data):
s = sum(data)
N = len(data)
mean =s/N
return mean
#Find the deviation from the mean
def find_difference(data):
mean = calculate_mean(data)
diff = []
for num in data:
diff.append(num-mean)
return diff
def calculate_variance(data):
diff = find_difference(data)
#Find the square of the difference
squared_diff = []
for d in diff:
squared_diff.append(d**2)
#Find the variance
sum_squared_diff = sum(squared_diff)
variance = sum_squared_diff/len(data)
return variance
if __name__ == '__main__':
data = [100,200,300,400,500,500,600,700,800,800]
variance = calculate_variance(data)
print('The value of the variance is:{0}'.format(variance))
std = variance**0.5
print('The standard deviation is:{0}'.format(std))
That's it.
Recommended Posts