When doing something on a PC, using Excel is inevitable. Imadoki, I'm sure there are people who use Lotus 1-2-3 saying "I hate Microsoft". Some measurement data etc. are saved on the Excel sheet. When it becomes, it is better to be able to read the data from the Excel file.

Is it Ikan in CSV?

If you can afford to open the file in Excel and save it in CSV or tab-delimited text format, of course, that's fine. But sometimes it happens. The results of the 30-day experiment are 30 files. In that case, repeat the operation of opening each file one by one and saving it in CSV again, or write Excel VBA that will do the operation automatically, hmm, it's good, Excel is Excel ..

I can't say that, so look for a package that can read and write Excel files provided by conda. At times like this, there is no choice but to google. There seems to be a package called xlrd.

% anaconda search -t conda xlrd
Using Anaconda API: https://api.anaconda.org/
Run 'anaconda show <USER/PACKAGE>' to get more details:
Packages:
     Name                      |  Version | Package Types   | Platforms
     ------------------------- |   ------ | --------------- | ---------------
     RMG/xlrd                  |    0.9.4 | conda           | linux-64, win-32, win-64, linux-32, osx-64
     RahulJain/xlrd            |    1.0.0 | conda           | win-64
     aetrial/xlrd              |          | conda           | linux-64, osx-64
     anaconda/xlrd             |    1.0.0 | conda           | linux-64, win-32, win-64, linux-32, osx-64
     auto/xlrd                 |    0.8.0 | conda           | linux-64
                                          : http://www.python-excel.org/
     jetztcast/xlrd            |    0.9.3 | conda           | linux-64, osx-64
                                          : Library for developers to extract data from Microsoft Excel (tm) spreadsheet files
Found 6 packages
% anaconda show anaconda/xlrd
Using Anaconda API: https://api.anaconda.org/
Name:    xlrd
Summary:
Access:  public
Package Types:  conda
Versions:
   + 0.9.2
   + 0.9.4
   + 0.9.3
   + 0.9.0
   + 0.9.1
   + 1.0.0

To install this package with conda run:
     conda install --channel https://conda.anaconda.org/anaconda xlrd

Read data from Excel sheet

When I was assigned to work alone and lived in Kyoto, it was often said that "Kyoto would be hot", so I downloaded the daily maximum temperature data for Tokyo and Kyoto in July 2016 from the Meteorological Agency for verification. I tried to summarize it in Excel.

It's hotter in Kyoto just by looking at it. You don't even have to parse it in Python. But let's take a second look and calculate the average temperature in both cities.

import xlrd
import os.path
import numpy as np
xlfile = "test.xlsx"
if os.path.exists(xlfile):
    xls = xlrd.open_workbook(xlfile)
    sheet1 = xls.sheet_by_index(0)
    nrows = sheet1.nrows-1
    ncols = sheet1.ncols 
    data = np.zeros(ncols*nrows).reshape((nrows, ncols))
    for r in range(1, nrows+1):
        for c in range(0, ncols):
            data[r-1,c] = sheet1.cell(r,c).value
    tokyo = data[:,1].mean()
    kyoto = data[:,2].mean()
    msg = "Tokyo(mean): %.2f\nKyoto(mean): %.2f" % (tokyo, kyoto)
    print(msg)

Create an array with 0 contents with np.zeros () and transform it into a two-dimensional array with reshape (). Since the first row of Excel is the header, subtract 1 from the value of sheet1.nrows. It's a numpy array, so if you want to find the mean, just do mean (). Looking at the results, it's still hotter in Kyoto.

Test

By the way, is it okay to decide that Kyoto is really hot just by comparing the average values? After all, it is better to test whether there is a statistically significant difference here.

You can easily test the difference between the mean values using scipy's stats.

import xlrd
import os.path
import numpy as np
from scipy import stats
xlfile = "test.xlsx"
if os.path.exists(xlfile):
    xls = xlrd.open_workbook(xlfile)
    sheet1 = xls.sheet_by_index(0)
    nrows = sheet1.nrows-1
    ncols = sheet1.ncols
    data = np.zeros(ncols*nrows).reshape((nrows, ncols))
    for r in range(1, nrows):
        for c in range(0, ncols):
            data[r-1,c] = sheet1.cell(r,c).value
    tokyo = data[:,1]
    kyoto = data[:,2]
    t,p = stats.ttest_ind(tokyo, kyoto, equal_var=False))
    msg = "p-value: %.5f" % p
    print(msg)

In addition, in the temperature of Tokyo and Kyoto, when testing the hypothesis that "Tokyo is lower", it is a one-sided test, but the option to specify that it is a one-sided test is in the stats.ttest_ind () method. It doesn't seem to exist. So, in this case, we will substitute it by showing that the t-value is negative and the p-value is 0.025 or less.

So, when I actually tried it, it became as follows.

t: -4.09874
p-value: 0.00014

Graph drawing

I read the data from the Excel sheet, so let's make a graph. In R, ggplot2 is used, but in Python in Anaconda environment, matplotlib can be used.

import xlrd
import os.path
import pandas as pd
import numpy as np
from scipy import stats
from matplotlib import pyplot as plt
%matplotlib inline 

xlfile = "test.xlsx"
if os.path.exists(xlfile):
    xls = xlrd.open_workbook(xlfile)
    sheet1 = xls.sheet_by_index(0)
    nrows = sheet1.nrows - 1
    ncols = sheet1.ncols
    data = np.zeros(ncols*nrows).reshape((nrows, ncols))
    date = []
    for r in range(1, nrows+1):
        for c in range(0, ncols):
            if c==0:
                d =  xlrd.xldate.xldate_as_datetime(sheet1.cell(r,c).value, xls.datemode)
                date.append(d)
            else:
                data[r-1,c] = sheet1.cell(r,c).value

    tokyo = data[:,1]
    kyoto = data[:,2]
    
    plt.plot(date, tokyo, label="Tokyo")
    plt.plot(date, kyoto, label="Kyoto")
    plt.legend()
    plt.show()

When drawing a line graph, I want to use the date read from Excel for the X axis. When the date 2016/07/01 is displayed on the Excel file, the data is "41090". Looking at this, I don't think it's 7/1, so I'll convert it. That is the following line.

d =  xlrd.xldate.xldate_as_datetime(sheet1.cell(r,c).value, xls.datemode)

With this, it is converted to datetime type, and "41090" becomes "2016-07-01 00:00:00". So, date.append (d) one by one in the list initialized by date = [].

By the way, when printing () the variable d, if you use d.strftime (“% m /% d”), you can print 7/1 only for the month / day. I can do it, but when I append the converted string to the list date, an error occurs at plt.plot (), so what should I do?

That's why the graph I could draw looks like this. After all the X axis is hard to read. I want to do something about it. Looking at this graph, we can see that Kyoto is hotter without testing (that is, data visualization is important).

This time code and data

https://github.com/fukuit/Python_SelfLearning/blob/master/read_excel.py
https://github.com/fukuit/Python_SelfLearning/blob/master/temperature.xlsx

Excel with Python

Is it Ikan in CSV?

Read data from Excel sheet

Test

Graph drawing

This time code and data