[Python] Master the reading of csv files. List of main options for pandas.read_csv.

Advanced version of reading csv files with pandas Note that there are many things you can do more than you think, such as specifying the rows and columns to read.

Basically it's OK if you hold down this much [List of main options for read_csv method](# 1-List of main options for read_csv method)

> ・ Click here for the basics of reading csv files with python (https://qiita.com/yuta-38/items/8f7a332651cd5a02e986)

・ The official page is here

**table of contents**

[List of main options for read_csv method](# 1-List of main options for read_csv method)
[Data read by default](# 2-Data read by default)
[Blank rows / columns / cells of the original file](# 3-Blank matrix cells of the original file)
Header (# 4-Header)
Read a file without a header (#Read a file without a header)
[Specify the header line](#Specify the header line)
[Specify header name](#Specify header name to read)
[Specify a common prefix for the header name](#Specify a common prefix for the header name)
[Specify Heading Column (Index)](# 5-Specify Heading Column Index)
Read Columns (# 6-Read Columns)
[Specify by column number](#Specify by column number)
[Specify by column name](#Specify by column name)
Read Line (# 7-Read Line)
[Specify the number of lines to read from the beginning](#Specify the number of lines to read from the beginning)
[Specify the number of lines to be excluded from the beginning](#Specify the number of lines to be excluded from the beginning)
[Exclude specified line](#Exclude specified line)
[Specify the number of lines to be excluded from the end](#Specify the number of lines to be excluded from the end)
[Read by specifying type](# 8-Read by specifying type)
[Read files on the web](# 9-Read files on the web)
[Read compressed file](# 10-Read compressed file)
[Read by specifying the delimiter](# 11-Read by specifying the delimiter)

## 1. List of main options for read_csv method

option	Example of use	Contents
sep	sep=';'	Separate
delimiter	delimiter=';'	Split(Same as sep)
header	header=1	Specify header line (default is guess, if not header=None * "N" is uppercase)
names	①names=['AA','BB','CC',,]　　②names='1234567'	Give column title (header if there is header)=In combination with "0")
index_col	index_col=0	Line heading(index)Specify the column to be
usecols	usecols=[1,2,5]	Specify the line to read. Specify only one line in list format "usecols=[0].. Can also be specified in the column title "
prefix	prefix="line number", header=None	Specify the prefix of the line title. Example "prefix='line number'」ならline number0、line number1、、、となる。 hedar=Valid only when None is specified.
dtype	dtype=str	Read by specifying the type. If not applicable, an error (such as reading str with float)
skiprows	①skiprows=5　　②skiprows=[1,3,6]	Specify the line number not to be read at the beginning. For integers, from 0 to the specified integer.
skipfooter	skipfooter=2, engine='python', encoding='utf_8'	Specify the number of lines to exclude from the bottom. Need to describe what to use in python. If the characters are garbled, specify the character code.
nrows	nrows=5	Specify the number of lines to read.
encoding	encoding='shift_jis'	Character code specification when reading a file
(compression)	compression='zip'	Open the compressed file. At present, it opens by analogy without description. (Conversely, compression in the zip file='gzip'Error if you specify
(skipinitialspace)	skipinitialspace=True	delimiter(Character delimiter)Later, remove the leading whitespace. Currently, it seems to be a specification that is deleted by default

## 2. Data read by default

■ Original file

When the following csv file is read

** ▼ Column ** ・ Column A is index (heading) ・ Column F is empty -Column G is a character and a blank cell

** ▼ line ** ・ The first row is the title of the column ・ The 9th line is empty ・ There is a formula error (#NUM!) On the 10th line.

#### ■ Reading result

** ▼ Point ** ・ ** Heading column added to the first column ** (index number from 0) ・ ** Title line added to the first line ** --Blank cells are ** filled with "Unnamed: column number" ** -** Blank cells become NaN **. -Formula error is displayed as #NUM !.

Additional contents will remain when you output to a file again. (NaN disappears)

### ■ Attributes of each column

`Column attributes`


Unnamed: 0     object
Column 1 object
Column 2 float64
Column 3 object
Column 4 float64
Unnamed: 5     float64
Column 5 object

-Date: object type ・ Numerical value: float64 type └ Both integers and decimals └ NaN is ignored -Column with function error: object type -Empty column: float64 type -Text: object type └ If there is one text cell, it will be an object type

### ■ Output reading result

When output as a csv file with utf8.

-Headings automatically inserted in the 1st row and 1st column remain. ・ NaN becomes a blank line

## 3. Blank rows / columns / cells of the original file

Blanks are treated as "NaN" (empty data). The following are also treated as NaN.

「''」
「＃N/A」
「＃N/A N/A」
「＃NA」
「-1.＃IND」
「-1.＃QNAN」
「-NaN」
「-nan」
「1.＃IND」
「1.＃QNAN」
「」
「N/A」
「NA」
「NULL」
「NaN」
「n/a」
「nan」
「null」

## 4. Header

The default when reading is "analog".

Basically, the top line is read as a header.

▼ Original file

▼ Reading result

`read csv file`


import pandas as pd

df = pd.read_csv('~/desktop/test.csv')
df

└ Read and display the test.csv file on the desktop.

### ① Read a file without a header Optionally specify that there is no header. `header=None`

** ▼ Original file ** ("Desktop test2.csv")

** ▼ Read file ** pd.read_csv('~/desktop/test2.csv' ,header=None)

`python`


import pandas as pd
df = pd.read_csv('~/desktop/test2.csv' ,header=None)
df

None N is a capital letter. (None is an error)

** ▼ If not specified ** `df2 = pd.read_csv('~/desktop/test2.csv')` ![image.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/563526/fb3d7fa6-e1cc-8afc-cca6-d1a49ab21179.png)
### ② Specify the line that will be the header * Above the specified line is not read.

** ▼ When a line to be the header is specified **

Optional header = integer

import pandas as pd
df = pd.read_csv('~/desktop/test.csv' ,header=6)
df

** ▼ If not specified ** ![image.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/563526/73ce0f60-b776-d8f6-9b71-7f38cd2f7874.png)
### ③ Read by specifying the header name Describe `names =` as an option. There are two ways to write.

(1) Consecutive character strings (2) list format

** ▼ Point ** -If you already have a header, overwrite it with header = 0. -When the number of specified characters is less than the number of columns to be read: The column title of the other party is blank -If the number of specified characters is larger: The title of the last column is NaN -Different columns cannot be given the same name (error)

### ■ Example (try the execution result with 7 columns of data)

(1) Specify as a continuous character string

** ▼ Example 1: When names = '123345' **

import pandas as pd
df = pd.read_csv('~/desktop/test.csv' ,names='12345')
df

The first two missing columns are blank.

** ▼ Example 2: When `names ='abcdefghi'` **

import pandas as pd
df = pd.read_csv('~/desktop/test.csv' ,names='abcdefghi')
df

Many column titles are empty (NaN) columns.

** ▼ Example 3: `names ='aaabbbccc'` Error if duplicated **

`python`


import pandas as pd
df = pd.read_csv('~/desktop/test.csv' ,names='aaabbbccc')
df

#output
#　ValueError: Duplicate names are not allowed.

#### (2) Specify in list format

** ▼ Example 1: When names = ['aaa','bbb','ccc','ddd','eee','fff'] **

`python`


import pandas as pd
df = pd.read_csv('~/desktop/test.csv' ,names=['aaa','bbb','ccc','ddd','eee','fff'])
df

** ▼ Example 2: `names = ['aaa','bbb','aaa','ddd']` Duplicate is an error **

`python`


import pandas as pd
df = pd.read_csv('~/desktop/test.csv' ,names=['aaa','bbb','aaa','ddd'])
df

### Specify a common prefix for header names `prefix ='string', header = None` └ Valid only when header = None (ignored if not) └ A column number is added to the specified character string. 　

`python`


import pandas as pd
df = pd.read_csv('~/desktop/test.csv', prefix="XXX", header=None)
df

## 5. Specifying the heading column (index) Describe ʻindex_col = integer` in the option. By default, columns with index numbers are added automatically.
```python import pandas as pd df = pd.read_csv('~/desktop/test.csv' ,index_col=0) df ```

For default (not specified) ![image.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/563526/c24e0186-6c71-e4b8-9b26-81fbeaa279f7.png)
## 6. Read columns Can be specified by column number or column name.

① Specify by column number ② Specify by column name

▼ Use the following for the original file ![image.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/563526/3046793e-146a-441a-8c88-938a8f3d9f14.png)

import pandas as pd
df = pd.read_csv('~/desktop/test.csv')
df

### ① Specify by column number Describe ʻusecols = []` in the option └ List type └ Described in [] even if the specification is one column

`Specify multiple columns`


import pandas as pd
df = pd.read_csv('~/desktop/test.csv', usecols=[0,3,6])
df

** ▼ For 1 column (example 0th column only) ** usecols=[0]

`Specify only one column (Example: 0th column only)`


import pandas as pd
df = pd.read_csv('~/desktop/test.csv', usecols=[0])
df

** ▼ Error if not list type **

`error`


import pandas as pd
df = pd.read_csv('~/desktop/test.csv', usecols=0)
df

#output
# ValueError: 'usecols' must either be list-like of all strings, all unicode, all integers or a callable.

### ② Specify by column name

It is also possible to extract only the specified column name.

▼ Example: ʻusecols = ['column 1','column 4'] ` └ Specify column 1 and column 4.

`python`


import pandas as pd
df = pd.read_csv('~/desktop/test.csv', usecols=['Column 1','Column 4'])
df

▼ It is also possible to give a column name when reading and extract by that name.

Example: ・ Header = 0 ・ Names ='ABCDEFG' ・ ʻUscols = ['A','C']`

import pandas as pd
df = pd.read_csv('~/desktop/test.csv', header=0, names='ABCDEFG' ,usecols=['A','C'])
df

## 7. Read line

① Specify the number of lines to read from the beginning (2) Specify the number of lines to be excluded from the beginning ③ Exclude the specified line ④ Specify the number of lines to exclude from the end

### ① Specify the number of lines to read from the beginning Describe ʻusecols = integer` in the option. Useful when checking the contents when the number of lines is huge.
▼ Example: `nrows = 3` Read up to the third line from the top.

`python`


import pandas as pd
df = pd.read_csv('~/desktop/test.csv', nrows=3)
df

### (2) Specify the number of lines to be excluded from the beginning Describe `skiprows = integer` in the option.

▼ Example: skiprows = 6 Skip to the 6th line from the top. If no header is specified, the 6th line will be the header.

There is no skip for "skiprows = 0".

`python`


import pandas as pd
df = pd.read_csv('~/desktop/test.csv', skiprows=6)
df

### ③ Exclude the specified line Describe `skiprows = [integer]` in the option.

▼ Example: skiprows = [2,3,6,7,8] Skip the 2nd, 3rd, 5th, 7th and 8th lines from the top.

Use [] to skip only one line └ "skip worw = [6]": Skip the 6th line

`python`


import pandas as pd
df = pd.read_csv('~/desktop/test.csv', skiprows=[2,3,6,7,8])
df

### ④ Specify the number of lines to exclude from the end Describe `skipfooter = integer, engine ='python'` in the option.

If the characters are garbled, specify the character code. Example: ʻencoding ='utf_8'`

▼ Example: skipfooter = 6, engine ='python', encoding ='utf_8' Skip the 6th line from the bottom.

`python`


import pandas as pd
df = pd.read_csv('~/desktop/test.csv', skipfooter=6, engine='python', encoding='utf_8')
df

▼ When no character code is specified `skipfooter=6, engine='python'`　

`python`


import pandas as pd
df = pd.read_csv('~/desktop/test.csv', skipfooter=6, engine='python')
df

Japanese characters are garbled.

▼ When python is not specified `skipfooter=6`　

`python`


import pandas as pd
df = pd.read_csv('~/desktop/test.csv', skipfooter=6)
df

#output
# <ipython-input-81-77b6fdc5c66e>:2: ParserWarning: Falling back to the 'python' engine 
#because the 'c' engine does not support skipfooter; 
#you can avoid this warning by specifying engine='python'.

An error is displayed. Instructions to write "engine ='python'".

## 8. Specify the type and read Describe `dtype = type` as an option. If it cannot be changed, an error will occur.

There is a "dtypes" method to see the types of the read table. It depends on ** plural or singular **.

▼ Convert to a character string with dtype = str and check the type with .dtypes (dtypes method).

`Convert to string`


import pandas as pd
df = pd.read_csv('~/desktop/test.csv', dtype=str)

df.dtypes

#output
Unnamed: 0    object
Column 1 object
Column 2 object
Column 3 object
Column 4 object
Unnamed: 5    object
Column 5 object
dtype: object

▼ Default

`python`


import pandas as pd
df = pd.read_csv('~/desktop/test.csv')

df.dtypes

#output
Unnamed: 0     object
Column 1 object
Column 2 float64
Column 3 object
Column 4 float64
Unnamed: 5     float64
Column 5 object
dtype: object

▼ Convert character string to float (error)

`python`


import pandas as pd
df = pd.read_csv('~/desktop/test.csv' ,dtype=float)
df.dtypes

#output
# ValueError: could not convert string to float

## 9. Read files on the WEB Files on the WEB can also be read.

pd.read_csv ('URL', encoding ='character code')

If the characters are garbled or if you get an error that the character code is different, specify "encoding ='character code'".

** ▼ Read the statistical data of the government's population by prefecture and gender ** ・ Reference page: e-Start

`python`


import pandas as pd

dfurl = pd.read_csv('https://www.e-stat.go.jp/stat-search/file-download?statInfId=000031524010&fileKind=1', encoding='shift_jis')
dfurl

### ▼ If no character code is specified (an error will occur)

`error`


import pandas as pd

dfurl = pd.read_csv('https://www.e-stat.go.jp/stat-search/file-download?statInfId=000031524010&fileKind=1')
dfurl

#output
# UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 0: invalid start byte

## 10. Read the compressed file Compressed files such as zip can be read without specifying anything. Readable compression formats: ‘gzip’, ‘bz2’, ‘zip’, ‘xz’

Because it reads the compressed format by analogy. └ Default: compression = infer

If multiple files are compressed, they cannot be read.
Cannot be read when PW is set.

** ▼ Read the zip file test.zip on the desktop **

`python`


import pandas as pd
df = pd.read_csv('~/desktop/test.zip')
df

■ The above is the same as compression ='zip'.

import pandas as pd
df = pd.read_csv('~/desktop/test.zip', compression='zip')
df

▼ An error will occur if the compression format is specified incorrectly.

`error`


import pandas as pd
df = pd.read_csv('~/desktop/test.zip', compression='gzip')
df

#output
# BadGzipFile: Not a gzipped file (b'PK')

▼ Error even if two or more files are compressed

`error`


import pandas as pd
df = pd.read_csv('~/desktop/2files.zip')
df

#output
# ValueError: Multiple files found in compressed zip file ['test.csv', 'space.csv']

▼ Error even if two or more files are compressed

`error`


import pandas as pd
df = pd.read_csv('~/desktop/2files.zip')
df

#output
# ValueError: Multiple files found in compressed zip file ['test.csv', 'space.csv']

▼ Error even if PW is set

`error`


import pandas as pd
df = pd.read_csv('~/desktop/test.zip')
df

#output
# RuntimeError: File 'test.csv' is encrypted, password required for extraction

## 11. Read by specifying the delimiter `sep ='delimiter'` └ The same applies to "delimiter ='delimiter'".

** ▼ Example: File to read ** There are multiple data in one cell. └ Data separated by "@" └ Data separated by ";"

2 characters cannot be specified (list cannot be used)
The same option cannot be repeated
Delimiter and sep cannot be used together. └ Delimiter has priority.

** ▼ Default loading **

`python`


import pandas as pd

df = pd.read_csv('~/desktop/test2.csv')
df

** ▼ `sep ='@'` ** Separated by "@"

`「@Separated by "(sep)`


import pandas as pd

df = pd.read_csv('~/desktop/test2.csv', sep='@')
df

** ▼ delimita ='@' ** Separated by "@"

`「@Delimiter`


import pandas as pd

df = pd.read_csv('~/desktop/test2.csv', delimiter='@')
df

** ▼ `sep =';'` ** Separated by ";"

`「;Separated by "(sep)`


import pandas as pd

df = pd.read_csv('~/desktop/test2.csv', sep=';')
df

** ▼ Options cannot be repeated. ** **

`error`


import pandas as pd

df = pd.read_csv('~/desktop/test2.csv', sep=';', sep='@')
df

#output
# SyntaxError: keyword argument repeated

** ▼ 2 characters cannot be specified (list cannot be used) **

`error`


import pandas as pd

df = pd.read_csv('~/desktop/test2.csv', sep=[';','@'])
df

#output
# TypeError: unhashable type: 'list'

** ▼ 2 delimiter and sep cannot be used together. ** ** └ Priority is given to the delimiter.

[Return to top](List of main options of pandas read_csv to master reading #pythoncsv file)