Advanced version of reading csv files with pandas Note that there are many things you can do more than you think, such as specifying the rows and columns to read.
Basically it's OK if you hold down this much [List of main options for read_csv method](# 1-List of main options for read_csv method)
・ The official page is here
option | Example of use | Contents |
---|---|---|
sep | sep=';' | Separate |
delimiter | delimiter=';' | Split(Same as sep) |
header | header=1 | Specify header line (default is guess, if not header=None * "N" is uppercase) |
names | ①names=['AA','BB','CC',,] ②names='1234567' | Give column title (header if there is header)=In combination with "0") |
index_col | index_col=0 | Line heading(index)Specify the column to be |
usecols | usecols=[1,2,5] | Specify the line to read. Specify only one line in list format "usecols=[0].. Can also be specified in the column title " |
prefix | prefix="line number", header=None | Specify the prefix of the line title. Example "prefix='line number'」ならline number0、line number1、、、となる。 hedar=Valid only when None is specified. |
dtype | dtype=str | Read by specifying the type. If not applicable, an error (such as reading str with float) |
skiprows | ①skiprows=5 ②skiprows=[1,3,6] | Specify the line number not to be read at the beginning. For integers, from 0 to the specified integer. |
skipfooter | skipfooter=2, engine='python', encoding='utf_8' | Specify the number of lines to exclude from the bottom. Need to describe what to use in python. If the characters are garbled, specify the character code. |
nrows | nrows=5 | Specify the number of lines to read. |
encoding | encoding='shift_jis' | Character code specification when reading a file |
(compression) | compression='zip' | Open the compressed file. At present, it opens by analogy without description. (Conversely, compression in the zip file='gzip'Error if you specify |
(skipinitialspace) | skipinitialspace=True | delimiter(Character delimiter)Later, remove the leading whitespace. Currently, it seems to be a specification that is deleted by default |
When the following csv file is read
** ▼ Column ** ・ Column A is index (heading) ・ Column F is empty -Column G is a character and a blank cell
** ▼ line ** ・ The first row is the title of the column ・ The 9th line is empty ・ There is a formula error (#NUM!) On the 10th line.
** ▼ Point ** ・ ** Heading column added to the first column ** (index number from 0) ・ ** Title line added to the first line ** --Blank cells are ** filled with "Unnamed: column number" ** -** Blank cells become NaN **. -Formula error is displayed as #NUM !.
Column attributes
Unnamed: 0 object
Column 1 object
Column 2 float64
Column 3 object
Column 4 float64
Unnamed: 5 float64
Column 5 object
-Date: object type ・ Numerical value: float64 type └ Both integers and decimals └ NaN is ignored -Column with function error: object type -Empty column: float64 type -Text: object type └ If there is one text cell, it will be an object type
When output as a csv file with utf8.
-Headings automatically inserted in the 1st row and 1st column remain. ・ NaN becomes a blank line
Blanks are treated as "NaN" (empty data). The following are also treated as NaN.
The default when reading is "analog".
▼ Original file
▼ Reading result
read csv file
import pandas as pd
df = pd.read_csv('~/desktop/test.csv')
df
└ Read and display the test.csv file on the desktop.
** ▼ Original file ** ("Desktop test2.csv")
** ▼ Read file **
pd.read_csv('~/desktop/test2.csv' ,header=None)
python
import pandas as pd
df = pd.read_csv('~/desktop/test2.csv' ,header=None)
df
** ▼ When a line to be the header is specified **
Optional header = integer
import pandas as pd
df = pd.read_csv('~/desktop/test.csv' ,header=6)
df
(1) Consecutive character strings (2) list format
** ▼ Point **
-If you already have a header, overwrite it with header = 0
.
-When the number of specified characters is less than the number of columns to be read: The column title of the other party is blank
-If the number of specified characters is larger: The title of the last column is NaN
-Different columns cannot be given the same name (error)
** ▼ Example 1: When names = '123345'
**
import pandas as pd
df = pd.read_csv('~/desktop/test.csv' ,names='12345')
df
The first two missing columns are blank.
import pandas as pd
df = pd.read_csv('~/desktop/test.csv' ,names='abcdefghi')
df
Many column titles are empty (NaN) columns.
python
import pandas as pd
df = pd.read_csv('~/desktop/test.csv' ,names='aaabbbccc')
df
#output
# ValueError: Duplicate names are not allowed.
** ▼ Example 1: When names = ['aaa','bbb','ccc','ddd','eee','fff']
**
python
import pandas as pd
df = pd.read_csv('~/desktop/test.csv' ,names=['aaa','bbb','ccc','ddd','eee','fff'])
df
python
import pandas as pd
df = pd.read_csv('~/desktop/test.csv' ,names=['aaa','bbb','aaa','ddd'])
df
python
import pandas as pd
df = pd.read_csv('~/desktop/test.csv', prefix="XXX", header=None)
df
① Specify by column number ② Specify by column name
import pandas as pd
df = pd.read_csv('~/desktop/test.csv')
df
Specify multiple columns
import pandas as pd
df = pd.read_csv('~/desktop/test.csv', usecols=[0,3,6])
df
** ▼ For 1 column (example 0th column only) **
usecols=[0]
Specify only one column (Example: 0th column only)
import pandas as pd
df = pd.read_csv('~/desktop/test.csv', usecols=[0])
df
** ▼ Error if not list type **
error
import pandas as pd
df = pd.read_csv('~/desktop/test.csv', usecols=0)
df
#output
# ValueError: 'usecols' must either be list-like of all strings, all unicode, all integers or a callable.
It is also possible to extract only the specified column name.
▼ Example: ʻusecols = ['column 1','column 4'] ` └ Specify column 1 and column 4.
python
import pandas as pd
df = pd.read_csv('~/desktop/test.csv', usecols=['Column 1','Column 4'])
df
Example:
・ Header = 0
・ Names ='ABCDEFG'
・ ʻUscols = ['A','C']`
import pandas as pd
df = pd.read_csv('~/desktop/test.csv', header=0, names='ABCDEFG' ,usecols=['A','C'])
df
① Specify the number of lines to read from the beginning (2) Specify the number of lines to be excluded from the beginning ③ Exclude the specified line ④ Specify the number of lines to exclude from the end
python
import pandas as pd
df = pd.read_csv('~/desktop/test.csv', nrows=3)
df
▼ Example: skiprows = 6
Skip to the 6th line from the top.
If no header is specified, the 6th line will be the header.
python
import pandas as pd
df = pd.read_csv('~/desktop/test.csv', skiprows=6)
df
▼ Example: skiprows = [2,3,6,7,8]
Skip the 2nd, 3rd, 5th, 7th and 8th lines from the top.
python
import pandas as pd
df = pd.read_csv('~/desktop/test.csv', skiprows=[2,3,6,7,8])
df
▼ Example: skipfooter = 6, engine ='python', encoding ='utf_8'
Skip the 6th line from the bottom.
python
import pandas as pd
df = pd.read_csv('~/desktop/test.csv', skipfooter=6, engine='python', encoding='utf_8')
df
python
import pandas as pd
df = pd.read_csv('~/desktop/test.csv', skipfooter=6, engine='python')
df
Japanese characters are garbled.
python
import pandas as pd
df = pd.read_csv('~/desktop/test.csv', skipfooter=6)
df
#output
# <ipython-input-81-77b6fdc5c66e>:2: ParserWarning: Falling back to the 'python' engine
#because the 'c' engine does not support skipfooter;
#you can avoid this warning by specifying engine='python'.
An error is displayed. Instructions to write "engine ='python'".
▼ Convert to a character string with dtype = str
and check the type with .dtypes
(dtypes method).
Convert to string
import pandas as pd
df = pd.read_csv('~/desktop/test.csv', dtype=str)
df.dtypes
#output
Unnamed: 0 object
Column 1 object
Column 2 object
Column 3 object
Column 4 object
Unnamed: 5 object
Column 5 object
dtype: object
python
import pandas as pd
df = pd.read_csv('~/desktop/test.csv')
df.dtypes
#output
Unnamed: 0 object
Column 1 object
Column 2 float64
Column 3 object
Column 4 float64
Unnamed: 5 float64
Column 5 object
dtype: object
python
import pandas as pd
df = pd.read_csv('~/desktop/test.csv' ,dtype=float)
df.dtypes
#output
# ValueError: could not convert string to float
pd.read_csv ('URL', encoding ='character code')
** ▼ Read the statistical data of the government's population by prefecture and gender ** ・ Reference page: e-Start
python
import pandas as pd
dfurl = pd.read_csv('https://www.e-stat.go.jp/stat-search/file-download?statInfId=000031524010&fileKind=1', encoding='shift_jis')
dfurl
error
import pandas as pd
dfurl = pd.read_csv('https://www.e-stat.go.jp/stat-search/file-download?statInfId=000031524010&fileKind=1')
dfurl
#output
# UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 0: invalid start byte
Because it reads the compressed format by analogy.
└ Default: compression = infer
python
import pandas as pd
df = pd.read_csv('~/desktop/test.zip')
df
■ The above is the same as compression ='zip'
.
import pandas as pd
df = pd.read_csv('~/desktop/test.zip', compression='zip')
df
error
import pandas as pd
df = pd.read_csv('~/desktop/test.zip', compression='gzip')
df
#output
# BadGzipFile: Not a gzipped file (b'PK')
error
import pandas as pd
df = pd.read_csv('~/desktop/2files.zip')
df
#output
# ValueError: Multiple files found in compressed zip file ['test.csv', 'space.csv']
error
import pandas as pd
df = pd.read_csv('~/desktop/2files.zip')
df
#output
# ValueError: Multiple files found in compressed zip file ['test.csv', 'space.csv']
error
import pandas as pd
df = pd.read_csv('~/desktop/test.zip')
df
#output
# RuntimeError: File 'test.csv' is encrypted, password required for extraction
** ▼ Example: File to read ** There are multiple data in one cell. └ Data separated by "@" └ Data separated by ";"
python
import pandas as pd
df = pd.read_csv('~/desktop/test2.csv')
df
「@Separated by "(sep)
import pandas as pd
df = pd.read_csv('~/desktop/test2.csv', sep='@')
df
** ▼ delimita ='@'
** Separated by "@"
「@Delimiter
import pandas as pd
df = pd.read_csv('~/desktop/test2.csv', delimiter='@')
df
「;Separated by "(sep)
import pandas as pd
df = pd.read_csv('~/desktop/test2.csv', sep=';')
df
error
import pandas as pd
df = pd.read_csv('~/desktop/test2.csv', sep=';', sep='@')
df
#output
# SyntaxError: keyword argument repeated
error
import pandas as pd
df = pd.read_csv('~/desktop/test2.csv', sep=[';','@'])
df
#output
# TypeError: unhashable type: 'list'
Recommended Posts