In this article pandas 0.18.I am using 1.
If you do not specify anything for dtype, the type will be determined without permission. For example, if there is the following tab-delimited data
data_1.txt
id x01 x02 x03 x04 x05 x06 x07 x08 x09 x10
0001 0.54 0.54 0.85 0.79 0.54 0.36 0.28 0.52 0.21 0.49
0002 0.72 0.68 0.77 0.69 0.07 na 0.29 0.42 0.32 0.51
0003 0.68 0.99 0.19 0.16 0.31 0.76 0.57 0.08 0.07 0.98
0004 0.98 na 0.49 0.47 0.09 0.52 0.42 0.35 0.83 0.64
0005 0.37 0.35 0.99 0.88 0.81 0.46 0.57 0.47 0.06 0.55
# coding: UTF-8
import pandas as pd
df = pd.read_csv('data_1.txt', header = 0, sep = '\t', na_values = 'na')
print df
id x01 x02 x03 x04 x05 x06 x07 x08 x09 x10
0 1 0.54 0.54 0.85 0.79 0.54 0.36 0.28 0.52 0.21 0.49
1 2 0.72 0.68 0.77 0.69 0.07 NaN 0.29 0.42 0.32 0.51
2 3 0.68 0.99 0.19 0.16 0.31 0.76 0.57 0.08 0.07 0.98
3 4 0.98 NaN 0.49 0.47 0.09 0.52 0.42 0.35 0.83 0.64
4 5 0.37 0.35 0.99 0.88 0.81 0.46 0.57 0.47 0.06 0.55
If you do not specify the type, it will be as above and the id will be zero. When I check the data type of id in df.dtypes, it is int.
In such a case
df = pd.read_csv('data_1.txt', header = 0, sep = '\t', na_values = 'na',
dtype = {'id':'object', 'x01':'float', 'x02':'float','x03':'float','x04':'float','x05':'float','x06':'float',
'x07':'float','x08':'float','x09':'float','x10':'float'})
print df
id x01 x02 x03 x04 x05 x06 x07 x08 x09 x10
0 0001 0.54 0.54 0.85 0.79 0.54 0.36 0.28 0.52 0.21 0.49
1 0002 0.72 0.68 0.77 0.69 0.07 NaN 0.29 0.42 0.32 0.51
2 0003 0.68 0.99 0.19 0.16 0.31 0.76 0.57 0.08 0.07 0.98
3 0004 0.98 NaN 0.49 0.47 0.09 0.52 0.42 0.35 0.83 0.64
4 0005 0.37 0.35 0.99 0.88 0.81 0.46 0.57 0.47 0.06 0.55
In this way, you can keep the original shape by specifying dtype. It's col Classes in R. I feel that the data is read faster when dtype is specified.
You can also read everything as an object for the time being, and then change only the necessary parts later.
#At first read everything with object
df = pd.read_csv('data_1.txt', header = 0, sep = '\t', na_values = 'na', dtype = 'object')
var_lst = ['x01','x02','x03','x04','x05','x06','x07','x08','x09','x10']
df[var_lst] = df[var_lst].astype(float) #Change data type to float
Recommended Posts