Be careful when reading data with pandas (specify dtype)

When reading data with pandas, it is safer to specify dtype

In this article pandas 0.18.I am using 1.

If you do not specify anything for dtype, the type will be determined without permission. For example, if there is the following tab-delimited data

data_1.txt

id	x01	x02	x03	x04	x05	x06	x07	x08	x09	x10
0001	0.54	0.54	0.85	0.79	0.54	0.36	0.28	0.52	0.21	0.49
0002	0.72	0.68	0.77	0.69	0.07	na	0.29	0.42	0.32	0.51
0003	0.68	0.99	0.19	0.16	0.31	0.76	0.57	0.08	0.07	0.98
0004	0.98	na	0.49	0.47	0.09	0.52	0.42	0.35	0.83	0.64
0005	0.37	0.35	0.99	0.88	0.81	0.46	0.57	0.47	0.06	0.55

# coding: UTF-8

import pandas as pd
df = pd.read_csv('‪data_1.txt', header = 0, sep = '\t', na_values = 'na')
print df

	id	x01	x02	x03	x04	x05	x06	x07	x08	x09	x10
0	1	0.54	0.54	0.85	0.79	0.54	0.36	0.28	0.52	0.21	0.49
1	2	0.72	0.68	0.77	0.69	0.07	NaN	0.29	0.42	0.32	0.51
2	3	0.68	0.99	0.19	0.16	0.31	0.76	0.57	0.08	0.07	0.98
3	4	0.98	NaN	0.49	0.47	0.09	0.52	0.42	0.35	0.83	0.64
4	5	0.37	0.35	0.99	0.88	0.81	0.46	0.57	0.47	0.06	0.55

If you do not specify the type, it will be as above and the id will be zero. When I check the data type of id in df.dtypes, it is int.

In such a case

df = pd.read_csv('data_1.txt', header = 0, sep = '\t', na_values = 'na',
                 dtype = {'id':'object', 'x01':'float', 'x02':'float','x03':'float','x04':'float','x05':'float','x06':'float',
                          'x07':'float','x08':'float','x09':'float','x10':'float'})

print df

     id   x01   x02   x03   x04   x05   x06   x07   x08   x09   x10
0  0001  0.54  0.54  0.85  0.79  0.54  0.36  0.28  0.52  0.21  0.49
1  0002  0.72  0.68  0.77  0.69  0.07   NaN  0.29  0.42  0.32  0.51
2  0003  0.68  0.99  0.19  0.16  0.31  0.76  0.57  0.08  0.07  0.98
3  0004  0.98   NaN  0.49  0.47  0.09  0.52  0.42  0.35  0.83  0.64
4  0005  0.37  0.35  0.99  0.88  0.81  0.46  0.57  0.47  0.06  0.55

In this way, you can keep the original shape by specifying dtype. It's col Classes in R. I feel that the data is read faster when dtype is specified.

You can also read everything as an object for the time being, and then change only the necessary parts later.

#At first read everything with object
df = pd.read_csv('data_1.txt', header = 0, sep = '\t', na_values = 'na', dtype = 'object')

var_lst = ['x01','x02','x03','x04','x05','x06','x07','x08','x09','x10']
df[var_lst] = df[var_lst].astype(float)    #Change data type to float