How to extract only the character string required by regular expression from a file that can not be read by "," delimiter with pandas as shown below and make it a DataFrame
If you read_csv the sample data below as it is, an error will occur because the number of columns is different.
test.csv
value1=12333,value2(fuga,hoge),value3=fuga
value1=111,value2(hoge),value3=fugahoge
When reading, first read as a row of data.
In[2]: import pandas as pd
In[3]: df = pd.read_csv('test.csv',header=None,sep='\t')
In[4]: df
Out[4]:
0
0 value1=12333,value2(fuga,hoge),value3=fuga
1 value1=111,value2(hoge),value3=fugahoge
Use Series.str.extract () to split with a regular expression.
In[5]: df[0].str.extract('value1=(?P<val1>\d+),value2\((?P<val2>[\w,]+)\),value3=(?P<val3>.*)')
Out[5]:
val1 val2 val3
0 12333 fuga,hoge fuga
1 111 hoge fugahoge
The column name can be specified in the part of "? P \
Moreover, since the extracted value is returned as an object, it is necessary to change it to an int type or the like as appropriate.
http://sinhrks.hatenablog.com/entry/2014/12/06/233032
Recommended Posts