I usually do pre-processing of data. Most of the data works well if you can read it in pandas, but you may come across unreadable data, so keep a record of your struggle with them.
--The delimiter is a comma "," and each data is enclosed in double quotes "" ". --A line feed code is included in the data (between "and"). --Comma is included in the data. --Double quotes are included in the data.
↓ If you compare it like this
"a","b"
"1","Ho\n""
"2","Fu,Or'"
--Create a csv file that can be read by read.csv () of pandas, which is a Python library. ――Since it is the first part of the pre-processing, I want to make it a DataFrame for the subsequent processing. --Commas and quotes are treated as they are as part of the data, considering the possibility of meaning. --The line feed code in the data is only bad, so remove it.
↓ In other words, I want to make this kind of DataFrame.
a | b |
---|---|
1 | Hoke" |
2 | Fu,'Or' |
import pandas as pd
df = pd.read_csv('hoge.csv')
print(df)
a | b | |
---|---|---|
1 | Ho\n"\n2" | Fu,Or' |
--Python version is 3.7
import re
import pandas as pd
#Read as text
with open('hoge.csv', 'r') as f:
text = f.read()
tmp_text = re.sub('([^"])\n([^"])', r'\1\2', text) #Line feed code in the middle of data(\n)Get rid of
tmp_text = re.sub('","', '\t', tmp_text) #Convert delimiter to tab
tmp_text = re.sub('(^"|"$)', '', tmp_text) #Remove the first and last quotes in the file
tmp_text = re.sub('"\n"', '\n', tmp_text) #Remove the quotes in the middle
#Spit out to a file once
with open('data.csv', 'w') as f:
f.write(tmp_text)
#Confirm
df = pd.read_csv('data.csv', sep='\t')
print(df)
output
a b
0 1"
1 2,Or'
I read.
The contents of the file look like this
data.csv
a b
1"
2 Fu,Or'
If the data contains \ t, this code will of course not work. The same applies when the line feed code is different. You need to check which characters are included and how to replace them. → Creating a method for checking and parameterizing delimiters?
I'll do it when I feel like it. If you process it on jupyter, you can easily check the data and change the source, so it may not be necessary ...?
If you read it with pandas, it's already here. It should be usable not only for pandas but also before it is eaten by BI tools.
** Regular expressions are convenient! !! ** **
Recommended Posts