Any column can be used in csv of 50 columns x 3 million rows, so when I was asked to extract a row containing a certain character, I couldn't open it in Excel, so I tried it with Pandas.
*** grep *** If you use it, never say ... It's just a practice of pandas.
It is troublesome to write in csv of 50 columns x 3 million rows, so the following is sample data. (The actual data is NaN, so it was a little more complicated.)
df = pd.DataFrame({"NAME":["Alice","Bob","Charly","Eve","Frank"],
"AGE":[10,20,30,20,10],
"ADDRESS":["TOKYO","OSAKA","TOKYO","OSAKA","AICHI"],
"COMPANY_PLACE":["TOKYO","TOKYO","AICHI","OSAKA","OSAKA"],
"BIRTH_PLACE":["TOKYO","OSAKA","TOKYO","OSAKA","OSAKA"]
})
df.head()
Why not combine all the columns and do a partial match search for that column!
df["P"] = df['ADDRESS'].str.cat(df['COMPANY_PLACE'], sep='-').str.cat(df['BIRTH_PLACE'], sep='-')
df
df[df["P"].str.contains("OSAKA")]
The sample has 5 columns, but the actual data has 50 columns. It's a little to add them all ...
df = df.drop("P",axis=1)
df["P"] = [""] * len(df)
for column in df.columns.values:
if column != "P":
df["P"] = df["P"].str.cat(df[column].astype(str), sep='-')
df[df["P"].str.contains("OSAKA")]
There is a risk that strings and numbers will be mixed, so it is important to use astype (str) when cating.
df["P"] = [False] * len(df)
for column in df.columns.values:
df["P"] = df["P"] | df[column].astype(str).str.contains("OSAKA")
df[df["P"]]
After all grep + awk is the strongest. It was a good training, but I felt drowned in the plan ...
Recommended Posts