Overview (TL; DR)

Any column can be used in csv of 50 columns x 3 million rows, so when I was asked to extract a row containing a certain character, I couldn't open it in Excel, so I tried it with Pandas.

Caution

*** grep *** If you use it, never say ... It's just a practice of pandas.

Sample data

It is troublesome to write in csv of 50 columns x 3 million rows, so the following is sample data. (The actual data is NaN, so it was a little more complicated.)

df = pd.DataFrame({"NAME":["Alice","Bob","Charly","Eve","Frank"],
                   "AGE":[10,20,30,20,10],
                  "ADDRESS":["TOKYO","OSAKA","TOKYO","OSAKA","AICHI"],
                  "COMPANY_PLACE":["TOKYO","TOKYO","AICHI","OSAKA","OSAKA"],
                  "BIRTH_PLACE":["TOKYO","OSAKA","TOKYO","OSAKA","OSAKA"]
                  })
df.head()

スクリーンショット 2020-03-27 23.17.58.png

Strategy A: Create a column that combines all columns

Why not combine all the columns and do a partial match search for that column!

df["P"] = df['ADDRESS'].str.cat(df['COMPANY_PLACE'], sep='-').str.cat(df['BIRTH_PLACE'], sep='-')
df

スクリーンショット 2020-03-27 23.20.40.png

df[df["P"].str.contains("OSAKA")]

The sample has 5 columns, but the actual data has 50 columns. It's a little to add them all ...

Strategy B: Let's devise the connection of columns


df = df.drop("P",axis=1)
df["P"] = [""] * len(df)
for column in df.columns.values:
    if column != "P":
        df["P"] = df["P"].str.cat(df[column].astype(str), sep='-')
df[df["P"].str.contains("OSAKA")]

There is a risk that strings and numbers will be mixed, so it is important to use astype (str) when cating.

Strategy C: Search each column and summarize the results.

df["P"] = [False] * len(df)
for column in df.columns.values:
    df["P"] = df["P"] | df[column].astype(str).str.contains("OSAKA")
df[df["P"]]

Conclusion

After all grep + awk is the strongest. It was a good training, but I felt drowned in the plan ...

Recommended Posts

How to find out if there is an arbitrary value in "somewhere" of pandas DataFrame

How to find the memory address of a Pandas dataframe value

How to check if a value exists in an enum

How to get an overview of your data in Pandas

How to reassign index in pandas dataframe

Is there NaN in the pandas DataFrame?

How to find out what kind of files are stored in S3 in Python

python: Tips for displaying an array (list) with an index (how to find out what number an element of an array is)

How to check in Python if one of the elements of a list is in another list

How to find the optimal number of clusters in k-means

[Pandas] If the first row data is in the header in DataFrame

[Python] How to write an if statement in one sentence.

How to check if the contents of the dictionary are the same in Python by hash value

Find out how many each character is in the string.

In the python dictionary, if a non-existent key is accessed, initialize it with an arbitrary value

I tried to find out in which language that software that I always take care of is written

How to know the internal structure of an object in Python

How to create a heatmap with an arbitrary domain in Python

I tried to find out if ReDoS is possible with Python

What to do if "Unnamed: 0" is added in to_csv-> read_csv in pandas

I will explain how to use Pandas in an easy-to-understand manner.

How to write soberly in pandas

What to do if there is a decimal in python json .dumps

How to change multiple columns of csv in Pandas (Unixtime-> Japan Time)

How to get a specific column name and index name in pandas DataFrame

How to find out the number of CPUs without using the sar command

How to compare if the contents of the objects in scipy.sparse.csr_matrix are the same

How to find a specific type (str, float etc) column in a DataFrame column

I want to leave an arbitrary command in the command history of Shell

In python pandas SettingWithCopyWarning A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc [row_indexer, col_indexer] = value instead

[Python] Summary of how to use pandas

[Pandas] What is set_option [How to use]

How to get the "name" of a field whose value is limited by the choice attribute in Django's model

How to read CSV files in Pandas

How to use is and == in Python

Find out how to divide a file with a certain number of lines evenly

How to replace with Pandas DataFrame, which is useful for data analysis (easy)

How to find the coefficient of the trendline that passes through the vertices in Python

How to find the area of the Voronoi diagram

How to keep track of work in Powershell

Summary of how to import files in Python 3

How to get help in an interactive shell

Delete rows with arbitrary values in pandas DataFrame

Summary of how to use MNIST in Python

Find the divisor of the value entered in python

How to store Python function in Value of dictionary (dict) and call function according to Key

How to use any or all to check if it is in a dictionary (Hash)

What to do if the progress bar is not displayed in tqdm of python

I used Python to find out about the role choices of the 51 "Yachts" in the world.

How to find out which process is using the localhost port and stop it