How to find out if there is an arbitrary value in "somewhere" of pandas DataFrame

Overview (TL; DR)

Any column can be used in csv of 50 columns x 3 million rows, so when I was asked to extract a row containing a certain character, I couldn't open it in Excel, so I tried it with Pandas.

Caution

*** grep *** If you use it, never say ... It's just a practice of pandas.

Sample data

It is troublesome to write in csv of 50 columns x 3 million rows, so the following is sample data. (The actual data is NaN, so it was a little more complicated.)

df = pd.DataFrame({"NAME":["Alice","Bob","Charly","Eve","Frank"],
                   "AGE":[10,20,30,20,10],
                  "ADDRESS":["TOKYO","OSAKA","TOKYO","OSAKA","AICHI"],
                  "COMPANY_PLACE":["TOKYO","TOKYO","AICHI","OSAKA","OSAKA"],
                  "BIRTH_PLACE":["TOKYO","OSAKA","TOKYO","OSAKA","OSAKA"]
                  })
df.head()

スクリーンショット 2020-03-27 23.17.58.png

Strategy A: Create a column that combines all columns

Why not combine all the columns and do a partial match search for that column!

df["P"] = df['ADDRESS'].str.cat(df['COMPANY_PLACE'], sep='-').str.cat(df['BIRTH_PLACE'], sep='-')
df

スクリーンショット 2020-03-27 23.20.40.png

df[df["P"].str.contains("OSAKA")]
スクリーンショット 2020-03-27 23.23.07.png

The sample has 5 columns, but the actual data has 50 columns. It's a little to add them all ...

Strategy B: Let's devise the connection of columns


df = df.drop("P",axis=1)
df["P"] = [""] * len(df)
for column in df.columns.values:
    if column != "P":
        df["P"] = df["P"].str.cat(df[column].astype(str), sep='-')
df[df["P"].str.contains("OSAKA")]

There is a risk that strings and numbers will be mixed, so it is important to use astype (str) when cating.

スクリーンショット 2020-03-28 0.18.40.png

Strategy C: Search each column and summarize the results.

df["P"] = [False] * len(df)
for column in df.columns.values:
    df["P"] = df["P"] | df[column].astype(str).str.contains("OSAKA")
df[df["P"]]
スクリーンショット 2020-03-28 0.10.06.png

Conclusion

After all grep + awk is the strongest. It was a good training, but I felt drowned in the plan ...

Recommended Posts

How to find out if there is an arbitrary value in "somewhere" of pandas DataFrame
How to find the memory address of a Pandas dataframe value
How to check if a value exists in an enum
How to get an overview of your data in Pandas
How to reassign index in pandas dataframe
Is there NaN in the pandas DataFrame?
How to find out what kind of files are stored in S3 in Python
python: Tips for displaying an array (list) with an index (how to find out what number an element of an array is)
How to check in Python if one of the elements of a list is in another list
How to find the optimal number of clusters in k-means
[Pandas] If the first row data is in the header in DataFrame
[Python] How to write an if statement in one sentence.
How to check if the contents of the dictionary are the same in Python by hash value
Find out how many each character is in the string.
In the python dictionary, if a non-existent key is accessed, initialize it with an arbitrary value
I tried to find out in which language that software that I always take care of is written
How to know the internal structure of an object in Python
How to create a heatmap with an arbitrary domain in Python
I tried to find out if ReDoS is possible with Python
What to do if "Unnamed: 0" is added in to_csv-> read_csv in pandas
I will explain how to use Pandas in an easy-to-understand manner.
How to write soberly in pandas
What to do if there is a decimal in python json .dumps
How to change multiple columns of csv in Pandas (Unixtime-> Japan Time)
How to get a specific column name and index name in pandas DataFrame
How to find out the number of CPUs without using the sar command
How to compare if the contents of the objects in scipy.sparse.csr_matrix are the same
How to find a specific type (str, float etc) column in a DataFrame column
I want to leave an arbitrary command in the command history of Shell
In python pandas SettingWithCopyWarning A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc [row_indexer, col_indexer] = value instead
[Python] Summary of how to use pandas
[Pandas] What is set_option [How to use]
How to get the "name" of a field whose value is limited by the choice attribute in Django's model
How to read CSV files in Pandas
How to use is and == in Python
Find out how to divide a file with a certain number of lines evenly
How to replace with Pandas DataFrame, which is useful for data analysis (easy)
How to find the coefficient of the trendline that passes through the vertices in Python
How to find the area of the Voronoi diagram
How to keep track of work in Powershell
Summary of how to import files in Python 3
How to get help in an interactive shell
Delete rows with arbitrary values in pandas DataFrame
Summary of how to use MNIST in Python
Find the divisor of the value entered in python
How to store Python function in Value of dictionary (dict) and call function according to Key
How to use any or all to check if it is in a dictionary (Hash)
What to do if the progress bar is not displayed in tqdm of python
I used Python to find out about the role choices of the 51 "Yachts" in the world.
How to find out which process is using the localhost port and stop it