Pandas is used when dealing with various data in Python, but until now, when searching for a specific numerical value, it was normally processed one by one with for
ordf.iterrows ()
.
But when the data gets bigger, it's slower.
After investigating if there is any good way, it seems that it is better to use df.ix []
as a mask (filter?).
For example, suppose you have a data frame like this:
print df
A B C D E
TIME
2001-01-08 06:00:00 -0.5925 -1.11 0.055023 -0.074103 0.129126
2001-01-08 06:05:00 -0.4213 NaN NaN NaN NaN
2001-01-08 06:10:00 -0.3613 -2.19 NaN NaN NaN
2001-01-08 06:15:00 -0.3613 NaN NaN NaN 0.224574
2001-01-08 06:20:00 -0.3613 NaN 0.129036 -0.000281 1.094003
2001-01-08 06:25:00 -0.8243 -0.65 0.137242 -0.022502 2.145931
I want to know which column is NaN in row B
print df.ix[df["B"].isnull()]
A B C D E
TIME
2001-01-08 06:05:00 -0.4213 NaN NaN NaN NaN
2001-01-08 06:15:00 -0.3613 NaN NaN NaN 0.224574
2001-01-08 06:20:00 -0.3613 NaN 0.129036 -0.000281 1.094003
I want to know the columns that are not NaN in row E
print df.ix[df["E"].isnull()==False]
A B C D E
TIME
2001-01-08 06:00:00 -0.5925 -1.11 0.055023 -0.074103 0.129126
2001-01-08 06:15:00 -0.3613 NaN NaN NaN 0.224574
2001-01-08 06:20:00 -0.3613 NaN 0.129036 -0.000281 1.094003
2001-01-08 06:25:00 -0.8243 -0.65 0.137242 -0.022502 2.145931
Search hits no matter which line is NaN
print df.ix[df.isnull().values.any()]
A B C D E
TIME
2001-01-08 06:05:00 -0.4213 NaN NaN NaN NaN
2001-01-08 06:10:00 -0.3613 -2.19 NaN NaN NaN
2001-01-08 06:15:00 -0.3613 NaN NaN NaN 0.224574
2001-01-08 06:20:00 -0.3613 NaN 0.129036 -0.000281 1.094003
I want only the first three columns where row C is NaN
print df.ix[df["C"].isnull(), :3]
A B C D E
TIME
2001-01-08 06:05:00 -0.4213 NaN NaN NaN NaN
2001-01-08 06:10:00 -0.3613 -2.19 NaN NaN NaN
I want rows A and B of a column that is NaN in row D
print df.ix[df["D"].isnull(), ["A", "B"]]
A B
TIME
2001-01-08 06:05:00 -0.4213 NaN
2001-01-08 06:10:00 -0.3613 -2.19
2001-01-08 06:15:00 -0.3613 Nan
Manipulate the number in row A of the column that is NaN in row D (try subtracting 1)
df.ix[df["D"].isnull(), "A"] -= 1
print df["A"]
A
TIME
2001-01-08 06:00:00 -0.5925
2001-01-08 06:05:00 -1.4213
2001-01-08 06:10:00 -1.3613
2001-01-08 06:15:00 -1.3613
2001-01-08 06:20:00 -0.3613
2001-01-08 06:25:00 -0.8243
Conditional search in the above state
print df["A"]
A
TIME
2001-01-08 06:00:00 -0.5925
2001-01-08 06:05:00 -1.4213
2001-01-08 06:10:00 -1.3613
2001-01-08 06:15:00 -1.3613
2001-01-08 06:20:00 -0.3613
2001-01-08 06:25:00 -0.8243
print df.ix[df["A"]<= -1]
A B C D E
TIME
2001-01-08 06:05:00 -1.4213 NaN NaN NaN NaN
2001-01-08 06:10:00 -1.3613 -2.19 NaN NaN NaN
2001-01-08 06:15:00 -1.3613 NaN NaN NaN 0.224574
It's small in this example, but it's quite fast even with large data frames.
Postscript
Df.apply (function, axis = 1)
seems to be the best when you want to operate the numerical value of each column with a complicated conditional expression.
Recommended Posts