In processing the data, there was work to flag odd or even rows. At that time, I tried various things to speed up, so make a note of it.
As a prerequisite, assume that you have a data frame of 10000 rows as shown below.
df = pd.DataFrame({'hoge':np.zeros(10000)}
df
hoge | |
---|---|
0 | 0.0 |
1 | 0.0 |
2 | 0.0 |
3 | 0.0 |
4 | 0.0 |
... | ... |
9995 | 0.0 |
9996 | 0.0 |
9997 | 0.0 |
9998 | 0.0 |
9999 | 0.0 |
Add the following column to this data frame, called'target_record', with flags on odd or even rows.
df['target_record'] = [1,0,1,0,1,...,0,1,0,1,0]
df
hoge | target_record | |
---|---|---|
0 | 0.0 | 1 |
1 | 0.0 | 0 |
2 | 0.0 | 1 |
3 | 0.0 | 0 |
4 | 0.0 | 1 |
... | ... | ... |
9995 | 0.0 | 0 |
9996 | 0.0 | 1 |
9997 | 0.0 | 0 |
9998 | 0.0 | 1 |
9999 | 0.0 | 0 |
Calculate the time to create this target_record column. By the way, the processing time was calculated as the average measured 10,000 times.
First of all, the simplest? From the way Add a'target_record' column with 0 assigned to all records, and assign 1 to the specified row with loc + slice.
df['target_record'] = 0
df.loc[0::2, 'target_record'] = 1 #Df for even rows.loc[1::2, 'target_record'] = 1
#Average processing time: 0.0009912237882614137 sec
By the way, with iloc,
df['target_record'] = 0
df.iloc[0::2, 1] = 1
#Average processing time: 0.0009658613920211792 sec
Is it slightly faster than loc?
It is a famous story that the processing speed of loc is slow, so create an array with numpy and assign 1 to slice.
target_record = np.zeros(10000, dtype=int)
target_record[0::2] = 1 #Target for even rows_record[1::2] = 1
df['target_record'] = target_record
#Average processing time: 0.00035130116939544677 sec
The processing time has been reduced to about 1/3.
Create an array from 0 to 9999 with np.arange (10000) and substitute the remainder value divided by 2.
target_record = np.arange(10000)
df['target_record'] = (target_record + 1) % 2 #Df for even rows['target_record'] = target_record % 2
#Average processing time: 0.00046031529903411863 sec
It's a little devised, but it seems that np.zeros + slice is 0.0001 seconds faster.
When flagging odd or even rows, np.zeros + slice is the fastest?
By the way, when I measured the time for each process, the difference in processing speed was divided. The difference was whether to calculate the remainder or substitute in slices. There was almost no difference in processing speed between np.zeros and np.arange (zeros is faster by 1.0e-06 seconds).