memo.

Extract periods that match a particular pattern from pandas time series qualitative data

import numpy as np
import pandas as pd

np.random.seed(99)
df = pd.DataFrame(index=pd.date_range('2020-01-01', periods=60, freq='D'),
                  data=np.array(list('Sunny cloudy rain snow'))[np.random.randint(0, 4, 60)],
                  columns=['weather'], dtype='string')
df

weather
2020-01-01 Cloudy
2020-01-02 Snow
2020-01-03 Cloudy
2020-01-04 fine
2020-01-05 Cloudy
2020-01-06 fine
2020-01-07 Rain
2020-01-08 fine
2020-01-09 Cloudy
2020-01-10 fine
2020-01-11 Cloudy
2020-01-12 snow
2020-01-13 rain
2020-01-14 snow
2020-01-15 snow
2020-01-16 cloudy
2020-01-17 Cloudy
2020-01-18 fine
2020-01-19 cloudy
2020-01-20 snow
2020-01-21 rain
2020-01-22 snow
2020-01-23 fine
2020-01-24 snow
2020-01-25 rain
2020-01-26 fine
2020-01-27 fine
2020-01-28 fine
2020-01-29 fine
2020-01-30 fine
2020-01-31 cloudy
2020-02-01 Snow
2020-02-02 Snow
2020-02-03 Snow
2020-02-04 Rain
2020-02-05 Snow
2020-02-06 Snow
2020-02-07 Rain
2020-02-08 Cloudy
2020-02-09 rain
2020-02-10 fine
2020-02-11 fine
2020-02-12 cloudy
2020-02-13 rain
2020-02-14 cloudy
2020-02-15 cloudy
2020-02-16 fine
2020-02-17 Cloudy
2020-02-18 snow
2020-02-19 cloudy
2020-02-20 snow
2020-02-21 snow
2020-02-22 rain
2020-02-23 rain
2020-02-24 rain
2020-02-25 snow
2020-02-26 cloudy
2020-02-27 fine
2020-02-28 fine
2020-02-29 fine

It is a record when you want to extract the period showing a specific weather pattern when you have the above data.

Extract sunny days

Judgment on a daily basis is easy.

df[df['weather'] == 'Fine']

weather
2020-01-04 fine
2020-01-06 fine
2020-01-08 fine
2020-01-10 fine
2020-01-18 fine
2020-01-23 fine
2020-01-26 fine
2020-01-27 fine
2020-01-28 fine
2020-01-29 fine
2020-01-30 fine
2020-02-10 fine
2020-02-11 fine
2020-02-16 fine
2020-02-27 fine
2020-02-28 fine
2020-02-29 fine

Extract the period of "sunny for 2 days or more"

I just want to finish it.

df[(df['weather'] == 'Fine') & (df['weather'].shift(-1, fill_value='') == 'Fine')]

weather
2020-01-26 fine
2020-01-27 fine
2020-01-28 fine
2020-01-29 fine
2020-02-10 fine
2020-02-27 fine
2020-02-28 fine

This has extracted the day of "the day offineand the next day of fine", and failed to extract the "period offine". The last day of the period, specifically January 30, February 11, and February 29, is missed in this example. The day after the day when the result of the conditional expression is True must also be True.

So the following is the correct answer.

cond = (df['weather'] == 'Fine') & (df['weather'].shift(-1, fill_value='') == 'Fine')
cond |= cond.shift(fill_value=False)
out_df = df[cond]

out_df

weather
2020-01-26 fine
2020-01-27 fine
2020-01-28 fine
2020-01-29 fine
2020-01-30 fine
2020-02-10 fine
2020-02-11 fine
2020-02-27 fine
2020-02-28 fine
2020-02-29 fine

cond |= cond.shift(fill_value=False)Iscond = cond | cond.shift(fill_value=False)is what it means.

By the way, in order to assign a group number for each period, you can see if it will be fine the next day, so use diff () to

ngroup = cond.diff()[cond].cumsum()

out_df.assign(ngroup=ngroup)

Weather ngroup
2020-01-26 fine 1
2020-01-27 fine 1
2020-01-28 fine 1
2020-01-29 fine 1
2020-01-30 fine 1
2020-02-10 fine 2
2020-02-11 fine 2
2020-02-27 fine 3
2020-02-28 fine 3
2020-02-29 fine 3

Extract the period of "sunny for 3 days or more"

It is an application of the previous section.

cond = ((df['weather'] == 'Fine') & (df['weather'].shift(-1, fill_value='') == 'Fine')
        & (df['weather'].shift(-2, fill_value='') == 'Fine'))
cond |= cond.shift(1, fill_value=False) | cond.shift(2, fill_value=False)
out_df = df[cond]

out_df

weather
2020-01-26 fine
2020-01-27 fine
2020-01-28 fine
2020-01-29 fine
2020-01-30 fine
2020-02-27 fine
2020-02-28 fine
2020-02-29 fine

Extract the period of "sunny for 5 days or more"

How to write as follows

cond = ((df['weather'] == 'Fine')
        & (df['weather'].shift(-1, fill_value='') == 'Fine')
        & (df['weather'].shift(-2, fill_value='') == 'Fine')
        & (df['weather'].shift(-3, fill_value='') == 'Fine')
        & (df['weather'].shift(-4, fill_value='') == 'Fine'))
cond |= (cond.shift(1, fill_value=False) | cond.shift(2, fill_value=False)
         | cond.shift(3, fill_value=False) | cond.shift(4, fill_value=False))
out_df = df[cond]

out_df

Is painful to go this far, so use scipy.ndimage.binary_closing () to

from scipy.ndimage import binary_closing

days = 5
cond = binary_closing((df['weather'] != 'Fine').to_numpy(dtype=bool),
                      np.ones(days, dtype=bool), border_value=True)
df[~cond]

weather
2020-01-26 fine
2020-01-27 fine
2020-01-28 fine
2020-01-29 fine
2020-01-30 fine

Extract the first two days of the "sunny for more than two days" period

If you want to get rain [rainy] sunny sunny sunny rain when there is rain fine sunny sunny sunny rain.

Applying the diff () used when assigning the group number,

cond = (df['weather'] == 'Fine') & (df['weather'].shift(-1, fill_value='') == 'Fine')
cond &= cond.diff()
cond |= cond.shift(fill_value=False)
out_df = df[cond]

out_df

weather
2020-01-26 fine
2020-01-27 fine
2020-02-10 fine
2020-02-11 fine
2020-02-27 fine
2020-02-28 fine

Extract the last two days of the "sunny for more than two days" period

If you want to get rain, sunny, sunny, sunny, rain when you have rain, fine, sunny, sunny, rain.

It is an application of the previous section.

cond = (df['weather'] == 'Fine') & (df['weather'].shift(-1, fill_value='') == 'Fine')
cond = ~cond & cond.diff()
cond |= cond.shift(-1, fill_value=False)
out_df = df[cond]

out_df

weather
2020-01-29 fine
2020-01-30 fine
2020-02-10 fine
2020-02-11 fine
2020-02-28 fine
2020-02-29 fine

Extract the period of "2 days in a row and sunny (excluding 3 days or more)"

It's easy to think of it as an application of the previous section, and add "not sunny the day before" and "not sunny the day after tomorrow" as the conditions. .. ..

cond = ((df['weather'] == 'Fine') & (df['weather'].shift(-1, fill_value='') == 'Fine')
        & (df['weather'].shift(1, fill_value='') != 'Fine')
        & (df['weather'].shift(-2, fill_value='') != 'Fine'))
cond |= cond.shift(fill_value=False)
out_df = df[cond]

out_df

weather
2020-02-10 fine
2020-02-11 fine

For the time being, you can also do this.

is_sunny = (df['weather'] == 'Fine').to_numpy(dtype=int)
# array([0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
#        1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
#        0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1])

base = 2**np.arange(4)
#Weather base
# 2020-02-09 rain* 1
# 2020-02-10 fine* 2
# 2020-02-11 fine* 4
# 2020-02-12 cloudy* 8
# 0*1 + 1*2 + 1*4 + 0*8 = 6
target_number = 6
cond = np.convolve(is_sunny, base) == target_number
out_df = df[cond[2:-1] | cond[3:]]

Extract the period from "from a sunny day to the next rainy day"

First, extract only the sunny and rainy days, and get the sunny day and the next rainy day. Then use cumsum () to separate the groups and extract.

is_sun_rain = df['weather'].isin(set('Sunny rain'))

subdf = df[is_sun_rain]
subcond = ((subdf['weather'] == 'Fine')
           & (subdf['weather'].shift(-1, fill_value='') == 'rain'))
subcond |= subcond.shift(fill_value=False)

cond = ((is_sun_rain & subcond).cumsum() % 2).astype('boolean')
cond |= cond.shift(fill_value=False)
out_df = df[cond]

ngroup = cond.diff()[cond].cumsum()
out_df.assign(ngroup=ngroup)

Weather ngroup
2020-01-06 fine 1
2020-01-07 rain 1
2020-01-10 fine 2
2020-01-11 cloudy 2
2020-01-12 snow 2
2020-01-13 rain 2
2020-01-18 fine 3
2020-01-19 cloudy 3
2020-01-20 snow 3
2020-01-21 rain 3
2020-01-23 fine 4
2020-01-24 snow 4
2020-01-25 rain 4
2020-01-30 fine 5
2020-01-31 cloudy 5
2020-02-01 snow 5
2020-02-02 Snow 5
2020-02-03 snow 5
2020-02-04 rain 5
2020-02-11 fine 6
2020-02-12 cloudy 6
2020-02-13 rain 6
2020-02-16 fine 7
2020-02-17 cloudy 7
2020-02-18 snow 7
2020-02-19 cloudy 7
2020-02-20 snow 7
2020-02-21 snow 7
2020-02-22 rain 7

Extract the period from a sunny day to the next rainy day with a cloudy day in between

It is an application of the previous section.

is_sun_rain = df['weather'].isin(set('Sunny rain'))

subdf = df[is_sun_rain]
subcond = ((subdf['weather'] == 'Fine')
           & (subdf['weather'].shift(-1, fill_value='') == 'Cloudy')
           & (subdf['weather'].shift(-2, fill_value='') == 'rain'))
subcond |= subcond.shift(2, fill_value=False)

cond = ((is_sun_rain & subcond).cumsum() % 2).astype('boolean')
cond |= cond.shift(fill_value=False)

out_df = df[cond]

ngroup = cond.diff()[cond].cumsum()
out_df.assign(ngroup=ngroup)

Weather ngroup
2020-01-10 fine 1
2020-01-11 cloudy 1
2020-01-12 snow 1
2020-01-13 rain 1
2020-01-18 fine 2
2020-01-19 cloudy 2
2020-01-20 snow 2
2020-01-21 rain 2
2020-01-30 fine 3
2020-01-31 cloudy 3
2020-02-01 snow 3
2020-02-02 Snow 3
2020-02-03 snow 3
2020-02-04 rain 3
2020-02-11 fine 4
2020-02-12 cloudy 4
2020-02-13 rain 4

arr, _ = df['weather'].factorize()
# 0: 'Cloudy', 1: 'snow', 2: 'Fine', 3: 'rain'
subarr, = (arr != 1).nonzero()
subcond = np.convolve(arr[subarr], [1, 10, 100]) == 203

flags = np.zeros(arr.size, dtype=int)
flags[subarr[subcond[2:]]] = 1
flags[subarr[subcond[:-2]]+1] = -1
cond = flags.cumsum().astype(bool)
out_df = df[cond]

ngroup = flags[cond].cumsum()
out_df.assign(ngroup=ngroup)