Recently, I've been reviewing code that processes data in Python and batch processes simple alert functions.
Among them, I was surprised to find the following code. (Slightly rewritten)
import pandas as pd
def select_this_month(df: pd.DataFrame) -> pd.DataFrame:
"""Extract only this month's data
Args:
df:Data frame to be filtered
Returns:
Filtered data frame
"""
now = df["Date"].max()
y, m = now[0:4], now[5:7]
res = df[df['Date'] >= (y + "-" + m + "-01")]
return res
When I look at this code, I'm fetching the maximum value with pandas `` `max``` (numeric type or datetime type?), And fetching a substring with a slice (string type?). I wondered, "What's this? Pandas built-in date type?"
Upon closer inspection, this was just a string type with a comparison operator (inequality sign) defined in it.
a = '2016-12-31'
b = '2016-01-01'
a > b
# => True
a < b
# => False
I also found where the result of the comparison operator is explicitly stated in the official Python documentation. 。
Strings (instances of str) compare lexicographically using the numerical Unicode code points (the result of the built-in function ord()) of their characters.
Strings and binary sequences cannot be directly compared.
I will translate it to Google for the time being.
Strings (instances of str) are compared lexicographically using Unicode code points of characters (results of the built-in function ord ()). [3]
It is not possible to directly compare strings and binary sequences.
Document of ord function also says this, so it seems that it is compared by Unicode point.
For a string that represents a single Unicode character, returns an integer that represents the Unicode code point for that character. For example, ord ('a') returns the integer 97 and ord ('€') (the euro sign) returns 8364. This is the opposite of chr ().
So
If so, it seems to behave without problems even if it is a character string type.
I think it's better to use the date type after all.
Recommended Posts