It is said that there is a pattern that behaves unintentionally if the date format string type is used carelessly when calculating TimeStampType of PySpark.
Therefore, it is possible to calculate with string for TimeStampType, but it is safer to use datetime.
A specific example will be described.
The example shown here is from PySPark 2.4.4.
With the following code, create a Spark DataFrame with date data from January 1, 2000 to January 5, 2000, and perform conditional processing on this date data.
Verification data
import pandas as pd
from pyspark.sql import functions as F
pdf = pd.DataFrame(pd.date_range(start='1/1/2000', periods=5), columns=['date'])
sdf = spark.createDataFrame(pdf, ['timestamp'])
The operation using datetime for TimeStampType works normally.
TimeStampType and datetime operations
target_datetime = datetime.strptime('2000-01-03', '%Y-%m-%d')
print('== datetime(2000-01-03)')
sdf.where(F.col('timestamp') == datetime.strptime('2000-01-03', '%Y-%m-%d')).show()
print('> datetime(2000-01-03)')
sdf.where(F.col('timestamp') > datetime.strptime('2000-01-03', '%Y-%m-%d')).show()
print('>= datetime(2000-01-03)')
sdf.where(F.col('timestamp') >= datetime.strptime('2000-01-03', '%Y-%m-%d')).show()
print('< datetime(2000-01-03)')
sdf.where(F.col('timestamp') < datetime.strptime('2000-01-03', '%Y-%m-%d')).show()
print('<= datetime(2000-01-03)')
sdf.where(F.col('timestamp') <= datetime.strptime('2000-01-03', '%Y-%m-%d')).show()
print('between datetime(2000-01-02) and datetime(2000-01-04)')
sdf.where(F.col('timestamp').between(datetime.strptime('2000-01-02', '%Y-%m-%d'), datetime.strptime('2000-01-04', '%Y-%m-%d'))).show()
Output result
== datetime(2000-01-03)
+-------------------+
| timestamp|
+-------------------+
|2000-01-03 00:00:00|
+-------------------+
> datetime(2000-01-03)
+-------------------+
| timestamp|
+-------------------+
|2000-01-04 00:00:00|
|2000-01-05 00:00:00|
+-------------------+
>= datetime(2000-01-03)
+-------------------+
| timestamp|
+-------------------+
|2000-01-03 00:00:00|
|2000-01-04 00:00:00|
|2000-01-05 00:00:00|
+-------------------+
< datetime(2000-01-03)
+-------------------+
| timestamp|
+-------------------+
|2000-01-01 00:00:00|
|2000-01-02 00:00:00|
+-------------------+
<= datetime(2000-01-03)
+-------------------+
| timestamp|
+-------------------+
|2000-01-01 00:00:00|
|2000-01-02 00:00:00|
|2000-01-03 00:00:00|
+-------------------+
between datetime(2000-01-02) and datetime(2000-01-04)
+-------------------+
| timestamp|
+-------------------+
|2000-01-02 00:00:00|
|2000-01-03 00:00:00|
|2000-01-04 00:00:00|
+-------------------+
Next, the result when string is given in datetime format (yyyy-mm-dd hh: mm: ss) is shown. The string seems to be cast implicitly, and the operation can be performed without any problem.
TimeStampType and string(datetime format)Calculation
print('== string(2000-01-03 00:00:00)')
sdf.where(F.col('timestamp') == '2000-01-03 00:00:00').show()
print('> string(2000-01-03 00:00:00)')
sdf.where(F.col('timestamp') > '2000-01-03 00:00:00').show()
print('>= string(2000-01-03 00:00:00)')
sdf.where(F.col('timestamp') >= '2000-01-03 00:00:00').show()
print('< string(2000-01-03 00:00:00)')
sdf.where(F.col('timestamp') < '2000-01-03 00:00:00').show()
print('<= string(2000-01-03 00:00:00)')
sdf.where(F.col('timestamp') <= '2000-01-03 00:00:00').show()
print('between string(2000-01-02 00:00:00) and string(2000-01-04 00:00:00)')
sdf.where(F.col('timestamp').between('2000-01-02 00:00:00', '2000-01-04 00:00:00')).show()
Output result
== string(2000-01-03 00:00:00)
+-------------------+
| timestamp|
+-------------------+
|2000-01-03 00:00:00|
+-------------------+
> string(2000-01-03 00:00:00)
+-------------------+
| timestamp|
+-------------------+
|2000-01-04 00:00:00|
|2000-01-05 00:00:00|
+-------------------+
>= string(2000-01-03 00:00:00)
+-------------------+
| timestamp|
+-------------------+
|2000-01-03 00:00:00|
|2000-01-04 00:00:00|
|2000-01-05 00:00:00|
+-------------------+
< string(2000-01-03 00:00:00)
+-------------------+
| timestamp|
+-------------------+
|2000-01-01 00:00:00|
|2000-01-02 00:00:00|
+-------------------+
<= string(2000-01-03 00:00:00)
+-------------------+
| timestamp|
+-------------------+
|2000-01-01 00:00:00|
|2000-01-02 00:00:00|
|2000-01-03 00:00:00|
+-------------------+
between string(2000-01-02 00:00:00) and string(2000-01-04 00:00:00)
+-------------------+
| timestamp|
+-------------------+
|2000-01-02 00:00:00|
|2000-01-03 00:00:00|
|2000-01-04 00:00:00|
+-------------------+
Finally, the result when string is given in date format (yyyy-mm-dd) is shown. In this case, there are patterns that result in results that are not intuitively intended.
TimeStampType and string(date format)Calculation
print('== string(2000-01-03)')
sdf.where(F.col('timestamp') == '2000-01-03').show()
print('> string(2000-01-03)') #Unintended pattern
sdf.where(F.col('timestamp') > '2000-01-03').show()
print('>= string(2000-01-03)')
sdf.where(F.col('timestamp') >= '2000-01-03').show()
print('< string(2000-01-03)')
sdf.where(F.col('timestamp') < '2000-01-03').show()
print('<= string(2000-01-03)') #Unintended pattern
sdf.where(F.col('timestamp') <= '2000-01-03').show()
print('between string(2000-01-02) and string(2000-01-04)') #Unintended pattern
sdf.where(F.col('timestamp').between('2000-01-02', '2000-01-04')).show()
Output result
== string(2000-01-03)
+-------------------+
| timestamp|
+-------------------+
|2000-01-03 00:00:00|
+-------------------+
> string(2000-01-03)
+-------------------+
| timestamp|
+-------------------+
|2000-01-03 00:00:00|
|2000-01-04 00:00:00|
|2000-01-05 00:00:00|
+-------------------+
>= string(2000-01-03)
+-------------------+
| timestamp|
+-------------------+
|2000-01-03 00:00:00|
|2000-01-04 00:00:00|
|2000-01-05 00:00:00|
+-------------------+
< string(2000-01-03)
+-------------------+
| timestamp|
+-------------------+
|2000-01-01 00:00:00|
|2000-01-02 00:00:00|
+-------------------+
<= string(2000-01-03)
+-------------------+
| timestamp|
+-------------------+
|2000-01-01 00:00:00|
|2000-01-02 00:00:00|
+-------------------+
between string(2000-01-02) and string(2000-01-04)
+-------------------+
| timestamp|
+-------------------+
|2000-01-02 00:00:00|
|2000-01-03 00:00:00|
+-------------------+
As a result of implicitly casting string to TimeStamp at the time of operation, TimeStamp (2000-01-01 00:00:00)
and TimeStamp (string (2000-01-01 00:00:00))
are equivalent. And the result is like TimeStamp (2000-01-01 00:00:00) <TimeStamp (string (2000-01-01))
.
From this, it can be imagined that the values for hours (00: 00: 00
in the above example) are not processed properly. (For strict specifications, you need to check the Scala source for the details of the operation)
By the way, when casting from StringType to TimeStampType, it seems that the cast is correct. An example is not shown here, but as with the above pattern, even if arithmetic processing with TimeStampType is performed, it operates normally (naturally because it is between TimeStampType types).
Cast from StringType to TimeStampType
df = spark.createDataFrame([('2000',), ('2000-01-01',), ('2000-01-01 00:00:00',) ], ['str'])
df = df.withColumn('timestamp', F.col('str').cast('timestamp'))
df.show()
Output result
+-------------------+-------------------+
| str| timestamp|
+-------------------+-------------------+
| 2000|2000-01-01 00:00:00|
| 2000-01-01|2000-01-01 00:00:00|
|2000-01-01 00:00:00|2000-01-01 00:00:00|
+-------------------+-------------------+
Recommended Posts