[Spark] I'm addicted to trapping "", null and [] in DataFrame

Is Spark2 really scary? The story of DataFrame.

Development environment

The version is old because it was an article I wrote a while ago.

An example of being addicted to reading CSV

input.csv


x,y,z
1,,2

pyspark


>>> df = spark.read.csv("input.csv", header=True)
>>> df
DataFrame[x: string, y: string, z: string]
>>> df.show()
+---+----+---+
|  x|   y|  z|
+---+----+---+
|  1|null|  2|
+---+----+---+

When you load such a file, the empty field will be null instead of "". In other words, saving as CSV makes "" and null indistinguishable. Please be careful. However, if the data is read by another Spark app, there is a way to save it with Parquet or Avro without using CSV.

An example of being addicted to string functions

Using the previous df.

pyspark


>>> df.select(pyspark.sql.functions.length("y")).show()
+---------+
|length(y)|
+---------+
|     null|
+---------+
#Recognize.

>>> df.select(pyspark.sql.functions.split("y", " ")).show()
+-----------+
|split(y,  )|
+-----------+
|       null|
+-----------+
#Well understand.

>>> df.select(pyspark.sql.functions.size(pyspark.sql.functions.split("y", " "))).show()
+-----------------+
|size(split(y,  ))|
+-----------------+
|               -1|
+-----------------+
# -1?Well...

>>> df.fillna("").show()
+---+---+---+
|  x|  y|  z|
+---+---+---+
|  1|   |  2|
+---+---+---+
#null""Replaced with.

>>> df.fillna("").select(pyspark.sql.functions.length("y")).show()
+---------+
|length(y)|
+---------+
|        0|
+---------+
# ""That's right.

>>> df.fillna("").select(pyspark.sql.functions.split("y", " ")).show()
+-----------+
|split(y,  )|
+-----------+
|         []|
+-----------+
#Sayana.

>>> df.fillna("").select(pyspark.sql.functions.size(pyspark.sql.functions.split("y", " "))).show()
+-----------------+
|size(split(y,  ))|
+-----------------+
|                1|
+-----------------+
#It's not 0??

>>> df2 = spark.createDataFrame([[[]]], "arr: array<string>")
>>> df2
DataFrame[arr: array<string>]
>>> df2.show()
+---+
|arr|
+---+
| []|
+---+

>>> df2.select(pyspark.sql.functions.size("arr")).show()
+---------+
|size(arr)|
+---------+
|        0|
+---------+
#Why is that 1 and this is 0...

>>> df.fillna("").select(pyspark.sql.functions.split("y", " ")).collect()
[Row(split(y,  )=[u''])]
# Oh...Certainly even in Python len("".split(" ")) == 1
#Does that mean I just got caught?...orz

In the output of show (), you can't distinguish between the empty array [] and the array ["]] with one empty string ... Surprisingly, these specifications are not properly written in Documentation. I was impatient.

Recommended Posts

[Spark] I'm addicted to trapping "", null and [] in DataFrame
I'm addicted to the difference in how Flask and Django receive JSON data
I'm addicted to Python 2D lists
I was addicted to confusing class variables and instance variables in Python
How to get a specific column name and index name in pandas DataFrame
The file name was bad in Python and I was addicted to import
How to split and save a DataFrame
How to reassign index in pandas dataframe
How to use is and == in Python
How to generate permutations in Python and C ++
I'm addicted to Kintone as a data store
How to write async and await in Vue.js
Add totals to rows and columns in pandas
Introducing Spark to EC2 and linking iPython Notebook
To represent date, time, time, and seconds in Python
How to plot autocorrelation and partial autocorrelation in python
What I was addicted to when combining class inheritance and Joint Table Inheritance in SQLAlchemy