[Spark Data Frame] Change a column from horizontal to vertical (Scala)

Introduction

In SparkDf, when changing from horizontal holding to vertical holding for a column with DF, it took some time, so I will record it as a memorandum. For the time being, the source code is Scala, but it shouldn't change that much with Python.

Data frame manipulation

Creating a test DataFrame

`testDf.scala`


val testDf = Seq(
    (1,"a",30,20),
    (2,"a",40,30),
    (3,"c",50,40),
    (4,"d",60,50),
    (2,"d",40,70),
    (2,"d",20,10),
    (1,"a",60,90)
   )
   .toDF("Column1", "Column2", "Column3","Column4")

/*
+-------+-------+-------+-------+
|Column1|Column2|Column3|Column4|
+-------+-------+-------+-------+
|      1|      a|     30|     20|
|      2|      a|     40|     30|
|      3|      c|     50|     40|
|      4|      d|     60|     50|
|      2|      d|     40|     70|
|      2|      d|     20|     10|
|      1|      a|     60|     90|
+-------+-------+-------+-------+
*/

Vertical holding → Horizontal holding

Vertical holding → horizontal holding is easy, and you can easily do it with pivot. When using it, you need to use three methods like "df.groupby (). Pivot (). Count ()".

`pivot.scala`


val test2Df = testDf
.groupBy("Column1") 
.pivot("Column2") //Column you want to hold horizontally
.count() //Aggregate function
.na.fill(0)　//Zero padding

/*
+-------+---+---+---+
|Column1|  a|  c|  d|
+-------+---+---+---+
|      1|  2|  0|  0|
|      3|  0|  1|  0|
|      4|  0|  0|  1|
|      2|  1|  0|  2|
+-------+---+---+---+
*/

Horizontal holding → Vertical holding

In the main subject, if you have test2Df and want to hold it vertically for some reason. Horizontal holding → Vertical holding does not have a function that can be done in one shot like pivot, so is it a little troublesome? Add. The ones used are "selectExpr ()" and "stack". The source code is as follows.

`unpivot.scala`


val test3Df = test2Df
    .selectExpr(
        "Column1",//Target column that serves as a reference for vertical holding
        "stack(3,'a',a,'c',c,'d',d)".//Original to be held vertically
    )
/*
+-------+----+----+
|Column1|col0|col1|
+-------+----+----+
|      1|   a|   2|
|      1|   c|   0|
|      1|   d|   0|
|      3|   a|   0|
|      3|   c|   1|
|      3|   d|   0|
|      4|   a|   0|
|      4|   c|   0|
|      4|   d|   1|
|      2|   a|   1|
|      2|   c|   0|
|      2|   d|   2|
+-------+----+----+
*/

It can be transformed like this. First, specify the target column that will be held vertically, and then use stack () to specify the source that will be held vertically. In the above case, a, c, and d are vertically held for Column1. 3 in stack () refers to the number of vertical holdings (in this case, 3 of a, c, d), "a" "c" "d" refers to the name, and acd refers to the original numerical value of the column name. (You can also simply put "name1" or a string). Also, the column name is col0 by default, so if you want to change it, you can change it by using **. WithColumnRenamed ("col0", "○○") **.

After that, I think that if you use join well, you can change from vertical holding to horizontal holding.

Reference link

How to Pivot and Unpivot a Spark DataFrame