[JAVA] About class upgrade when using Spark Streaming on YARN

Write a Spark Steaming application in Java or Scala and spark submit to run it on a YARN cluster. At this time, if the Spark Streaming application that is already running and FQCN are included in the submitted JAR file with a new version of the same class, you will get into trouble.

In the case of Spark Streaming, the processes that make up the DAG are serialized as Lambda objects, transferred to the node where the executor process is running over the network, and executed. The executor process desrializes the serialized Lambda object and tries to restore it to memory, but there is a conflict between the class version of the object you are trying to deserialize and the version of the class loaded in memory of the executor process. Sometimes. Because the version is different, InvalidClassException occurs saying that deserialize cannot be done.

It would be possible to devise a way to avoid the InvalidClassException, but we want new applications to use the new version of the class, so we don't want the memory-loaded class to be used.

That's why I keep track of the classes used by each application, and in the event of a conflict, I stop the running application in advance and upgrade it.

But ... this is fine if it's in the category of a class you developed yourself, but it's too annoying if it happens in a class that's in a dependency.

Especially if the YARN cluster is shared by many users ...

Recommended Posts

About class upgrade when using Spark Streaming on YARN
Suppress warning messages about SLF4J when using SSHJ