1. Background / Target

A memo for running pyspark using conda in the local environment. Install and run pyspark just like any other popular Python library.

Main target to assume:

――I want to create an environment that works with a few steps, leaving detailed settings aside. ――It would be good if you could move the sample code of online articles and reference books, and code and develop functions using small-scale test data for the time being. -Download from Official Site or Mirror and PATH It is troublesome to install Java through and PYTHONPATH, and so on. -** I don't want to write or manage things like bashrc ** --I want to manage Spark and Java versions separately for each virtual environment --I want to distinguish between Java used on the PC and Java used on Spark. --I want to use Spark 2.4 and Spark 3.0 properly (or I want to install Spark separately for each project) ――But I don't want to use Docker or virtual machines

I am thinking about the situation.

2. Install Spark and Java with conda

Enter the target conda virtual environment and

--When using Apache Spark 3.0

conda install -c conda-forge pyspark=3.0 openjdk=8

--When using Apache Spark 2.4

#Note: Python3.8 is not supported, so Python 3.7.Use an environment such as x
conda install -c conda-forge pyspark=2.4 openjdk=8

Then, not only the pyspark library but also Apache Spark itself will be installed under the virtual environment. (By the way, pandas and pyarrow, which handles data linkage between pandas and Spark, are also included.)

** At this point, you should be able to use pyspark for the time being. ** **

By the way, if you insert openjdk with conda as in the above example, when you enter the virtual environment with conda activate, JAVA_HOME will be automatically set to match the one entered with conda. (If you enter from the conda-forge channel, the version will be 1.8.0_192 (Azul Systems, Inc.) as of 2020-08-14.)

Run

conda activate <virtual environment name> and then on the CLI

Spark3

`shell (conda environment)`


$ pyspark                                           
Python 3.8.5 (default, Aug  5 2020, 08:36:46) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
20/08/14 22:00:15 WARN Utils: Your hostname, <***> resolves to a loopback address: 127.0.1.1; using 192.168.3.17 instead (on interface wlp3s0)
20/08/14 22:00:15 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
20/08/14 22:00:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.0.0
      /_/

Using Python version 3.8.5 (default, Aug  5 2020 08:36:46)
SparkSession available as 'spark'.
>>>

Spark2

`shell (conda environment)`


$ pyspark                      
Python 3.7.7 (default, May  7 2020, 21:25:33) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
20/08/14 22:16:09 WARN Utils: Your hostname, <***> resolves to a loopback address: 127.0.1.1; using 192.168.3.17 instead (on interface wlp3s0)
20/08/14 22:16:09 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
20/08/14 22:16:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.6
      /_/

Using Python version 3.7.7 (default, May  7 2020 21:25:33)
SparkSession available as 'spark'.
>>>

You can check that pyspark can be used for each as follows.

--Since you just prepared a virtual environment with conda and conda install, you can install and execute pyspark in the same way as other ordinary Python libraries.

Supplement (Java)

In addition, Java 11 is also supported from Spark 3, but when I tried it easily, I got a memory related error and I could not move it satisfactorily. .. .. Even if you look at here etc., it seems that additional settings are required when using Java 11 (it seems to be different from the above error) (I think), as the title says, if you want to run it with "** Easy for the time being **", I think that Java version 8 is safe even in Spark3. (In addition, it does not work unless it is Java 8 in Spark 2 series.)

Supplement (Windows)

The decent function works as above, but by default, an error around permissions occurs when operating the database table of spark.sql. [Here](https://qiita.com/tomotagwork/items/1431f692387242f4a636#apache-spark%E3%81%AE%E3%82%A4%E3%83%B3%E3%82%B9%E3%83% 88% E3% 83% BC% E3% 83% AB), etc.

--Download the hadoop 2.7 winutils.exe (eg here or [here ](Available from the repository at https://github.com/cdarlint/winutils) --Put through PATH --Set the above download location to the environment variable HADOOP_HOME

Need to be done additionally.

3. If additional settings are required

At this point you should be able to run pyspark easily (** with default settings **), but sometimes you need to configure and adjust config.

If you customize it in earnest, it will go beyond the "easy" range of the title, but I will supplement it only to the minimum. (I will omit common stories that are not unique to conda, such as setting general environment variables.)

Setting `SPARK_HOME`

I wrote that the environment variable JAVA_HOME (necessary for running Spark) is set on the conda side without permission, but the environment variable SPARK_HOME, which is often set when using Apache Spark, is not actually set. (It works relatively well even if it is not set, but sometimes it is a problem)

You can specify the installation location in the virtual environment, but the location is a little difficult to understand. I think there are various ways to do it, but as a personal research method,

If you install pyspark with conda, you can also run spark-shell, which is the Spark shell of scala (it should also be in your PATH), so run spark-shell on the CLI.
Type sc.getConf.get ("spark.home ") and press Enter to get the string that appears and set it in the environment variable SPARK_HOME

For example, it looks like this:

`shell`


$ spark-shell                
20/08/16 12:32:18 WARN Utils: Your hostname, <***> resolves to a loopback address: 127.0.1.1; using 192.168.3.17 instead (on interface wlp3s0)
20/08/16 12:32:18 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
20/08/16 12:32:19 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://192.168.3.17:4040
Spark context available as 'sc' (master = local[*], app id = local-1597548749526).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.0
      /_/
         
Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_192)
Type in expressions to have them evaluated.
Type :help for more information.

scala> sc.getConf.get("spark.home")
res0: String = <Virtual environment PATH>/lib/python3.8/site-packages/pyspark

# ↑ `String = `The absolute path of the Spark installation location is displayed in the part of
# `Ctrl-C`Exit with and set the environment variables as follows

$ export SPARK_HOME=<Virtual environment PATH>/lib/python3.8/site-packages/pyspark

Run spark-shell in the same way
Since it is supposed to run locally, access http: // localhost: 4040 after 1. and open the Spark UI

(If you are running remotely with ssh etc., replace the localhost part as appropriate)

Make a note of the path in spark.home on the Environment tab and set it in the environment variable SPARK_HOME

And so on. (This method is based on here.) For example, SPARK_HOME = / path / to / miniconda3-latest / envs / <virtual environment name> /lib/python3.7/site-packages/pyspark.

In short, scala's spark-shell automatically sets the appropriate spark.home only in SparkSession, but for some reason pyspark doesn't do it, so it's like checking using spark-shell.

Location of configuration files

The conf directory exists in Spark downloaded from the official etc., but it seems that the conf directory does not exist in the one automatically installed by conda. .. .. However, it seems that if you create a conf directory in the appropriate place and put the configuration file, it will be read. (Verified with spark-defaults.conf)

The location of the configuration file will be under $ SPARK_HOME / conf / using the path of SPARK_HOME that you checked earlier. So, for example

$SPARK_HOME/conf/spark-defaults.conf

You can set the config by creating and filling in.

It seems that the environment variable SPARK_HOME is not required for the above reading because it was read even if SPARK_HOME was not set.

I haven't tried other config files (eg conf / spark-env.sh), but I think it will work if you create and fill in the same way. (I'm sorry if it's different.)

I personally don't like it that much, as modifying the individual packages I put in with conda would make it less portable and dirty (the "easy" element of the title would fade). .. (It is a story that you can do it if you need it.)

However, even so, I think the merit of being able to keep the settings independently for each virtual environment remains.

Summary

I confirmed that pyspark can be easily installed and managed with conda, and that it is possible to customize the configuration file if you feel like it.

Easy to install pyspark with conda

1. Background / Target

2. Install Spark and Java with conda

Run

shell (conda environment)

shell (conda environment)

Supplement (Java)

Supplement (Windows)

3. If additional settings are required

Setting SPARK_HOME

shell

Location of configuration files

Summary

`shell (conda environment)`

`shell (conda environment)`

Setting `SPARK_HOME`

`shell`