This tutorial will show you step by step how to configure ** PySpark ** on a ** Alibaba Cloud ** ECS instance running the CentOS 7.x operating system.

This blog is a translation from the English version. You can check the original from here. We use some machine translation. We would appreciate it if you could point out any translation errors. *

Prerequisites

Before you start, here are the resources you need to set up a PySpark node on Alibaba Cloud:

1, Alibaba Cloud ECS instance 1 2, one ElP. You may also need additional ones depending on your needs. For example, you may need additional storage, SLB instances, or other components. In particular, this tutorial uses the minimum required resources, but with one ECS instance and EIP as both master and slave nodes. Understand that the techniques described in the tutorial can be easily extended to multiple master and slave node configurations on various ECS instances.

Apart from these resources, you need to install the following items: 1、Python 2、Java 3, spark If you already have Python installed and have cloud resources, you can move on to Section 3.

Section 1: Cloud Resources

What is ECS?

Alibaba Cloud Elastic Computing Service (ECS) is a virtual computing environment with elastic functions provided by Alibaba Cloud. .. ECS contains basic computing components such as CPU and memory. The user can select an instance of ECS with the appropriate CPU core, memory, system disk, additional data disk (optional), and network capacity as needed (calculation optimization, I / O optimization). ).

What is EIP?

Alibaba Cloud Elastic IP Address (EIP) is an instance-independent public IP address provided by Alibaba Cloud. It can be purchased independently and associated with the appropriate ECS or other cloud resource (SLB, NAT gateway).

If the ECS does not have a public IP, you can select it and bind it to the ECS, but in this tutorial the public IP can also be used to download related packages from the internet.

Get an ECS instance

Before you can set up Python on the Alibaba cloud, you need to purchase a cloud ECS instance. Depending on your needs, you can choose PAYG (Pay as you go mode) without long-term commitment, or you can choose subscription mode and commit your usage first to save money. ..

Select an ECS instance with the required memory, CPU, and sufficient system storage. We recommend using at least 2vCPU, 4GB of memory and 30GB of Ultracloud system disk to scale as needed.

EIP purchase and partnership

By default, you get a private IP, but to connect your ECS instance to the Internet, you need an Elastic Public IP that charges for traffic. This is required to download the associated packages to your ECS instance. If you are not getting network bandwidth with ECS, you need to purchase an EIP and bind it to your instance. By default, the ECS security group allows Internet traffic. To protect your ECS instance, you can either unbind the EIP after downloading the required packages, or use a security group to allow only relevant traffic to protect your ECS.

Section 2: Python Installation

** What is Python? ** ** Python is a powerful, general-purpose, high-level programming language that is easy for humans to interpret (code is easy to read) and easy to understand. Is known for. In addition, its huge support community and vast library make it a popular choice among data scientists, big data developers, and machine learning practitioners. From statistics to deep learning, you can find Python libraries.

Installing Python on an Alibaba Cloud ECS instance

Connect to your ECS instance using Alibaba Cloud Console.
Enter your VNC password.

After logging in successfully, the following screen will be displayed.

Execute the following command on the machine that has successfully logged in. This will install the gcc compiler needed to install Python on your Linux machine.

yum install gcc openssl-devel bzip2-devel libffi-devel

** Note **: To avoid writing such a long statement at the prompt, you can copy the above command and paste it into your ECS instance using Alibaba Cloud's console feature. At the top right is a button called "Enter copy Commands", which pastes the statements you copy on your machine into your ECS instance.

To download python, move to your favorite directory and execute the wget command. In this guide, the directory for downloading python packages is changed to / usr / src.

Specifically, execute the following command.

cd /usr/src

wget https://www.python.org/ftp/python/3.7.2/Python-3.7.2.tgz

Then use tar to unpack the file and run the tar xzf Python-3.7.2.tgz command.

1. To install the unzipped python package, go to the python directory, configure the settings, and install with the following command.

cd Python-3.7.2
./configure --enable-optimizations

After completing the settings, execute the make alt install command. This command installs Python and its dependencies on your system. If the command is executed normally, a message like the screen below will be displayed. The final message output will be "Successfully installed".

(Optional) Execute the rm /usr/src/Python-3.7.2.tgz command to remove the downloaded python package.
Request a version of python and make sure python is installed successfully. Execute the "python3.7 -V" command.

In the above example, both versions of Python 2.x and Python 3.x are installed and can be launched with different commands. Just running python "runs Python2.x, and running" python3.7 "runs Python3.x.

Section 3: Installing Spark and PySpark

What is Spark?

Spark is an open source cluster computing framework. In other words, it's a resilient distributed data processing engine. Introduced as an improved version of Hadoop with added features such as in-memory processing, stream processing and low latency. Spark is written in Scala, but it also supports other languages such as Java, Python and R. Spark's main uses are ETL and SQL execution for large datasets, streaming analysis and machine learning for big data, and more. The main offerings and components of Spark are: 1, ** Spark SQL. SQL **: A Spark component for processing data using SQL syntax. 2, ** Spark Streaming **: A core library for streaming data processing and handling. 3, ** MLlib (Machine Learning Library) **: A specific library for clustering and predictive analysis of data and applying basic machine learning algorithms and data mining. 4, ** GraphX **: A library for working with networks and graphs.

What is PySpark?

PySpark is a combination of Apache Spark and Python. The integration of the two platforms allows you to take advantage of the simplicity of the Python language to work with big data while interacting with powerful Spark components (discussed in Section 2).

Install Spark / PySpark on Alibaba Cloud ECS instance

Make sure Java is installed

Let's run the Java -version command to see if Java is installed.

If Java is not installed, follow the steps in Step-2 to install Java to set up Spark, otherwise go to Step-4.

Execute the following command to update the system.

sudo yum update
sudo yum install java-1.8.0-openjdk-headless

Type y and press ** enter ** to install.

Execute the java -version command to confirm that the installation was successful.
Run the cd / opt command to change the directory and run the following command to download the spark binaries.

wget https://www-eu.apache.org/dist/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz

If the link is broken, the following Apache site Check to see if the new link has been updated.

Execute the tar -xzf spark-2.4.0-bin-hadoop2.7.tgz command to enter the binary.
Enter the cd spark-2.4.0-bin-hadoop2.7 command.
Consider a basic configuration guide for Spark.

You are now ready to set up your Spark cluster based on the shell and Hadoop deployment scripts located in sparks directory / sbin.

-- sbin / start-master.sh: Start the master instance on the machine where the script will be executed.

--Start a slave instance on each machine specified in the sbin / start-slaves.sh: conf / slaves` file.

--sbin / start-slave.sh: Start a slave instance on the machine where the script will be executed.

--sbin / start-all.sh.: Start both the master and many slaves as described above.

--sbin / stop-master.sh: Stops the master started by the sbin / start-master.sh script.

--sbin / stop-slaves.sh: conf / slaves Stops all slave instances on the machine specified in the file.

--sbin / stop-all.sh: Stop both master and slave as above.

To set the ECS node as the master, run the sbin / start-master.sh command or one of the scripts shown below.

You can open the log file to see which port the master is running on.

cat /opt/spark-2.4.0-bin-hadoop2.7/logs/spark-root-org.apache.spark.deploy.master.Master-1-centos.out

The master URL is spark: // centos: 7077.

9, Now set up the slave node (you can run any number of slave nodes and connect to the master node)

It is for starting a slave process on the second node while still in the spark directory.

./sbin/start-slave.sh <master-spark-URL>

In my case:

./sbin/start-slave.sh spark://centos:7077

You can now reopen the master log to see if it is connected.

10, the worker has been registered.

Now it's time to update the path directory.

export SPARK_HOME=/opt/spark-2.4.0-bin-hadoop2.7  
export PATH=$SPARK_HOME/bin:$PATH

11, Now let's run Spark and make sure it's installed perfectly.

bin/pyspark

You can exit spark by entering the ʻexit ()` command.

12, This completes both Python and Spark settings. All you need to use the Python API on Spark is pyspark. PySpark can be downloaded and installed from the PyPi repository.

Run the pip install pyspark command.

Sample code

Here, we will use the pyspark library to create a basic example.py file that utilizes Spark with the Python API.

Enter the python command.

To see how the Python API leverages Spark, run the following command line by line:

from pyspark import SparkContext
outFile = "file:///opt/spark-2.4.0-bin-hadoop2.7/logs/spark-root-org.apache.spark.deploy.master.Master-1-centos.out"  
sc = SparkContext("local", "example app")
outData = sc.textFile(outFile).cache()
numAs = logData.filter(lambda s: 'a' in s).count()
print("Lines with a: %i " % (numAs))

Alibaba Cloud is the No. 1 (2019 Gartner) cloud infrastructure operator in the Asia Pacific region with two data centers in Japan and more than 60 availability zones in the world. Click here for more information on Alibaba Cloud. Alibaba Cloud Japan Official Page *

PySpark: Set up PySpark on your Alibaba Cloud CentOS instance