This tutorial will show you step by step how to configure ** PySpark ** on a ** Alibaba Cloud ** ECS instance running the CentOS 7.x operating system.
Before you start, here are the resources you need to set up a PySpark node on Alibaba Cloud:
1, Alibaba Cloud ECS instance 1 2, one ElP. You may also need additional ones depending on your needs. For example, you may need additional storage, SLB instances, or other components. In particular, this tutorial uses the minimum required resources, but with one ECS instance and EIP as both master and slave nodes. Understand that the techniques described in the tutorial can be easily extended to multiple master and slave node configurations on various ECS instances.
Apart from these resources, you need to install the following items: 1、Python 2、Java 3, spark If you already have Python installed and have cloud resources, you can move on to Section 3.
Alibaba Cloud Elastic Computing Service (ECS) is a virtual computing environment with elastic functions provided by Alibaba Cloud. .. ECS contains basic computing components such as CPU and memory. The user can select an instance of ECS with the appropriate CPU core, memory, system disk, additional data disk (optional), and network capacity as needed (calculation optimization, I / O optimization). ).
Alibaba Cloud Elastic IP Address (EIP) is an instance-independent public IP address provided by Alibaba Cloud. It can be purchased independently and associated with the appropriate ECS or other cloud resource (SLB, NAT gateway).
If the ECS does not have a public IP, you can select it and bind it to the ECS, but in this tutorial the public IP can also be used to download related packages from the internet.
Before you can set up Python on the Alibaba cloud, you need to purchase a cloud ECS instance. Depending on your needs, you can choose PAYG (Pay as you go mode) without long-term commitment, or you can choose subscription mode and commit your usage first to save money. ..
Select an ECS instance with the required memory, CPU, and sufficient system storage. We recommend using at least 2vCPU, 4GB of memory and 30GB of Ultracloud system disk to scale as needed.
By default, you get a private IP, but to connect your ECS instance to the Internet, you need an Elastic Public IP that charges for traffic. This is required to download the associated packages to your ECS instance. If you are not getting network bandwidth with ECS, you need to purchase an EIP and bind it to your instance. By default, the ECS security group allows Internet traffic. To protect your ECS instance, you can either unbind the EIP after downloading the required packages, or use a security group to allow only relevant traffic to protect your ECS.
** What is Python? ** ** Python is a powerful, general-purpose, high-level programming language that is easy for humans to interpret (code is easy to read) and easy to understand. Is known for. In addition, its huge support community and vast library make it a popular choice among data scientists, big data developers, and machine learning practitioners. From statistics to deep learning, you can find Python libraries.
After logging in successfully, the following screen will be displayed.
yum install gcc openssl-devel bzip2-devel libffi-devel
** Note **: To avoid writing such a long statement at the prompt, you can copy the above command and paste it into your ECS instance using Alibaba Cloud's console feature. At the top right is a button called "Enter copy Commands", which pastes the statements you copy on your machine into your ECS instance.
wget
command. In this guide, the directory for downloading python packages is changed to / usr / src
.Specifically, execute the following command.
cd /usr/src
wget https://www.python.org/ftp/python/3.7.2/Python-3.7.2.tgz
Then use tar to unpack the file and run the tar xzf Python-3.7.2.tgz
command.
cd Python-3.7.2
./configure --enable-optimizations
After completing the settings, execute the make alt install
command. This command installs Python and its dependencies on your system. If the command is executed normally, a message like the screen below will be displayed. The final message output will be "Successfully installed".
(Optional) Execute the rm /usr/src/Python-3.7.2.tgz
command to remove the downloaded python package.
Request a version of python and make sure python is installed successfully. Execute the "python3.7 -V" command.
In the above example, both versions of Python 2.x and Python 3.x are installed and can be launched with different commands. Just running python "runs Python2.x, and running" python3.7 "runs Python3.x.
Spark is an open source cluster computing framework. In other words, it's a resilient distributed data processing engine. Introduced as an improved version of Hadoop with added features such as in-memory processing, stream processing and low latency. Spark is written in Scala, but it also supports other languages such as Java, Python and R. Spark's main uses are ETL and SQL execution for large datasets, streaming analysis and machine learning for big data, and more. The main offerings and components of Spark are: 1, ** Spark SQL. SQL **: A Spark component for processing data using SQL syntax. 2, ** Spark Streaming **: A core library for streaming data processing and handling. 3, ** MLlib (Machine Learning Library) **: A specific library for clustering and predictive analysis of data and applying basic machine learning algorithms and data mining. 4, ** GraphX **: A library for working with networks and graphs.
PySpark is a combination of Apache Spark and Python. The integration of the two platforms allows you to take advantage of the simplicity of the Python language to work with big data while interacting with powerful Spark components (discussed in Section 2).
Let's run the Java -version
command to see if Java is installed.
If Java is not installed, follow the steps in Step-2 to install Java to set up Spark, otherwise go to Step-4.
sudo yum update
sudo yum install java-1.8.0-openjdk-headless
Type y
and press ** enter ** to install.
Execute the java -version
command to confirm that the installation was successful.
Run the cd / opt
command to change the directory and run the following command to download the spark binaries.
wget https://www-eu.apache.org/dist/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz
If the link is broken, the following Apache site Check to see if the new link has been updated.
Execute the tar -xzf spark-2.4.0-bin-hadoop2.7.tgz
command to enter the binary.
Enter the cd spark-2.4.0-bin-hadoop2.7
command.
Consider a basic configuration guide for Spark.
You are now ready to set up your Spark cluster based on the shell and Hadoop deployment scripts located in sparks directory / sbin
.
-- sbin / start-master.sh
: Start the master instance on the machine where the script will be executed.
--Start a slave instance on each machine specified in the sbin / start-slaves.sh:
conf / slaves` file.
--sbin / start-slave.sh
: Start a slave instance on the machine where the script will be executed.
--sbin / start-all.sh.
: Start both the master and many slaves as described above.
--sbin / stop-master.sh
: Stops the master started by the sbin / start-master.sh script.
--sbin / stop-slaves.sh
: conf / slaves
Stops all slave instances on the machine specified in the file.
--sbin / stop-all.sh
: Stop both master and slave as above.
To set the ECS node as the master, run the sbin / start-master.sh
command or one of the scripts shown below.
cat /opt/spark-2.4.0-bin-hadoop2.7/logs/spark-root-org.apache.spark.deploy.master.Master-1-centos.out
The master URL is spark: // centos: 7077
.
9, Now set up the slave node (you can run any number of slave nodes and connect to the master node)
It is for starting a slave process on the second node while still in the spark directory.
./sbin/start-slave.sh <master-spark-URL>
In my case:
./sbin/start-slave.sh spark://centos:7077
You can now reopen the master log to see if it is connected.
10, the worker has been registered.
Now it's time to update the path directory.
export SPARK_HOME=/opt/spark-2.4.0-bin-hadoop2.7
export PATH=$SPARK_HOME/bin:$PATH
11, Now let's run Spark and make sure it's installed perfectly.
bin/pyspark
You can exit spark by entering the ʻexit ()` command.
12, This completes both Python and Spark settings. All you need to use the Python API on Spark is pyspark
. PySpark can be downloaded and installed from the PyPi repository.
Run the pip install pyspark
command.
Here, we will use the pyspark library to create a basic example.py file that utilizes Spark with the Python API.
Enter the python
command.
To see how the Python API leverages Spark, run the following command line by line:
from pyspark import SparkContext
outFile = "file:///opt/spark-2.4.0-bin-hadoop2.7/logs/spark-root-org.apache.spark.deploy.master.Master-1-centos.out"
sc = SparkContext("local", "example app")
outData = sc.textFile(outFile).cache()
numAs = logData.filter(lambda s: 'a' in s).count()
print("Lines with a: %i " % (numAs))
Recommended Posts