Introducing a Python library that is useful for data analysis, data processing, machine learning, and more.
For statistics and machine learning, there is also the option R. It is a language that excels in processing, aggregating, and statistically processing R data, and can do a lot with only the language standard functions. There is no doubt that it is a powerful option as it has a rich machine learning library. The advantage of Python over R is the richness of the surrounding ecosystem. The Python ecosystem goes beyond the field of data science. Data processed with NumPy and Pands can also be used in full-scale Web applications using Django.
Most of the libraries listed here can be installed in bulk with Anaconda.
NumPy NumPy is a library for efficient numerical calculations. A one-dimensional array is taken as an example here, but a multidimensional array can also be supported. Vector and matrix calculations can be performed at high speed.
In [1]: import numpy as np #Import NumPy
In [2]: arr = np.asarray([n for n in range(10)]) #Vector creation
In [3]: arr #output
Out[3]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [4]: arr * 10 #Data processing
Out[4]: array([ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90])
Pandas
Pandas is a library that extends NumPy and has functions that are indispensable for pre-processing of machine learning, such as reading data and handling missing values. There is an object called DataFrame
, which makes it easy to process and merge data. Close to R's data.frame
.
In [1]: import pandas as pd #Import Pandas
In [2]: df = pd.DataFrame({ #Creating a data frame
...: 'A': [n for n in range(5)],
...: 'B': ['male', 'male', 'female', 'female', 'male'],
...: 'C': [0.3, 0.4, 1.2, 100.5, -20.0]
...: })
In [3]: df
Out[3]:
A B C
0 0 male 0.3
1 1 male 0.4
2 2 female 1.2
3 3 female 100.5
4 4 male -20.0
In [4]: df.describe() #Output of basic statistics
Out[4]:
A C
count 5.000000 5.000000
mean 2.000000 16.480000
std 1.581139 47.812101
min 0.000000 -20.000000
25% 1.000000 0.300000
50% 2.000000 0.400000
75% 3.000000 1.200000
max 4.000000 100.500000
In [5]: df[df['B'] == 'female'] #Subset call
Out[5]:
A B C
2 2 female 1.2
3 3 female 100.5
Python Data Analysis Library — pandas: Python Data Analysis Library
jupyter
Jupyter Notebook is a Python execution environment that records code content and output results, so it can be used as a coding environment for exploratory data processing and statistical processing. It can also be output as a report or slide.
matplotlib
matplotlib is a graph drawing library. It supports various graphs such as bar graphs, scatter plots, and histograms.
Matplotlib: Python plotting — Matplotlib 2.0.2 documentation
plotly
plotly can draw richer and more interactive graphs than matplotlib. The created graph can also be shared with plot.ly.
Python Graphing Library, Plotly
Kafka-Python
Kafka-Python, as the name implies, is Apache Kafka's Python client.
from kafka import KafkaConsumer
import json
consumer = KafkaConsumer('topic', bootstrap_servers='localhost:9092')
for msg in consumer:
data = json.loads(msg.value.decode())
print(data)
PySpark
Spark and Kafka have become indispensable for big data. There is a machine learning library called MLlib.
Python Programming Guide - Spark 0.9.0 Documentation
scikit-learn
scikit-learn is a machine learning library. Not only popular neural networks, but other algorithms are also available. In addition, it has functions such as division into training data and verification data, cross-validation, and grid search, which are necessary for machine learning, and it is a library that can reach the itchy place. If you want to touch the machine learning library, start from now on.
scikit-learn: machine learning in Python — scikit-learn 0.18.2 documentation
TensorFlow
You know the deep learning library.
Keras
Keras is a wrapper for TensorFlow, CNTK, Theano and more.
A book by the author of Pandas. Learn how to use Pandas and data analysis techniques. It also covers peripheral libraries such as NumPy and matplotlib.
A book by the author of scikit-learn. You can learn how to use scikit-learn and the engineering required for machine learning.
If you're not happy with just tweaking data in Pandas or tuning your machine learning library, you'll need to jump out of the Python ecosystem. The world of data is deep and vast, and engineers need to cover a wider area to follow data scientists. Specifically, if you suppress distributed processing infrastructure such as Hadoop, Spark, Apex, and fully managed DWH such as BigQuery and TreasureData, the field of activity will expand.
-Count the frequency of occurrence of words in a sentence by stream processing \ (Apache Apex ) -Bad sentence pattern -Set up a fluentd container with Docker and save Rails log in Treasure Data by IDCF -Bad sentence pattern
Recommended Posts