List of Python libraries for data scientists and data engineers

Introducing a Python library that is useful for data analysis, data processing, machine learning, and more.

Why python

For statistics and machine learning, there is also the option R. It is a language that excels in processing, aggregating, and statistically processing R data, and can do a lot with only the language standard functions. There is no doubt that it is a powerful option as it has a rich machine learning library. The advantage of Python over R is the richness of the surrounding ecosystem. The Python ecosystem goes beyond the field of data science. Data processed with NumPy and Pands can also be used in full-scale Web applications using Django.

Installation of libraries

Most of the libraries listed here can be installed in bulk with Anaconda.

Data processing

NumPy NumPy is a library for efficient numerical calculations. A one-dimensional array is taken as an example here, but a multidimensional array can also be supported. Vector and matrix calculations can be performed at high speed.

In [1]: import numpy as np #Import NumPy

In [2]: arr = np.asarray([n for n in range(10)]) #Vector creation

In [3]: arr #output
Out[3]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [4]: arr * 10 #Data processing
Out[4]: array([ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90])

NumPy — NumPy

Pandas

Pandas is a library that extends NumPy and has functions that are indispensable for pre-processing of machine learning, such as reading data and handling missing values. There is an object called DataFrame, which makes it easy to process and merge data. Close to R's data.frame.

In [1]: import pandas as pd #Import Pandas

In [2]: df = pd.DataFrame({ #Creating a data frame
   ...: 'A': [n for n in range(5)],
   ...: 'B': ['male', 'male', 'female', 'female', 'male'],
   ...: 'C': [0.3, 0.4, 1.2, 100.5, -20.0]
   ...: })

In [3]: df
Out[3]: 
   A       B      C
0  0    male    0.3
1  1    male    0.4
2  2  female    1.2
3  3  female  100.5
4  4    male  -20.0

In [4]: df.describe() #Output of basic statistics
Out[4]: 
              A           C
count  5.000000    5.000000
mean   2.000000   16.480000
std    1.581139   47.812101
min    0.000000  -20.000000
25%    1.000000    0.300000
50%    2.000000    0.400000
75%    3.000000    1.200000
max    4.000000  100.500000

In [5]: df[df['B'] == 'female'] #Subset call
Out[5]: 
   A       B      C
2  2  female    1.2
3  3  female  100.5

Python Data Analysis Library — pandas: Python Data Analysis Library

Report, visualization

jupyter

Jupyter Notebook is a Python execution environment that records code content and output results, so it can be used as a coding environment for exploratory data processing and statistical processing. It can also be output as a report or slide.

Project Jupyter | Home

matplotlib

matplotlib is a graph drawing library. It supports various graphs such as bar graphs, scatter plots, and histograms.

Matplotlib: Python plotting — Matplotlib 2.0.2 documentation

plotly

plotly can draw richer and more interactive graphs than matplotlib. The created graph can also be shared with plot.ly.

Python Graphing Library, Plotly

Messaging, stream processing

Kafka-Python

Kafka-Python, as the name implies, is Apache Kafka's Python client.

from kafka import KafkaConsumer
import json

consumer = KafkaConsumer('topic', bootstrap_servers='localhost:9092')

for msg in consumer:
    data = json.loads(msg.value.decode())
    print(data)

PySpark

Spark and Kafka have become indispensable for big data. There is a machine learning library called MLlib.

Python Programming Guide - Spark 0.9.0 Documentation

Machine learning

scikit-learn

scikit-learn is a machine learning library. Not only popular neural networks, but other algorithms are also available. In addition, it has functions such as division into training data and verification data, cross-validation, and grid search, which are necessary for machine learning, and it is a library that can reach the itchy place. If you want to touch the machine learning library, start from now on.

scikit-learn: machine learning in Python — scikit-learn 0.18.2 documentation

TensorFlow

You know the deep learning library.

TensorFlow

Keras

Keras is a wrapper for TensorFlow, CNTK, Theano and more.

Keras Documentation

Recommended books

O'Reilly Japan -Introduction to Data Analysis with Python

A book by the author of Pandas. Learn how to use Pandas and data analysis techniques. It also covers peripheral libraries such as NumPy and matplotlib.

O'Reilly Japan -Machine learning starting with Python

A book by the author of scikit-learn. You can learn how to use scikit-learn and the engineering required for machine learning.

Pop out python

If you're not happy with just tweaking data in Pandas or tuning your machine learning library, you'll need to jump out of the Python ecosystem. The world of data is deep and vast, and engineers need to cover a wider area to follow data scientists. Specifically, if you suppress distributed processing infrastructure such as Hadoop, Spark, Apex, and fully managed DWH such as BigQuery and TreasureData, the field of activity will expand.

-Count the frequency of occurrence of words in a sentence by stream processing \ (Apache Apex ) -Bad sentence pattern -Set up a fluentd container with Docker and save Rails log in Treasure Data by IDCF -Bad sentence pattern