This article

I have python and various libraries for data science.

――I want to plot! ――I want to do the certification process! ――I want to process the data frame!

I will introduce what kind of library is available for basic things such as.

** Request: Please increase the number of items in the edit request or let us know your recommendations. ** **

Data processing

pandas

Holds data in a "data frame" that looks like a relational model (famous for SQL) It provides functions such as filtering, mapping, and grouping for this. It also has a wealth of interfaces for reading and writing data.

The following is a sample that reads csv and leaves only the items whose 'sales' item is 1000 or more.

import pandas as pd
data = pd.read_csv("data.csv")
over_1000 = data[ data['Earnings'] > 1000 ]

Linear algebraic processing

numpy

import numpy as np
#Matrix generation from list
mat = np.matrix([[1, 2], [3, 4]])
#Vector generation from list
vec = np.array([5, 6])
#Take a matrix product
mat.dot(vec)

Random number generation

numpy

numpy provides a wide range of basic processing, including processing linear algebra. This includes random number generation according to the distribution.

For example, a sequence of random numbers that follows a normal distribution can be generated as follows:

import numpy as np

mu, sigma = 2, 0.5
v = np.random.normal(mu,sigma,10000)

plot

A library that can be used to draw graphs

matplotlib

It provides the ability to draw various graphs. Since it is a relatively low layer library, it will be used in combination with seaborn etc.

import numpy as np
import matplotlib.pyplot as plt
x = np.arange(-3, 3, 0.1)
y = np.sin(x)
plt.plot(x, y)

seaborn

Searborn is a library that wraps matplotlib and provides an easier way to draw clean graphs. It provides the ability to draw heatmaps, for example.

import numpy as np
import matplotlib
#When you import seaborn, the graph of matplotlib becomes a beautiful seaborn style graph
import seaborn as sns

x = np.random.normal(size=100)
sns.distplot(x);

Statistical test

scipy

scipy is a library that provides the processing required for scientific and technological calculations. This library actually offers a fairly wide range of features, so you may find most of what you want to do here.

The t-test can be performed as follows.

import numpy as np
from scipy import stats

a = np.random.normal(0, 1, size=100)
b = np.random.normal(1, 1, size=10)
stats.ttest_ind(a, b)

Symbol differentiation

sympy

A library that automatically performs algebraic calculations. In other words, it is a library that can throw all kinds of expression transformations. (By the way, if anyone knows: Is this a term rewriting system?)

Here, we will mention symbolic differentiation as an application.

import sympy as sym

#Prepare variables
x = sym.symbols("x")
#Make a polynomial ...
f = x**3 + 2*x**2 - x + 5
#Differentiate
df_dx = sym.diff(f, x)

Creating a statistical model

statsmodels

A convenient library for creating statistical models.

The following is an example of generating a generalized linear model and looking at its basic statistics (AIC etc. will appear)

import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

df = pd.read_csv("data.csv")

formula = 'Sales ~ AccessCount + MailSendedCount'
mod = smf.ols(formula=formula, data=df)
res = mod.fit()
res.summary()

scikit-learn