I have python and various libraries for data science.
――I want to plot! ――I want to do the certification process! ――I want to process the data frame!
I will introduce what kind of library is available for basic things such as.
** Request: Please increase the number of items in the edit request or let us know your recommendations. ** **
pandas
Holds data in a "data frame" that looks like a relational model (famous for SQL) It provides functions such as filtering, mapping, and grouping for this. It also has a wealth of interfaces for reading and writing data.
The following is a sample that reads csv and leaves only the items whose 'sales'
item is 1000
or more.
import pandas as pd
data = pd.read_csv("data.csv")
over_1000 = data[ data['Earnings'] > 1000 ]
numpy
import numpy as np
#Matrix generation from list
mat = np.matrix([[1, 2], [3, 4]])
#Vector generation from list
vec = np.array([5, 6])
#Take a matrix product
mat.dot(vec)
numpy
numpy provides a wide range of basic processing, including processing linear algebra. This includes random number generation according to the distribution.
For example, a sequence of random numbers that follows a normal distribution can be generated as follows:
import numpy as np
mu, sigma = 2, 0.5
v = np.random.normal(mu,sigma,10000)
A library that can be used to draw graphs
matplotlib
It provides the ability to draw various graphs. Since it is a relatively low layer library, it will be used in combination with seaborn etc.
import numpy as np
import matplotlib.pyplot as plt
x = np.arange(-3, 3, 0.1)
y = np.sin(x)
plt.plot(x, y)
seaborn
Searborn is a library that wraps matplotlib and provides an easier way to draw clean graphs. It provides the ability to draw heatmaps, for example.
import numpy as np
import matplotlib
#When you import seaborn, the graph of matplotlib becomes a beautiful seaborn style graph
import seaborn as sns
x = np.random.normal(size=100)
sns.distplot(x);
scipy
scipy is a library that provides the processing required for scientific and technological calculations. This library actually offers a fairly wide range of features, so you may find most of what you want to do here.
The t-test can be performed as follows.
import numpy as np
from scipy import stats
a = np.random.normal(0, 1, size=100)
b = np.random.normal(1, 1, size=10)
stats.ttest_ind(a, b)
sympy
A library that automatically performs algebraic calculations. In other words, it is a library that can throw all kinds of expression transformations. (By the way, if anyone knows: Is this a term rewriting system?)
Here, we will mention symbolic differentiation as an application.
import sympy as sym
#Prepare variables
x = sym.symbols("x")
#Make a polynomial ...
f = x**3 + 2*x**2 - x + 5
#Differentiate
df_dx = sym.diff(f, x)
statsmodels
A convenient library for creating statistical models.
The following is an example of generating a generalized linear model and looking at its basic statistics (AIC etc. will appear)
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
df = pd.read_csv("data.csv")
formula = 'Sales ~ AccessCount + MailSendedCount'
mod = smf.ols(formula=formula, data=df)
res = mod.fit()
res.summary()
scikit-learn
(The content will be increased sequentially.)