This article is a sutra copy and commentary of the official pandas tutorial "10 minutes to pandas"
I refer to the following URL https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html
import numpy as np
import pandas as pd
np
pd
OK if each module is displayed as below
** ModuleNotFoundError: No module named'pandas' ** If you get angry, put pandas first.
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-1-59ab05e21164> in <module>
1 import numpy as np
----> 2 import pandas as pd
ModuleNotFoundError: No module named 'pandas'
command
python -m pip install pandas
You can easily create data by putting a list in the Series class. ..
#Easy to line up
s = pd.Series(data=[1, 3, 5, np.nan, 6, 8])
s
You can use date_range () to create a line with a date for a specific time period.
#Data for 6 days from January 1, 2020
dates = pd.date_range("20200101", periods=6)
dates
[DataFrame] of pandas (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas-dataframe) By specifying the class ** argument index **, the line You can specify the index.
#Specify data from January 1, 2020 for row index
#Enter a random number for each value
df = pd.DataFrame(np.random.randn(6, 4), index=dates)
df
Also, of the DataFrame class You can set the column names by specifying the ** argument columns **.
#Set column name ABCD
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
df
By passing dictionary type data to the DataFrame class, the key part of the dictionary type becomes the column name.
df2 = pd.DataFrame(
{
"A": 1.,
"B": pd.Timestamp("20200101"),
"C": pd.Series(1, index=list(range(4)), dtype="float32"),
"D": np.array([3] * 4, dtype="int32"),
"E": pd.Categorical(["test", "train", "test", "train"]),
"F": "foo",
}
)
df2
You can see the data attributes of each column by referring to ** dtypes attribute **.
df2.dtypes
If you are using Jupyter nootbook or Jupyter Lab, column names will be displayed in tab completion.
db2.<TAB>
Data by using the [head () method] of the DataFrame class (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html#pandas.DataFrame.head) You can display the beginning.
df.head(2)
Similarly, by using the DataFrame class tail () You can view the tail.
df.tail(2)
By referring to ** index ** of the DataFrame class You can display the row index of that data.
df.index
df2.index
Data by using the DataFrame class to_numpy () Can be converted to data that is easy to operate with numpy.
df.to_numpy()
df2.to_numpy()
Use the DataFrame class Reference: DataFrame.describe () You can get a quick statistic for each column of data.
df2.describe()
If you refer to the T attribute of the DataFrame class, the matrix-swapped data You can access.
df.T
Also, transpose the matrix in transpose () of the DataFrame class. Can be obtained.
df.transpose()
By using the DataFrame class sort_index () [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_index.html#pandas-dataframe-sort-index) , You can sort the entire row or column.
df.sort_index()
** Set the argument axis ** to 0 or "index" to sort by row, set 1 or "columns" to sort by axis (default value 0). Also, if False is specified for the ** argument ascending **, the sort order will be descending (default value True).
df.sort_index(axis=0, ascending=False)
df.sort_index(axis=1, ascending=False)
By using the DataFrame class sort_values () [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html#pandas-dataframe-sort-values) You can sort by row or column.
df.sort_values(by="B")
df.sort_values(by="2020-01-01", axis=1)
(Added on 2020-03-07)
You can get the specified row by setting ** df ["A"] ** or ** df.A **.
df["A"]
df.A
If specified in the list ** [] **, you can select columns and rows with a Python slice operation.
#First 4 columns display
df[0:3]
You can also get the reindex range.
#Displayed from January 2, 2020 to January 4, 2020
df['20200102':'20200104']
Index (dates in this case) to loc () of DataFrame class ) Can be specified to select the row as a column.
df.loc[dates]
df.loc[dates[0]]
Select multiple columns by using loc () can do.
df.loc[:, ["A", "B"]]
It seems that an error will occur if there is no leading colon.
loc () Multiple lines and multiples by combining slice operations You can select columns.
df.loc['20200102':'20200104', ['A', 'B']]
Single data by specifying an index in loc () Can get
df.loc[dates[0], 'A']
By using at (), you can get single data faster.
df.at[dates[0], 'A']
Select data by specifying a numerical value by using iloc () of the DataFrame class. You can.
df.iloc[3]
df.iloc[3:5, 0:2]
df.iloc[[1, 2, 4], [0, 2]]
Slice (:) with the start position and end position omitted in the argument of iloc () of the DataFrame class. You can get a specific all rows or all columns by specifying (only)
df.iloc[1:3, :]
df.iloc[:, 1:3]
By specifying only a numerical value as an argument to iloc () of the DataFrame class, it is a single data. You can choose
df.iloc[1, 1]
Like at (), iat () .org / pandas-docs / stable / reference / api / pandas.DataFrame.iat.html) You can get single data faster by using
df.at[dates[0], 'A']
(I'm exhausted here. The rest ... isn't there? 10 minutes is: thinking :)
4. Missing data 5. Operations 6. Merge 7. Grouping 8. Rebuild 9. Time Series 10. Categorize 11. Plot 12. Data Input and Output 13. Pit Pit
Recommended Posts