When analyzing data with python, it is common (likely) to use a module called pandas.
In pandas, data can be stored in types called Series and Dataframe. Series is used to store one-dimensional data and Dataframe is used to store two-dimensional data. They are like high-performance one-dimensional arrays and two-dimensional arrays, respectively. High-performance means that each row and column can be named, and many methods are available.
# | Hachiman | Yukino | Yui |
---|---|---|---|
Math | 8 | 90 | 10 |
National language | 88 | 100 | 50 |
English | 38 | 95 | 35 |
When expressing this in a two-dimensional array, it is difficult to handle characters such as "Hachiman", "Yukino", "Yui", "Mathematics", "Kokugo", and "English". In Dataframe, this can be represented by columns and index.
However, this type has various troublesome specifications, and I stumbled from the beginning. This is a super rudimentary pandas operation manual that I made for myself as a pandas super beginner. The python version is 3.5.2 (I'm using Anacondan 4.2.0 instead of standard python) The version of pandas is 0.18.1 The code imagines a situation running on iPython 5.1.0.
I put it all at once in Anaconda. (Anaconda is like python + a popular library, including Numpy and iPython) Besides that, you can put it in with pip or something.
Since pandas is a module, it must be imported.
In[1]: import pandas
However, no matter where you look at the reference sites, pandas seems to be loaded under the name pd, so I will follow that here as well.
In[2]: import pandas as pd
# | Hachiman |
---|---|
Math | 8 |
National language | 88 |
English | 38 |
For example, suppose you are given such a one-dimensional array. The first thing that comes to mind when you see this is to create a List.
HachimanList = [8, 88, 38]
It's easy to access the elements of this.
If you want a national language score,
ʻIn [4]: Hachiman [1]`
ʻOut [4]: 88` will be returned.
The problem with this is that information such as "math" scores is missing.
Of course, you can create dictionaries, objects, and named tuples, but none of them are suitable for large-scale data processing.
The solution to this is the pandas Series.
In[5]: HachimanSeries = pd.Series(HachimanList, index = ["math", "japanese", "english"])
in this way,
variable= pd.Series(Array of data, index =Array of rampant names)
Can be specified with.
When I try to output this,
In[6]: HachimanSeries Out[6]: math 8 japanese 88 english 38 dtype: int64`
You can see that each item is given a name and output.
Note that `dtype` is the data type of the entire array. (In Numpy, integer types are assigned to several different integer types, int64 is one of them)
What if you didn't specify an index? Will an error be returned?
In[7]: YukinoSeries = pd.Series([90, 100, 95]) In[8]: YukinoSeries Out[8]: 0 90 1 100 2 95 dtype: int64
Apparently, ʻindex` has a default value that increases from 0.
Note that ʻindex` can be added later.
In[9]: YukinoSeries.index = ["math", "japanese", "english"] In[10]: YukinoSeries Out[10]: math 90 japanese 100 english 95 dtype: int64
There is also a method using a dictionary.
In[11]: YuiSeries = pd.Series({"math":10, "japanese":50, "english":35}) In[12]: YuiSeries Out[12]: english 35 japanese 50 math 10 dtype: int64
In this case, it is unavoidable that the order will be out of order.
## Extract the value of Series
Retrieving the values in the Series is almost the same as a normal array.
#### Element specification
The nth element of the array is taken out as `array [n-1]`.
Similarly, the nth element of Series is
In[13]: HachimanSeries[2] Out[14]: 38
You can also pass this to a variable for calculation.
In[15]: HachimanMath = HachimanSeries[0] In[16]: 40 <= HachimanMath Out[16]: False
However, the type of HachimanMath is `numpy.int64` instead of the usual ʻint`.
In[17]: type(HachimanMath) Out[17]: numpy.int64
You can also use the `array [-1]`.
It's easy to retrieve multiple variables.
In[18]: HachimanSeries[0:2] Out[18]: [8, 38]
Readers who have run the sample code in their own environment without taking my results for granted should have noticed that I finally revealed my horse hoof and wrote a lie here.
This code `HachimanSeries [0: 2]` returns the execution result of Series type because it looks like `pandas.core.series.Series`.
In[18]: HachimanSeries[0:2] Out[18]: math 8 japanese 88 dtype: int64
In[19]: type(HachimanSeries[0:2]) Out[19]: pandas.core.series.Series
Summary
--If you specify an element alone, you will get an execution result like ʻint` type called` numpy.int64`.
--If you specify an element in a range, you will get a `Series` type called` pandas.core.series.Series`.
Some people may find it a little unpleasant, but if you think about it carefully
- `int`->`numpy.int64`
- `list` ->`pandas.core.series.Series`
It is the same as a normal array operation, just because there is a correspondence.
So, of course, a `Series` type with a single element can be retrieved in the same way it retrieves a single array from an array.
In[20]: HachimanSeries[1:1+1] Out[20]: japanese 88 dtype: int64
In[21]: type(HachimanSeries[1:1+1]) Out[21]: pandas.core.series.Series
To specify an element, you can also specify it by ʻindex` name like a dictionary type.
In[22]: HachimanSeries["math"] Out[22]: 8
By the way, if you use `array [:: -1]`, the reverse result will be returned.
Surprisingly, this can even be ranged using the ʻindex` name.
In[23]: HachimanSeries["math":"english"] Out[23]: math 8 japanese 88 english 38 dtype: int64
This is something that can't be done with collections.OrderedDict as well as ordinary dictionaries, showing the high performance of Series-tan.
If you want to access the name of ʻindex`, treat` Series.index` like an array.
In[24]:HachimanSeries.index Out[24]:Index(['math', 'japanese', 'english'], dtype='object')
In[25]: HachimanSeries.index[1] Out[25]: 'japanese'
In[26]: HachimanSeries.index.[1:2] Out[26]: Index(['japanese'], dtype='object')
So far, we have explained that Series can retrieve elements like ordinary arrays and dictionaries.
#### Get the Series you want
What if you want to pick up data in `Series`?
In other words, you want only `math` and` japanese`, or you want only `math` and ʻenglish`.
Or maybe you want `math` in two places.
(Do you feel such a need now ...)
For `math` and` japanese`, `HachimanSeries [0: 2]` will do the trick. However, when it comes to `math` and ʻenglish`, it's quite annoying.
I come up with it there.
In[27]: HachimanSeries["math"]+HachimanSeries["english"]
How about this! !!
Out[27]: 46
The reality is ruthless, but this output nods. In the first place, the result of `HachimanSeries ["math "]` is `numpy.int64`.
If so,
In[28]: HachimanSeries[0:0 + 1] + HachimanSeries[2:2 + 1]
Try.
Out[28]: english NaN math NaN dtype: float64
As you can see, it spewed out industrial waste.
This is probably because the addition in `Series` is" adding the same indexes ".
And for the elements that are not common, fill in `NaN` for the time being.
In fact
In[29]: HachimanSeries[0:0 + 2] + YukinoSeries[1:1 + 2] Out[29]: english NaN japanese 188.0 math NaN dtype: float64
Will be.
So how do you favor only `math` and ʻenlish` in the same Series?
The answer is to write a double `[]`.
In[30]: HachimanSeries[[0, 2]] Out[30]: math 8 english 38 dtype: int64
Maybe this `[[]]` has nothing to do with the notation of the quadratic array. It seems that I just wanted to use the notation `[[]]`.
(`HachimanSeries [(0,2)]` doesn't pass, so it didn't have to be something like an iterator, but `HachimanSeries [list ((0,2))]` passes, so It is considered to be the same as an array in terms of processing.)
If you just want to emphasize your math score
In[31]: HachimanSeries[["math","math","math","math","math"]] Out[31]: math 8 math 8 math 8 math 8 math 8 dtype: int64
You can do it. (Here, I specified the ʻindex` name directly)
The same is true for ʻindex`.
In [32]: HachimanSeries.index[[1,2]] Out[32]: Index(['japanese', 'english'], dtype='object')
So far, I've learned how to use `[[]]` to create a new Series that extracts only the desired ʻindex`.
### Rewriting elements of Series
You may later find that the contents of the `Series` and the ʻindex` name were incorrect.
There is a way to overwrite the modified version `Series` with the same name, but in fact it can be changed as easily as an array.
First of all, the code to rewrite only one.
In[33]:HachimanSeries[1] Out[33]: 88
In[34]: HachimanSeries[1] = 98
In[35]: HachimanSeries[1] Out[35]: 98
Then rewrite the specified range
In[36]: HachimanSeries Out[36]: math 8 japanese 98 english 38 dtype: int64
In[37]: HachimanSeries[1:1+2] = [89,33]
In[38]: HachimanSeries Out[38]: math 8 japanese 89 english 33 dtype: int64
Here, I get angry if there are no numbers on the left and right sides.
ValueError (Omitted) ValueError: cannot set using a slice indexer with a different length than the value
However, they can be aligned to the same value.
In[40]: HachimanSeries[0:0+3] = 0
In[41]: HachimanSeries Out[40]: math 0 japanese 0 english 0 dtype: int64
Finally, rewriting ʻindex`
In[42]: HachimanSeries.index[1] = "Japanese" Out[42]: HachimanSeries math 0 Japanese 0 english 0 dtype: int64
Actually, this is not the case.
TypeError: Index does not support mutable operations
As you can see, ʻindex` seems to be immutable. (Even if you do something similar with a string, you get angry)
So there is no choice but to overwrite it.
In[43]: HachimanSeries.index = ["Math","Japanese","English"]
In[44]: HachimanSeries Out[44]: Math 0 Japanese 0 English 0 dtype: int64
Well, let's reset it after reviewing.
In[45]: HachimanSeries[0:0+3] = [8,88,38]
In[46]: HachimanSeries.index = ["math", "japanese", "english"] Out[46]: math 8 japanese 88 english 38 dtype: int64
The above is the basic operation of Series.
It's longer than I expected, so I'll talk about Dataframe and Series methods in a subsequent article.
Recommended Posts