[Memo] Small story of pandas, numpy

A memo of a small story that seems to be used often but is not used, and there are many other ways to forget it immediately

1. Count the number of samples per day

Suppose you have a time stamp type dataframe.

timestamp  score
2013-06-29 00:52:28+00:00        -0.420070
2013-06-29 00:51:53+00:00        -0.445720
2013-06-28 16:40:43+00:00         0.508161
2013-06-28 15:10:30+00:00         0.921474
2013-06-28 15:10:17+00:00         0.876710

I want to check the number of samples per day like this.

            score
2013-06-28      3
2013-06-29      2

I want to prepare an empty array, group by every day, check the number of samples with count, append it with a for statement, and do the troublesome work in one line.

df.groupby(df.index.date).count()

You can easily process it by passing the index of dataframe to pd.DataFrame.groupby ().

example1.py


import io
import pandas as pd
content = '''\
timestamp  score
2013-06-29 00:52:28+00:00        -0.420070
2013-06-29 00:51:53+00:00        -0.445720
2013-06-28 16:40:43+00:00         0.508161
2013-06-28 15:10:30+00:00         0.921474
2013-06-28 15:10:17+00:00         0.876710
'''

df = pd.read_table(io.BytesIO(content), sep='\s{2,}', parse_dates=[0], index_col=[0])

print(df)

print(df.groupby(df.index.date).count())

When it comes to heavy processing, we tend to avoid groupby (), but it may be compatible with datetime.

2. I want to search the columns of pd.DataFrame for multiple parts that match the target character string.

If there is a data frame with such a column,

print df.columns

Index([u'A_0', u'A_1', u'A_2', u'B_0', u'B_1', u'B_2', u'C_0', u'C_1', u'C_2', u'C_3'], dtype='object', length=121)

For example, I want to filter only the column of'B_'from this. There are various string matching, but you can use find

df.ix[:, np.where(df.columns.str.find('B')==0)[0]]

str.find () returns the list by T / F, so take it out with ix as it is.

3. Extract columns / rows containing multiple specific values from the data frame using list as the filtering condition.

df[df.A > 0]

I would like to utilize such a common conditional expression with a list as a condition. You can do it as soon as you use isin.

df[df['A'].isin([0, 1, 2, 3])]

The same operation can be done with numpy.

a = np.array([0,1,2,3])
a[np.isin(a,[1,2,5,7])]

4. Downsample data with sparse sample spacing

Use pandas.DatatimeIndex and resample Suppose you have data with the following time stamps where the sample interval is not constant.

['2016-11-11 01:41:29', '2016-11-11 01:41:31',
'2016-11-11 01:41:32', '2016-11-11 01:41:33',
'2016-11-11 01:41:34', '2016-11-11 01:41:35',
'2016-11-11 01:41:36', '2016-11-11 01:41:37',
'2016-11-11 01:41:38', '2016-11-11 01:41:38',...

Downsampling using groupby takes too much time, so downsample lighter.

x['date'] = pd.to_datetime(x.date)
X = x.copy().set_index('date')

Change date to pd.DatetimeIndex with to_timestamp. Set date to index.

X.resample('5S').dropna()

DatetimeIndex can be easily downsampled by resampling with freq like this.

5. Convert the triangular matrix to an array

Suppose you have such an upper triangular matrix at hand.

A = array([[ 4,  0,  3],
           [ 0,  4, -2],
           [ 0,  0,  7]])

I want to get a list of values like this.

[4, 0, 3, 4, -2, 7]

Solved with numpy.triu_indies ().

>>> A[np.triu_indices(3)]
array([ 4,  0,  3,  4, -2,  7])

When removing diagonal values that are common in heat maps, etc., adjustment is possible by specifying k.

>>> A[np.triu_indices(3, k=1)]
array([ 0,  3, -2])

6. Simplify element counting

To count the data (create a histogram)

labels = pd.Series(['A', 'B', 'C', 'A', 'C', 'A'])
0    A
1    B
2    C
3    A
4    C
5    A

Use defaultdict to do this.

from collections import Counter, defaultdict
counts = defaultdict(int)
for l in labels:
    for l2 in l:
        counts[l2] += 1

The output looks like this. defaultdict(<class 'int'>, {'A': 3, 'B': 1, 'C': 2})

With pandas, you can do it more easily with groupby, but if you want to count nested tags like the one below, you can easily extract by just adding a for statement.

labels = pd.Series([['A', 'B'], ['C', 'A'],[ 'C', 'A', 'D'], ['D', 'A', 'B']])

from collections import Counter, defaultdict
counts = defaultdict(int)
for l in labels:
    for l2 in l:
        counts[l2] += 1

Of course, the key & value of the return value can be listed as follows.

list(counts.keys())
list(counts.values())

7. Nested list comprehension

To turn the nest with for, write as follows.

[x for inner_list in [[1, 3], [5], [7, 9]] for x in inner_list]
>>> [1, 3, 5, 7, 9]

By the way, you can also use this to create the nested histogram you wrote in "6. Simplify Counting Elements".

pd.DataFrame({'numbers':numbers}).groupby('numbers').size().plot.bar()
スクリーンショット 2017-07-18 17.33.04.png

8. In Pandas DataFrame, enter data while giving column names

a = [1,2,3,4,5]
pd.DataFrame({'hello': a})
>>>
 	hello
0	1
1	2
2	3
3	4
4	5

9. Create Json from multiple lists

Create json from the list.

a=["USA","France","Italy"]
b=["10","5","6"]

From this list, I want to create the following json.

[{'country': 'USA', 'wins': '10'},
 {'country': 'France', 'wins': '5'},
 {'country': 'Italy', 'wins': '6'}]

There are two methods. The first is list comprehension

>>> [{'country': country, 'wins': wins} for country, wins in zip(a, b)]
[{'country': 'USA', 'wins': '10'},
 {'country': 'France', 'wins': '5'},
 {'country': 'Italy', 'wins': '6'}]

The second is json dump

>>> json.dumps(
...     [{'country': country, 'wins': wins} for country, wins in zip(a, b)]
... )
'[{"country": "USA", "wins": "10"}, {"country": "France", "wins": "5"}, {"country": "Italy", "wins": "6"}]'

10. Load jsonl with DataFrame

I want to read jsonl with different number of elements in one line without going through the dictionary type.

json_string = '{"first_name": "Guido", "last_name":"Rossum"}, {"first_name": "Guido", "last_name":"Terry", "test_name": "test-man"}'
parsed_json = pd.read_json(json_string, lines=True)
print(parsed_json)
first_name last_name test_name
0 Guido Rossum NaN
1 Guido Terry test-man

If you use pd.read_json (lines = True), json line can be read neatly.

11. Create a layered json from the list

I want to create the following Json.

{'filename': 'filename',
 'data': [{'keyword': 'bomber',
   'term_freq': 5,
   'lists': [{'occurrance': 1},
    {'occurrance': 2},
    {'occurrance': 3},
    {'occurrance': 4},
    {'occurrance': 5}]},
  {'keyword': 'irritation',
   'term_freq': 5,
   'lists': [{'occurrance': 1},
    {'occurrance': 3},
    {'occurrance': 5},
    {'occurrance': 7},
    {'occurrance': 8}]}]}

Use dict.

import json

dic = {
        "bomber": [1, 2, 3, 4, 5],
        "irritation": [1, 3, 5, 7, 8]
      }


json_dict = {}
data = []

#data creation
for k, v in dic.items():
    tmp_dict = {}
    
    #Add keyword
    tmp_dict["keyword"] = k
    
    #Add list length
    tmp_dict["term_freq"] = len(v)
    
    #Add keyword to the contents of the list
    tmp_dict["lists"] = [{"occurrance": i} for i in v]
    
    data.append(tmp_dict)

json_dict["filename"] = 'filename'
json_dict["data"] = data

print(json.dumps(json_dict))

12. Store multiple initial values as new columns

pd.DataFrame.assig()

df['a'], df['b'] = value_a, valueb

Use assign instead

df.assign(a=value_a, b=value_b)

13. I want to quickly retrieve values from a DataFrame

If it is pd.DataFrame.loc [0,0] or pd.DataFrame.values, array type is output.

Use pd.DataFrame.iat [0,0]

14. Save heavy processing at regular intervals

Divide the DF line into regular intervals and do to_csv () The method of equal division is as follows

x = 17
n = 3
l = [(x + i) // n for i in range(n)]
l
>>> [5, 6, 6]

Rewrite the following and do to_csv

#Create a suitable DF for testing
df = pd.DataFrame(data={
    'a': range(10),
    'b': list(map(lambda x: x/5, list(range(10))))
})

test = []
for num in range(n):
    begin = sum(l[:num])
    end = sum(l[:(num+1)])
    test.append(df.iloc[begin:end])

15. I want to exclude multiple elements from a list

>>> a = range(1, 10)
>>> [x for x in a if x not in [2, 3, 7]]
[1, 4, 5, 6, 8, 9]

It seems that this can also be done

x = set(range(10))
y = x - set([2, 3, 7])
# y = set([0, 1, 4, 5, 6, 8, 9])
  1. Set list into DataFrame as value The list may not be stored in the DF.
df = pd.DataFrame(data={
    'a': [1, 2],
    'b': [2, 3]
})
df['c'] = [[3]]

Since an error occurs above, execute as follows.

df = pd.DataFrame(data={
    'a': [1, 2],
    'b': [2, 3]
})
df['c'] = [[3]] * len(df)

17. Arrange the column order

def df_sort(df):
    return df.reindex(sorted(df.columns), axis=1)

18. String search from column name

df.columns.str.contains('required string')

19. orient quick reference table of pd.DataFrame.to_json ()

Screen Shot 2018-11-01 at 14.32.14.png

20. Count missing values in all columns at once

df.apply(lambda x: x.isnull().sum(), axis=0)

21. I want to use OneHotEncorder

Use pd.get_dummies ()

Recommended Posts

[Memo] Small story of pandas, numpy
Small story: numpy flatten mechanism
Pandas memo
[Python] Operation memo of pandas DataFrame
pandas memo
[Python] Numpy memo
[Small story] Download the image of Ghibli immediately
About all of numpy
Numpy basic calculation memo
Pandas reverse lookup memo
Basic operation of pandas
About MultiIndex of pandas
python small story collection
Basic operation of Pandas
The story of sys.path.append ()
[Numpy / pandas / matplotlib Exercise 01]
Set function of NumPy
[Memo] Load csv of s3 into pandas with boto3
Python Basic --Pandas, Numpy-
Sum of multiple numpy arrays (sum)
The story of building Zabbix 4.4
[Apache] The story of prefork
Program-level story of libscips ① (α0.0.1)
Qiita memo of my thoughts
Features of programming languages [Memo]
A small memorandum of openpyxl
[Memo] Construction of cygwin environment
About all of numpy (2nd)
My reverse numpy / scipy memo
App Service Small story collection
Visualization memo by pandas, seaborn
Story of trying competitive programming 2
Basic usage of Pandas Summary
Behavior of pandas rolling () method
Index of certain pandas usage
The Power of Pandas: Python