A memo of a small story that seems to be used often but is not used, and there are many other ways to forget it immediately
Suppose you have a time stamp type dataframe.
timestamp score
2013-06-29 00:52:28+00:00 -0.420070
2013-06-29 00:51:53+00:00 -0.445720
2013-06-28 16:40:43+00:00 0.508161
2013-06-28 15:10:30+00:00 0.921474
2013-06-28 15:10:17+00:00 0.876710
I want to check the number of samples per day like this.
score
2013-06-28 3
2013-06-29 2
I want to prepare an empty array, group by every day, check the number of samples with count, append it with a for statement, and do the troublesome work in one line.
df.groupby(df.index.date).count()
You can easily process it by passing the index of dataframe to pd.DataFrame.groupby ().
example1.py
import io
import pandas as pd
content = '''\
timestamp score
2013-06-29 00:52:28+00:00 -0.420070
2013-06-29 00:51:53+00:00 -0.445720
2013-06-28 16:40:43+00:00 0.508161
2013-06-28 15:10:30+00:00 0.921474
2013-06-28 15:10:17+00:00 0.876710
'''
df = pd.read_table(io.BytesIO(content), sep='\s{2,}', parse_dates=[0], index_col=[0])
print(df)
print(df.groupby(df.index.date).count())
When it comes to heavy processing, we tend to avoid groupby (), but it may be compatible with datetime.
If there is a data frame with such a column,
print df.columns
Index([u'A_0', u'A_1', u'A_2', u'B_0', u'B_1', u'B_2', u'C_0', u'C_1', u'C_2', u'C_3'], dtype='object', length=121)
For example, I want to filter only the column of'B_'from this. There are various string matching, but you can use find
df.ix[:, np.where(df.columns.str.find('B')==0)[0]]
str.find () returns the list by T / F, so take it out with ix as it is.
df[df.A > 0]
I would like to utilize such a common conditional expression with a list as a condition. You can do it as soon as you use isin.
df[df['A'].isin([0, 1, 2, 3])]
The same operation can be done with numpy.
a = np.array([0,1,2,3])
a[np.isin(a,[1,2,5,7])]
Use pandas.DatatimeIndex and resample Suppose you have data with the following time stamps where the sample interval is not constant.
['2016-11-11 01:41:29', '2016-11-11 01:41:31',
'2016-11-11 01:41:32', '2016-11-11 01:41:33',
'2016-11-11 01:41:34', '2016-11-11 01:41:35',
'2016-11-11 01:41:36', '2016-11-11 01:41:37',
'2016-11-11 01:41:38', '2016-11-11 01:41:38',...
Downsampling using groupby takes too much time, so downsample lighter.
x['date'] = pd.to_datetime(x.date)
X = x.copy().set_index('date')
Change date to pd.DatetimeIndex with to_timestamp. Set date to index.
X.resample('5S').dropna()
DatetimeIndex can be easily downsampled by resampling with freq like this.
Suppose you have such an upper triangular matrix at hand.
A = array([[ 4, 0, 3],
[ 0, 4, -2],
[ 0, 0, 7]])
I want to get a list of values like this.
[4, 0, 3, 4, -2, 7]
Solved with numpy.triu_indies ().
>>> A[np.triu_indices(3)]
array([ 4, 0, 3, 4, -2, 7])
When removing diagonal values that are common in heat maps, etc., adjustment is possible by specifying k.
>>> A[np.triu_indices(3, k=1)]
array([ 0, 3, -2])
To count the data (create a histogram)
labels = pd.Series(['A', 'B', 'C', 'A', 'C', 'A'])
0 A
1 B
2 C
3 A
4 C
5 A
Use defaultdict to do this.
from collections import Counter, defaultdict
counts = defaultdict(int)
for l in labels:
for l2 in l:
counts[l2] += 1
The output looks like this. defaultdict(<class 'int'>, {'A': 3, 'B': 1, 'C': 2})
With pandas, you can do it more easily with groupby, but if you want to count nested tags like the one below, you can easily extract by just adding a for statement.
labels = pd.Series([['A', 'B'], ['C', 'A'],[ 'C', 'A', 'D'], ['D', 'A', 'B']])
from collections import Counter, defaultdict
counts = defaultdict(int)
for l in labels:
for l2 in l:
counts[l2] += 1
Of course, the key & value of the return value can be listed as follows.
list(counts.keys())
list(counts.values())
To turn the nest with for, write as follows.
[x for inner_list in [[1, 3], [5], [7, 9]] for x in inner_list]
>>> [1, 3, 5, 7, 9]
By the way, you can also use this to create the nested histogram you wrote in "6. Simplify Counting Elements".
pd.DataFrame({'numbers':numbers}).groupby('numbers').size().plot.bar()
a = [1,2,3,4,5]
pd.DataFrame({'hello': a})
>>>
hello
0 1
1 2
2 3
3 4
4 5
Create json from the list.
a=["USA","France","Italy"]
b=["10","5","6"]
From this list, I want to create the following json.
[{'country': 'USA', 'wins': '10'},
{'country': 'France', 'wins': '5'},
{'country': 'Italy', 'wins': '6'}]
There are two methods. The first is list comprehension
>>> [{'country': country, 'wins': wins} for country, wins in zip(a, b)]
[{'country': 'USA', 'wins': '10'},
{'country': 'France', 'wins': '5'},
{'country': 'Italy', 'wins': '6'}]
The second is json dump
>>> json.dumps(
... [{'country': country, 'wins': wins} for country, wins in zip(a, b)]
... )
'[{"country": "USA", "wins": "10"}, {"country": "France", "wins": "5"}, {"country": "Italy", "wins": "6"}]'
I want to read jsonl with different number of elements in one line without going through the dictionary type.
json_string = '{"first_name": "Guido", "last_name":"Rossum"}, {"first_name": "Guido", "last_name":"Terry", "test_name": "test-man"}'
parsed_json = pd.read_json(json_string, lines=True)
print(parsed_json)
first_name | last_name | test_name | |
---|---|---|---|
0 | Guido | Rossum | NaN |
1 | Guido | Terry | test-man |
If you use pd.read_json (lines = True), json line can be read neatly.
I want to create the following Json.
{'filename': 'filename',
'data': [{'keyword': 'bomber',
'term_freq': 5,
'lists': [{'occurrance': 1},
{'occurrance': 2},
{'occurrance': 3},
{'occurrance': 4},
{'occurrance': 5}]},
{'keyword': 'irritation',
'term_freq': 5,
'lists': [{'occurrance': 1},
{'occurrance': 3},
{'occurrance': 5},
{'occurrance': 7},
{'occurrance': 8}]}]}
Use dict.
import json
dic = {
"bomber": [1, 2, 3, 4, 5],
"irritation": [1, 3, 5, 7, 8]
}
json_dict = {}
data = []
#data creation
for k, v in dic.items():
tmp_dict = {}
#Add keyword
tmp_dict["keyword"] = k
#Add list length
tmp_dict["term_freq"] = len(v)
#Add keyword to the contents of the list
tmp_dict["lists"] = [{"occurrance": i} for i in v]
data.append(tmp_dict)
json_dict["filename"] = 'filename'
json_dict["data"] = data
print(json.dumps(json_dict))
pd.DataFrame.assig()
df['a'], df['b'] = value_a, valueb
Use assign instead
df.assign(a=value_a, b=value_b)
If it is pd.DataFrame.loc [0,0] or pd.DataFrame.values, array type is output.
Use pd.DataFrame.iat [0,0]
Divide the DF line into regular intervals and do to_csv () The method of equal division is as follows
x = 17
n = 3
l = [(x + i) // n for i in range(n)]
l
>>> [5, 6, 6]
Rewrite the following and do to_csv
#Create a suitable DF for testing
df = pd.DataFrame(data={
'a': range(10),
'b': list(map(lambda x: x/5, list(range(10))))
})
test = []
for num in range(n):
begin = sum(l[:num])
end = sum(l[:(num+1)])
test.append(df.iloc[begin:end])
>>> a = range(1, 10)
>>> [x for x in a if x not in [2, 3, 7]]
[1, 4, 5, 6, 8, 9]
It seems that this can also be done
x = set(range(10))
y = x - set([2, 3, 7])
# y = set([0, 1, 4, 5, 6, 8, 9])
df = pd.DataFrame(data={
'a': [1, 2],
'b': [2, 3]
})
df['c'] = [[3]]
Since an error occurs above, execute as follows.
df = pd.DataFrame(data={
'a': [1, 2],
'b': [2, 3]
})
df['c'] = [[3]] * len(df)
def df_sort(df):
return df.reindex(sorted(df.columns), axis=1)
df.columns.str.contains('required string')
df.apply(lambda x: x.isnull().sum(), axis=0)
Use pd.get_dummies ()
Recommended Posts