Introduction In Python, pandas.DataFrame can handle 2D table data. It is easier to handle table data than list and tuple, which are equivalent to arrays in other languages, but its processing speed is basically faster than list and tuple. Let's see the strengths and weaknesses of such a DataFrame while actually running the program.
Execution environment / conditions Execution environment ・ Windows10 Home 64bit conditions -Data used: CSV data of 10000 lines ・ Column of data used: index, student ID, A-E 5 subject scores (0-10000) * Modeled on a list of grades of 10,000 students who took a certain exam

DataFrame strength The strength of DataFrame is that it is easy to process and can process data at high speed.

When calculating the average value Add a column after it to implement the process of substituting the mean value. If it is a list, the following processing will be done.

for idx, row in enumerate(list):
	row.extend('0')
	row[7] = str((float(row[2]) + float(row[3]) + float(row[4]) + float(row[5]) + float(row[6]))/5.0)

It is similar in writing to other languages. This is fine, but ... If you implement it with DataFrame, you only need one line below.

df['average'] = (df['subjectA'] + df['subjectB'] + df['subjectC'] + df['subjectD'] + df['subjectE'])/5

Since you can write with the image of one line operation, there is less risk of coding mistakes.

When sorting When sorting in descending order of the above average value, if it is list, the following processing will be done.

list = sorted(list, key=lambda x: x[7], reverse=True)

Sorting can be implemented in one line even with list. If you implement this with a DataFrame, you only need one line.

df.sort_values('average', ascending=False)

When narrowing down by conditions If you want to narrow down only the data with an average score of 50 points or more, the following processing will be done if it is a list.

list2 = []
for idx, row in enumerate(list):
	if 50 <= float(row[7]):
		list2.append(row)

Use the for loop as you did when calculating the average. If you implement this in a DataFrame ... you can expect it to be in one line.

df2 = df[50 < df['average']]

The speed of each process is ... The speed of each of these processes is measured and averaged 10 times as follows.

Process name 10 times average travel time(list)[sec] 10 times average travel time(DataFrame)[sec]

Average calculation 0.764768385887146 0.01179955005645752

sort 0.030899477005004884 0.011399650573730468

Narrow down 0.04529948234558105 0.006699275970458984

Not only is the code simple to implement with DataFrame, but it's also fast.

Process name	10 times average travel time(list)[sec]	10 times average travel time(DataFrame)[sec]
Average calculation	0.764768385887146	0.01179955005645752
sort	0.030899477005004884	0.011399650573730468
Narrow down	0.04529948234558105	0.006699275970458984

DataFrame Weaknesses The weakness of DataFrame is the for loop. If you want to create the mean value in a for loop, use the code below.

for idx in range(len(df)):
	df.iat[idx, 6] = str((float(df.iat[idx, 1]) + float(df.iat[idx, 2])
		+ float(df.iat[idx, 3]) + float(df.iat[idx, 4]) + float(df.iat[idx, 5])/5.0))

It takes an average of 2.33 [s] 10 times, which is slower than that of list. Therefore, when dealing with DataFrame, it is desirable not to use for as much as possible.

If you really want to use for in DataFrame Still, there are situations where for is used in DataFrame. In such a case, you can speed up the process by making only the part that uses for into list or ndarray, or by using the method like the post below. [[Python3 / pandas] Speed improvement measures when you really want to process DataFrame line by line](https://qiita.com/siruku6/items/0633db690283a0f525ad)

About sample data creation The sample data used this time was created with the code at the following URL. https://github.com/HagiAyato/PythonTests/blob/main/make10000data.py

Recommended Posts
[Python] Strengths and weaknesses of DataFrame in terms of time required

A discussion of the strengths and weaknesses of Python

Difference between Ruby and Python in terms of variables

Project Euler # 1 "Multiples of 3 and 5" in Python

Explanation of edit distance and implementation in Python

To represent date, time, time, and seconds in Python

Basic operation of Python Pandas Series and Dataframe (1)

"Linear regression" and "Probabilistic version of linear regression" in Python "Bayesian linear regression"

Full-width and half-width processing of CSV data in Python

Calculation of standard deviation and correlation coefficient in Python

[Python] Measures and displays the time required for processing

[python] Calculation of months and years of difference in datetime

Overview of generalized linear models and implementation in Python

Sample of getting module name and class name in Python

Summary of date processing in Python (datetime and dateutil)

Check the processing time and the number of calls for each process in python (cProfile)

Applied practice of try/except and dictionary editing and retrieval in Python

Equivalence of objects in Python

Reference order of class variables and instance variables in "self. Class variables" in Python

[Python] Display the elapsed time in hours, minutes, and seconds (00:00:00)

Stack and Queue in Python

Get the current date and time in Python, considering the time difference

Graph time series data in Python using pandas and matplotlib

[Tips] Problems and solutions in the development of python + kivy

Unittest and CI in Python

Implementation of quicksort in Python

Source installation and installation of Python

The story of returning to the front line for the first time in 5 years and refactoring Python Django

Determine the date and time format in Python and convert to Unixtime

I compared the calculation time of the moving average written in Python

A function that measures the processing time of a method in python

Create a CGH for branching a laser in Python (laser and SLM required)

List of Linear Programming (LP) solvers and modelers available in Python

Browse .loc and .iloc at the same time in pandas DataFrame

Verify the compression rate and time of PIXZ used in practice

Get the title and delivery date of Yahoo! News in Python

Environment construction of python and opencv

Pixel manipulation of images in Python

The story of Python and the story of NaN

MIDI packages in Python midi and pretty_midi

[Python] Operation memo of pandas DataFrame

Difference between list () and [] in Python

Difference between == and is in python

Installation of SciPy and matplotlib (Python)

View photos in Python and html

Sorting algorithm and implementation in Python

Division of timedelta in Python 2.7 series

Manipulate files and folders in Python

MySQL-automatic escape of parameters in python

About dtypes in Python and Cython

Handling of JSON files in Python

Measure function execution time in Python

Assignments and changes in Python objects

Implementation of life game in Python

Waveform display of audio in Python

Check and move directories in Python

This and that of python properties

Ciphertext in Python: IND-CCA2 and RSA-OAEP

Hashing data in R and Python

Function synthesis and application in Python

Export and output files in Python