I heard from someone that "malloc is slow, so it's faster to store memory and reuse it." If you really want to do it, you have to lock it, isn't it? While thinking, it seems better to make a rapper if it is true. This kind of thing can't be helped even if you continue the theory on the desk, so let's actually measure it! ** **
This time, ** we will perform repeated calloc / free tests on the following 4 types and measure **. I chose calloc because I personally use it more often. I regret that malloc has more pure accuracy.
The library simply determines the amount and size of memory to be stored. So, as a prediction before measurement, if it is within the memory usage and size, 2 and 3 are fast, and outside that range, 1 is a good match, 4 is not. What will happen now
Have the init function called and create a memory list there. The list has a used / unused flag, passing unused memory during calloc and moving that data to the tail of the list. That way, the first head is always unused and the tail is in use memory data.
On the contrary, when free, the memory data in use is searched from tail and it is searched whether it matches the data to be free. Matches are not freed, they are changed to unused and moved to the beginning.
Since the positions of head and tail are remembered, I'm worried about searching when free, but other than that, I can pass the memory immediately. Isn't it a good feeling! ??
2018/08/05 Added Library. This is mempool.
The neck of the above wrapper is still the behavior when free. When almost all memory data is in use, you have to search almost all the data. I was wondering if I had to review the search list structure, but I found a good move. This site, my hand is introduced as a bad answer. Lol If you set the condition of the last correct answer, "** The size of the memory area is the same and it is a power of 2 bytes. **", the position can be grasped immediately by bit operation. It ’s wonderful!
Based on the above, we reviewed the rapper that became the bottleneck.
First of all, at the beginning. The list and the memory entity will be lined up. In addition, size specification is limited to 2 power bytes. Then, the memory entities are lined up in multiples of 2 ^ x like this, so if you shift the bits x times when free, you can find the position.
Now, what happens when you are free? First, you can tell whether the memory entity to be free is within management from the address range of the memory entity created first. So you don't have to do a full search every time like the first wrapper. Also, if applicable, bit shift to specify the position ⇒ Take out the corresponding list data entity and move it to the head. The search process is no longer needed. Isn't this faster?
Also, for the method of acquiring the bit shift x times, I referred to Code for finding the bit position here. Since it is used only during init, it is not involved in this measurement, but it is also amazing that you do not need time to work hard on the bit shift to the end.
I thought that this area was my limit, so I will measure it.
The measurement tool and source code are located below. https://github.com/developer-kikikaikai/speedtest
After building, you can measure by executing run.rb in sample / mallocspeed.
Ask the creation library to store 1024 1024 bytes of memory. Under these conditions, the following measurements are performed 50 times, 100 times, 1024 times, and 10000 times 10 times, and the total time is output.
Time measurement is done in the speedtest library. (Since the timed log is dumped in memory and displayed at the end, I think that the overhead due to measurement is not that much.
Linux Ubuntu 18.04 Memory 8GByte 4 CPUs (on VM)
Description | meaning |
---|---|
Action column | What measurement is represented |
Total time column | total time.-0 for 3.001(1/10^3)means |
XXX calloc(calloc/free under initial size) | 1024byte malloc/free measurement |
XXX calloc(calloc/free initial size+over size) | 1024byte*2 and 1024byte malloc/free measurement |
XXX=normal | Ordinary calloc/free(Described as normal below) |
XXX=wrap | First wrapper |
XXX=hhs_calloc | Bottleneck elimination version(Below hhs_Described as calloc) |
XXX=tc_calloc | tc_calloc/tc_Use free(Below tc_Described as calloc) |
** 50 times x 10 measurements **
Action | Total time |
---|---|
normal calloc(calloc/free under initial size) | 0.15806e-3 |
wrap calloc code(calloc/free under initial size) | 0.75032e-4 |
hhs_calloc(calloc/free under initial size) | 0.135938e-3 |
tc_calloc(calloc/free under initial size) | 0.186022e-3 |
normal calloc(calloc/free initial size+over size) | 0.711725e-3 |
wrap calloc(calloc/free initial size+over size) | 0.487427e-3 |
hhs_calloc(calloc/free initial size+over size) | 0.443236e-3 |
tc_calloc(calloc/free initial size+over size) | 0.702283e-3 |
The two wrappers are fast while the number is small. The first rapper is particularly strong if it is only within range. Also, if the size is oversized, hhs_calloc is already good. Isn't both good?
(Titles 1 and 2 are omitted below)
** 100 times x 10 **
Action | Total time |
---|---|
normal calloc(calloc/free under initial size) | 0.283713e-3 |
wrap calloc code(calloc/free under initial size) | 0.153599e-3 |
hhs_calloc(calloc/free under initial size) | 0.272328e-3 |
tc_calloc(calloc/free under initial size) | 0.293742e-3 |
normal calloc(calloc/free initial size+over size) | 0.1394836e-2 |
wrap calloc(calloc/free initial size+over size) | 0.1395083e-2 |
hhs_calloc(calloc/free initial size+over size) | 0.1244906e-2 |
tc_calloc(calloc/free initial size+over size) | 0.1337618e-2 |
It's still about 1/10 of the maximum number of registrations, but the speed of the first wrapper has already dropped sharply around here. This area is a level where the ranking changes each time it is executed.
** 500 times x 10 **
Action | Total time |
---|---|
normal calloc(calloc/free under initial size) | 0.1453858e-2 |
wrap calloc code(calloc/free under initial size) | 0.301273e-2 |
hhs_calloc(calloc/free under initial size) | 0.1291004e-2 |
tc_calloc(calloc/free under initial size) | 0.13424e-2 |
normal calloc(calloc/free initial size+over size) | 0.7863012e-2 |
wrap calloc(calloc/free initial size+over size) | 0.44953204e-1 |
hhs_calloc(calloc/free initial size+over size) | 0.6339118e-2 |
tc_calloc(calloc/free initial size+over size) | 0.6493924e-2 |
With about half the maximum number of registrations, the first rapper was by far the slowest. Dropping out faster than I expected (-_-;) Also, the power of tc_calloc stands out from this area.
** 1024 times x 10 **
Action | Total time |
---|---|
normal calloc(calloc/free under initial size) | 0.2824874e-2 |
wrap calloc code(calloc/free under initial size) | 0.20755398e-1 |
hhs_calloc(calloc/free under initial size) | 0.2560653e-2 |
tc_calloc(calloc/free under initial size) | 0.2751811e-2 |
normal calloc(calloc/free initial size+over size) | 0.16774762e-1 |
wrap calloc(calloc/free initial size+over size) | 0.54582233e-1 |
hhs_calloc(calloc/free initial size+over size) | 0.13983322e-1 |
tc_calloc(calloc/free initial size+over size) | 0.14046191e-1 |
Just reached the maximum number of registrations. Hfs_calloc is doing its best even in the case where unexpectedly out of range is mixed.
** 10000 times x 10 **
Action | Total time |
---|---|
normal calloc(calloc/free under initial size) | 0.30325826e-1 |
wrap calloc code(calloc/free under initial size) | 0.5094687e-1 |
hhs_calloc(calloc/free under initial size) | 0.30124209e-1 |
tc_calloc(calloc/free under initial size) | 0.26917778e-1 |
normal calloc(calloc/free initial size+over size) | 0.176232715e0 |
wrap calloc(calloc/free initial size+over size) | 0.221244686e0 |
hhs_calloc(calloc/free initial size+over size) | 0.176962985e0 |
tc_calloc(calloc/free initial size+over size) | 0.146961605e0 |
Finally tc_calloc is at the top. It seems that the more the number, the more powerful it is. At this point, hhs_calloc is no different from normal calloc, so I'm just trying my best to escape with an advantage of up to 1024 times.
This result was interesting, so I will post it. It's already a measurement experiment of tc_malloc. Since each calloc / free is overwritten by tc_malloc, each result is longer than the first.
** 50 times x 10 measurements **
Action | Total time |
---|---|
normal calloc(calloc/free under initial size) | 0.14662e-3 |
wrap calloc code(calloc/free under initial size) | 0.7501e-4 |
hhs_calloc(calloc/free under initial size) | 0.14337e-3 |
tc_calloc(calloc/free under initial size) | 0.22985e-4 |
normal calloc(calloc/free initial size+over size) | 0.803501e-3 |
wrap calloc(calloc/free initial size+over size) | 0.28677e-3 |
hhs_calloc(calloc/free initial size+over size) | 0.271435e-3 |
tc_calloc(calloc/free initial size+over size) | 0.117837e-3 |
Here, tc_calloc is already by far the best. I thought it was about the same as calloc / free, which was completely wrapped, but I was surprised.
** 100 times x 10 **
Action | Total time |
---|---|
normal calloc(calloc/free under initial size) | 0.27454e-3 |
wrap calloc code(calloc/free under initial size) | 0.147085e-3 |
hhs_calloc(calloc/free under initial size) | 0.270945e-3 |
tc_calloc(calloc/free under initial size) | 0.43941e-4 |
normal calloc(calloc/free initial size+over size) | 0.1559897e-2 |
wrap calloc(calloc/free initial size+over size) | 0.759443e-3 |
hhs_calloc(calloc/free initial size+over size) | 0.489991e-3 |
tc_calloc(calloc/free initial size+over size) | 0.255653e-3 |
The speed of tc_calloc is unabated. Other than that, it doesn't change. Out-of-range speeds are affected by calloc / free overrides, and wrappers are faster.
** 500 times x 10 **
Action | Total time |
---|---|
normal calloc(calloc/free under initial size) | 0.1373838e-2 |
wrap calloc code(calloc/free under initial size) | 0.1687707e-2 |
hhs_calloc(calloc/free under initial size) | 0.1296464e-2 |
tc_calloc(calloc/free under initial size) | 0.262718e-3 |
normal calloc(calloc/free initial size+over size) | 0.7076653e-2 |
wrap calloc(calloc/free initial size+over size) | 0.13166146e-1 |
hhs_calloc(calloc/free initial size+over size) | 0.2556278e-2 |
tc_calloc(calloc/free initial size+over size) | 0.1392622e-2 |
Not to mention tc_calloc. Considering the first result, hhs_calloc is doing its best.
** 1024 times x 10 **
Action | Total time |
---|---|
normal calloc(calloc/free under initial size) | 0.2763511e-2 |
wrap calloc code(calloc/free under initial size) | 0.6518494e-2 |
hhs_calloc(calloc/free under initial size) | 0.2707665e-2 |
tc_calloc(calloc/free under initial size) | 0.561575e-3 |
normal calloc(calloc/free initial size+over size) | 0.14081652e-1 |
wrap calloc(calloc/free initial size+over size) | 0.16251563e-1 |
hhs_calloc(calloc/free initial size+over size) | 0.4127922e-2 |
tc_calloc(calloc/free initial size+over size) | 0.3438721e-2 |
It is said that the difference will widen within the range that rappers should be good at. I think tc_calloc is also using memory well.
** 10000 times x 10 **
Action | Total time |
---|---|
normal calloc(calloc/free under initial size) | 0.27661637e-1 |
wrap calloc code(calloc/free under initial size) | 0.18210164e-1 |
hhs_calloc(calloc/free under initial size) | 0.12308628e-1 |
tc_calloc(calloc/free under initial size) | 0.9218671e-2 |
normal calloc(calloc/free initial size+over size) | 0.148160442e0 |
wrap calloc(calloc/free initial size+over size) | 0.84871154e-1 |
hhs_calloc(calloc/free initial size+over size) | 0.68569389e-1 |
tc_calloc(calloc/free initial size+over size) | 0.65872532e-1 |
There seems to be a library that goes beyond the field I prepared and speeds up. The uncle who created it is sad.
The above was a simple comparison, so this time we will build a scenario based on the use case of the application we are thinking about.
The target application is "multithreading of ligttpd". Therefore, we will measure with this assumption.
--Measurements are for requests from all connected clients (target 100,000) --Roughly estimate the memory required for 1 request / response (about 4096 bytes x 256 (1 MByte)) --Roughly estimate the number of connections that lighttpd can handle (always hold) at one time (10)
So
element | conditions |
---|---|
Memory pool | 10 clients ⇒ 4096 bytes x 256 x 10 pools |
test case | 10 clients out of all clients are connected. After allocating that much memory, the release is repeated. * Since the timing of returning the client response is different, change the release order as appropriate. |
Assumed all connected clients | 2000 (10,000 seems to take time so stop) |
Under these conditions, each measurement is performed 5 times and the total time is output.
Call 5 times, input is 512000
Action | Total time |
---|---|
normal calloc(calloc/free under initial size) | 0.2694234765e1 |
wrap calloc code(calloc/free under initial size) | 0.27786253728e2 |
hhs_calloc(calloc/free under initial size) | 0.673090478e0 |
tc_calloc(calloc/free under initial size) | 0.617789121e0 |
hhs_calloc> = tc_calloc> Normal> First wrapper
The first two are levels where the order changes depending on the execution timing
Call 5 times, input is 512000
Action | Total time |
---|---|
normal calloc(calloc/free under initial size) | 0.593853921e0 |
wrap calloc code(calloc/free under initial size) | 0.4174470792e1 |
hhs_calloc(calloc/free under initial size) | 0.6199698e0 |
tc_calloc(calloc/free under initial size) | 0.592267608e0 |
tc_calloc> = normal> = hhs_calloc> first wrapper
The previous three were levels where the order was changed each time it was executed. However, in this result, the calloc replaced by tc_malloc has affected hhs_calloc. Compare again with only tc_calloc measurement of tc_malloc.
Action | Total time |
---|---|
normal calloc(calloc/free under initial size) | 0.2694234765e1 |
wrap calloc code(calloc/free under initial size) | 0.27786253728e2 |
hhs_calloc(calloc/free under initial size) | 0.673090478e0 |
tc_calloc(calloc/free under initial size) | 0.592267608e0 |
hhs_calloc> tc_calloc> Normal> First wrapper
Looking at it this way, hhs_calloc may be interesting if tuned well.
2018/06/10 I was curious about the comments, so I also tried no lock on hhs_calloc. Compare side by side with tc_calloc results.
Action | Total time |
---|---|
hhs_calloc(calloc/free under initial size) | 0.589852316e0 |
tc_calloc(calloc/free under initial size) | 0.592267608e0 |
Oh, the record changes every time you measure, but if there is no lock, hhs_calloc may win. It's interesting if you use the threads properly!
――It is true that ** malloc / free will be faster if you decide how to use it and wrap it well according to the situation . However, if you do not seriously consider the speed, it will slow down. - tc_malloc is fast if you use it for speed **. When using it, it is faster to replace it with ** tc_ instead of wearing it sideways **.
By the way, I haven't seen the memory usage. If the memory usage of tc_malloc is huge, you may want to make your own wrapper.
Reference for eliminating bottlenecks How to realize a large number of high-speed and memory-efficient sets using a large memory area of alignment
Bit shift reference ■ [C #] "Amazing" code to find the rightmost standing bit position
Measurement: For studying ruby processing and floating point at the time of result aggregation About decimals, Float and BigDecimal by Ruby ... (for beginners)
Recommended Posts