Why parallelization? That's to speed things up. When I introduced MPI parallel before, I was asked if there is a simpler parallel, so I will introduce parallel that can be easily done if there are multiple cores. Try it on a machine with a CPU of 2core or higher. It is targeted at Linux, but try "challenge" on other operating systems.
Prepare the target program. If you have a program that uses arrays, try it. I would like to have an array size of about 1000. If not, I prepared the following program. Please copy and try.
#include <stdio.h>
#include <time.h>
int main()
{
int n,m,i,j,k;
double a[5000][600];
double b[600][5000];
double us,t0,t1,t2,sum;
t0=clock();
us = 10.0 ;
n=5000 ;
m=600 ;
for (j =0;j<m;j++){
for (i =0;i<n;i++){
b[j][i] = (j+i+2)/10.0 ;
}
}
for (k=0;k<200;k++) {
for (j =0;j<m;j++){
for (i =0;i<n;i++){
a[i][j] = b[j][i] + us ;
}
}
us = us + 1.2 ;
}
sum=0.0 ;
for (j =0;j<m;j++){
for (i =0;i<n;i++){
sum=sum+a[i][j] ;
}
}
t1=clock();
t2 = (t1 - t0) * 1.0e-6 ;
printf(" TIME=%f data= %f\n",t2,sum);
}
This program has no physical meaning. It's a practicing program that is just a matter of putting values in an array and summing them up.
Let's base this on. Let's call the file sample2.c. (Any * .c)
[]$ gcc sample2.c
[]$ ./a.out
Segmentation fault
[]$
Ah, how innocent. .. Let's remove the restrictions.
[]$ ulimit -s unlimited
[]$ ./a.out
TIME=3.898616 data= 1586700000.000000
[]$
Let's add an optimization option.
[]$ gcc -O2 sample2.c
[]$ ./a.out
TIME=3.840020 data= 1586700000.000000
[]$
The optimization option doesn't seem to work very well. Let's unroll a little.
[]$ vi sample2a.c (sample2.I made changes to c and saved it)
[]$ gcc sample2a.c
[]$ ./a.out
TIME=1.300404 data= 1586700000.000000
[]$
It seems to work. Now let's add some optimization.
[]$ gcc -O2 sample2a.c
[]$ ./a.out
TIME=1.192950 data= 1586700000.000000
[]$
It seems to work well. Since this time the theme is SMP parallel, I will touch on tuning for speeding up later, but I would like to do it at a later date if there is an opportunity.
Now let's run in parallel.
Let's put a parallelization instruction line in a plain program. For the time being, the heaviest part is the next loop, so insert the OpenMP parallelization instruction line as follows.
[]$ cp sample2.c sample2b.c
[]$ vi sample2b.c
for (k=0;k<200;k++) {
#pragma omp parallel for private(i,j)
for (j =0;j<m;j++){
for (i =0;i<n;i++){
a[i][j] = b[j][i] + us ;
}
}
us = us + 1.2 ;
}
There are 3 loops, but this is the heaviest triple loop, so try swapping the specified positions. In addition, there are various options and instructions, so please check and try it yourself. It can be specified in other loops, but the effect is limited due to the small granularity. But please try it. In the sum loop, add reduction (+: sum). It is like this.
sum=0.0 ;
#pragma omp parallel for private(i,j) reduction(+:sum)
for (j =0;j<m;j++){
for (i =0;i<n;i++){
sum+=a[i][j] ;
}
}
Here, the instruction line to be entered is only the basic form. Now let's compile. Add "-fopenmp" to the option.
[]$ gcc -O2 -fopenmp sample2b.c
This is OK.
Let's run it.
[]$ ./a.out
TIME=3.932937 data= 1586700000.000000
[]$
I think it works normally. Next, let's finally operate in parallel. First of all, because it ’s magic
[]$ export OMP_NUM_THREADS=2
[]$ ./a.out
TIME=3.945268 data= 1586700000.000000
[]$
Hmm, it's late. .. .. ?? Oh, I forgot. In the case of C, the thread time was added. Use the time measurement command time. The part displayed as "real" is the execution time. Let's continue running.
[]$ export OMP_NUM_THREADS=1
[]$ time ./a.out
TIME=3.826071 data= 1586700000.000000
real 0m3.836s
user 0m3.816s
sys 0m0.012s
[]$ export OMP_NUM_THREADS=2
[]$ time ./a.out
TIME=3.908159 data= 1586700000.000000
real 0m1.960s
user 0m3.895s
sys 0m0.016s
[]$ export OMP_NUM_THREADS=4
[]$ time ./a.out
TIME=5.014193 data= 1586700000.000000
real 0m1.284s
user 0m4.974s
sys 0m0.044s
[]$
How was. Did you get faster? If you have a machine with more cores, try up to the number of cores. The machine I used this time was 4 cores, so it was a little faster, but I think that if you have the number of cores, you can go perfectly. Try making the array a little bigger. If the result changes, it means that you put the instruction line in the part that cannot be parallelized. Even in that case, there is a possibility that parallelization can be done correctly, so please try it. If it is not parallelized, you may have made a mistake in the instruction line, or gcc may be too old to support it. When you ask for the version, you will usually see more numbers than the following example. Please check.
[]$ gcc --version
gcc (GCC) 4.9.3
Copyright (C) 2015 Free Software Foundation, Inc.
If you need a lot of processing and you have enough CPU in the node, please consider parallelism by OpenMP (introduced this time). Furthermore, if the machine environment allows, please consider MPI (previous introduction). If the amount of data is not very large, it may be faster to just unroll. Please consider speeding up according to the amount of data.
If you are worried about unrolling, please refer to the following. This is an example of 8 steps. Increase the calculation density in the loop and let it calculate at once. The optimum number of stages differs depending on the calculation, so please try it. Note that the number of arrays is divided by the number of stages. Otherwise, if you don't round it properly, you'll get a Segmentation fault or your results will change.
for (k=0;k<200;k++) {
for (j =0;j<m;j+=8){
for (i =0;i<n;i++){
a[i][j] = b[j][i] + us ;
a[i][j+1] = b[j+1][i] + us ;
a[i][j+2] = b[j+2][i] + us ;
a[i][j+3] = b[j+3][i] + us ;
a[i][j+4] = b[j+4][i] + us ;
a[i][j+5] = b[j+5][i] + us ;
a[i][j+6] = b[j+6][i] + us ;
a[i][j+7] = b[j+7][i] + us ;
}
}
us = us + 1.2 ;
}
The innermost loop should be continuous so that pipeline processing is performed, and the outer loop is skipped. The example uses an array that doesn't have the same subscripts on purpose, but it's faster to align them, and blocking cache tuning seems to be effective. Try it. It's a parallel theme, so let's check it after tuning.
[]$ gcc -fopenmp sample2a.c
[]$ export OMP_NUM_THREADS=2
[]$ time ./a.out
TIME=2.066413 data= 1586700000.000000
real 0m1.038s
user 0m2.053s
sys 0m0.016s
[]$ export OMP_NUM_THREADS=4
[]$ time ./a.out
TIME=5.264482 data= 1586700000.000000
real 0m1.344s
user 0m5.206s
sys 0m0.062s
[]$
4 It became slow in parallel. It is considered that this is because the processing granularity has become smaller and the procedures (fork and join) before and after the increase in parallelization have exceeded the effect of parallelization. The limit value around this depends on the processing content, processing amount, and machine performance, so check and use it according to your own environment. This time, I used i5 4core machine + VirtualBox + vinelinux.
that's all