A memo when I was involved in the process of comparing and uniqueizing only specific columns in a certain business. It is to remove multiple tens of millions of rows of a specific column from a file with about 300 million rows and output it, and it took a considerable amount of time to write it by using awk and shell commands, but that is Java. Then, it is said that high performance could be achieved without difficulty with a small amount of effort.
The source code has Gist at the end of the page.
In conclusion, Java was the fastest and slowest of the processes I wrote, making heavy use of awk.
As already written, the executed process reads a file with about 300 million rows and removes the duplication of the first and second columns, and the total number of columns is about 20. Sorting has already been done, and if there are duplicates, it doesn't matter if you win first or second, so the condition is that the first and second columns are always unique. By the way, there are about 40 million duplicate rows, but the number of duplicates is not necessarily one, and there may be several rows with the same first and second columns.
Under this condition, the sort command's unique processing and uniq command cannot uniqueize rows that have the same values in the first and second columns but different values in the nth column (can be achieved with sort -k 1,2 -u
). Since there is a possibility, but it has not been verified), I decided to extract a specific column with awk and check for duplication by comparing the value of that column with the previous value.
The one I wrote at the beginning is the one that processes if with awk. distinct.sh
This process was about 120-150 lines / s. Originally it was said that no matter how long it took, it would take dozens of days at this speed, so I looked at the situation for a few minutes and considered efficiency.
As a result, the use of awk is limited to specific column extraction as follows. distinct_hs.sh
In this process, we tried to speed up the comparison process and export part by not going through awk. As a matter of fact, it was certainly a little faster, resulting in speeds of around 170-200 lines / s.
However, even this is a calculation that takes about 20 days, so it is also possible to divide the file and handle only the duplication of the beginning and end of each divided file separately and process them in parallel.
However, considering the reason why it is slow in the first place, it is easy to buffer and write the same process with Java, judging that performance is not achieved due to detailed I / O. To write. FilePrinter.java
The result is clear. The speed was 185k-190k lines / s, and the process was completed in less than 30 minutes.
The speeds of each process are shown below. (I just wanted to show it. It's too overwhelming to make a comparable figure.)
Shell scripts (and awk) are good for simple and small processing, but when dealing with a large amount of data, it is a good opportunity to learn that it is necessary to change the processing description to suit it or change the tool to be used in the first place. It was a story that became.
Click here for Gist: distinct rows with column 1 and 2. ยท GitHub
Recommended Posts