How to use multicore CPU to speed up your Linux command of GNU Parallel

2020-12-26 06:04:24
OfStack

Have you ever had to compute a very large number (a few hundred GB)? Or search inside, or do something else -- 1 something that can't be done in parallel. Data experts, I'm talking to you. You may have one CPU with four or more cores, but our appropriate tools, such as grep, bzip2, wc, awk, sed, etc., are single-threaded and can only use one CPU kernel.

To borrow the words of cartoon character Cartman, "How can I use these kernels?"

To get the Linux command to use all the CPU kernels, we need to use the GNU Parallel command, which lets all the CPU kernels do the magic map-ES20en operation on a stand-alone machine, using, of course, the rarely used pipes parameter, also known as spreadstdin. That way, your load will be spread evenly across CPU, really.

BZIP2

bzip2 is a better compression tool than gzip, but it's slow! Come on, we have a way out of this problem.

Previous practice:


cat bigfile.bin | bzip2 --best > compressedfile.bz2

Here it is:


cat bigfile.bin | parallel --pipe --recend '' -k bzip2 --best > compressedfile.bz2

Especially for bzip2, GNU parallel is super fast on multicore CPU. If you are not careful, it will be finished.

GREP

If you had a very large text file, you might have done this before:


grep pattern bigfile.txt

Now you can do this:


cat bigfile.txt | parallel --pipe grep 'pattern'

Or this:


cat bigfile.txt | parallel --block 10M --pipe grep 'pattern'

The second use uses the block 10M parameter, which means that each kernel handles 10 million rows -- you can use this parameter to adjust how many rows each CUP kernel handles.

AWK

Here is an example of calculating a very large data file with the awk command.

General usage:


cat rands20M.txt | awk '{s+=$1} END {print s}'

Here it is:


cat rands20M.txt | parallel --pipe awk \'{s+=\$1} END {print s}\' | awk '{s+=$1} END {print s}'

This is a bit complicated: the arguments to pipe in the parallel command divide the cat output into blocks dispatched to the awk call, forming a number of sub-computations. These sub-computations pass through the second pipe into the same awk command, thus output the final result. The first awk has three backslashes, which is what GNU parallel needs to call awk.

Want the fastest way to calculate the number of lines per file?

Traditional practice:


wc -l bigfile.txt

Here's what you should do:


cat bigfile.txt | parallel --pipe wc -l | awk '{s+=$1} END {print s}'

Very clever, first use the parallel command 'mapping' out of a large number of ES98en-ES99en calls, forming a sub-calculation, and finally sent to awk through the pipeline for summary.

SED

Want to use the sed command to do a lot of substitutions in a huge file?

General practice:


sed s^old^new^g bigfile.txt

Now you can:


cat bigfile.bin | parallel --pipe --recend '' -k bzip2 --best > compressedfile.bz2

... You can then use a pipe to store the output in a specified file.