Processes Paralleling to Speed up Computing and Tasks Execution in Linux

*nix

Almost all PCs, released during the last few years, have had at least a dual core processor. So reader, if your PC isn’t extremely old or the bottom of the barrel budget brand, then most likely you are the owner of a multiple-processor system. Also if you like to play games, you should know you are utilizing potentially hundreds of GPU cores. But during the lion’s share of time all of these cores just gather dust. Let’s try to fix that.

Introduction

We rarely involve all the power of several processors (or cores) in order to solve every day problems. The high-end graphic engines are used for computer games only. I don’t like that. If I am working, why should my processors be resting? We should take on board all facilities and beauties of multi processors (multi cores) and parallelize everything possible. We should also use a heavy-duty video cards with hundreds of cores on board, that are useful for the limited group of tasks, but will essentially speed up the computing…

Linux is really strong on that score. First of all, a number of useful tools for parallelizing execution are available in most distributions right out of the box. Secondly, there is plenty of software written and developed considering multicore systems. I guess there’s no need to talk about flexibility. The only thing we may have problems are videocard drivers, but there’s nothing we can’t do about it.

At first, let’s pin down parallelizing methods. There are two. The first one is by means of application itself, which starts several parallel threads for task execution (multithreading). The second one lies in running several application copies, each of which will process a definite data chunk. In this case the operating system will allocate processes among cores or processors (multitasking).

Paralleling Right at the Terminal

We’ll begin with parallel processes starting right from the terminal window. It’s no problem to start parallel processes, which will operate for a long time, at the terminal. But what if we need two such processes? “It’s no problem either, — you’ll say, — we’ll just start the second process in another terminal window.” But what if we need to start ten or more processes? Hmm… The first thing coming to mind is to use xargs utility. If we define –max-procs=n option for it, the software will execute n processes concurrently and that’s exactly what we need. The official manual recommends to use –max-procs argument grouping (-n option) with the option. Problems with parallel start may arise without doing so. As an example, let’s suppose that we need to archive a larger number of big (or small) files:

$ cd folder/with/files
$ ls | xargs -n1 --max-procs=4 gzip

How much time have we gained? Is it worth bothering? That’s where it’s necessary to state figures. On my four-core processor usual archiving of five files, each of which was around 400 MB, took 1 minute 40 sec. When using xargs –max-procs=4 the spent time shortened almost 4 times: 34 sec! I think the solution is obvious.

Let’s try something more interesting. For example, reconvert, WAV-files to MP3 with the help of lame:

$ ls *.wav | xargs -n1 --max-procs=4 -I {} lame {} -o {}.mp3

Looks lubberly? I agree. But parallel process execution isn’t the main task for xargs, it’s just one of its facilities. Besides, xargs isn’t good at transferring special symbols, such as space or quotes. That’s when a great utility called GNU Parallel comes to our rescue. The software is available in standard repositories, but I wouldn’t recommend you to install it from there. For example, in the main Ubuntu repositories I came across a two-year old version. You’d better compile a new version from the source code:

$ wget ftp.gnu.org/gnu/parallel/parallel-latest.tar.bz2
$ tar xjf parallel-latest.tar.bz2
$ cd parallel-20130822
$ ./configure && make
# make install

The utility name speaks for its narrow specialization. Really, Parallel is much more handy for paralleling and its usage seems more logical. The example above accomplished with Parallel usage rather than xargs looks like the following:

$ ls *.wav | parallel lame {} -o {}.mp3

By the way, if you you’re using Ubuntu or Kubuntu, the example will not operate, it will throw some errors. It can be fixed by adding ‘–gnu’ key (the same for the next example). Find more details about the problem here.

Why don’t we define the quantity of concurrently executed processes? Because Parallel will do it for us. It will define the cores quantity and will start one process per core. Of course, we can define this number manually or with the help of –j option. By the way, if you need to start the task on different machines, it’s good to define this option as j +2, in order to improve portability. Which in this case means “to start concurrently two processes more than there are units in the system”.

If we make Parallel and Python friends, we will get a powerful tool for parallel task execution. For example, web pages loading and their further processing can look like the following:

$ python makelist.py | parallel -j+2 'wget "{}" -O - | python parse.py'

But this utility has plenty of other capabilities even without Python. You should certainly read the man pages – there are heaps of interesting examples. Besides Parallel and xargs, there are lots of other utilities with similar functionality, but they can’t operate like the first ones.

We’re done with this part. Let’s move on.

Parallel Compilation

Compiling something from source code is a usual task for any linuxoid. It often involves the compilation of a small group of files. Nobody thinks in terms of compilation time for such projects. But sometimes there are bigger projects and you have to wait eternally for the setup to finish. For example, in order to setup Android (AOSP) from source code (into one thread) lasts about five hours! All the cores should be used for such projects.

I think it’s obvious that compilation itself (GCC, for example) doesn’t parallelize. But big projects mostly consist of a big amount of independent modules, libraries and other things that can and must be compiled concurrently. We don’t have to think about how to parallelize the compilation as make will take care of it, but only if interactions will be written in makefile. Otherwise make won’t know in what order and what should or shouldn’t be gathered concurrently. As makefile is a developer’s tool, all we have to do is run

$ make -jN

command. After that make will begin the project setup, starting concurrently up to N tasks.

As for value choice for –j parameter, they often recommend in the Web to use 1.5 * . But it’s not always right. For example the quickest way to setup a project of about 250 MB on my four-core was with –j parameter value, which is four.

In order to gain more time, you can add ‘-pipe’ key to GCC. With this key data transfer between different compilation stages is performed via pipes, not temporary files, which fastens the process a bit (a very little bit).

Besides make, you can also try pmake — a system of concurrent setup, which is written in Python. Its application doesn’t differ from make for users. But it can be really interesting for developers as it offers more extensive facilities than the standard instrument.

Parallel Rsync

If you have ever used Rsync to synchronize with the remote server with a large number of small files, you might have noticed a delay at receiving file list stage. Can this stage be fastened with the help of parallelization? Of course, it can. A lot of time is wasted on delays in the network. In order to minimize these time losses we will run a few Rsync copies. In order not to let it copy the same files we will set each copy to a separate catalog. We’ll use –include and –exclude parameters combination for it, for example:

$ rsync -av --include="/a*" --exclude="/*" -P login@server:remote /localdir/
$ rsync -av --include="/b*" --exclude="/*" -P login@server:remote /localdir/

We can manually run few copies in deferent terminals, but Parallel can be connected up for it: $ cat directory_list.txt | parallel rsync -av --include="/{}*" --exclude="/*" ...

Turbo-Jet SSH Copying

As a rule, in order to synchronize directories between two hosts Rsync is run over SSH. Having hastened SSH-connection, we will speed up Rsync operation as well. SSH can be hastened by using the OpenSSH HPN patch set, which will fix bottlenecks in the mechanism of server and client part SSH buffering. A multithreaded version of AES-CTR algorithm is used in HPN, which increases the speed of files enciphering (activated by the flag oCipher=aes[128|192|256]-ctr). In order to check whether OpenSSH HPN is installed on your PC, type the following in the terminal:

$ ssh –V and find HPN substring beginning. If you have a standard OpenSSH, you can install HPN version like this: 
$ sudo add-apt-repository ppa:w-rouesnel/openssh-hpn
$ sudo apt-get update -y
$ sudo apt-get install openssh-server

Then add strings in /etc/ssh/sshd_config:

HPNDisabled no
TcpRcvBufPoll yes
HPNBufferSize 8192
NoneEnabled yes

Then restart SSH service. Now create Rsync/SSH/SCP-connection and estimate your gain.

Connect HPN support.

File Compression

All the speedups we performed before are based on concurrent start of several copies of the very same process. The OS Process scheduler allocated these processes between the cores (processors) of our machine. That’s how we got the speedup. And now let’s get back to the example, in which we compressed several files. But what if we need to compress one huge file with the help of the low bzip2? Fortunately, file compression can be parallelized. The file is divided into blocks that are compressed independently. But standard utilities, such as gzip and bzip2 don’t offer such functionalities. But there are plenty of external products that can do that. We’ll consider two of them only: a parallel analog of gzip — pigz and an analog of bzip2 — pbzip2. These two utilities are available at standard repositories of Ubuntu.

Pigz application doesn’t differ from operation with gzip, except for the facility to define threads quantity and the block size. In most cases, the block size can be kept default, and as amount of threads should be indicated a number which is equal (adding 1-2 more) to the quantity of system processors (cores).

$ pigz -c -p5 backup.tar > pigz-backup.tar.gz

This command execution for backup.tar file of approx 620 MB took 12.8 sec. The output file was 252.2 MB. Processing of the same file with the help of gzip:

$ gzip -c backup.tar > gzip-backup.tar.gz

It took 43 sec. The output file was about 100 kB less than the previous one of 252.1 MB. And again we got almost four-time speedup!

Pigz can parallelize compression only, but not decompression, while pbzip2 can do both. Utility use is similar to its nonparallel variant:

$ pbzip2 -c -p5 backup.tar > pbzip-backup.tar.bz2

The same file backup.tar processing took 38.8 sec, the output file size is 232.8 MB. Compression using standard bzip2 took 1 min 53 sec, with a file size of 232.7 MB.

As I have already said, decompression can also be sped up via pbzip2. But you should remember that you can only concurrently decompress those files which were compressed concurrently. I.e. files created using pbzip2. Some figures:

  • Standard decompression — 40.1 sec;
  • Five-thread decompression— 1.,3 sec.

I can only add that archives that were created using pigz and pbzip2 are fully compatible with the archives created using unparallel analogs.

Enciphering

eCryptfs is used by default for home directory enciphering in Ubuntu and all derivative distributives. At the time of writing this article eCryptfs doesn’t support multipath. And this is especially notable in folders with large quantities of small files. So if you have a multi core, you should be using something other than eCryptfs. The best replacement will be using dm-crypt or Truecrypt systems. They can cipher the entire chapters or containers, but they do support multipath.

INFO

NPF packer filter from NetBSD 6.0 allows you to achieve maximum efficiency in multi core systems by means of parallel multi path packet processing while minimizing the amount of blockings.

If you want to cipher just one definite directory, not the entire disk, try EncFS. It’s very similar to eCryptfs, but operates using FUSE, which makes it potentially slower than eCryptfs. But it supports multipath, which performs better in multicore systems. Besides it’s very simple to setup (available at most standard repositories). You just need to execute

$ encfs ~/.crypt-raw ~/crypt

Enter the passphrase and that’s it: in .crypt-raw will be located ciphered versions of the files, and in crypt – the ones not in cipher. In order to dismount EncFS execute:

$ fusermount -u ~/name

Of course, all of that can be automated. How can you do it? Read here.

Monitoring

We already can fully load the processor, but we should also monitor its operation. Almost every Linux distribution has decent facilities for monitoring the processor use, including information about each core or processor. For example, in Kubuntu KSysGuard reflects current core load (screenshot “KSysGuard in four-core system”).

KSysGuard in four-core system.

There are other interesting utilities enabling processor monitoring. The fans of console solutions will like htop – a more colorful and interactive analog of top. Also pay attention to Conky — a high end and easily adjusted system monitor. It can be easily adjusted to monitor each core load and entire processor as well. A separate chart can be made for each core. You can look at my variant of utility configuration at the screenshot.

That’s how Conky looks like

Htop in action

I put here the content of the appropriate configuration file. There are plenty of interesting configs online, which you can take as a basis and adjust for your needs.

But these utilities can give information about load only, which isn’t always enough. Mpstat from sysstat set provides more interesting information. For example, the dead time of each core, time spent on input/output waiting or the time spent on interrupts processing.

Mpstat utility output.

GPU is Not for Games Only

It’s no secret that modern GPU has great computing capacities. But due to special GPU architecture and limited instruction set, it fits limited group of tasks only. Formerly only Guru’s of shaders could perform computations on GPU. Today video card developers do their best to simplify life for enthusiasts and developers who want to involve graphic processors capacities in their projects: CUDA от NVIDIA, AMD FireStream, open standard OpenCL. Year by year computing via the GPU is becoming more and more available.

Hash Calculation

I guess the most popular task among other run on GPU is hash calculation. And it’s because of Bitcoin-mining that lies in calculating the hashes. Most Bitcoin-miners are available for Linux. If you want to mine Bitcoins and if your graphic engine supports OpenCL (if it supports CUDA, it supports OpenCL as well), I recommend you to pay attention to bfgminer: it’s fast, handy and functional. But it’s not so easy in adjustment.

Snort Speedup by Means of GPU

A very interesting concept named Gnort has been developed by researches from the Greek Institute FORTH (Foundation for Research and Technology — Hellas). They offer to increase Snort attack detection efficiency by means of shifting to GPU the code that is in charge of regular expressions check. According to charts, provided in an official research (PDF), they achieved at least a two-fold increase of Snort carrying capacity.

But Bitcoin isn’t the only one. We can also use GPU capacities for brute force hashing(in order to learn the forgotten password ;]). In solving this task oclHashcat-plus utility has proved itself. It can collate MD5 hashes with salt and without it, SHA-1, NTLM, cached domain passwords, MySQL database passwords, GRUB passwords and that’s not even a half of the list.

Enciphering on GPU

Weibin Sun and Xing Lin students from Utah have introduced, as part of KGPU project, a very interesting application of graphic engines capacities. The project’s essence lies in shifting some Linux core parts execution to CUDA-compatible GPU. AES algorithm was the first to be shifted to GPU. Unfortunately, the project development stopped at that point, though they promised to continue their work. But this idea can be used for AES-ciphering speedup in eCryptfs and dm-crypt. But unfortunately cores with version number 3.0 and are not supported.

GPU Performance Monitoring

Why not? Of course, we won’t be able to know the load of each core, but we will get at least some information about GPU. CUDA-Z program (a sort of Windows program GPU-Z analog), besides different static information about GPU, can also obtain dynamic info: the current data exchange rate between the graphic engine and the machine, and also general performance of all GPU cores in flops.

CUDA-Z inlay with GPU performance information

Summary

Multicore or multiprocessor workstations entered our everyday life quite a long time ago. So our approach when working with them should be changed. Tasks parallelized in such systems gives us a huge gain of time and we made sure of that.

Useful Links - JVM tuning for Java EE applications performance increase in a multicore system: bit.ly/JavaTuning - Multicore CPU performance measurement by means of MPI tests: bit.ly/MPIbenchmark - If you have nothing to load your computer with, give computing capacities to the study of global warming or other interesting scientific researches: https://boinc.berkeley.edu

[original source]

Comments

  1. Well written, loads of examples. Nice read during coffee break! Just discovered this website, keep on the good work! (god do I sound like a bot xD)
940

Ropes — Fast Strings

Most of us work with strings one way or another. There’s no way to avoid them — when writing code, you’re doomed to concatinate strings every day, split them into parts and access certain characters by index. We are used to the fact that strings are fixed-length arrays of characters, which leads to certain limitations when working with them. For instance, we cannot quickly concatenate two strings. To do this, we will at first need to allocate the required amount of memory, and then copy there the data from the concatenated strings.