I have a python code treating a lot of apache logs (decompress, parse, crunching numbers, regexping etc). One parent process which takes a list of files (up to few millions), and sends a list of files to parse to workers, using multiprocess pool.

I wonder, if there is any guidelines / benchmarks / advices which can help me to estimate ideal number of child process ? Ie. having one process per core is better than launching few hundreds of them?

Currently 3/4 time of script execution is reading files and decompressing them, and in terms of resources, its CPU which is 100% loaded, memory and I/O being ok. So I assume there is a lot which can be done with proper multiprocessing settings. Script will be running on different machines / os, so os-specific hints are welcome, too.

Also, is there any benefit in using threads rather than multiprocess?

Answer 1

You can never know in advance.

There are too many degrees of freedom.

You can only discover it empirically by running experiments until you get the level of performance you desire.

Threads don't help much. Multiple threads doing I/O will be locked up waiting while the process (as a whole) waits for the O/S to finish the I/O request.

Your operating system does a very, very good job of scheduling processes. When you have I/O intensive operations, you really want multiple processes.

Answer 2

Multiple cores do not provide better performance if the program is I/O bound. The performance might even become worse if the disk is serving two or more masters.

Answer 3

I'm not sure if current OSes do this, but it used to be that I/O buffers were allocated per-process, so dividing one process' buffer among multiple threads would lead to buffer thrashing. You're far better off using multiple processes for I/O-heavy tasks.

Answer 4

I'll address the last question first. In CPython, it is next to impossible to make sizeable performance gains by distributing CPU-bound load across threads. This is due to the Global Interpreter Lock. In that respect multiprocessing is a better bet.

As to estimating the ideal number of workers, here is my advice: run some experiments with your code, your data, your hardware and a varying number of workers, and see what you can glean from that in terms of speedups, bottlenecks etc.

