Defining appropriate number of processes

I have a python code treating a lot of apache logs (decompress, parse, crunching numbers, regexping etc). One parent process which takes a list of files (up to few millions), and sends a list of files to parse to workers, using multiprocess pool.

I wonder, if there is any guidelines / benchmarks / advices which can help me to estimate ideal number of child process ? Ie. having one process per core is better than launching few hundreds of them?

Currently 3/4 time of script execution is reading files and decompressing them, and in terms of resources, its CPU which is 100% loaded, memory and I/O being ok. So I assume there is a lot which can be done with proper multiprocessing settings. Script will be running on different machines / os, so os-specific hints are welcome, too.

Also, is there any benefit in using threads rather than multiprocess?

Asked by: Ada559 | Posted: 30-11-2021

Answer 1

I wonder, if there is any guidelines / benchmarks / advices which can help me to estimate ideal number of child process ?


having one process per core is better than launching few hundreds of them?

You can never know in advance.

There are too many degrees of freedom.

You can only discover it empirically by running experiments until you get the level of performance you desire.

Also, is there any benefit in using threads rather than multiprocess?


Threads don't help much. Multiple threads doing I/O will be locked up waiting while the process (as a whole) waits for the O/S to finish the I/O request.

Your operating system does a very, very good job of scheduling processes. When you have I/O intensive operations, you really want multiple processes.

Answered by: John433 | Posted: 01-01-2022

Answer 2

Multiple cores do not provide better performance if the program is I/O bound. The performance might even become worse if the disk is serving two or more masters.

Answered by: Catherine472 | Posted: 01-01-2022

Answer 3

I'm not sure if current OSes do this, but it used to be that I/O buffers were allocated per-process, so dividing one process' buffer among multiple threads would lead to buffer thrashing. You're far better off using multiple processes for I/O-heavy tasks.

Answered by: Aida516 | Posted: 01-01-2022

Answer 4

I'll address the last question first. In CPython, it is next to impossible to make sizeable performance gains by distributing CPU-bound load across threads. This is due to the Global Interpreter Lock. In that respect multiprocessing is a better bet.

As to estimating the ideal number of workers, here is my advice: run some experiments with your code, your data, your hardware and a varying number of workers, and see what you can glean from that in terms of speedups, bottlenecks etc.

Answered by: Chloe836 | Posted: 01-01-2022

Similar questions

python - Which of these scripting languages is more appropriate for pen-testing?

Closed. This question is opinion-based. It is not c...

Help me find an appropriate ruby/python parser generator

The first parser generator I've worked with was Parse::RecDescent, and the guides/tutorials available for it were great, but the most useful feature it has was it's debugging tools, specifically the tracing capabilities ( activated by setting $RD_TRACE to 1 ). I am looking for a parser generator that can help you debug it's rules. The thing is, it has to be written in python or in ruby, and have a verbose mo...

Would python be an appropriate choice for a video library for home use software

I am thinking of creating a video library software which keep track of all my videos and keep track of videos that I already haven't watched and stats like this. The stats will be specific to each user using the software. My question is, is python appropriate to create this software or do I need something like c++.

Is Python appropriate for algorithms focused on scientific computing?

python - How do I determine the appropriate check interval?

I'm just starting to work on a tornado application that is having some CPU issues. The CPU time will monotonically grow as time goes by, maxing out the CPU at 100%. The system is currently designed to not block the main thread. If it needs to do something that blocks and asynchronous drivers aren't available, it will spawn another thread to do the blocking operation. Thus we have the main thread being almost tot...

arrays - Most appropriate data structure (Python)

I'm new to Python and have what is probably a very basic question about the 'best' way to store data in my code. Any advice much appreciated! I have a long .csv file in the following format: Scenario,Year,Month,Value 1,1961,1,0.5 1,1961,2,0.7 1,1961,3,0.2 etc. My scenario values run from 1 to 100, year goes from 1961 to 1990 and month goes from 1 to 12. My file therefore has 100*29...

python - Numpy time based vector operations where state of preceding elements matters - are for loops appropriate?

What do numpy arrays provide when performing time based calculations where state matters. In other words, where what has occurred in earlier or later in a sequence is important. Consider the following time based vectors, TIME = np.array([0., 10., 20., 30., 40., 50., 60., 70., 80., 90.]) FLOW = np.array([100., 75., 60., 20.0, 60.0, 50.0, 20.0, 30.0, 20.0, 10.0]) TEMP = np.array([300., 310...

python extend or append a list when appropriate

Is there a simple way to append a list if X is a string, but extend it if X is a list? I know I can simply test if an object is a string or list, but I was wondering if there is a quicker way than this?

python - Finding appropriate cut-off values

I try to implement Hampel tanh estimators to normalize highly asymmetric data. In order to do this, I need to perform the following calculation: Given x - a sorted list of numbers and...

python - What is an appropriate way to datamine the total number of results of a keyword search?

newbie programmer and lurker here, hoping for some sensible advice. :) Using a combination of Python, BeautifulSoup, and the Bing API, I was able to find what I wanted with the following code: import urllib2 from BeautifulSoup import BeautifulStoneSoup Appid = #My Appid query = #My query soup = BeautifulStoneSoup(urllib2.urlopen("" + Appid + "&query=" ...

Still can't find your answer? Check out these communities...

PySlackers | Full Stack Python | NHS Python | Pythonist Cafe | Hacker Earth | Discord Python