Multiprocess in Python
文章目录
多进程学习,介绍了 Process
和 Pool
类并且在最后进行了比较。
what is such a multiprocessing system? We have the following possibilities:
- A multiprocessor- a computer with more than one central processor
- A multi-core processor- a single computing component with more than one independent actual processing units /cores.
In either case, the CPU is able to execute multiple tasks at once assigning a processor to each task
As CPU manufacturers start adding more and more cores to their processors, creating parallel code is a great way to improve performance. Python introduced multiprocessing module to let us write parallel code.
因为 CPU 现在是多核的,所以自然就可以使用多进程去并行处理计算。
|
|
Depending on the application, two common approaches in parallel programming are either to run code via threads or multiple processes, respectively. If we submit “jobs” to different threads, those jobs can be pictured as “sub-tasks” of a single process and those threads will usually have access to the same memory areas (i.e., shared memory). This approach can easily lead to conflicts in case of improper synchronization, for example, if processes are writing to the same memory location at the same time.
多线程可以共享一个进程的资源,与此同时容易发生同时读写同一个 memory area 的情况;相对来说 multi-processing 是可以处理上述的冲突。
Process 类
Process used when function based parallelism is required, where I could define different functionality with parameters that they receive and run those different functions in parallesl which are doing totally vairous kind of computations.
There are two important functions that belongs to the Process class – start()
and join()
function.
If we create a process object, nothing will happen until we tell it to start processing via
start()
function. Then, the process will run and return its result. After that we tell the process to complete viajoin()
function.调用 start() 之后程序开始运行,当调用 join() 之后进程终止。
Without
join()
function call, process will remain idle and won’t terminate. So if you create many processes and don’t terminate them, you may face scarcity of resources. Then you may need to kill them manually.(所以调用 join() 是必须要做的一个很好的习惯问题)
- .start() helps in starting a process and that too asynchronously.
- .join() method on a
Process
does block until the process has finished, but because we called.start()
on both p1 and p2 before joining, then both processes will run asynchronously. The interpreter will, however, wait until p1 finishes before attempting to wait for p2 to finish.
|
|
如果想要在不同的进程中共享数据,那么可以使用 multiprocessing
中的 Queue
Python Multiprocessing modules provides
Queue
class that is exactly a First-In-First-Out data structure. They can store any pickle Python object (though simple ones are best) and are extremely useful for sharing data between processes.Queues are specially useful when passed as a parameter to a Process’ target function to enable the Process to consume data. By using
put()
function we can insert data to then queue and usingget()
we can get items from queues. See the following code for a quick example.python 中的队列 queue, 在这里可以用来 share data。使用
put
放入 queue 中,使用get
获得相应的data,按照 first-in-first-out 的顺序。
Process 中常见的还有以下属性:pid
, is_alive
, name
getting Process ID, Process name and checking if alive
|
|
锁机制不是很常用,但是写一下吧
|
|
How to retrieve results in a particular order?
(可以将文本中的index 信息放入到最后的结果中,所以这样就能保持原来的顺序)
|
|
Pool 类
The pool class schedules execution using FIFO policy. It workings like a map reduce design. The input are mapped from different processors and bring together the output from all the processors. After the running the code, it restores the output in the form of a list or array. It waits for all the jobs to finish and then returns the output. The processes in execution are put in memory and other non-executing processes are puts away out of memory.
Pool 可以理解为 map-reduce,那么得到的结果是和原来数据的顺序是不吻合的。
The pool distributes the tasks to the available processors using a FIFO scheduling. It works like a map-reduce architecture. It maps the input to the different processors and collects the output from all the processors. After the execution of code, it returns the output in form of a list or array. It waits for all the tasks to finish and then returns the output. The processes in execution are stored in memory and other non-executing processes are stored out of memory.
在初始化Pool()
的时候,默认 processes
是 cpu_count
processes is the number of worker threads to use. If processes is
None
then the number returned byos.cpu_count()
is used.
|
|
using pool: data based parallelism, it offers a convenient means of parallelizing the execution of function across multiple input values, distributing the input data across processes.
apply
Call func with arguments args. It blocks until the result is ready.apply_async
It is better suited for performing work in parallel.map
A parallel equivalent of themap()
built-in function (it supports only one iterable argument though, for multiple iterables). But it blocks.map_async
It is better suited for performing work as map in parallel.
The Pool.map
and Pool.apply
will lock the main program until all processes are finished, which is quite useful if we want to obtain results in a particular order for certain applications.
In contrast, the async
variants will submit all processes at once and retrieve the results as soon as they are finished. One more difference is that we need to use the get
method after the apply_async()
call in order to obtain the return
values of the finished processes.
比较
Pool 和 Process 的比较
- task number 方面
As we have seen, the Pool allocates only executing processes in memory and the process allocates all the tasks in memory, when the task number is small, we can use proces class and when the task number is large, we can use the Pool. In the case of Pool, there is overhead in creating it, Hence with small task numbers, the performance is impacted when Pool is used
- IO operations
The Pool distributes the processes among the available cores in FIFO manner. On each core, the allocated process executes serially. So if there is a long IO operation, it waits till the IO operation is completed and does not schedule another process. The Process class suspends the process executing IO operations and schedules another process. So, in the case of long IO operation, it is advisable to user Process class.
Pool:
When you have junk of data, you can use Pool class. Only the process under execution are kept in the memory. I/O operation: It waits till the I/O operation is completed & does not schedule another process. This might increase the execution time. Uses FIFO scheduler.Process:
When you have a small data or functions and less repetitive tasks to do. It puts all the process in the memory. Hence in the larger task, it might cause to loss of memory. I/O operation: The process class suspends the process executing I/O operations and schedule another process parallel. Uses FIFO scheduler.
参考文献
Python Multiprocessing Example
https://sebastianraschka.com/Articles/2014_multiprocessing.html
文章作者 jijeng
上次更新 2021-01-10