Effects of Thread-Local Task Trees in the CHT-MPI C++ Programming Library for the Chunks and Tasks Programming Model
Abstract: Programming efficiently in extremely high thread count environments can be both challenging and tedious for the modern programmer. Task based programming models provide an abstraction of the low--level system details by instead allowing a program to be expressed as a set of tasks to executed by the system. However, these abstractions often have an associated performance overhead that can cut into the benefits of running with many threads. The Chunks and Tasks programming model is one such task based abstraction model. The Chunks and Tasks CHT-MPI C++ library, developed at Uppsala University, follows this programming model. The CHT-MPI library works by first spawning a set of worker processes. Tasks are then assigned to the workers in a dynamic fashion so that those who complete their tasks faster than others are automatically assigned additional tasks. This is done to help with load balancing across the system. Workers may also steal tasks from each other. Each worker process may have many threads. In this thesis, the CHT--MPI library is modified to reduce inter-thread communication within individual worker processes. Currently, each worker process has a single master list of all tasks to be executed by that worker. This list is shared between all threads within a single worker. This list sharing can create a performance bottleneck when the tasks are relatively short and the list is being modified often. Adding support to the library for thread-local task trees reduces the number of times that threads must synchronize with the master list. With this new feature, a worker process with twenty threads that was assigned a set of highly granular tasks can see a two times boost in performance when compared to the old version of the library. This level of performance improvement was observed for dense matrix-matrix multiplication and integration benchmarks. This is an important result as being able to effectively scale programs of a fixed problem size across an ever increasing number of threads necessitates being able to efficiently execute them in smaller and smaller parts.
AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)