Detecting time inefficiencies in service-oriented systems using distributed tracing

University essay from Umeå universitet/Institutionen för datavetenskap

Author: Josefin Ekenstedt; [2023]

Keywords: ;

Abstract: Stragglers, which are tasks that operate significantly slower than other tasks in a system, are a big issue in distributed systems. A system can contain relatively few tasks that qualify as stragglers but that have a great impact on the overall system performance. For example, a study of a large data center showed that as few as 3.48 % of the tasks constituting various jobs were stragglers, and that these had a negative performance impact on almost 50 % of all total jobs. The purpose of this study is to utilize distributed tracing to detect stragglers in a service-oriented, distributed system. Distributed tracing is a tool that tracks requests across system boundaries and offers observability into which services a request has interacted with, and in which order. It also measures the duration of each service interaction which could act as a measurement for defining stragglers. Distributed tracing was utilized in this project to track a request in a case-study system constituting four nodes, to find services with straggling behavior. Specifically developed for this project was a program measuring the usage of CPU, memory, disk bandwidth and network bandwidth by a process. This program was used for services with which a request did interact, as well as for other services co-allocated on the same node. The metrics obtained were used as a basis when evaluating the reason for experienced straggling behavior. It was concluded that consuming certain resources together, for example CPU and memory, entailed straggling behavior. It was also shown that making conscious choices regarding how to co-allocate processes with respect to these results could improve a request round-trip time by up to 60 %. However, although time was insufficient to test this theory, it is believed that these results are highly system and application dependent and that the stragglers experienced in this project might not emerge in other systems. Therefore, it is believed that these experiments must be performed on each system of interest to get accurate results for that particular system. Nevertheless, these results demonstrate how severe the performance impact could possibly be due to stragglers caused by resource contention.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)