Protractor: Leveraging distributed tracing in service meshes for application profiling at scale

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Abstract: Large scale Internet services are increasingly implemented as distributed systems in order to achieve fault tolerance, availability, and scalability. When requests traverse multiple services, end-to-end metrics no longer tell a clear picture. Distributed tracing emerged to break down end-to-end latency on a per service basis, but only answers where a problem occurs, not why. From user research we found that root-cause analysis of performance problems is often still done by manually correlating information from logs, stack traces, and monitoring tools. Profilers provide fine-grained information, but we found they are rarely used in production systems because of the required changes to existing applications, the substantial storage requirements they introduce, and because it is difficult to correlate profiling data with information from other sources. The proliferation of modern low-overhead profilers opens up possibilities to do online always-on profiling in production environments. We propose Protractor as the missing link that exploits these possibilities to provide distributed profiling. It features a novel approach that leverages service meshes for application-level transparency, and uses anomaly detection to selectively store relevant profiling information. Profiling information is correlated with distributed traces to provide contextual information for root-cause analysis. Protractor has support for different profilers, and experimental work shows impact on end-to-end request latency is less than 3%. The utility of Protractor is further substantiated with a survey showing the majority of the participants would use it frequently

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)