Fault-Tolerant Cloud Services
Abstract: Nowadays, due to the convenience of deployment, ease to scale up and cost savings, the application of cloud computing systems has spread throughout the factory, commercial and individual users. However, fault tolerance in cloud computing systems has always been an important topic due to the high failure rate caused by the sheer size of cloud computing systems. This thesis presents an implementation of a fault-tolerant system called "supervision system" as a fault-tolerant mechanism for cloud computing systems. We first proposed a supervisor-worker relation: a supervisor node is responsible for monitoring its child (worker or another supervisor), and the worker node which does the actual work periodically reset a timer in its supervisor. If the corresponding timer overflows, the supervisor marks it as a failure, and try to restore or restart a new instance of it. The system also supports a multi-watchdog mode, which uses more fine-grained watchdogs that group the threads in the worker and applies different strategies to the groups. Besides the local system, we also implemented a remote supervision system to ensure the safety of local root supervisors, by periodically saving its running state and uploading the image files to its remote supervisor. If an overflow occurs, the remote supervisor remotely calls the restore function on the local machine. Then the restore function gets the most recent image files from the remote supervisor and restores itself. In addition to the implementation details of the system, we designed several test cases and tested the speed of each system part. According to the results, we can conclude that the system works as we expected.
AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)