Analysis and Development of Error-Job Mapping and Scheduling for Network-on-Chips with Homogeneous Processors

University essay from Linköpings universitet/Institutionen för datavetenskap

Abstract: Due to increased complexity of today’s computer systems, which are manufactured in recent semiconductor technologies, and the fact that recent semiconductor technologies are more liable to soft errors (non-permanent errors) it is inherently difficult to ensure that the systems are and will remain error-free. Depending on the application, a soft error can have serious consequences for the system. It is therefore important to detect the presence of soft errors as early as possible and recover from the erroneous state and maintain correct operation. There is an entire research area devoted on proposing, implementing and analyzing techniques that can detect and recover from these errors, known as fault tolerance. The drawback of using faulttolerance is that it usually introduces some overhead. This overhead may be for instance redundant hardware, which increases the cost of the system, or it may be a time overhead that negatively impacts on system performance. Thus a main concern when applying fault tolerance is to minimize the imposed overhead while the system is still able to deliver the correct error-free operation. In this thesis we have analyzed one well known fault tolerant technique, Rollback-Recovery with Checkpointing (RRC). This technique is able to detect and recover from errors by taking and storing checkpoints during the execution of a job.Therefore we can think as if a job is divided into a number of execution segments and a checkpoint is taken after executing each execution segment. This technique requires the job to be concurrently executed on two processors. At each checkpoint, both processors exchange data, which contains enough information for the job’s state. The exchanged data are then compared. If the data differ, it means that an error is detected in the previous execution segment and it is therefore re-executed. If the exchanged data are the same, it means that no errors are detected and the data are stored as a safe point from which the job can be restarted later. A time overhead due to exchanging data between processors is therefore introduced, and it increases the average execution time of a job, i.e. the average time required for a given job to complete. The overhead depends on the number of links that has to be traversed (due to data exchange) after each execution segment and the number of execution segments that are needed for the given job. The number of links that has to be traversed after each execution segment is twice the distance between the processors that are executing the same job concurrently. However, this is only true if all the links are fully functional. A link failure can result in a longer route for communication between the processors. Even though all links arefully functional, the number of execution segments still depends on error-free probabilities of the processors, and these error-free probabilities can vary between processors. This implies that the choice of processors affects the total number of links the communication has to traverse. Choosing two processors with higher error-free probability further away from eachother increases the distance, but decreases the number of execution segments, which can result in a lower overhead. By carefully determining the mapping for a given job, one can decrease the overhead, hence decreasing the average execution time. Since it is very common to have a larger number of jobs than available resources, it is not only important to find a good mapping to decrease the average execution time for a whole system, but also a good order of execution for a given set jobs (scheduling of the jobs). We propose in this thesis several mapping and scheduling algorithms that aim to reduce the average execution time in a fault-tolerant multiprocessor System-on-Chip, which uses Network-on-Chip as an underlying interconnect architecture, so that the fault-tolerant technique (RRC) can perform efficiently.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)