Download Free Detection Of Soft Errors Via Time Based Double Execution With Hardware Assistance Book in PDF and EPUB Free Download. You can read online Detection Of Soft Errors Via Time Based Double Execution With Hardware Assistance and write the review.

The progress made in semiconductor technology has pushed transistor dimensions to smaller geometries and higher densities. One of the disadvantages of this progress is that electronic devices have become more sensitive to the effects of radiation-induced soft errors. As current CMOS technology approaches its final practical limits, soft errors are no longer an exclusive problem of space and mission critical applications, but also for many ground-level consumer and commercial applications such as wearables, medical, aviation, automotive, home, and the emerging internet-of-things (IoT) applications which must continue to operate reliably in the presence of higher soft error rates. Over the last decades, researchers have developed techniques to mitigate the effects of soft errors, but as semiconductor technology continues to mature, soft-error mitigation research has gradually redirected its focus from space and mission-critical to terrestrial consumer and commercial applications. The challenges that new applications need to confront are derived from the need to guarantee adequate reliability and performance while at the same time satisfy all production market constrains of area, yield, power, and cost. Most of the techniques to detect, mitigate, and correct soft errors incorporate redundancy in the form of space (hardware), time or a combination of both. Generally, there is not one single perfect solution to solve the soft-error problem and designers must continuously consider the tradeoffs between the cost of hardware redundancy, or the performance degradation of time-added redundancy when selecting a solution. The objective of this research is to develop and evaluate a new hybrid hardware/software technique to detect soft errors. Our technique is based on a time-redundancy approach that performs execution duplication on the same hardware with the goal of saving area, software development while limiting impact on performance. The proposed technique attains execution duplication with the assistance of limited hardware and software overhead that emulates a virtual duplex system similar to that of double modular redundancy hardware solution. A prototype of the hybrid system was implemented on a custom model of a basic 32-bit RISC processor. The hybrid implementation emulates a virtual system duplication by generating small signatures of the processor execution at separate times and detects soft errors when it encounters differences in the execution signatures. The hardware assistance consists of three components. The first is a state signature generation module that compresses the processor execution information. The second part is a signature processing module that detects soft errors when it encounters differences between execution signatures. The third part consists of enhancements to the instruction set that are incorporated into the program to help synchronize the assisting hardware. We then present the results of our implementation of soft-error detection system and discuss its capabilities/drawbacks as well as possible future enhancements. We finally discuss other potential applications of the architecture for approximate computing and IoT applications.
In the past, general-purpose processors (GPPs) have been able to increase speed without sacrificing reliable operation. Future processor reliability is threatened by a combination of shrinking transistor size, higher clock rates, reduced supply voltages, and other factors. It is predicted that the occurrence of arbitrary transient faults, or soft errors, will dramatically increase as these trends continue. This thesis proposes and implements a fault-tolerant microprocessor architecture that detects soft errors in its own data pipeline. The goal of this architecture is to accomplish soft error detection without requiring extra program execution time. Similar architectures have been proposed in the past. However, these approaches have not addressed ways of reducing the extra time necessary to implement fault tolerance. The approach in this thesis meets the demands for soft-error detection by using idle capacity that is inherent in the microprocessor pipeline. In our approach, every instruction is executed twice. The first execution is the primary execution, and the second is the redundant execution. After both are done, the two results are compared, and soft errors can be detected. Our approach, called REESE (REdundant Execution using Spare Elements), improves on past methods and, when necessary, adds a minimal amount of hardware to the processor. We add hardware only to minimize the increased execution time due to the redundant instructions.
Due to the rapid decrease in Mean Time Between Failure (MTBF) in High Performance Computing, fault tolerance emerged as a critical topic to improve overall performance in the HPC community. In recent decades, along with the decrease in size of hardware, and the extensively used near-threshold computation for energy saving, the community is now facing more frequent soft errors than ever. Particularly, due to the difficulty in detecting soft errors, we are in urgent need for a general solution for these errors.Our work includes providing efficient and effective solution to handle soft and hard errors for parallel system. We start from solving the write bottleneck of the traditional checkpoint and restart. We exploit the communication structure to find locally finalized data, as well as each process's contribution to globally finalized data. We allow each node to take independent checkpoint using this information and therefore achieve uncoordinated checkpointing. We checkpoint asynchronously by overlapping the workload of checkpoint with computation, so that the system avoids write congestion. We discovered that the soft error impact in convergent iterative applications' output follows a pattern. We developed a signature analysis based detection with checkpointing based recovery, which is driven by the observation that high order bit flips can very negatively impact execution, but can also be easily detected. Specifically, we have developed signatures for this class of applications.For non-monotonically convergent applications, we observed that the signature of silent data corruption is specific to an application but independent of the input dataset size for the application. Based on this observation, we explored an approach that involves machine learning technique to detect soft errors. We use off-line training framework of machine learning, construct classifiers with representative inputs and periodically invoke the classifiers during execution to verify the status. Our work not only focuses on optimizing the existing fault tolerance solution to handle general case of faults, but also includes exploring new algorithms that detects and recovers from soft errors. We proposed an algorithm level fault tolerance solution for molecular dynamic applications to detect soft errors and recover from the error. We also developed an algorithm level recovery strategy, so that the applications do not need traditional checkpoint to back up the computation state. Finally, we supported in-situ analysis paradigm with fault resilience. We explored a Map-Reduce like platform for in-situ analysis and discovered the possibility of achieving runtime execution state by utilizing the redundant properties of reduction objects during computation. With the state stored in the shared locations among the nodes, we could maintain a checkpoint-restart like mechanism and the system could restart from any previous backup if any node fails. We were able to apply the approach both time-wise and space-wise for the Smart with reasonable extra overhead.
This book addresses reliability and energy efficiency of on-chip networks using cooperative error control. It describes an efficient way to construct an adaptive error control codec capable of tracking noise conditions and adjusting the error correction strength at runtime. Methods are also presented to tackle joint transient and permanent error correction, exploiting the redundant resources already available on-chip. A parallel and flexible network simulator is also introduced, which facilitates examining the impact of various error control methods on network-on-chip performance.
The sustained drive to downsize the transistors has reached a point where device sensitivity against transient faults due to neutron and alpha particle strikes a.k.a soft errors has moved to the forefront of concerns for next-generation designs. Following Moore's law, the exponential growth in the number of transistors per chip has brought tremendous progress in the performance and functionality of processors. However, incorporating billions of transistors into a chip makes it more likely to encounter a soft soft errors. Moreover, aggressive voltage scaling and process variations make the processors even more vulnerable to soft errors. Also, the number of cores on chip is growing exponentially fueling the multicore revolution. With increased core counts and larger memory arrays, the total failure-in-time (FIT) per chip (or package) increases. Our studies concluded that the shrinking technology required to match the power and performance demands for servers and future exa- and tera-scale systems impacts the FIT budget. New soft error mitigation techniques that allow meeting the failure rate target are important to keep harnessing the benefits of Moore's law. Traditionally, reliability research has focused on providing circuit, microarchitecture and architectural solutions, which include device hardening, redundant execution, lock-step, error correcting codes, modular redundancy etc. In general, all these techniques are very effective in handling soft errors but expensive in terms of performance, power, and area overheads. Traditional solutions fail to scale in providing the required degree of reliability with increasing failure rates while maintaining low area, power and performance cost. Moreover, this family of solutions has hit the point of diminishing return, and simply achieving 2X improvement in the soft error rate may be impractical. Instead of relying on some kind of redundancy, a new direction that is growing in interest by the research community is detecting the actual particle strike rather than its consequence. The proposed idea consists of deploying a set of detectors on silicon that would be in charge of perceiving the particle strikes that can potentially create a soft error. Upon detection, a hardware or software mechanism would trigger the appropriate recovery action. This work proposes a lightweight and scalable soft error mitigation solution. As a part of our soft error mitigation technique, we show how to use acoustic wave detectors for detecting and locating particle strikes. We use them to protect both the logic and the memory arrays, acting as unified error detection mechanism. We architect an error containment mechanism and a unique recovery mechanism based on checkpointing that works with acoustic wave detectors to effectively recover from soft errors. Our results show that the proposed mechanism protects the whole processor (logic, flip-flop, latches and memory arrays) incurring minimum overheads.
This book constitutes thoroughly refereed post-conference proceedings of the workshops of the 19th International Conference on Parallel Computing, Euro-Par 2013, held in Aachen, Germany in August 2013. The 99 papers presented were carefully reviewed and selected from 145 submissions. The papers include seven workshops that have been co-located with Euro-Par in the previous years: - Big Data Cloud (Second Workshop on Big Data Management in Clouds) - Hetero Par (11th Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Platforms) - HiBB (Fourth Workshop on High Performance Bioinformatics and Biomedicine) - OMHI (Second Workshop on On-chip Memory Hierarchies and Interconnects) - PROPER (Sixth Workshop on Productivity and Performance) - Resilience (Sixth Workshop on Resiliency in High Performance Computing with Clusters, Clouds, and Grids) - UCHPC (Sixth Workshop on Un Conventional High Performance Computing) as well as six newcomers: - DIHC (First Workshop on Dependability and Interoperability in Heterogeneous Clouds) - Fed ICI (First Workshop on Federative and Interoperable Cloud Infrastructures) - LSDVE (First Workshop on Large Scale Distributed Virtual Environments on Clouds and P2P) - MHPC (Workshop on Middleware for HPC and Big Data Systems) -PADABS ( First Workshop on Parallel and Distributed Agent Based Simulations) - ROME (First Workshop on Runtime and Operating Systems for the Many core Era) All these workshops focus on promotion and advancement of all aspects of parallel and distributed computing.
This Open Access book introduces readers to many new techniques for enhancing and optimizing reliability in embedded systems, which have emerged particularly within the last five years. This book introduces the most prominent reliability concerns from today’s points of view and roughly recapitulates the progress in the community so far. Unlike other books that focus on a single abstraction level such circuit level or system level alone, the focus of this book is to deal with the different reliability challenges across different levels starting from the physical level all the way to the system level (cross-layer approaches). The book aims at demonstrating how new hardware/software co-design solution can be proposed to ef-fectively mitigate reliability degradation such as transistor aging, processor variation, temperature effects, soft errors, etc. Provides readers with latest insights into novel, cross-layer methods and models with respect to dependability of embedded systems; Describes cross-layer approaches that can leverage reliability through techniques that are pro-actively designed with respect to techniques at other layers; Explains run-time adaptation and concepts/means of self-organization, in order to achieve error resiliency in complex, future many core systems.
This book presents the theory behind software-implemented hardware fault tolerance, as well as the practical aspects needed to put it to work on real examples. By evaluating accurately the advantages and disadvantages of the already available approaches, the book provides a guide to developers willing to adopt software-implemented hardware fault tolerance in their applications. Moreover, the book identifies open issues for researchers willing to improve the already available techniques.
This book constitutes the thoroughly refereed post-proceedings of the Second International Workshop on Radical Agent Concepts, WRAC 2005, held in Greenbelt, MD, USA in September 2005. The 27 full papers presented are fully revised to incorporate reviewers' comments and discussions at the workshop. Topics addressed are social aspects of agents, agent architectures, autonomic systems, agent communities, and agent intelligence.
This book constitutes the thoroughly refereed post-conference proceedings of the 28th International Workshop on Languages and Compilers for Parallel Computing, LCPC 2015, held in Raleigh, NC, USA, in September 2015. The 19 revised full papers were carefully reviewed and selected from 44 submissions. The papers are organized in topical sections on programming models, optimizing framework, parallelizing compiler, communication and locality, parallel applications and data structures, and correctness and reliability.