Download Free Towards Efficient Inference And Improved Training Efficiency Of Deep Neural Networks Book in PDF and EPUB Free Download. You can read online Towards Efficient Inference And Improved Training Efficiency Of Deep Neural Networks and write the review.

In recent years, deep neural networks have surpassed human performance on image classification tasks and and speech recognition. While current models can reach state of the art performance on stand-alone benchmarks, deploying them on embedded systems that have real-time latency deadlines either cause them to fail these requirements or severely get degraded in performance to meet the stated specifications. This requires intelligent design of the network architecture in order to minimize the accuracy degradation while deployed on the edge. Similarly, deep learning often has a long turn-around time due to the volume of the experiments on different hyperparameters and consumes time and resources. This motivates a need for developing training strategies that allow researchers who do not have access to large computational resources to train large models without waiting for exorbitant training cycles to be completed. This dissertation addresses these concerns through data dependent pruning of deep learning computation. First, regarding inference, we propose an integration of two different conditional execution strategies we call FBS-pruned CondConv by noticing that if we use input-specific filters instead of standard convolutional filters, we can aggressively prune at higher rates and mitigate accuracy degradation for significant computation savings. Then, regarding long training times, we introduce our dynamic data pruning framework which takes ideas from active learning and reinforcement learning to dynamically select subsets of data to train the model. Finally, as opposed to pruning data and in the same spirit of reducing training time, we investigate the vision transformer and introduce a unique training method called PatchDrop (originally designed for robustness to occlusions on transformers [1]), which uses the self-supervised DINO [2] model to identify the salient patches in an image and train on the salient subsets of an image. These strategies/training methods take a step in a direction to make models more accessible to deploy on edge devices in an efficient inference context and reduces the barrier for the independent researcher to train deep learning models which would require immense computational resources, pushing towards the democratization of machine learning.
In recent years, deep neural networks have surpassed human performance on image classification tasks and and speech recognition. While current models can reach state of the art performance on stand-alone benchmarks, deploying them on embedded systems that have real-time latency deadlines either cause them to fail these requirements or severely get degraded in performance to meet the stated specifications. This requires intelligent design of the network architecture in order to minimize the accuracy degradation while deployed on the edge. Similarly, deep learning often has a long turn-around time due to the volume of the experiments on different hyperparameters and consumes time and resources. This motivates a need for developing training strategies that allow researchers who do not have access to large computational resources to train large models without waiting for exorbitant training cycles to be completed. This dissertation addresses these concerns through data dependent pruning of deep learning computation. First, regarding inference, we propose an integration of two different conditional execution strategies we call FBS-pruned CondConv by noticing that if we use input-specific filters instead of standard convolutional filters, we can aggressively prune at higher rates and mitigate accuracy degradation for significant computation savings. Then, regarding long training times, we introduce our dynamic data pruning framework which takes ideas from active learning and reinforcement learning to dynamically select subsets of data to train the model. Finally, as opposed to pruning data and in the same spirit of reducing training time, we investigate the vision transformer and introduce a unique training method called PatchDrop (originally designed for robustness to occlusions on transformers [1]), which uses the self-supervised DINO [2] model to identify the salient patches in an image and train on the salient subsets of an image. These strategies/training methods take a step in a direction to make models more accessible to deploy on edge devices in an efficient inference context and reduces the barrier for the independent researcher to train deep learning models which would require immense computational resources, pushing towards the democratization of machine learning.
This book provides a structured treatment of the key principles and techniques for enabling efficient processing of deep neural networks (DNNs). DNNs are currently widely used for many artificial intelligence (AI) applications, including computer vision, speech recognition, and robotics. While DNNs deliver state-of-the-art accuracy on many AI tasks, it comes at the cost of high computational complexity. Therefore, techniques that enable efficient processing of deep neural networks to improve key metrics—such as energy-efficiency, throughput, and latency—without sacrificing accuracy or increasing hardware costs are critical to enabling the wide deployment of DNNs in AI systems. The book includes background on DNN processing; a description and taxonomy of hardware architectural approaches for designing DNN accelerators; key metrics for evaluating and comparing different designs; features of DNN processing that are amenable to hardware/algorithm co-design to improve energy efficiency and throughput; and opportunities for applying new technologies. Readers will find a structured introduction to the field as well as formalization and organization of key concepts from contemporary work that provide insights that may spark new ideas.
This book provides a structured treatment of the key principles and techniques for enabling efficient processing of deep neural networks (DNNs). DNNs are currently widely used for many artificial intelligence (AI) applications, including computer vision, speech recognition, and robotics. While DNNs deliver state-of-the-art accuracy on many AI tasks, it comes at the cost of high computational complexity. Therefore, techniques that enable efficient processing of deep neural networks to improve metrics—such as energy-efficiency, throughput, and latency—without sacrificing accuracy or increasing hardware costs are critical to enabling the wide deployment of DNNs in AI systems. The book includes background on DNN processing; a description and taxonomy of hardware architectural approaches for designing DNN accelerators; key metrics for evaluating and comparing different designs; features of the DNN processing that are amenable to hardware/algorithm co-design to improve energy efficiency and throughput; and opportunities for applying new technologies. Readers will find a structured introduction to the field as well as a formalization and organization of key concepts from contemporary works that provides insights that may spark new ideas.
Explains current co-design and co-optimization methodologies for building hardware neural networks and algorithms for machine learning applications This book focuses on how to build energy-efficient hardware for neural networks with learning capabilities—and provides co-design and co-optimization methodologies for building hardware neural networks that can learn. Presenting a complete picture from high-level algorithm to low-level implementation details, Learning in Energy-Efficient Neuromorphic Computing: Algorithm and Architecture Co-Design also covers many fundamentals and essentials in neural networks (e.g., deep learning), as well as hardware implementation of neural networks. The book begins with an overview of neural networks. It then discusses algorithms for utilizing and training rate-based artificial neural networks. Next comes an introduction to various options for executing neural networks, ranging from general-purpose processors to specialized hardware, from digital accelerator to analog accelerator. A design example on building energy-efficient accelerator for adaptive dynamic programming with neural networks is also presented. An examination of fundamental concepts and popular learning algorithms for spiking neural networks follows that, along with a look at the hardware for spiking neural networks. Then comes a chapter offering readers three design examples (two of which are based on conventional CMOS, and one on emerging nanotechnology) to implement the learning algorithm found in the previous chapter. The book concludes with an outlook on the future of neural network hardware. Includes cross-layer survey of hardware accelerators for neuromorphic algorithms Covers the co-design of architecture and algorithms with emerging devices for much-improved computing efficiency Focuses on the co-design of algorithms and hardware, which is especially critical for using emerging devices, such as traditional memristors or diffusive memristors, for neuromorphic computing Learning in Energy-Efficient Neuromorphic Computing: Algorithm and Architecture Co-Design is an ideal resource for researchers, scientists, software engineers, and hardware engineers dealing with the ever-increasing requirement on power consumption and response time. It is also excellent for teaching and training undergraduate and graduate students about the latest generation neural networks with powerful learning capabilities.
Modern machine learning often relies on deep neural networks that are prohibitively expensive in terms of the memory and computational footprint. This in turn significantly inhibits the potential range of applications where we are faced with non-negligible resource constraints, e.g., real-time data processing, embedded devices, and robotics. In this thesis, we develop theoretically-grounded algorithms to reduce the size and inference cost of modern, large-scale neural networks. By taking a theoretical approach from first principles, we intend to understand and analytically describe the performance-size trade-offs of deep networks, i.e., the generalization properties. We then leverage such insights to devise practical algorithms for obtaining more efficient neural networks via pruning or compression. Beyond theoretical aspects and the inference time efficiency of neural networks, we study how compression can yield novel insights into the design and training of neural networks. We investigate the practical aspects of the generalization properties of pruned neural networks beyond simple metrics such as test accuracy. Finally, we show how in certain applications pruning neural networks can improve the training and hence the generalization performance.
Energy efficiency is critical for running computer vision on battery-powered systems, such as mobile phones or UAVs (unmanned aerial vehicles, or drones). This book collects the methods that have won the annual IEEE Low-Power Computer Vision Challenges since 2015. The winners share their solutions and provide insight on how to improve the efficiency of machine learning systems.
My thesis is a result of a confluence of several trends that have emerged in recent years. First, the rapid proliferation of deep learning across the application and hardware landscapes is creating an immense demand for computing power. Second, the waning of Moore's Law is paving the way for domain-specific acceleration as a means of delivering performance improvements. Third, deep learning's inherent error tolerance is reviving long-forgotten approximate computing paradigms. Fourth, latency, energy, and privacy considerations are increasingly pushing deep learning towards edge inference, with its stringent deployment constraints. All of the above have created a unique, once-in-a-generation opportunity for accelerated widespread adoption of new classes of hardware and algorithms, provided they can deliver fast, efficient, and accurate deep learning inference within a tight area and energy envelope. One approach towards efficient machine learning acceleration that I have explored attempts to push a neural network model size to its absolute minimum. 3PXNet - Pruned, Permuted, Packed XNOR Networks combines two widely used model compression techniques: binarization and sparsity to deliver usable models with a size down to single kilobytes. It uses an innovative combination of weight permutation and packing to create structured sparsity that can be implemented efficiently in both software and hardware. 3PXNet has been deployed as an open-source library targeting microcontroller-class devices with various software optimizations, further improving runtime and storage requirements. The second line of work I have pursued is the application of stochastic computing (SC). It is an approximate, stream-based computing paradigm enabling extremely area-efficient implementations of basic arithmetic operations such as multiplication and addition. SC has been enjoying a renaissance over the past few years due to its unique synergy with deep learning. On the one hand, SC makes it possible to implement extremely dense multiply-accumulate (MAC) computational fabric well suited towards computing large linear algebra kernels, which are the bread-and-butter of deep neural networks. On the other hand, those neural networks exhibit immense approximation tolerance levels, making SC a viable implementation candidate. However, several issues need to be solved to make the SC acceleration of neural networks feasible. The area efficiency comes at the cost of long stream processing latency. The conversion cost between fixed-point and stochastic representations can cancel out the gains from computation efficiency if not managed correctly. The above issues lead to a question on how to design an accelerator architecture that best takes advantage of SC's benefits and minimizes its shortcomings. To address this, I proposed the ACOUSTIC (Accelerating Convolutional Neural Networks through Or-Unipolar Skipped Stochastic Computing) architecture and its extension - GEO (Generation and Execution Optimized Stochastic Computing Accelerator for Neural Networks). ACOUSTIC is an architecture that tries to maximize SC's compute density to amortize conversion costs and memory accesses, delivering system-level reduction in inference energy and latency. It has taped out and demonstrated in silicon, using a 14nm fabrication process. GEO addresses some of the shortcomings of ACOUSTIC. Through the introduction of near-memory computation fabric, GEO enables a more flexible selection of dataflows. Novel progressive buffering scheme unique to SC lowers the reliance on high memory bandwidth. Overall, my work tries to approach accelerator design from the systems perspective, making it stand apart from most recent SC publications targeting point improvements in the computation itself. As an extension to the above line of work, I have explored the combination of SC and sparsity, to apply it to new classes of applications, and enable further benefits. I have proposed the first SC accelerator that supports weight sparsity - SASCHA (Sparsity-Aware Stochastic Computing Hardware Architecture for Neural Network Acceleration), which can improve performance on pruned neural networks, while maintaining the throughput when processing dense ones. SASCHA solves a series of unique, non-trivial challenges of combining SC with sparsity. On the other hand, I have also designed an architecture for accelerating event-based camera object tracking - SCIMITAR. Event-based cameras are relatively new imaging devices which only transmit information about pixels that have changed in brightness, resulting in very high input sparsity. SCIMITAR combines SC with computing-in-memory (CIM), and, through a series of architectural optimizations, is able to take advantage of this new data format to deliver low-latency object detection for tracking applications.
This book provides a structured treatment of the key principles and techniques for enabling efficient processing of deep neural networks (DNNs). DNNs are currently widely used for many artificial intelligence (AI) applications, including computer vision, speech recognition, and robotics. While DNNs deliver state-of-the-art accuracy on many AI tasks, it comes at the cost of high computational complexity. Therefore, techniques that enable efficient processing of deep neural networks to improve key metrics--such as energy-efficiency, throughput, and latency--without sacrificing accuracy or increasing hardware costs are critical to enabling the wide deployment of DNNs in AI systems. The book includes background on DNN processing; a description and taxonomy of hardware architectural approaches for designing DNN accelerators; key metrics for evaluating and comparing different designs; features of DNN processing that are amenable to hardware/algorithm co-design to improve energy efficiency and throughput; and opportunities for applying new technologies. Readers will find a structured introduction to the field as well as formalization and organization of key concepts from contemporary work that provide insights that may spark new ideas.
Breakthrough of deep learning (DL) has greatly promoted development of machine learning in numerous academic disciplines and industries in recent years. A subsequent concern, which is frequently raised by multidisciplinary researchers, software developers, and machine learning end users, is inefficiency of DL methods: intolerable training and inference time, exhausted computing resources, and unsustainable power consumption. To tackle the inefficiency issues, tons of DL efficiency methods have been proposed to improve efficiency without sacrificing prediction accuracy of a specified application such as image classification and visual object detection. However, we suppose that the traditional DL efficiency methods are not sufficiently flexible or adaptive to meet requirement of practical usage scenarios, based on two observations. First, most of the traditional methods adopt an objective "no accuracy loss for a specified application", while the objective cannot cover considerable scenarios. For example, to meet diverse user needs, a public cloud platform should provide an efficient and multipurpose DL method instead of focusing on an application only. Second, most of the traditional methods adopt model compression and quantization as efficiency enhancement strategies, while these two strategies are severely degraded for a certain number of scenarios. For example, for embedded deep neural networks (DNNs), significant architecture change and quantization may severely weaken customized hardware accelerators designed for predefined DNN operators and precision. In this dissertation, we will investigate three popular usage scenarios and correspondingly propose our DL efficiency methods: versatile model efficiency, robust model efficiency, and processing-step efficiency. The first scenario is requiring a DL method to achieve model efficiency and versatility. The model efficiency is to design a compact deep neural network, while the versatility is to get satisfactory prediction accuracy on multiple applications. We propose a compact DNN by integrating shape information into a newly designed module Conv-M, to tackle an issue that previous compact DNNs cannot achieve matched level of accuracy on image classification and unsupervised domain adaptation. Our method can benefit software developers, since they can directly replace an original single-purpose DNN with our versatile one in their programs. The second scenario is requiring a DL method to achieve model efficiency and robustness. The robustness is to get satisfactory prediction accuracy for certain categories of samples. These samples are critical but often wrongly predicted by previous methods. We propose a fast training method based on simultaneous adaptive filter reuse (dynamic compression) and neuron-level robustness enhancement, to improve accuracy on self-driving motion prediction, especially the accuracy on night driving samples. Our method can benefit algorithm researchers who are proficient in mathematically exploring loss functions but not skilled in empirically constructing efficient sub-modules of DNNs, since our dynamic compression does not require expertise on the sub-modules of DNNs. The third scenario is requiring inference speed of a DL method to be fast without significantly changing DNN architecture and adopting quantization .We propose a fast photorealistic style transfer method by removing time-consuming smoothing step during inference and introducing spatially coherent content-style preserving loss during training. For computer vision engineers who struggle to combine DL efficiency approaches, our method provides a different candidate efficiency method compared to popular architecture tailoring and quantization.