Efficient neural network processing via model compression and low-power functional units
Files
Date
Authors
Editor(s)
Advisor
Supervisor
Co-Advisor
Co-Supervisor
Instructor
BUIR Usage Stats
views
downloads
Series
Abstract
We present a framework that contributes neural network optimization through novel methods in pruning, quantization, and arithmetic unit design for resource-constrained devices to datacenters. The first component is a pruning method that employs an importance metric to measure and selectively eliminate less critical neurons and weights, achieving high compression rates up to 99.9% without sacrificing significant accuracy. This idea is improved by a novel pruning schedule that optimizes the balance between compression and model’s generalization capa-bility. Next, we introduce a quantization method that combines with pruning to improve hardware compatibility for floating point format, offering efficient model compression and fast computation and general usability. Finally, we propose a logarithmic arithmetic unit that designed as an energy-efficient alternative to conventional floating-point operations, providing precise and configurable processing without relying on bulky lookup tables. Extensive evaluations across different datasets and CUDA-based simulations and Verilog based hardware designs indicate that our approaches outperforms existing methods, making it a powerful solution for deploying artificial intelligence models more efficiently.