operands

Low precision matrix multiplication for efficient deep learning in NVIDIA Carmel processors

Journal: :The Journal of Supercomputing 2021

We introduce a high performance, multi-threaded realization of the gemm kernel for ARMv8.2 architecture that operates with 16-bit (half precision)/queryKindly check and confirm whether corresponding author is correctly identified. floating point operands. Our code especially designed efficient machine learning inference (and to certain extent, also training) deep neural networks. The results on...

متن کامل

A Hybrid Radix-4 and Approximate Logarithmic Multiplier for Energy Efficient Image Processing

Journal: :Electronics 2021

Multiplication is an essential image processing operation commonly implemented in hardware DSP cores. To improve cores’ area, speed, or energy efficiency, we can approximate multiplication. We present multiplier that generates two partial products using hybrid radix-4 and logarithmic encoding of the input operands. It uses exact to generate product from three most significant bits approximation...

متن کامل

Overlap-free Karatsuba-Ofman polynomial multiplication algorithms

Journal: :IET Information Security 2007

Haining Fan Jia-Guang Sun Ming Gu Kwok-Yan Lam

We describe how a simple way to split input operands allows for fast VLSI implementations of subquadratic GF (2)[x] Karatsuba-Ofman multipliers. The theoretical XOR gate delay of the resulting multipliers is reduced significantly. For example, it is reduced by about 33% and 25% for n = 2 and n = 3 (t > 1), respectively. To the best of our knowledge, this parameter has never been improved since ...

متن کامل

SIFTAL: A Typed Assembly Language for Secure Information Flow Analysis Technical Report Draft - Not for distribution

2004

Eduardo Bonelli Adriana Compagnoni Ricardo Medel

2 SIFTAL 4 2.1 Syntax of SIFTAL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Type System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2.1 Typing Basic Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.2 Typing Operands, Word Values and Heap Values . . . ...

متن کامل

Instructor Selector Generation from Architecture Description

2010

Miloslav Trmac Adam Husár Jan Hranac Tomás Hruska Karel Masarík

We describe an automated way to generate data for a practical LLVM instruction selector based on machine-generated description of the target architecture at register transfer level. The generated instruction selector can handle arbitrarily complex machine instructions with no internal control flow, and can automatically find and take advantage of arithmetic properties of an instructions, specia...

متن کامل

Automatic Data Distribution Optimisation in a Lazy, Self-optimising Parallel Matrix Library (extended Abstract)

1996

Olav Beckmann Paul H J Kelly

This short paper describes a matrix-vector library implementation running on the Fujitsu AP1000. The library optimises data distribution at run-time, taking advantage of information about how operands and results are used by delaying evaluation where possible. The work extends our earlier paper on the subject 5] by giving a general methodology for representing data distributions, which is then ...

متن کامل

Improvements to Linear Scan register allocation

2004

Alkis Evlogimenos

Linear scan register allocation is a fast global register allocation first presented in [PS99] as an alternative to the more widely used graph coloring approach. In this paper, I apply the linear scan register allocation algorithm in a system with SSA form and show how to improve the algorithm by taking advantage of lifetime holes and memory operands, and also eliminate the need for reserving r...

متن کامل

Reducing Access Count to Register-Files through Operand Reuse

2003

Hiroshi Takamura Koji Inoue Vasily G. Moshnyaga

This paper proposes an approach for reducing access count to register-files based on operand data reuse. The key idea is to compare source and destination operands of the current and previous instructions and if they are the same, omit the corresponding register file activation during operand fetch, thus saving energy consumption. Simulations show that using this technique we can decrease the t...

متن کامل

A Prolog Specification of Giant Number Arithmetic

Journal: :CoRR 2013

Paul Tarau

The tree based representation described in this paper, hereditarily binary numbers, applies recursively a run-length compression mechanism that enables computations limited by the structural complexity of their operands rather than by their bitsizes. While within constant factors from their traditional counterparts for their worst case behavior, our arithmetic operations open the doors for inte...

متن کامل

Efficient Prefix Computation on Faulty Hypercubes

Journal: :J. Inf. Sci. Eng. 2001

Yu-Wei Chen Kuo-Liang Chung

Consider an n-dimensional SIMD hypercube Hn with 3n/2 − 1 faulty nodes. Given 2 operands, this paper presents an efficient algorithm for prefix computation on the faulty Hn. Employing the newly proposed delay-update technique and the subcube partition scheme, the proposed algorithm takes n+5logn+7 steps, and it tolerates n/2 more faulty nodes than does Raghavendra and Sridhar’s algorithm [4...

متن کامل