We consider the problem of building high-performance implementations of sparse matrix-vector multiply (SpM×V), or y = y+A ·x, which is an important and ubiquitous computational kernel. Prior work indicates that cache blocking of SpM×V is extremely important for some matrix and machine combinations, with speedups as high as 3x. In this paper we present a new, more compact data structure for cach...