skip to main content

Towards General Purpose Acceleration by Exploiting Common Data-Dependence Forms

Dadu, Vidushi ; Weng, Jian ; Liu, Sihao ; Nowatzki, Tony

Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, 2019, p.924-939

New York, NY, USA: ACM

Texto completo disponível

Citações Citado por
  • Título:
    Towards General Purpose Acceleration by Exploiting Common Data-Dependence Forms
  • Autor: Dadu, Vidushi ; Weng, Jian ; Liu, Sihao ; Nowatzki, Tony
  • Assuntos: Computer systems organization -- Architectures -- Other architectures -- Data flow architectures ; Computer systems organization -- Architectures -- Other architectures -- Heterogeneous (hybrid) systems ; Computer systems organization -- Architectures -- Other architectures -- Reconfigurable computing
  • É parte de: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, 2019, p.924-939
  • Descrição: With slowing technology scaling, specialized accelerators are increasingly attractive solutions to continue expected generational scaling of performance. However, in order to accelerate more advanced algorithms or those from challenging domains, supporting data-dependence becomes necessary. This manifests as either data-dependent control (eg. join two sparse lists), or data-dependent memory accesses (eg. hash-table access). These forms of data-dependence inherently couple compute with memory, and also preclude efficient vectorization -- defeating the traditional mechanisms of programmable accelerators (eg. GPUs). Our goal is to develop an accelerator which is broadly applicable across algorithms with and without data-dependence. To this end, we first identify forms of data-dependence which are both common and possible to exploit with specialized hardware: specifically stream-join and alias-free indirection. Then, we create an accelerator with an interface to support these, called the Sparse Processing Unit (SPU). SPU supports alias-free indirection with a compute-enabled scratchpad and aggressive stream reordering and stream-join with a novel dataflow control model for a reconfigurable systolic compute-fabric. Finally, we add robustness across datatypes by adding decomposability across the compute and memory pipelines. SPU achieves 16.5×, 10.3×, and 14.2× over a 24-core SKL CPU on ML, database, and graph algorithms respectively. SPU achieves similar performance to domain-specific accelerators. For ML, SPU achieves 1.8-7× speedup against a similarly provisioned GPGPU, with much less area and power.
  • Editor: New York, NY, USA: ACM
  • Idioma: Inglês

Buscando em bases de dados remotas. Favor aguardar.