skip to main content
Primo Search
Search in: Busca Geral

DaDianNao: A Machine-Learning Supercomputer

Chen, Yunji ; Luo, Tao ; Liu, Shaoli ; Zhang, Shijin ; He, Liqiang ; Wang, Jia ; Li, Ling ; Chen, Tianshi ; Xu, Zhiwei ; Sun, Ninghui ; Temam, Olivier

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, 2014, p.609-622

IEEE Computer Society

Texto completo disponível

Citações Citado por
  • Título:
    DaDianNao: A Machine-Learning Supercomputer
  • Autor: Chen, Yunji ; Luo, Tao ; Liu, Shaoli ; Zhang, Shijin ; He, Liqiang ; Wang, Jia ; Li, Ling ; Chen, Tianshi ; Xu, Zhiwei ; Sun, Ninghui ; Temam, Olivier
  • Assuntos: accelerator ; Bandwidth ; Biological neural networks ; Computer architecture ; Graphics processing units ; Hardware ; Kernel ; machine learning ; neural network ; Neurons
  • É parte de: 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, 2014, p.609-622
  • Descrição: Many companies are deploying services, either for consumers or industry, which are largely based on machine-learning algorithms for sophisticated processing of large amounts of data. The state-of-the-art and most popular such machine-learning algorithms are Convolutional and Deep Neural Networks (CNNs and DNNs), which are known to be both computationally and memory intensive. A number of neural network accelerators have been recently proposed which can offer high computational capacity/area ratio, but which remain hampered by memory accesses. However, unlike the memory wall faced by processors on general-purpose workloads, the CNNs and DNNs memory footprint, while large, is not beyond the capability of the on-chip storage of a multi-chip system. This property, combined with the CNN/DNN algorithmic characteristics, can lead to high internal bandwidth and low external communications, which can in turn enable high-degree parallelism at a reasonable area cost. In this article, we introduce a custom multi-chip machine-learning architecture along those lines. We show that, on a subset of the largest known neural network layers, it is possible to achieve a speedup of 450.65x over a GPU, and reduce the energy by 150.31x on average for a 64-chip system. We implement the node down to the place and route at 28nm, containing a combination of custom storage and computational units, with industry-grade interconnects.
  • Editor: IEEE Computer Society
  • Idioma: Inglês

Buscando em bases de dados remotas. Favor aguardar.