skip to main content
Visitante
Meu Espaço
Minha Conta
Sair
Identificação
This feature requires javascript
Tags
Revistas Eletrônicas (eJournals)
Livros Eletrônicos (eBooks)
Bases de Dados
Bibliotecas USP
Ajuda
Ajuda
Idioma:
Inglês
Espanhol
Português
This feature required javascript
This feature requires javascript
Primo Search
Busca Geral
Busca Geral
Acervo Físico
Acervo Físico
Produção Intelectual da USP
Produção USP
Search For:
Clear Search Box
Search in:
Busca Geral
Or hit Enter to replace search target
Or select another collection:
Search in:
Busca Geral
Busca Avançada
Busca por Índices
This feature requires javascript
This feature requires javascript
The Design and Performance of Batched BLAS on Modern High-Performance Computing Systems
Dongarra, Jack ; Hammarling, Sven ; Higham, Nicholas J ; Relton, Samuel D ; Valero-Lara, Pedro ; Zounon, Mawussi
Elsevier 2017
Texto completo disponível
Citações
Citado por
Exibir Online
Detalhes
Resenhas & Tags
Mais Opções
Nº de Citações
This feature requires javascript
Enviar para
Adicionar ao Meu Espaço
Remover do Meu Espaço
E-mail (máximo 30 registros por vez)
Imprimir
Link permanente
Referência
EasyBib
EndNote
RefWorks
del.icio.us
Exportar RIS
Exportar BibTeX
This feature requires javascript
Título:
The Design and Performance of Batched BLAS on Modern High-Performance Computing Systems
Autor:
Dongarra, Jack
;
Hammarling, Sven
;
Higham, Nicholas J
;
Relton, Samuel D
;
Valero-Lara, Pedro
;
Zounon, Mawussi
Assuntos:
Batched BLAS
;
BLAS
;
Dispositius de memòria
;
Enginyeria elèctrica
;
Gestió de memòria (Informàtica)
;
High performance computing
;
Memory management
;
Memory management (
Computer
science
)
;
Ordinadors
;
Parallel processing
;
Scientific computing
;
Supercomputadors
;
Àrees temàtiques de la UPC
Descrição:
A current trend in high-performance computing is to decompose a large linear algebra problem into batches containing thousands of smaller problems, that can be solved independently, before collating the results. To standardize the interface to these routines, the community is developing an extension to the BLAS standard (the batched BLAS), enabling users to perform thousands of small BLAS operations in parallel whilst making efficient use of their hardware. We discuss the benefits and drawbacks of the current batched BLAS proposals and perform a number of experiments, focusing on a general matrix-matrix multiplication (GEMM), to explore their affect on the performance. In particular we analyze the effect of novel data layouts which, for example, interleave the matrices in memory to aid vectorization and prefetching of data. Utilizing these modifications our code outperforms both MKL1 CuBLAS2 by up to 6 times on the self-hosted Intel KNL (codenamed Knights Landing) and Kepler GPU architectures, for large numbers of double precision GEMM operations using matrices of size 2 × 2 to 20 × 20. The authors would like to thank The University of Tennessee for the use of their computational resources. This research was funded in part from the European Union’s Horizon 2020 research and innovation programme under the NLAFET grant agreement No. 671633. Peer Reviewed
Editor:
Elsevier
Data de criação/publicação:
2017
Idioma:
Inglês
Links
View record in Consorci de Serveis Universitaris de Catalunya (CSUC)$$FView record in $$GConsorci de Serveis Universitaris de Catalunya (CSUC)
This feature requires javascript
This feature requires javascript
Voltar para lista de resultados
Resultado
1
Avançar
This feature requires javascript
This feature requires javascript
Buscando em bases de dados remotas. Favor aguardar.
Buscando por
em
scope:(USP_VIDEOS),scope:("PRIMO"),scope:(USP_FISICO),scope:(USP_EREVISTAS),scope:(USP),scope:(USP_EBOOKS),scope:(USP_PRODUCAO),primo_central_multiple_fe
Mostrar o que foi encontrado até o momento
This feature requires javascript
This feature requires javascript