skip to main content
Primo Search
Search in: Busca Geral

Heterogeneous Acceleration Pipeline for Recommendation System Training

Muhammad Adnan ; Yassaman Ebrahimzadeh Maboud ; Mahajan, Divya ; Nair, Prashant J

arXiv.org, 2024-04

Ithaca: Cornell University Library, arXiv.org

Texto completo disponível

Citações Citado por
  • Título:
    Heterogeneous Acceleration Pipeline for Recommendation System Training
  • Autor: Muhammad Adnan ; Yassaman Ebrahimzadeh Maboud ; Mahajan, Divya ; Nair, Prashant J
  • Assuntos: Central processing units ; Computer Science - Artificial Intelligence ; Computer Science - Hardware Architecture ; Computer Science - Learning ; CPUs ; Deep learning ; Embedding ; Hybrid modes ; Pipelines ; Pipelining (computers) ; Recommender systems ; Stitching ; Telephone hotlines ; Training
  • É parte de: arXiv.org, 2024-04
  • Descrição: Recommendation models rely on deep learning networks and large embedding tables, resulting in computationally and memory-intensive processes. These models are typically trained using hybrid CPU-GPU or GPU-only configurations. The hybrid mode combines the GPU's neural network acceleration with the CPUs' memory storage and supply for embedding tables but may incur significant CPU-to-GPU transfer time. In contrast, the GPU-only mode utilizes High Bandwidth Memory (HBM) across multiple GPUs for storing embedding tables. However, this approach is expensive and presents scaling concerns. This paper introduces Hotline, a heterogeneous acceleration pipeline that addresses these concerns. Hotline develops a data-aware and model-aware scheduling pipeline by leveraging the insight that only a few embedding entries are frequently accessed (popular). This approach utilizes CPU main memory for non-popular embeddings and GPUs' HBM for popular embeddings. To achieve this, Hotline accelerator fragments a mini-batch into popular and non-popular micro-batches. It gathers the necessary working parameters for non-popular micro-batches from the CPU, while GPUs execute popular micro-batches. The hardware accelerator dynamically coordinates the execution of popular embeddings on GPUs and non-popular embeddings from the CPU's main memory. Real-world datasets and models confirm Hotline's effectiveness, reducing average end-to-end training time by 2.2x compared to Intel-optimized CPU-GPU DLRM baseline.
  • Editor: Ithaca: Cornell University Library, arXiv.org
  • Idioma: Inglês

Buscando em bases de dados remotas. Favor aguardar.