University of Warwick
  • Study
  • |
  • Research
  • |
  • Business
  • |
  • Alumni
  • |
  • News
  • Sign in
  • Computer Science Repository
  • More…

    Computer Science Repository

    • Home
    • About
    • Browse by Year
    • Browse by Subject
    • Browse by Division
    • Browse by Author
      • Login

    Performance Analysis of a Hybrid MPI/CUDA Implementation of the NAS-LU Benchmark

    Pennycook, S.J., Hammond, S.D., Mudalige, G.R. and Jarvis, S.A. (2010) Performance Analysis of a Hybrid MPI/CUDA Implementation of the NAS-LU Benchmark. In: 1st International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems (PMBS 10), held in conjunction with IEEE/ACM Supercomputing 2010 (SC'10), New Orleans, LA, USA.

    [img]
    Preview
    PDF - Published Version
    Download (396Kb) | Preview

      Abstract

      The emergence of Graphics Processing Units (GPUs) as a potential alternative to conventional general-purpose processors has led to significant interest in these architectures by both the academic community and the High Performance Computing (HPC) industry. While GPUs look likely to deliver unparalleled levels of performance, the publication of studies claiming performance improvements in excess of 30,000x are misleading.

      Significant on-node performance improvements have been demonstrated for code kernels and algorithms amenable to GPU acceleration; studies demonstrating comparable results for full scientific applications requiring multiple-GPU architectures are rare.

      In this paper we present an analysis of a port of the NAS LU benchmark to NVIDIA's Compute Unified Device Architecture (CUDA) - the most stable GPU programming model currently available. Our solution is also extended to multiple nodes and multiple GPU devices.

      Runtime performance on several GPUs is presented, ranging from low-end, consumer-grade cards such as the 8400GS to NVIDIA's flagship Fermi HPC processor found in the recently released C2050. We compare the runtimes of these devices to several processors including those from Intel, AMD and IBM.

      In addition to this we utilise a recently developed performance model of LU. With this we predict the runtime performance of LU on large-scale distributed GPU clusters, which are predicted to become commonplace in future high-end HPC architectural solutions.

      Item Type: Conference or Workshop Item (Paper)
      Uncontrolled Keywords: pcav hpsg gpu lu cuda performance modelling modeling pmbs bluegene
      Subjects: Q Science > QA Mathematics > QA75 Electronic computers. Computer science
      Q Science > QA Mathematics > QA76 Computer software
      Q Science > QA Mathematics > QA76.73 Computer algorithms. Data structures.
      Divisions: Faculty of Science > Computer Science
      Depositing User: Simon Hammond
      Date Deposited: 04 Nov 2010 07:41
      Last Modified: 23 Feb 2012 09:07
      URI: http://eprints.dcs.warwick.ac.uk/id/eprint/270

      Actions (login required)

      View Item
      Close this email form
      Page contact: Repository administrator Last revised: Wed 21 Mar 2012
      • Sign in
      • | Powered by EPrints 3