Fall Papers 2009

From RCSWiki

Jump to: navigation, search


Presentation 1 (2009/09/08)

Paper 1 Title: End-to-End Performance Forecasting: Finding Bottlenecks Before They Happen
Paper 1 Authors: Ali G. Saidi, Nathan L. Binkert, Steven K. Reinhardt, Trevor Mudge
Paper 1 Contact Information: The University of Michigan, Hewlett-Packard, Advanced Micro Devices
Link to Download Paper 1

Paper 2 Title: Rigel: an architecture and scalable programming interface for a 1000-core accelerator
Paper 2 Authors: John H. Kelm, Daniel R. Johnson, Matthew R. Johnson, Neal C. Crago, William Tuohy, Aqeel Mahesri, Steven S. Lumetta, Matthew I. Frank, Sanjay J. Patel
Paper 2 Contact Information: University of Illinois, Urbana
Link to Download Paper 2

Presented By: Andy Schmidt
Email: andrewgschmidt@gmail.com
Presentation 1

Presentation 2 (2009/09/15)

Paper 1 Title: Efficient hardware code generation for FPGAs
Authors: ZHI GUO, WALID NAJJAR; and BETUL BUYUKKURT
Link to Paper 1

Paper 2 Title: Performance and power of cache-based reconfigurable computing
Authors: Andrew Putnam, Susan Eggers, Dave Bennett, Eric Dellinger, Jeff Mason, Henry Styles, Prasanna Sundararajan, Ralph Wittig
Link to Paper 2

Presented By: Rahul S

Presentation 3 (2009/09/22)

Paper 1 Title: Introduction to Programmable Active Memories
Authors: Patrice Bertin, Didier Roncin, and Jean Vuillemin
Link to download paper 1

Paper 2 Title: PAM-Blox: High Performance FPGA Design for Adaptive Computing
Authors: Oskar Mencer, Martin Morf, and Michael J. Flynn
Link to download paper 2

Presented By: Scott Bucsemi
Email: abuscemi@uncc.edu


Presentation X

Presentation 4 (2009/09/29)

Paper 1 Title: Maximizing MPI Point-to-Point Communication Performance on RDMA-enabled Clusters with Customized Protocols
Paper 1 Authors: Matthew Small, Xin Yuan
Paper 1 Contact Information: Florida State University
Link to download paper 1


Paper 2 Title: Efficient High Performance Collective Communication for the Cell Blade
Paper 2 Authors: Qasim Ali, Samuel P. Midkiff, Vijay S. Pai
Paper 2 Contact Information: Purdue University
Link to download paper 2


Presented By: Shanyuan Gao
Presentation 4

Presentation 5 (2009/10/06)

Paper 1 Title: Amdahl's Law in the Multicore Era
Paper 1 Authors: Mark D. Hill and Micheal R. Marty
Link to Paper 1


Paper 2 Title: Producing Wrong Data Without Doing Anything obviously Wrong!
Paper 2 Authors: Todd Mytkowicz, Amer Diwan, Matthias Hauswirth and Peter F. Sweeney
Link to Paper 2


Presented By: Siddhartha Datta
Presentation 5

Presentation 6 (2009/10/20)

Paper 1 Title: Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications
Authors: Adrian M. Caulfield, Laura M. Grupp and Steven Swanson
Contact Information: University of California, San Diego
Link to download paper

Paper 2 Title: The Performance of PC Solid-State Disks (SSDs) as a Function of Bandwidth, Concurrency, Device Architecture and System Organization
Authors: Cagdas Dirik and Bruce Jacob
Contact Information: University of Maryland, College Park
Link to download paper

Presented By: Ashwin Mendon
Presentation 6

Presentation 7 (2009/10/27)

Title: I/O Performance Challenges at Leadership Scale
Authors: Kevin Harms, William Allcock, Samuel Lang, Philip Carns,Robert Latham, and Robert Ross
Abstract: In this paper we present a case study of the I/O challenges to performance and scalability on Intrepid, the IBM Blue Gene/P system at the Argonne Leadership Computing Facility.
Contact Information: Argonne National Laboratory
Conference: SC ’09
Link to download paper

Title: Scalable Performance of the Panasas Parallel File System
Authors: Brent Welch1, Marc Unangst1, Zainul Abbasi1, Garth Gibson12, Brian Mueller1,Jason Small1, Jim Zelenka1, Bin Zhou
Abstract: This paper presents performance measures of I/O, metadata, and recovery operations for storage clusters that range in size from 10 to 120 storage nodes, 1 to 12 metadata nodes, and with file system client counts ranging from 1 to 100 compute nodes. Production installations are as large as 500 storage nodes, 50 metadata managers, and 5000 clients.
Contact Information: Panasas, Inc. Carnegie Mellon
Conference : FAST ’08
Link to download paper

Presented By: Bin Huang
Presentation X

Presentation 8 (2009/11/03)

Title:
FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs
Authors:
Alexandros Papakonstantinou1, Karthik Gururaj, John A. Stratton1, Deming Chen1, Jason Cong, Wen-Mei W. Hwu1
Electrical & Computer Engineering Dept., University of Illinois, Urbana-Champaign, IL, USA
Computer Science Dept., University of California, Los-Angeles, CA, USA
Contact Information:
{apapako2, stratton, dchen, hwu} @ Illinois.edu
Abstract:
As growing power dissipation and thermal effects disrupted the rising clock frequency trend and threatened to annul Moore’s law, the computing industry has switched its route to higher performance through parallel processing. The rise of multi-core systems in all domains of computing has opened the door to heterogeneous multi-processors, where processors of different compute characteristics can be combined to effectively boost the performance per watt of different application kernels. GPUs and FPGAs are becoming very popular in PC-based heterogeneous systems for speeding up compute intensive kernels of scientific, imaging and simulation applications. GPUs can execute hundreds of concurrent threads, while FPGAs provide customized concurrency for highly parallel kernels. However, exploiting the parallelism available in these applications is often not a push-button task. Often the programmer has to expose the application’s fine and coarse grained parallelism by using special APIs. CUDA is such a parallel-computing API that is driven by the GPGPU industry and is gaining significant popularity. In this work, we adapt the CUDA programming model into a new FPGA design flow called FCUDA, which efficiently maps the coarse and fine grained parallelism exposed in CUDA onto the reconfigurable fabric. Our CUDA-to-FPGA flow employs AutoPilot, an advanced high-level synthesis tool which enables high-abstraction FPGA programming. FCUDA is based on a source-to-source compilation that transforms the SPMD CUDA thread blocks into parallel C code for AutoPilot. We describe the details of our CUDA-to-FPGA flow and demonstrate the highly competitive performance of the resulting customized FPGA multi-core accelerators. To the best of our knowledge, this is the first CUDA-to-FPGA flow to demonstrate the applicability and potential advantage of using the CUDA programming model for high-performance computing in FPGAs.
Presented By: Liu Hu
Conference: SASP'09
Link to download paper
Presentation X

Presentation 9 (2009/11/10)

Title: ParalleX: A Study of A New Parallel Computation Model
Authors: Guang R. Gao, Thomas Sterling, Rick Stevens, Mark Hereld, and Weirong Zhu
Abstract: This paper proposes the study of a new computation model that attempts to address the underlying sources of performance degradation (e.g. latency, overhead, and starvation) and the difficulties of programmer productivity (e.g. explicit locality management and scheduling, performance tuning, fragmented memory, and synchronous global barriers) to dramatically enhance the broad effectiveness of parallel processing for high end computing. In this paper, we present the progress of our research on a parallel programming and execution model - mainly, ParalleX. We describe the functional elements of ParalleX, one such model being explored as part of this project. We also report our progress on the development and study of a subset of ParalleX $the LITL-X at University of Delaware. We then present a novel architecture model - Gilgamesh II - as a ParalleX processing architecture. A design point study of Gilgamesh II and the architecture concept strategy are presented
Paper 1


Title: An Executable Analytical Performance Evaluation Approach for Early Performance Prediction
Authors: Jacquet, A. Janot, V. Leung, C. Gao, G.R. Govindarajan, R. Sterling, T.L.
Abstract: Percolation has recently been proposed as a key component of an advanced program execution model for future generation high-end machines featuring adaptive data/code transformation and movement for effective latency tolerance. An early evaluation of the performance effect of percolation is very important in the design space exploration of future generations of supercomputers. In this paper, we develop an executable analytical performance model of a high performance multithreaded architecture that supports percolation. A novel feature of our approach is modeling interactions between software (program) and hardware (architecture) components. We solve the analytical model using a queuing simulation tool enriched with synchronization. The proposed approach is effective and facilitates obtaining performance trends quickly. Our results indicate that percolation brings in significant performance gains (by a factor of 2.7 to 11). Further, our results reveal that percolation and multithreading can complement each other.
Paper 2
Presented By: Will Kritikos
Presentation 9

Presentation 10 (2009/11/17)

Paper 1

Title: PowerNap: Eliminating Server Idle Power
Authors: Meisner, David and Gold, Brian T. and Wenisch, Thomas F.
Abstract: Data center power consumption is growing to unprecedented levels: the EPA estimates U.S. data centers will consume 100 billion kilowatt hours annually by 2011. Much of this energy is wasted in idle systems: in typical deployments, server utilization is below 30%, but idle servers still consume 60% of their peak power draw. Typical idle periods— though frequent—last seconds or less, confounding simple energy-conservation approaches. In this paper, we propose PowerNap, an energy-conservation approach where the entire system transitions rapidly between a high-performance active state and a near-zeropower idle state in response to instantaneous load. Rather than requiring fine-grained power-performance states and complex load-proportional operation from each system component, PowerNap instead calls for minimizing idle power and transition time, which are simpler optimization goals. Based on the PowerNap concept, we develop requirements and outline mechanisms to eliminate idle power waste in enterprise blade servers. Because PowerNap operates in lowefficiency regions of current blade center power supplies, we introduce the Redundant Array for Inexpensive Load Sharing (RAILS), a power provisioning approach that provides high conversion efficiency across the entire range of Power- Nap’s power demands. Using utilization traces collected from enterprise-scale commercial deployments, we demonstrate that, together, PowerNap and RAILS reduce average server power consumption by 74%.

Conference: ASPLOS '09 (Architectural Support for Programming Languages and Operating Systems), Washington, DC, USA

Link to download paper


Paper 2

Title: Interconnect Agnostic Checkpoint/Restart in Open MPI
Authors: Hursey, Joshua and Mattox, Timothy I. and Lumsdaine, Andrew
Abstract: Long running High Performance Computing (HPC) applications at scale must be able to tolerate inevitable faults if they are to harness current and future HPC systems. Message Passing Interface (MPI) level transparent checkpoint/restart fault tolerance is an appealing option to HPC application developers that do not wish to restructure their code. Historically, MPI implementations that provided this option have struggled to provide a full range of interconnect support, especially shared memory support. This paper presents a new approach for implementing checkpoint/restart coordination algorithms that allows the MPI implementation of checkpoint/restart to be interconnect agnostic. This approach allows an application to be checkpointed on one set of interconnects (e.g., InfiniBand and shared memory) and be restarted with a different set of interconnects (e.g., Myrinet and shared memory or Ethernet). By separating the network interconnect details from the checkpoint/restart coordination algorithm we allow the HPC application to respond to changes in the cluster environment such as interconnect unavailability due to switch failure, re-load balance on an existing machine, or migrate to a different machine with a different set of interconnects. We present results characterizing the performance impact of this approach on HPC applications.

Conference: HPDC '09 (High Performance Distributed Computing), Garching, Germany

Link to download paper

Presented By: Robin Jacob
Email: rpottath@uncc.edu
Presentation 10

Presentation 11 (2009/11/24)

Title:
Authors:
Contact Information:
Abstract:
Presented By: Yamuna Rajasekar
Email:
Link to download paper
Presentation X

Presentation 12 (2009/12/01)

Title:
Authors:
Contact Information:
Abstract:
Presented By: Shweta Jain
Email:
Link to download paper
Presentation X

Personal tools