Spring Papers 2008

From RCSWiki

Jump to: navigation, search

Presentation 1 (2008/02/06)

Titles:

  • Understanding and Optimizing High-Speed Serial Memory System Protocols
  • Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling


Authors: Jacob Leverich, Hideho Arakida, Alex Solomatnikov, Amin Firoozshahian, Mark Horowitz, Christos Kozyrakis
Contact Information: Stanford University
Abstract:
There are two basic models for the on-chip memory in CMP systems:hardware-managed coherent caches and software-managed streaming memory. This paper performs a direct comparison of the two modelsunder the same set of assumptions about technology, area, and computational capabilities. The goal is to quantify how and when they differ in terms of performance, energy consumption, bandwidth requirements, and latency tolerance for general-purpose CMPs. We demonstrate that for data-parallel applications, the cache-based and streaming models perform and scale equally well. For certain applications with little data reuse, streaming scales better due to better bandwidth use and macroscopic software prefetching. However, the introduction of techniques such as hardware prefetching and non-allocating stores to the cache-based model eliminates the streaming advantage. Overall, our results indicate that there is not sufficient advantage in building streaming memory systems where all on-chip memory structures are explicitly managed. On the other hand, we show that streaming at the programming model level is particularly beneficial, even with the cache-based model, as it enhances locality and creates opportunities for bandwidth optimizations. Moreover, we observe that stream programming is actually easier with the cache-based model because the hardware guarantees correct, best-effort execution even when the programmer cannot fully regularize an application's code.

Presented By: Andy Schmidt
Email: andrewgschmidt@gmail.com
Presentation Date: February 6th, 2008

Link to Download Paper 1
Link to Download Paper 2
Presentation 1

Presentation 2 (2008/02/20)

Title: MPI Example & Demo
Authors: Michael Quinn
Contact Information:
Abstract: NONE (book chapters)
Presented By: Ron Sass
Email: rsass@uncc.edu
Presentation Date: February 20th, 2008

Chapter 4
Chapter 5
Presentation 2

Presentation 3 (2008/02/27)

Title: Benchmarking
Authors:
Contact Information:
Abstract:



Presented By: Shanyuan Gao
Email: sgao1@uncc.edu
Presentation Date: February 27th, 2008

Link to Download Paper 3
Presentation 3

Presentation 4 (2008/03/12)

Title: Interfaces (Part 1)
Authors:
Contact Information:
Abstract:



Presented By: Siddhartha Datta
Email:
Presentation Date: March 12th, 2008

Link to Download Paper 3
Presentation 3

Presentation 5 (2008/03/19)

Title:
Authors:
Contact Information:
Abstract:



Presented By:
Email:
Presentation Date: March 19th, 2008

Link to Download Paper 5
Presentation 5

Presentation 6 (2008/03/26)

Title: A Performance Analysis of the Berkeley UPC Compiler
Authors: W. Chen, D. Bonachea, J. Duell, P. Husbands, C. Iancu, K. Yelick
Contact Information: UC Berkeley
Abstract:
Unified Parallel C (UPC) is a parallel language that uses a Single Program Multiple Data (SPMD) model of parallelism within a global address space. The global address space is used to simplify programming, especially on applications with irregular data structures that lead to fine-grained sharing between threads. Recent results have shown that the performance of UPC using a commercial compiler is comparable to that of MPI [7]. In this paper we describe a portable open source compiler for UPC. Our goal is to achieve a similar performance while enabling easy porting of the compiler and runtime, and also provide a framework that allows for extensive optimizations. We identify some of the challenges in compiling UPC and use a combination of microbenchmarks and application kernels to show that our compiler has low overhead for basic operations on shared data and is competitive, and sometimes faster than, the commercial HP compiler. We also investigate several communication optimizations, and show significant benefits by hand-optimizing the generated code.

Presented By: Reshmi Mitra
Email: resh.mitra@gmail.com
Presentation Date: March 26th, 2008

Link to Download Paper 6
Presentation 6

Presentation 7 (2008/04/02)

Title: Evaluation of the Raw Microprocessor: An Exposed-Wire-Delay Architecture for ILP and Streams
Authors: Michael Bedford Taylor, Walter Lee, Jason Miller, David Wentzlaff, Ian Bratt, Ben Greenwald, Henry Hoffmann, Paul Johnson, Jason Kim, James Psota, Arvind Saraf, Nathan Shnidman, Volker Strumpen, Matt Frank, Saman Amarasinghe, and Anant Agarwal
Contact Information:
Abstract:This paper evaluates the Raw microprocessor. Raw addresses the challenge of building a general-purpose architecture that performs well on a larger class of stream and embedded computing applications than existing microprocessors, while still running existing ILP-based sequential programs with reasonable performance in the face of increasing wire delays. Raw approaches this challenge by implementing plenty of on-chip resources - including logic, wires, and pins - in a tiled arrangement, and exposing them through a new ISA, so that the software can take advantage of these resources for parallel applications. Raw supports both ILP and streams by routing operands between architecturally-exposed functional units over a point-to-point scalar operand network. This network offers low latency for scalar data transport. Raw manages the effect of wire delays by exposing the interconnect and using software to orchestrate both scalar and stream data transport. We have implemented a prototype Raw microprocessor in IBM's 180 nm, 6-layer copper, CMOS 7SF standard-cell ASIC process. We have also implemented ILP and stream compilers. Our evaluation attempts to determine the extent to which Raw succeeds in meeting its goal of serving as a more versatile, general-purpose processor. Central to achieving this goal is Raw's ability to exploit all forms of parallelism, including ILP, DLP, TLP, and Stream parallelism. Specifically, we evaluate the performance of Raw on a diverse set of codes including traditional sequential programs, streaming applications, server workloads and bit-level embedded computation. Our experimental methodology makes use of a cycle-accurate simulator validated against our real hardware. Compared to a 180nm Pentium-III, using commodity PC memory system components, Raw performs within a factor of 2/spl times/ for sequential applications with a very low degree of ILP, about 2/spl times/ to 9/spl times/ better for higher levels of ILP, and 10/spl times/-100/spl times/ better when highly parallel applications are coded in a stream language or optimized by hand. The paper also proposes a new versatility metric and uses it to discuss the generality of Raw.



Presented By: Will Kritikos
Email: will.kritikos@gmail.com
Presentation Date: April 2th, 2008

Link to Download Paper 7
Presentation 7

Presentation 8 (2008/04/09)

Topic: Architecture Modeling
Title: Automated Design of Application Specific Superscalar Processors: An Analytical Approach
Authors: Tejas S. Karkhanis and James E. Smith
Contact Information: ECE Dept. University of Wisconsin - Madison
Abstract:
Analytical modeling is applied to the automated design of application-specific superscalar processors. Using an analytical method bridges the gap between the size of the design space and the time required for detailed cycle-accurate simulations. The proposed design framework takes as inputs the design targets (upper bounds on execution time, area, and energy), design alternatives, and one or more application programs. The output is the set of out-of-order superscalar processors that are Pareto-optimal with respect to performance-energy-area. The core of the new design framework is made up of analytical performance and energy activity models, and an analytical model-based design optimization process. For a set of benchmark programs and a design space of 2000 designs, the design framework arrives at all performance energy- area Pareto-optimal design points within 16 minutes on a 2 GHz Pentium-4. In contrast, it is estimated that a naïve cycle-accurate simulation-based exhaustive search would require at least two months to arrive at the Pareto-optimal design points for the same design space.

Presented By: Kushal Datta
Email: Kushal@UNCC
Presentation Date: April 9th, 2008

Link to Download Paper 8
Presentation 8

Presentation 9 (2008/04/30)

Title: Parallel Nonequilibrium Molecular Dynamics Simulations
Authors: S. Srinivasan and R. S. Miller
Contact Information: Department of Mechanical Engineering, Clemson University, Clemson, South Carolina, USA
Abstract: Parallelization strategies for nonequilibrium molecular dynamics (NEMD) simulations of heat conduction in heterogeneous materials are presented. In particular, a previously published algorithm involving the pair decomposition of three-body potentials is extended for heterogeneous materials. In addition, a novel and linear scaling scheme, also based on pair decomposition of three-body terms, is introduced for the calculation of the heat flux. The distributed-computing-based implementation of this algorithm is outlined and its speed-up characteristics are demonstrated to be close to ideal. Example NEMD simulations using the new algorithm are performed for the Si/Ge superlattice based on the three-body Stillinger-Weber potential.

Presented By: Robin Jacob
Email: rpottath@uncc.edu
Presentation Date: April 30th, 2008

Link to Download Paper 9
Presentation 9

Presentation 10 (2008/05/02)

Title: High Performance Support of Parallel Virtual File System (PVFS2) over Quadrics
Authors: Weikuan Yu, Shuang Liang and Dhabaleswar K. Panda
Contact Information:Network-Based Computing Laboratory, Dept. of Computer Sci. & Engineering, The Ohio State University
Abstract: Parallel I/O needs to keep pace with the demand of high performance computing applications on systems with ever-increasing speed. Exploiting high-end interconnect technologies to reduce the network access cost and scale the aggregated bandwidth is one of the ways to increase the performance of storage systems. In this paper, we explore the challenges of supporting parallel file system with modern features of Quadrics, including user-level communication and RDMA operations. We design and implement a Quadrics-capable version of a parallel file system (PVFS2). Our design overcomes the challenges imposed by Quadrics static communication model to dynamic client/server architectures. Quadrics QDMA and RDMA mechanisms are integrated and optimized for high performance data communication. Zero-copy PVFS2 list IO is achieved with a Single Event Associated MUltiple RDMA (SEAMUR) mechanism. Experimental results indicate that the performance of PVFS2, with Quadrics user-level protocols and RDMA operations, is significantly improved in terms of both data transfer and management operations. With four IO server nodes, our implementation improves PVFS2 aggregated read bandwidth by up to 140% compared to PVFS2 over TCP on top of Quadrics IP implementation. Moreover, it delivers significant performance improvement to application benchmarks such as mpi-tile-io and BTIO . To the best of our knowledge, this is the first work in the literature to report the design of a high performance parallel file system over Quadrics user-level communication protocols.



Presented By: Vikram Karwal
Email: vkarwal@uncc.edu
Presentation Date: May 02nd, 2008

Download Paper 10
Presentation 10

Personal tools