Spring Papers 2008
From RCSWiki
Presentation 1 (2008/02/06)
Titles:
- Understanding and Optimizing High-Speed Serial Memory System Protocols
- Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling
Authors: Jacob Leverich, Hideho Arakida, Alex Solomatnikov, Amin Firoozshahian, Mark Horowitz, Christos Kozyrakis
Contact Information:
Stanford University
Abstract:
There are two basic models for the on-chip memory in CMP systems:hardware-managed coherent caches and software-managed streaming memory. This paper performs a direct comparison of the two modelsunder the same set of assumptions about technology, area, and computational capabilities. The goal is to quantify how and when they differ in terms of performance, energy consumption, bandwidth requirements, and latency tolerance for general-purpose CMPs. We demonstrate that for data-parallel applications, the cache-based and streaming models perform and scale equally well. For certain applications with little data reuse, streaming scales better due to better bandwidth use and macroscopic software prefetching. However, the introduction of techniques such as hardware prefetching and non-allocating stores to the cache-based model eliminates the streaming advantage. Overall, our results indicate that there is not sufficient advantage in building streaming memory systems where all on-chip memory structures are explicitly managed. On the other hand, we show that streaming at the programming model level is particularly beneficial, even with the cache-based model, as it enhances locality and creates opportunities for bandwidth optimizations. Moreover, we observe that stream programming is actually easier with the cache-based model because the hardware guarantees correct, best-effort execution even when the programmer cannot fully regularize an application's code.
Presented By: Andy Schmidt
Email: andrewgschmidt@gmail.com
Presentation Date: February 6th, 2008
Link to Download Paper 1
Link to Download Paper 2
Presentation 1
Presentation 2 (2008/02/20)
Title: MPI Example & Demo
Authors: Michael Quinn
Contact Information:
Abstract: NONE (book chapters)
Presented By: Ron Sass
Email: rsass@uncc.edu
Presentation Date: February 20th, 2008
Chapter 4
Chapter 5
Presentation 2
Presentation 3 (2008/02/27)
Title: Benchmarking
Authors:
Contact Information:
Abstract:
Presented By: Shanyuan Gao
Email: sgao1@uncc.edu
Presentation Date: February 27th, 2008
Link to Download Paper 3
Presentation 3
Presentation 4 (2008/03/12)
Title: Interfaces (Part 1)
Authors:
Contact Information:
Abstract:
Presented By: Siddhartha Datta
Email:
Presentation Date: March 12th, 2008
Link to Download Paper 3
Presentation 3
Presentation 5 (2008/03/19)
Title:
Authors:
Contact Information:
Abstract:
Presented By:
Email:
Presentation Date: March 19th, 2008
Link to Download Paper 5
Presentation 5
Presentation 6 (2008/03/26)
Title: A Performance Analysis of the Berkeley UPC Compiler
Authors: W. Chen, D. Bonachea, J. Duell, P. Husbands, C. Iancu, K. Yelick
Contact Information: UC Berkeley
Abstract:
Unified Parallel C (UPC) is a parallel language that uses a Single
Program Multiple Data (SPMD) model of parallelism within
a global address space. The global address space is used to simplify
programming, especially on applications with irregular data
structures that lead to fine-grained sharing between threads. Recent
results have shown that the performance of UPC using a commercial
compiler is comparable to that of MPI [7]. In this paper we describe
a portable open source compiler for UPC. Our
goal is to achieve a similar performance while enabling easy porting of the compiler and runtime, and also provide a framework
that allows for extensive optimizations. We identify some of the
challenges in compiling UPC and use a combination of microbenchmarks and application kernels to show that our compiler has
low overhead for basic operations on shared data and is competitive, and sometimes faster than, the commercial HP compiler. We
also investigate several communication optimizations, and show
significant benefits by hand-optimizing the generated code.
Presented By: Reshmi Mitra
Email: resh.mitra@gmail.com
Presentation Date: March 26th, 2008
Link to Download Paper 6
Presentation 6
Presentation 7 (2008/04/02)
Title: Evaluation of the Raw Microprocessor: An Exposed-Wire-Delay Architecture for ILP and Streams
Authors: Michael Bedford Taylor, Walter Lee, Jason Miller, David Wentzlaff, Ian Bratt, Ben Greenwald, Henry Hoffmann, Paul Johnson, Jason Kim, James Psota, Arvind Saraf, Nathan Shnidman, Volker Strumpen, Matt Frank, Saman Amarasinghe, and Anant Agarwal
Contact Information:
Abstract:This paper evaluates the Raw microprocessor. Raw addresses the challenge of building a general-purpose architecture that performs well on a larger class of stream and embedded computing applications than existing microprocessors, while still running existing ILP-based sequential programs with reasonable performance in the face of increasing wire delays. Raw approaches this challenge by implementing plenty of on-chip resources - including logic, wires, and pins - in a tiled arrangement, and exposing them through a new ISA, so that the software can take advantage of these resources for parallel applications. Raw supports both ILP and streams by routing operands between architecturally-exposed functional units over a point-to-point scalar operand network. This network offers low latency for scalar data transport. Raw manages the effect of wire delays by exposing the interconnect and using software to orchestrate both scalar and stream data transport. We have implemented a prototype Raw microprocessor in IBM's 180 nm, 6-layer copper, CMOS 7SF standard-cell ASIC process. We have also implemented ILP and stream compilers. Our evaluation attempts to determine the extent to which Raw succeeds in meeting its goal of serving as a more versatile, general-purpose processor. Central to achieving this goal is Raw's ability to exploit all forms of parallelism, including ILP, DLP, TLP, and Stream parallelism. Specifically, we evaluate the performance of Raw on a diverse set of codes including traditional sequential programs, streaming applications, server workloads and bit-level embedded computation. Our experimental methodology makes use of a cycle-accurate simulator validated against our real hardware. Compared to a 180nm Pentium-III, using commodity PC memory system components, Raw performs within a factor of 2/spl times/ for sequential applications with a very low degree of ILP, about 2/spl times/ to 9/spl times/ better for higher levels of ILP, and 10/spl times/-100/spl times/ better when highly parallel applications are coded in a stream language or optimized by hand. The paper also proposes a new versatility metric and uses it to discuss the generality of Raw.
Presented By: Will Kritikos
Email: will.kritikos@gmail.com
Presentation Date: April 2th, 2008
Link to Download Paper 7
Presentation 7
Presentation 8 (2008/04/09)
Topic: Architecture Modeling
Title: Automated Design of Application Specific Superscalar Processors: An Analytical Approach
Authors: Tejas S. Karkhanis and James E. Smith
Contact Information: ECE Dept. University of Wisconsin - Madison
Abstract:
Analytical modeling is applied to the automated design of
application-specific superscalar processors. Using an analytical
method bridges the gap between the size of the design space
and the time required for detailed cycle-accurate simulations.
The proposed design framework takes as inputs the design
targets (upper bounds on execution time, area, and energy),
design alternatives, and one or more application programs. The
output is the set of out-of-order superscalar processors that are
Pareto-optimal with respect to performance-energy-area. The
core of the new design framework is made up of analytical
performance and energy activity models, and an analytical
model-based design optimization process.
For a set of benchmark programs and a design space of
2000 designs, the design framework arrives at all performance energy-
area Pareto-optimal design points within 16 minutes on
a 2 GHz Pentium-4. In contrast, it is estimated that a naïve
cycle-accurate simulation-based exhaustive search would require
at least two months to arrive at the Pareto-optimal design
points for the same design space.
Presented By: Kushal Datta
Email: Kushal@UNCC
Presentation Date: April 9th, 2008
Link to Download Paper 8
Presentation 8
Presentation 9 (2008/04/30)
Title: Parallel Nonequilibrium Molecular Dynamics Simulations
Authors: S. Srinivasan and R. S. Miller
Contact Information: Department of Mechanical Engineering, Clemson University, Clemson, South
Carolina, USA
Abstract: Parallelization strategies for nonequilibrium molecular dynamics (NEMD) simulations of
heat conduction in heterogeneous materials are presented. In particular, a previously published
algorithm involving the pair decomposition of three-body potentials is extended for
heterogeneous materials. In addition, a novel and linear scaling scheme, also based on pair
decomposition of three-body terms, is introduced for the calculation of the heat flux. The
distributed-computing-based implementation of this algorithm is outlined and its speed-up
characteristics are demonstrated to be close to ideal. Example NEMD simulations using
the new algorithm are performed for the Si/Ge superlattice based on the three-body
Stillinger-Weber potential.
Presented By: Robin Jacob
Email: rpottath@uncc.edu
Presentation Date: April 30th, 2008
Link to Download Paper 9
Presentation 9
Presentation 10 (2008/05/02)
Title: High Performance Support of
Parallel Virtual File System (PVFS2) over Quadrics
Authors: Weikuan Yu, Shuang Liang and Dhabaleswar K. Panda
Contact Information:Network-Based Computing Laboratory,
Dept. of Computer Sci. & Engineering,
The Ohio State University
Abstract: Parallel I/O needs to keep pace with the demand of high performance
computing applications on systems with ever-increasing
speed. Exploiting high-end interconnect technologies to reduce the
network access cost and scale the aggregated bandwidth is one of
the ways to increase the performance of storage systems. In this
paper, we explore the challenges of supporting parallel file system
with modern features of Quadrics, including user-level communication
and RDMA operations. We design and implement a
Quadrics-capable version of a parallel file system (PVFS2). Our
design overcomes the challenges imposed by Quadrics static communication
model to dynamic client/server architectures. Quadrics
QDMA and RDMA mechanisms are integrated and optimized for
high performance data communication. Zero-copy PVFS2 list IO is
achieved with a Single Event Associated MUltiple RDMA (SEAMUR)
mechanism. Experimental results indicate that the performance
of PVFS2, with Quadrics user-level protocols and RDMA
operations, is significantly improved in terms of both data transfer
and management operations. With four IO server nodes, our implementation
improves PVFS2 aggregated read bandwidth by up to
140% compared to PVFS2 over TCP on top of Quadrics IP implementation.
Moreover, it delivers significant performance improvement
to application benchmarks such as mpi-tile-io and BTIO
. To the best of our knowledge, this is the first work in the
literature to report the design of a high performance parallel file
system over Quadrics user-level communication protocols.
Presented By: Vikram Karwal
Email: vkarwal@uncc.edu
Presentation Date: May 02nd, 2008
