Fall Papers 2007
From RCSWiki
Contents |
Presentation 1 (2007/09/11)
Title: Performance Analysis of k-ary n-cube Interconnection Networks
Authors: William J. Dally
Contact Information:
Stanford University
Abstract:
This paper analyzes communication
networks of varying dimension under the assumption of
constant wire bisection. Expressions for the latency, average case
throughput, and hot-spot throughput of k-ary n-cube networks
with constant bisection are derived that agree closely with experi-
mental measurements. It is shown that low-dimensional networks
(e.g., tort) have lower latency and higher hot-spot throughput
than high-dimensional networks (e.g., binary n-cubes) with the
same bisection width.
Presented By: Will Kritikos
Email: will.kritikos@gmail.com
Presentation Date: Sept 11, 2007
Download Paper
Citeseer - BibTex
Presentation 1
Presentation 2 (2007/09/25)
Title: The Turn Model for Adaptive Routing
Authors: Christopher J. Glass and Lionel M. Ni
Contact Information:
Michigan State University
Abstract:
We present a model for designing wormhole routing algorithms that are
deadlock free, livelock free, minimal or nonminimal, and maximally
adaptive. A unique feature of this model is that it is not based on
adding physical or virtual channels to network topologies (though it
can be applied to networks with extra channels). Instead, the model
is based on analyzing the directions in which packets can turn in a
network and the cycles that the turns can form. Prohibiting just
enough turns to break all of the cycles produces routing algorithms
that are deadlock free, livelock free, minimal or nonminimal, and
maximally adaptive for the network. In this paper, we focus on the
two most common network topologies for wormhole routing,
n-dimensional meshes and k-ary n-cubes, without extra channels. In an
n-dimensional mesh, just a quarter of the turns must be prohibited to
prevent deadlock. The remaining three quarters of the turns permit
partial adaptiveness in routing. Partially adaptive routing ...
Presented By: Siddhartha Datta
Email: skdatta@uncc.edu
Presentation Date: Sept 25, 2007
Download Paper
Citeseer - BibTex
Presentation 2
Presentation 3 (2007/10/02)
Title: Overview of the Blue Gene/L system architecture
Authors: A. Gara, M. A. Blumrich, D. Chen
Contact Information: IBM
Abstract:
A great gap has existed between the cost/performance ratios of existing supercomputers and that of dedicated application-specific machines. The Blue Gene*/L (BG/L) supercomputer was designed to address that gap by retaining the exceptional cost/performance ratio between existing supercomputer offerings and that obtained by dedicated application-specific machines. The objective was to retain the exceptional cost/performance levels achieved by application-specific machines, while generalizing the massively parallel architecture enough to enable a relatively broad class of applications. The goal of excellent cost/performance meshes nicely with the additional goals of achieving exceptional performance/power and performance/volume ratios.
Title: Optimization of MPI Collective Communication on BlueGene/L Systems
Authors: George Almási, Philip Heidelberger,
Contact Information: IBM T.J. Watson Research Center, Yorktown Heights, NY
Abstract:
BlueGene/L is currently the world's fastest supercomputer. It consists of a large number of low power dual-processor compute nodes interconnected by high speed torus and collective networks, Because compute nodes do not have shared memory, MPI is the the natural programming model for this machine. The BlueGene/L MPI library is a port of MPICH2.In this paper we discuss the implementation of MPI collectives on BlueGene/L. The MPICH2 implementation of MPI collectives is based on point-to-point communication primitives. This turns out to be suboptimal for a number of reasons. Machine-optimized MPI collectives are necessary to harness the performance of BlueGene/L. We discuss these optimized MPI collectives, describing the algorithms and presenting performance results measured with targeted micro-benchmarks on real BlueGene/L hardware with up to 4096 compute nodes.
Presented By: Robin Jacob
Email: robinjacob@mail.com
Presentation Date: Oct 2, 2007
Link to Paper 1
Link to Paper 2
Presentation 3
Presentation 4 (2007/10/23)
Title: LogTM: Log-Based Transactional Memory
Authors: Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood
Contact Information:
Abstract:
Transactional memory (TM) simplifies parallel programming by guaranteeing that transactions appear to execute atomically and in isolation. Implementing these properties includes providing data version management for the simultaneous storage of both new (visible if the transaction commits) and old (retained if the transaction aborts) values. Most (hardware) TM systems leave old values "in place" (the target memory address) and buffer new values elsewhere until commit. This makes aborts fast, but penalizes (the much more frequent) commits. In this paper, we present a new implementation of transactional memory, log-based transactional memory (LogTM), that makes commits fast by storing old values to a per-thread log in cacheable virtual memory and storing new values in place. LogTM makes two additional contributions. First, LogTM extends a MOESI directory protocol to enable both fast conflict detection on evicted blocks and fast commit (using lazy cleanup). Second, LogTM handles aborts in (library) software with little performance penalty. Evaluations running micro- and SPLASH-2 benchmarks on a 32-way multiprocessor support our decision to optimize for commit by showing that only 1-2% of transactions abort.
Presented By: Kushal Datta
Email:
Presentation Date: Oct 23, 2007
Download Paper
Presentation 4
Presentation 5 (2007/10/30)
Title: Transactional Memory: Architectural Support for Lock-Free Data Structures
Authors: Maurice Herlihy and J. Eliot B. Moss
Contact Information: Digital Equipment Corporation & Dept. of Computer Science
Abstract:
A shared data structure is lock-free if its operations do not
require mutual exclusion. If one process is interrupted in
the middle of an operation, other processes will not be
prevented from operating on that object. In highly concurrent
systems, lock-free data structures avoid common
problems associated with conventional locking techniques,
including priority inversion, convoying, and difficulty of
avoiding deadlock. This paper introduces transactional
memory, a new multiprocessor architecture intended to
make lock-free synchronization as efficient (and easy to
use) as conventional techniques based on mutual exclusion.
Transactional memory allows programmers to define
customized read-modify-write operations that apply
to multiple, independently-chosen words of memory. It
is implemented by straightforward extensions to any multiprocessor
cache-coherence protocol. Simulation results
show that transactional memory matches or outperforms
the best known locking techniques for simple bench...
Presented By: Vikram Karwal
Email: vkarwal@uncc.edu
Presentation Date: Oct 30, 2007
Download Paper
Citeseer - BibTex
Presentation 5
Presentation 6 (2007/11/12)
Title: Configurable Transactional Memory
Authors: Kachris, Chirstoforos and Kulkarni, Chidamber
Contact Information: Delft University of Technology, The Netherlands; Xilinx Inc., USA;
Abstract:
Programming efficiency of heterogeneous concurrent systems is limited by the use of lock-based synchronization mechanisms. Transactional memories can greatly improve the programming efficiency of such systems. In field-programmable computing machines, a conventional fixed transactional memory becomes inefficient use of the silicon. We propose configurable transactional memory (CTM) as a mechanism to implement application specific synchronization that utilizes the field-programmability of such devices to match with the requirements of an application. The proposed configurable transactional memory is targeted at embedded applications and is area efficient compared to conventional schemes that are implemented with cache-coherent protocols. In particular, the CTM is designed to be incorporated in to compilation and synthesis paths of either high-level languages or during system creation process using tools such as Xilinx EDK. We study the impact of deploying a CTM in a packet metering and statistics application and two micro-benchmarks as compared to a lock-based synchronization scheme. We have implemented this application in a Xilinx Virtex4 device and found that the CTM was 0-73% better than a fine-grained lock-based scheme.
Presented By: Andy Schmidt
Email: andrewgschmidt@gmail.com
Presentation Date: Nov 12, 2007
Download Paper
Presentation 6
Presentation 7 (2007/11/19)
Title: Implications of Application Usage Characteristics for Collective Communication Offload
Authors: Ron Brightwell, Sue P. Goudy, Arun Rodrigues, Keith D. Underwood
Contact Information: Sandia National Laboratories
Abstract:
The global, synchronous nature of some collective operations implies that they will become the bottleneck when scaling to hundreds of thousands of nodes. One approach improves collective performance using a programmable network interface to directly implement collectives. While these implementations improve micro-benchmark performance, accelerating applications will require deeper understanding of application behaviour. We describe several characteristics of applications that impact collective communication performance. We analyse network resource usage data to guide the design of collective offload engines and their associated programming interfaces. In particular, we provide an analysis of the potential benefit of non-blocking collective communication operations for MPI.
Title: A preliminary Analysis of the InfiniPath and XD1 Network Interfaces
Authors: Ron Brightwell, Doug Doerfler, Keith Underwood
Contact Information: Sandia National Laboratories
Abstract:
Two recently delivered systems have begun a new trend in cluster interconnects. Both the InfiniPath network from PathScale, Inc., and the rapidarray fabric in the XDI system from Cray, Inc., leverage commodity network fabrics while customizing the network interface in an attempt to add value specifically for the high performance computing (HPC) cluster market. Both network interfaces are compatible with standard InfiniBand (IB) switches, but neither use the traditional programming interfaces to support MPI. Another fundamental difference between these networks and other modern network adapters is that much of the processing needed for the network protocol stack is performed on the host processor(s) rather than by the network interface itself. This approach stands in stark contrast to the current direction of most high-performance networking activities, which is to offload as much protocol processing as possible to the network interface. In this paper, we provide an initial performance comparison of the two partially custom networks (PathScale's InfiniPath and Cray's XDI) with a more commodity network (standard IB) and a more custom network (Quadrics Elan4). Our evaluation includes several micro-benchmark results as well as some initial application performance data.
Presented By: Shanyuan Gao
Presentation Date: Nov 19, 2007
Download Paper 1
Download Paper 2
Presentation 7
