Spring Papers 2009

From RCSWiki

Jump to: navigation, search


Presentation 1 (2009/02/17)

Title: FastForward for Efficient Pipeline Parallelism - A Cache-Optimized Concurrent Lock-Free Queue
Authors: John Giacomoni, Tipp Moseley and Manish Vachharajani
Contact Information: University of Colorado at Boulder
Abstract: Low overhead core-to-core communication is critical for efficient pipeline-parallel software applications. This paper presents Fast- Forward, a cache-optimized single-producer/single-consumer concurrent lock-free queue for pipeline parallelism on multicore architectures, with weak to strongly ordered consistency models. Enqueue and dequeue times on a 2.66 GHz Opteron 2218 based system are as low as 28.5 ns, up to 5x faster than the next best solution. FastForward’s effectiveness is demonstrated for real applications ,by applying it to line-rate soft network processing on Gigabit Ethernet with general purpose commodity hardware.
Presented By: Siddhartha Datta
Email: sidd3000@yahoo.com
Presentation Date: 02/17/2009

Link to download paper
Presentation X

Presentation 2 (2009/02/24)

Title: An adaptive and fault tolerant wormhole routing strategy for k-ary n-cubes
Authors: Linder, D.H. and Harden, J.C.
Contact Information: Dept. of Electr. Eng., Mississippi State Univ., MS
Abstract: The concept of virtual channels is extended to multiple virtual communication systems that provide adaptability and fault tolerance in addition to being deadlock-free. A channel dependency graph is taken as the definition of what connections are possible, and any routing function must use only those connections defined by it. Virtual interconnection networks allowing adaptive, deadlock-free routing are examined for three k-ary n-cube topologies: unidirectional, torus-connected bidirectional, and mesh-connected bidirectional
Presented By: Will
Email: will.kritikos@gmail.com
Presentation Date: Feb. 24, 2009

Link to Download Paper
Presentation

Presentation 3 (2009/03/03)

Title: Online architectures: A theoretical formulation and experimental prototype
Authors: Ron Sass, Brian Greskamp, Brian Leonard, Jeff Young, and Srinivas Beeravolu
Contact Information:Dept. of Elec. and Comp. Engineering, University of North Carolina at Charlotte
Abstract:This article describes a class of reconfigurable computing system called online architectures. These architectures use an online algorithm to make run-time reconfiguration decisions that continually adapt the underlying architecture to match the application’s current computational demand. Online architectures have several potential advantages, including better resource utilization (reduced cost), faster execution, and reduced (static) power consumption. However, to realize these benefits, online architectures must balance the overhead (reconfiguration, profiling, and decision costs) against expected gains of reconfiguration. In this article, the basic foundation of online architecture is formulated, core challenges enumerated, and results reported based on a simple prototype and trace-driven simulations. These results suggest that the overhead is manageable and that a more comprehensive investigation is worthwhile.
Presented By: Yamuna Rajasekhar
Email: yrajasek@uncc.edu
Presentation Date: March 3rd, 2009

Link to Download Paper

Presentation 4 (2009/03/17)

Title: Implementation of NAMD molecular dynamics non-bonded force-field on the Cell Broadband Engine processor
Authors: Guochun Shi, Volodymyr Kindrantenko
Contact Information:Nat. Center for Supercomput. Applic., Univ. of Illinois at Urbana-Champaign, Urbana-Champaign, IL
Abstract:We present results of porting an important kernel of a production molecular dynamics simulation program, NAMD, to the Cell/B.E. processor. The non-bonded force-field kernel, as implemented in the NAMD SPEC 2006 CPU benchmark, has been implemented. Both single-precision and double-precision floating-point kernel variations are considered, and performance results obtained on the Cell/B.E., as well as several other platforms, are reported. Our results obtained on a 3.2 GHz Cell/B.E. blade show linear speedups when using multiple synergistic processing elements.
Presented By: Robin Pottathuparambil
Email: rpottath@uncc.edu
Presentation Date: March 17th, 2009

Link to Download Paper
Presentation 4

Presentation 5 (2009/03/24)

Title: Should Disks be Speed Demons or Brainiacs?
Author: Sudhanva Gurumurthi
Contact Information:Department of Computer Science, Univ. of Virginia, Charlottesville.
Abstract:Disk drives play a critical role on the performance of I/O intensive applications. Over the years, disk drive performance has grown as a result of advances in magnetic recording density and faster rotational speeds. In essence, the performance driver in disks has been the data rate. In this paper, we show that data rate is going to be increasingly difficult to optimize, due to power/thermal constraints. We argue that disk drive designers should instead focus their efforts on providing more computational capabilities that data intensive applications could leverage in order to boost performance. We also discuss the scope for provisioning powerful processors inside disk drives to provide these computational capabilities.
Presented By: Ashwin Mendon
Email: aamendon@uncc.edu
Presentation Date: March 24th, 2009

Link to Download Paper

Presentation 6 (2009/03/31)

Title: Characterizing Application Sensitivity to OS Interference Using Kernel-Level Noise Injection
Abstract: Operating system noise has been shown to be a key limiter of application scalability in high-end systems. In this paper, we examine the sensitivity of real-world, large-scale applications to a range of OS noise patterns using a kernel-based noise injection mechanism implemented in the Catamount lightweight kernel. Our results demonstrate the importance of how noise is generated, in terms of frequency and duration, and how this impact changes with application scale. We also discuss how the characteristics of the applications we studied, for example computation/communication ratios, collective communication sizes, and other characteristics, related to their tendency to amplify or absorb noise. Finally, we discuss the implications of our findings on the design of new operating systems, middleware, and other system services for high-end parallel systems.
Presented By: Bin Huang
From Proceeding SC2008
Presentation Date: Mar. 31, 2009
Link to Download Paper
Link to Bin's PPT

Presentation 7 (2009/04/14)

Title: Interprocedural Optimization for Dynamic Hardware Configurations
Authors: ELENA MOSCU PANAINTE, KOEN BERTELS and STAMATIS VASSILIADIS
TU Delft
Abstract: Little research in compiler optimizations has been undertaken to eliminate or diminish the negative influence on performance of the huge reconfiguration latency of the available FPGA platforms. In this paper, we propose an interprocedural optimization that minimizes the number of executed hardware con- figuration instructions taking into account constraints such as the ”FPGA-area placement conflicts” between the available hardware configurations. The proposed algorithm allows the anticipation of hardware configuration instructions up to the application's main procedure. The presented results show that our optimization produces a reduction of up to 3 - 5 order of magnitude of the number of executed hardware configuration instructions.
Presented By: Shweta Jain
Presentation Date: Apr. 14, 2009
Link to Download Paper

Presentation 8 (2009/04/21)

Title: Performance Analysis of MPI Collective Operations
Authors: Jelena Pješivac-Grbović, Thara Angskun, George Bosilca, Graham E. Fagg, Edgar Gabriel, Jack J. Dongarra
Abstract: Previous studies of application usage show that the performance of collective communications are critical for high-performance computing. Despite active research in the field, both general and feasible solution to the optimization of collective communication problem is still missing.

In this paper, we analyze and attempt to improve intra-cluster collective communication in the context of the widely deployed MPI programming paradigm by extending accepted models of point-to-point communication, such as Hockney, LogP/LogGP, and PLogP, to collective operations. We compare the predictions from models against the experimentally gathered data and using these results, construct optimal decision function for broadcast collective. We quantitatively compare the quality of the model-based decision functions to the experimentally-optimal one. Additionally, in this work, we also introduce a new form of an optimized tree-based broadcast algorithm, splitted-binary.

Our results show that all of the models can provide useful insights into various aspects of the different algorithms as well as their relative performance. Still, based on our findings, we believe that the complete reliance on models would not yield optimal results. In addition, our experimental results have identified the gap parameter as being the most critical for accurate modeling of both the classical point-to-point-based pipeline and our extensions to fan-out topologies.
Presented By: Shanyuan Gao
Presentation Date: Apr. 21, 2009
Link to Download Paper Media:template1.pdf

Personal tools