This page has only limited features, please log in for full access.
Deep Neural Networks (DNNs) have emerged as an important class of machine learning algorithms, providing accurate solutions to a broad range of applications. Sparsity in activation maps in DNN training presents an opportunity to reduce computations. However, exploiting activation sparsity presents two major challenges: i) profiling activation sparsity during training comes with significant overhead due to computing the degree of sparsity and data movement; ii) the dynamic nature of activation maps requires dynamic dense-to-sparse conversion during training, leading to significant overhead. In this paper, we present Spartan, a lightweight hardware/software framework to accelerate DNN training on a GPU. Spartan provides a cost-effective and programmer-transparent microarchitectural solution to exploit activation sparsity detected during training. Spartan provides an efficient sparsity monitor, a tile-based sparse GEMM algorithm, and a novel compaction engine designed for GPU workloads. Spartan can reduce sparsity profiling overhead by 52.5X on average. For the most compute-intensive layers, i.e., convolutional layers, we can speedup AlexNet by 3.4X, VGGNet-16 by 2.14X, and ResNet-18 by 2.02X, when training on the ImageNet dataset.
Shi Dong; Yifan Sun; Nicolas Bohm Agostini; Elmira Karimi; Daniel Lowell; Jing Zhou; Jose Cano; Jose L. Abellan; David R. Kaeli. Spartan: A Sparsity-Adaptive Framework to Accelerate Deep Neural Network Training on GPUs. IEEE Transactions on Parallel and Distributed Systems 2021, PP, 1 -1.
AMA StyleShi Dong, Yifan Sun, Nicolas Bohm Agostini, Elmira Karimi, Daniel Lowell, Jing Zhou, Jose Cano, Jose L. Abellan, David R. Kaeli. Spartan: A Sparsity-Adaptive Framework to Accelerate Deep Neural Network Training on GPUs. IEEE Transactions on Parallel and Distributed Systems. 2021; PP (99):1-1.
Chicago/Turabian StyleShi Dong; Yifan Sun; Nicolas Bohm Agostini; Elmira Karimi; Daniel Lowell; Jing Zhou; Jose Cano; Jose L. Abellan; David R. Kaeli. 2021. "Spartan: A Sparsity-Adaptive Framework to Accelerate Deep Neural Network Training on GPUs." IEEE Transactions on Parallel and Distributed Systems PP, no. 99: 1-1.
Recently there has been a rapidly growing demand for faster machine learning (ML) processing in data centers and migration of ML inference applications to edge devices. These developments have prompted both industry and academia to explore custom accelerators to optimize ML executions for performance and power. However, identifying which accelerator is best equipped for performing a particular ML task is challenging, especially given the growing range of ML tasks, the number of target environments, and the limited number of integrated modeling tools. To tackle this issue, it is of paramount importance to provide the computer architecture research community with a common framework capable of performing a comprehensive, uniform, and fair comparison across different accelerator designs targeting a particular ML task. To this aim, we propose a new framework named TFLITE-SOC (System On Chip) that integrates a lightweight system modeling library (SystemC) for fast design space exploration of custom ML accelerators into the build/execution environment of Tensorflow Lite (TFLite), a highly popular ML framework for ML inference. Using this approach, we are able to model and evaluate new accelerators developed in SystemC by leveraging the language's hierarchical design capabilities, resulting in faster design prototyping. Furthermore, any accelerator designed using TFLITE-SOC can be benchmarked for inference with any DNN model compatible with TFLite, which enables end-to-end DNN processing and detailed (i.e., per DNN layer) performance analysis. In addition to providing rapid prototyping, integrated benchmarking, and a range of platform configurations, TFLITE-SOC offers comprehensive performance analysis of accelerator occupancy and execution time breakdown as well as a rich set of modules that can be used by new accelerators to implement scaling up studies and optimized memory transfer protocols. We present our framework and demonstrate its utility by considering the design space of a TPU-like systolic array and describing possible directions for optimization. Using a compression technique, we implement an optimization targeting reducing the memory traffic between DRAM and on-device buffers. Compared to the baseline accelerator, our optimized design shows up to 1.26x speedup on accelerated operations and up to 1.19x speedup on end-to-end DNN execution.
Nicolas Bohm Agostini; Shi Dong; Elmira Karimi; Marti Torrents Lapuerta; Jose Cano; Jose L. Abellan; David Kaeli. Design Space Exploration of Accelerators and End-to-End DNN Evaluation with TFLITE-SOC. 2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) 2020, 10 -19.
AMA StyleNicolas Bohm Agostini, Shi Dong, Elmira Karimi, Marti Torrents Lapuerta, Jose Cano, Jose L. Abellan, David Kaeli. Design Space Exploration of Accelerators and End-to-End DNN Evaluation with TFLITE-SOC. 2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). 2020; ():10-19.
Chicago/Turabian StyleNicolas Bohm Agostini; Shi Dong; Elmira Karimi; Marti Torrents Lapuerta; Jose Cano; Jose L. Abellan; David Kaeli. 2020. "Design Space Exploration of Accelerators and End-to-End DNN Evaluation with TFLITE-SOC." 2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) , no. : 10-19.
Photonic network-on-chip (PNoC) is a promising candidate to replace traditional electrical NoC in manycore systems that require substantial bandwidths. The photonic links in the PNoC comprise laser sources, optical ring resonators, passive waveguides, and photodetectors. Reliable link operation requires laser sources and ring resonators to have matching optical frequencies. However, inherent thermal sensitivity of photonic devices and manufacturing process variations can lead to a frequency mismatch. To avoid this mismatch, micro-heaters are used for thermal trimming and tuning, which can dissipate a significant amount of power. This paper proposes a novel FreqAlign workload allocation policy, accompanying an adaptive frequency tuning (AFT) policy, that is capable of reducing thermal tuning power of PNoC. FreqAlign uses thread allocation and thread migration to control temperature for matching the optical frequencies of ring resonators in each photonic link. The AFT policy reduces the remaining optical frequency difference among ring resonators and corresponding on-chip laser sources by hardware tuning methods. We use a full modeling stack of a PNoC that includes a performance simulator, a power simulator, and a thermal simulator with a temperature-dependent laser source power model to design and evaluate our proposed policies. Our experimental results demonstrate that FreqAlign reduces the resonant frequency gradient between ring resonators by 50%-60% when compared to existing workload allocation policies. Coupled with AFT, FreqAlign reduces localized thermal tuning power by 19.28 W on average, and is capable of saving up to 34.57 W when running realistic loads in a 256-core system without any performance degradation.
Jose L. Abellan; Ayse K. Coskun; Jose Luis Abellan Miguel; Warren Jin; Ajay Joshi; Andrew B. Kahng; Jonathan Klamkin; Cristian Morales; John Recchio; Vaishnav Srinivas; Tiansheng Zhang. Adaptive Tuning of Photonic Devices in a Photonic NoC Through Dynamic Workload Allocation. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 2016, 36, 801 -814.
AMA StyleJose L. Abellan, Ayse K. Coskun, Jose Luis Abellan Miguel, Warren Jin, Ajay Joshi, Andrew B. Kahng, Jonathan Klamkin, Cristian Morales, John Recchio, Vaishnav Srinivas, Tiansheng Zhang. Adaptive Tuning of Photonic Devices in a Photonic NoC Through Dynamic Workload Allocation. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. 2016; 36 (5):801-814.
Chicago/Turabian StyleJose L. Abellan; Ayse K. Coskun; Jose Luis Abellan Miguel; Warren Jin; Ajay Joshi; Andrew B. Kahng; Jonathan Klamkin; Cristian Morales; John Recchio; Vaishnav Srinivas; Tiansheng Zhang. 2016. "Adaptive Tuning of Photonic Devices in a Photonic NoC Through Dynamic Workload Allocation." IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 36, no. 5: 801-814.
Transactional memory (TM) is a compelling alternative to simplify multithreaded programming that traditionally relies on error-prone lock-based synchronization for implementing cooperative tasks. Lazy-Lazy hardware TM is one of the most efficient schemes in today’s hardware TM systems. Nonetheless, the commit protocol in these systems has severe impact on performance and energy. The SEQ in Scalable-TCC implementation (STCC-SEQ) is the most popular and efficient commit protocol to date. In this paper, we propose GCommit, a cost-effective hardware-based STCC-SEQ protocol. GCommit employs a G-Arbiter microarchitecture for achieving minimal-latency and high-efficient commits. We implement G-Arbiter with a standard 45 nm cell library. For a target 16-core CMP, a G-Arbiter just represents 0.07 % of the whole on-chip area, requiring marginal energy consumption. Full-system simulations of the target system with the STAMP benchmarks show that GCommit achieves average reductions of 15.7 and 13.7 % in execution time and energy, respectively, when compared with STCC-SEQ.
Epifanio Gaona; José L. Abellán; Manuel E. Acacio. Fast and efficient commits for Lazy-Lazy hardware transactional memory. The Journal of Supercomputing 2015, 71, 4305 -4326.
AMA StyleEpifanio Gaona, José L. Abellán, Manuel E. Acacio. Fast and efficient commits for Lazy-Lazy hardware transactional memory. The Journal of Supercomputing. 2015; 71 (12):4305-4326.
Chicago/Turabian StyleEpifanio Gaona; José L. Abellán; Manuel E. Acacio. 2015. "Fast and efficient commits for Lazy-Lazy hardware transactional memory." The Journal of Supercomputing 71, no. 12: 4305-4326.
In manycore systems, the silicon-photonic link technology is projected to replace electrical link technology for global communication in network-on-chip (NoC) as it can provide as much as an order of magnitude higher bandwidth density and lower data-dependent power. However, a large amount of fixed power is dissipated in the laser sources required to drive these silicon-photonic links, which negates any bandwidth density advantages. This large laser power dissipation depends on the number of on-chip silicon-photonic links, the bandwidth of each link, and the photonic losses along each link. In this paper, we propose to reduce the laser power dissipation at runtime by dynamically activating/deactivating L2 cache banks and switching ON/OFF the corresponding silicon-photonic links in the NoC. This method effectively throttles the total on-chip NoC bandwidth at runtime according to the memory access features of the applications running on the manycore system. Full-system simulation utilizing Princeton application repository for shared-memory computers and Stanford parallel applications for shared-memory-2 parallel benchmarks reveal that our proposed technique achieves on an average 23.8% (peak value 74.3%) savings in laser power, and 9.2% (peak value 26.9%) lower energy-delay product for the whole system at the cost of 0.65% loss (peak value 2.6%) in instructions per cycle on average when compared to the cases where all L2 cache banks are always active.
Chao Chen; Jose L. Abellan; Ajay Joshi. Managing Laser Power in Silicon-Photonic NoC Through Cache and NoC Reconfiguration. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 2015, 34, 972 -985.
AMA StyleChao Chen, Jose L. Abellan, Ajay Joshi. Managing Laser Power in Silicon-Photonic NoC Through Cache and NoC Reconfiguration. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. 2015; 34 (6):972-985.
Chicago/Turabian StyleChao Chen; Jose L. Abellan; Ajay Joshi. 2015. "Managing Laser Power in Silicon-Photonic NoC Through Cache and NoC Reconfiguration." IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 34, no. 6: 972-985.
Virtual Screening (VS) methods can considerably aid clinical research, predicting how ligands interact with drug targets. Most VS methods suppose a unique binding site for the target, usually derived from the interpretation of the protein crystal structure. But it has been demonstrated that in many cases, diverse ligands interact with unrelated parts of the target and many VS methods do not take into account this relevant fact.However, this fact increases the computationally complexity exponentially. In this work we enhance the parallelization of non-bonded interactions kernel for VS methods on Nvidia GPU architectures. We show several parallelization strategies that lead to a speed up factor of 15x compared to previous GPU implementations.
Baldomero Imbernón; Antonio Llanes; Jorge Peña-García; José L. Abellán; Horacio Pérez-Sánchez; José M. Cecilia. Enhancing the Parallelization of Non-bonded Interactions Kernel for Virtual Screening on GPUs. Transactions on Petri Nets and Other Models of Concurrency XV 2015, 9044, 620 -626.
AMA StyleBaldomero Imbernón, Antonio Llanes, Jorge Peña-García, José L. Abellán, Horacio Pérez-Sánchez, José M. Cecilia. Enhancing the Parallelization of Non-bonded Interactions Kernel for Virtual Screening on GPUs. Transactions on Petri Nets and Other Models of Concurrency XV. 2015; 9044 ():620-626.
Chicago/Turabian StyleBaldomero Imbernón; Antonio Llanes; Jorge Peña-García; José L. Abellán; Horacio Pérez-Sánchez; José M. Cecilia. 2015. "Enhancing the Parallelization of Non-bonded Interactions Kernel for Virtual Screening on GPUs." Transactions on Petri Nets and Other Models of Concurrency XV 9044, no. : 620-626.
In the search for new paradigms to simplify multithreaded programming, Transactional Memory (TM) is currently being advocated as a promising alternative to lock-based synchronization. Among the two most important alternatives proposed for conflict detection and data versioning in today’s Hardware Transactional Memory systems (HTMs), the Lazy-Lazy one allows increased concurrency, potentially bringing higher performance levels in most cases. Unfortunately, the implementation of the commit protocol in Lazy-Lazy systems results in increased complexity and has severe impact on performance and energy consumption. In this work, we propose GCommit, an efficient and low cost hardware implementation of the SEQ commit protocol based on the use of hardware locks. Specifically, GCommit deploys hardware locks to ensure exclusive access to shared data at commit time. Implementing this functionality using dedicated hardware brings important benefits in terms of execution time as well as energy consumption with respect to traditional commit protocols that use the general-purpose interconnection network . Additionally, our proposal has negligible requirements in terms of area. Results for a 16-core CMP show that the GCommit protocol obtains average reductions of 15.7% and 13.7% in terms of execution time and energy consumption, respectively, compared with a traditional implementation of Scalable TCC with SEQ, a high-performance commit protocol proposed in the literature.
Epifanio Gaona; José L. Abellán; Manuel E. Acacio; Juan Fernández. Deploying Hardware Locks to Improve Performance and Energy Efficiency of Hardware Transactional Memory. Computer Vision 2013, 7767, 220 -231.
AMA StyleEpifanio Gaona, José L. Abellán, Manuel E. Acacio, Juan Fernández. Deploying Hardware Locks to Improve Performance and Energy Efficiency of Hardware Transactional Memory. Computer Vision. 2013; 7767 ():220-231.
Chicago/Turabian StyleEpifanio Gaona; José L. Abellán; Manuel E. Acacio; Juan Fernández. 2013. "Deploying Hardware Locks to Improve Performance and Energy Efficiency of Hardware Transactional Memory." Computer Vision 7767, no. : 220-231.
José L. Abellán; Alberto Ros; Juan Fernández; Manuel E. Acacio. Efficient Dir0B Cache Coherency for Many-core CMPs. Procedia Computer Science 2013, 18, 2545 -2548.
AMA StyleJosé L. Abellán, Alberto Ros, Juan Fernández, Manuel E. Acacio. Efficient Dir0B Cache Coherency for Many-core CMPs. Procedia Computer Science. 2013; 18 ():2545-2548.
Chicago/Turabian StyleJosé L. Abellán; Alberto Ros; Juan Fernández; Manuel E. Acacio. 2013. "Efficient Dir0B Cache Coherency for Many-core CMPs." Procedia Computer Science 18, no. : 2545-2548.
Traditional software-based barrier implementations for shared memory parallel machines tend to produce hotspots in terms of memory and network contention as the number of processors increases. This could limit their applicability to future many-core CMPs in which possibly several dozens of cores would need to be synchronized efficiently. In this work, we develop GBarrier, a hardware-based barrier mechanism especially aimed at providing efficient barriers in future many-core CMPs. Our proposal deploys a dedicated G-line-based network to allow for fast and efficient signaling of barrier arrival and departure. Since GBarrier does not have any influence on the memory system, we avoid all coherence activity and barrier-related network traffic that traditional approaches introduce and that restrict scalability. Through detailed simulations of a 32-core CMP, we compare GBarrier against one of the most efficient software-based barrier implementations for a set of kernels and scientific applications. Evaluation results show average reductions of 54 and 21 percent in execution time, 53 and 18 percent in network traffic, and also 76 and 31 percent in the energy-delay2 product metric for the full CMP when the kernels and scientific applications, respectively, are considered.
Jose L. Abellan; Juan Fernandez; Manuel E. Acacio. Efficient Hardware Barrier Synchronization in Many-Core CMPs. IEEE Transactions on Parallel and Distributed Systems 2011, 23, 1453 -1466.
AMA StyleJose L. Abellan, Juan Fernandez, Manuel E. Acacio. Efficient Hardware Barrier Synchronization in Many-Core CMPs. IEEE Transactions on Parallel and Distributed Systems. 2011; 23 (8):1453-1466.
Chicago/Turabian StyleJose L. Abellan; Juan Fernandez; Manuel E. Acacio. 2011. "Efficient Hardware Barrier Synchronization in Many-Core CMPs." IEEE Transactions on Parallel and Distributed Systems 23, no. 8: 1453-1466.
Barrier synchronization in shared memory parallel machines has been widely implemented through busy-waiting on shared variables. However, typical implementations of barrier synchronization tend to produce hot-spots in terms of memory and network contention, thus creating performance bottlenecks that become markedly more pronounced as the number of cores or processors increases. To overcome such limitations, we present a novel hardware-based barrier mechanism in the context of many-core CMPs. Our proposal is based on global interconnection lines (G-lines) and the S-CSMA technique, which have been recently used to enhance a flow control mechanism (EVC) in the context of networks-on-chip. Based on this technology, we have designed a simple and scalable G-line-based network that operates independently of the main data network, and that is aimed at carrying out barrier synchronizations efficiently. In the ideal case, our design takes only 4 cycles to perform a barrier synchronization once all cores or threads have arrived at the barrier. As a proof of concept, we examine the benefits of our proposal by comparing it with one of the best software approaches (a binary combining-tree barrier). To do so, we run several kernels and scientific applications on top of the Sim-PowerCMP performance simulator that models a 32-core CMP with a 2D-mesh network configuration. Our proposal entails average reductions in terms of execution time of 68% and 21% for kernels and scientific applications, respectively. Additionally, network traffic is also lowered by 74% and 18%, respectively.
Jose L. Abellan; Juan Fernandez; Manuel E. Acacio. A G-Line-Based Network for Fast and Efficient Barrier Synchronization in Many-Core CMPs. 2010 39th International Conference on Parallel Processing 2010, 267 -276.
AMA StyleJose L. Abellan, Juan Fernandez, Manuel E. Acacio. A G-Line-Based Network for Fast and Efficient Barrier Synchronization in Many-Core CMPs. 2010 39th International Conference on Parallel Processing. 2010; ():267-276.
Chicago/Turabian StyleJose L. Abellan; Juan Fernandez; Manuel E. Acacio. 2010. "A G-Line-Based Network for Fast and Efficient Barrier Synchronization in Many-Core CMPs." 2010 39th International Conference on Parallel Processing , no. : 267-276.
José L. Abellán; Juan Fernández; Manuel E. Acacio. Efficient and scalable barrier synchronization for many-core CMPs. Proceedings of the 7th ACM international conference on Web search and data mining 2010, 1 .
AMA StyleJosé L. Abellán, Juan Fernández, Manuel E. Acacio. Efficient and scalable barrier synchronization for many-core CMPs. Proceedings of the 7th ACM international conference on Web search and data mining. 2010; ():1.
Chicago/Turabian StyleJosé L. Abellán; Juan Fernández; Manuel E. Acacio. 2010. "Efficient and scalable barrier synchronization for many-core CMPs." Proceedings of the 7th ACM international conference on Web search and data mining , no. : 1.
The Cell Broadband Engine (Cell BE) is a heterogeneous chip-multiprocessor (CMP) architecture to offer very high performance, especially on game and multimedia applications. The singularity of its architecture, nine cores of two different types, along with the variety of synchronization and communication primitives offered to programmers, make the task of developing efficient applications very challenging. This situation gets even worse when dual Cell-based blade platforms are considered, where two separate Cells can be linked together through a dedicated high-speed interface. In this work, we present a characterization of the main synchronization and communication primitives provided to programmers in the context of a dual Cell-based blade under varying workloads through our CellStats tool. In particular, we focus on the DMA transfer mechanism, the mailboxes, the signals, the read-modify-write atomic operations, and the time taken by thread creation. Our performance results expose the bottlenecks and asymmetries of these platforms, which must be taken into account by programmers for choosing the most adequate primitives to improve the efficiency of their applications.
José L. Abellán; Juan Fernández; Manuel E. Acacio. Characterizing the basic synchronization and communication operations in Dual Cell-based Blades through CellStats. The Journal of Supercomputing 2009, 53, 247 -268.
AMA StyleJosé L. Abellán, Juan Fernández, Manuel E. Acacio. Characterizing the basic synchronization and communication operations in Dual Cell-based Blades through CellStats. The Journal of Supercomputing. 2009; 53 (2):247-268.
Chicago/Turabian StyleJosé L. Abellán; Juan Fernández; Manuel E. Acacio. 2009. "Characterizing the basic synchronization and communication operations in Dual Cell-based Blades through CellStats." The Journal of Supercomputing 53, no. 2: 247-268.
The Cell Broadband Engine (Cell BE) is a recent heterogeneous chip-multiprocessor (CMP) architecture jointly developed by IBM, Sony and Toshiba to offer very high performance, especially on game and multimedia applications. The significant number of processor cores that it contains (nine in its first generation), along with their heterogeneity (they are of two different types) and the variety of synchronization and communication primitives offered to programmers, make the task of developing efficient applications for the Cell BE very challenging. In this work, we present CellStats, a tool aimed at characterizing the performance of the main synchronization and communication primitives provided by the Cell BE under varying workloads. In particular, the current implementation of CellStats allows to evaluate the DMA transfer mechanism, the read-modify-write atomic operations, the mailboxes, the signals and the time taken by thread creation. As an example of application of CellStats, we present a characterization of the Cell BE incorporated into the PlayStation 3. From this characterization, we extract some recommendations that can help programmers to identify the most appropriate primitive under different assumptions.
Jose L. Abellan; Juan Fernandez; Manuel E. Acacio. CellStats: A Tool to Evaluate the Basic Synchronization and Communication Operations of the Cell BE. 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008) 2008, 261 -268.
AMA StyleJose L. Abellan, Juan Fernandez, Manuel E. Acacio. CellStats: A Tool to Evaluate the Basic Synchronization and Communication Operations of the Cell BE. 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008). 2008; ():261-268.
Chicago/Turabian StyleJose L. Abellan; Juan Fernandez; Manuel E. Acacio. 2008. "CellStats: A Tool to Evaluate the Basic Synchronization and Communication Operations of the Cell BE." 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008) , no. : 261-268.
The Cell Broadband Engine (Cell BE) is a heterogeneous chip-multiprocessor (CMP) architecture to offer very high performance, especially on game and multimedia applications. The singularity of its architecture, nine cores of two different types, along with the variety of synchronization and communication primitives offered to programmers, make the task of developing efficient applications very challenging. This situation gets even worse when we consider Dual Cell-Based Blade architectures where two separate Cells can be linked together through a dedicated high-speed interface. In this work, we present a characterization of the main synchronization and communication primitives provided by dual Cell-based blades under varying workloads. In particular, we focus on the DMA transfer mechanism, the mailboxes, the signals, the read-modify-write atomic operations, and the time taken by thread creation. Our performance results expose the bottlenecks and asymmetries of these platforms which must be taken into account by programmers for improving the efficiency of their applications.
José L. Abellán; Juan Fernández; Manuel E. Acacio. Characterizing the Basic Synchronization and Communication Operations in Dual Cell-Based Blades. Computer Vision 2008, 5101, 456 -465.
AMA StyleJosé L. Abellán, Juan Fernández, Manuel E. Acacio. Characterizing the Basic Synchronization and Communication Operations in Dual Cell-Based Blades. Computer Vision. 2008; 5101 ():456-465.
Chicago/Turabian StyleJosé L. Abellán; Juan Fernández; Manuel E. Acacio. 2008. "Characterizing the Basic Synchronization and Communication Operations in Dual Cell-Based Blades." Computer Vision 5101, no. : 456-465.