This page has only limited features, please log in for full access.

Dr. Jose L. Abellan
Computer Science and Engineering Department, Universidad Católica de Murcia (UCAM), Murcia 30107, Spain

Basic Info


Research Keywords & Expertise

0 Computer Architecture
0 Machine Learning
0 Interconnection networks
0 Memory hierarchy
0 Domain-specific architecture

Fingerprints

Machine Learning
Computer Architecture

Honors and Awards

The user has no records in this section


Career Timeline

The user has no records in this section.


Short Biography

The user biography is not available.
Following
Followers
Co Authors
The list of users this user is following is empty.
Following: 0 users

Feed

Journal article
Published: 22 March 2021 in IEEE Transactions on Parallel and Distributed Systems
Reads 0
Downloads 0

Deep Neural Networks (DNNs) have emerged as an important class of machine learning algorithms, providing accurate solutions to a broad range of applications. Sparsity in activation maps in DNN training presents an opportunity to reduce computations. However, exploiting activation sparsity presents two major challenges: i) profiling activation sparsity during training comes with significant overhead due to computing the degree of sparsity and data movement; ii) the dynamic nature of activation maps requires dynamic dense-to-sparse conversion during training, leading to significant overhead. In this paper, we present Spartan, a lightweight hardware/software framework to accelerate DNN training on a GPU. Spartan provides a cost-effective and programmer-transparent microarchitectural solution to exploit activation sparsity detected during training. Spartan provides an efficient sparsity monitor, a tile-based sparse GEMM algorithm, and a novel compaction engine designed for GPU workloads. Spartan can reduce sparsity profiling overhead by 52.5X on average. For the most compute-intensive layers, i.e., convolutional layers, we can speedup AlexNet by 3.4X, VGGNet-16 by 2.14X, and ResNet-18 by 2.02X, when training on the ImageNet dataset.

ACS Style

Shi Dong; Yifan Sun; Nicolas Bohm Agostini; Elmira Karimi; Daniel Lowell; Jing Zhou; Jose Cano; Jose L. Abellan; David R. Kaeli. Spartan: A Sparsity-Adaptive Framework to Accelerate Deep Neural Network Training on GPUs. IEEE Transactions on Parallel and Distributed Systems 2021, PP, 1 -1.

AMA Style

Shi Dong, Yifan Sun, Nicolas Bohm Agostini, Elmira Karimi, Daniel Lowell, Jing Zhou, Jose Cano, Jose L. Abellan, David R. Kaeli. Spartan: A Sparsity-Adaptive Framework to Accelerate Deep Neural Network Training on GPUs. IEEE Transactions on Parallel and Distributed Systems. 2021; PP (99):1-1.

Chicago/Turabian Style

Shi Dong; Yifan Sun; Nicolas Bohm Agostini; Elmira Karimi; Daniel Lowell; Jing Zhou; Jose Cano; Jose L. Abellan; David R. Kaeli. 2021. "Spartan: A Sparsity-Adaptive Framework to Accelerate Deep Neural Network Training on GPUs." IEEE Transactions on Parallel and Distributed Systems PP, no. 99: 1-1.

Conference paper
Published: 01 September 2020 in 2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)
Reads 0
Downloads 0

Recently there has been a rapidly growing demand for faster machine learning (ML) processing in data centers and migration of ML inference applications to edge devices. These developments have prompted both industry and academia to explore custom accelerators to optimize ML executions for performance and power. However, identifying which accelerator is best equipped for performing a particular ML task is challenging, especially given the growing range of ML tasks, the number of target environments, and the limited number of integrated modeling tools. To tackle this issue, it is of paramount importance to provide the computer architecture research community with a common framework capable of performing a comprehensive, uniform, and fair comparison across different accelerator designs targeting a particular ML task. To this aim, we propose a new framework named TFLITE-SOC (System On Chip) that integrates a lightweight system modeling library (SystemC) for fast design space exploration of custom ML accelerators into the build/execution environment of Tensorflow Lite (TFLite), a highly popular ML framework for ML inference. Using this approach, we are able to model and evaluate new accelerators developed in SystemC by leveraging the language's hierarchical design capabilities, resulting in faster design prototyping. Furthermore, any accelerator designed using TFLITE-SOC can be benchmarked for inference with any DNN model compatible with TFLite, which enables end-to-end DNN processing and detailed (i.e., per DNN layer) performance analysis. In addition to providing rapid prototyping, integrated benchmarking, and a range of platform configurations, TFLITE-SOC offers comprehensive performance analysis of accelerator occupancy and execution time breakdown as well as a rich set of modules that can be used by new accelerators to implement scaling up studies and optimized memory transfer protocols. We present our framework and demonstrate its utility by considering the design space of a TPU-like systolic array and describing possible directions for optimization. Using a compression technique, we implement an optimization targeting reducing the memory traffic between DRAM and on-device buffers. Compared to the baseline accelerator, our optimized design shows up to 1.26x speedup on accelerated operations and up to 1.19x speedup on end-to-end DNN execution.

ACS Style

Nicolas Bohm Agostini; Shi Dong; Elmira Karimi; Marti Torrents Lapuerta; Jose Cano; Jose L. Abellan; David Kaeli. Design Space Exploration of Accelerators and End-to-End DNN Evaluation with TFLITE-SOC. 2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) 2020, 10 -19.

AMA Style

Nicolas Bohm Agostini, Shi Dong, Elmira Karimi, Marti Torrents Lapuerta, Jose Cano, Jose L. Abellan, David Kaeli. Design Space Exploration of Accelerators and End-to-End DNN Evaluation with TFLITE-SOC. 2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). 2020; ():10-19.

Chicago/Turabian Style

Nicolas Bohm Agostini; Shi Dong; Elmira Karimi; Marti Torrents Lapuerta; Jose Cano; Jose L. Abellan; David Kaeli. 2020. "Design Space Exploration of Accelerators and End-to-End DNN Evaluation with TFLITE-SOC." 2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) , no. : 10-19.

Journal article
Published: 12 August 2016 in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Reads 0
Downloads 0

Photonic network-on-chip (PNoC) is a promising candidate to replace traditional electrical NoC in manycore systems that require substantial bandwidths. The photonic links in the PNoC comprise laser sources, optical ring resonators, passive waveguides, and photodetectors. Reliable link operation requires laser sources and ring resonators to have matching optical frequencies. However, inherent thermal sensitivity of photonic devices and manufacturing process variations can lead to a frequency mismatch. To avoid this mismatch, micro-heaters are used for thermal trimming and tuning, which can dissipate a significant amount of power. This paper proposes a novel FreqAlign workload allocation policy, accompanying an adaptive frequency tuning (AFT) policy, that is capable of reducing thermal tuning power of PNoC. FreqAlign uses thread allocation and thread migration to control temperature for matching the optical frequencies of ring resonators in each photonic link. The AFT policy reduces the remaining optical frequency difference among ring resonators and corresponding on-chip laser sources by hardware tuning methods. We use a full modeling stack of a PNoC that includes a performance simulator, a power simulator, and a thermal simulator with a temperature-dependent laser source power model to design and evaluate our proposed policies. Our experimental results demonstrate that FreqAlign reduces the resonant frequency gradient between ring resonators by 50%-60% when compared to existing workload allocation policies. Coupled with AFT, FreqAlign reduces localized thermal tuning power by 19.28 W on average, and is capable of saving up to 34.57 W when running realistic loads in a 256-core system without any performance degradation.

ACS Style

Jose L. Abellan; Ayse K. Coskun; Jose Luis Abellan Miguel; Warren Jin; Ajay Joshi; Andrew B. Kahng; Jonathan Klamkin; Cristian Morales; John Recchio; Vaishnav Srinivas; Tiansheng Zhang. Adaptive Tuning of Photonic Devices in a Photonic NoC Through Dynamic Workload Allocation. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 2016, 36, 801 -814.

AMA Style

Jose L. Abellan, Ayse K. Coskun, Jose Luis Abellan Miguel, Warren Jin, Ajay Joshi, Andrew B. Kahng, Jonathan Klamkin, Cristian Morales, John Recchio, Vaishnav Srinivas, Tiansheng Zhang. Adaptive Tuning of Photonic Devices in a Photonic NoC Through Dynamic Workload Allocation. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. 2016; 36 (5):801-814.

Chicago/Turabian Style

Jose L. Abellan; Ayse K. Coskun; Jose Luis Abellan Miguel; Warren Jin; Ajay Joshi; Andrew B. Kahng; Jonathan Klamkin; Cristian Morales; John Recchio; Vaishnav Srinivas; Tiansheng Zhang. 2016. "Adaptive Tuning of Photonic Devices in a Photonic NoC Through Dynamic Workload Allocation." IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 36, no. 5: 801-814.

Journal article
Published: 21 September 2015 in The Journal of Supercomputing
Reads 0
Downloads 0

Transactional memory (TM) is a compelling alternative to simplify multithreaded programming that traditionally relies on error-prone lock-based synchronization for implementing cooperative tasks. Lazy-Lazy hardware TM is one of the most efficient schemes in today’s hardware TM systems. Nonetheless, the commit protocol in these systems has severe impact on performance and energy. The SEQ in Scalable-TCC implementation (STCC-SEQ) is the most popular and efficient commit protocol to date. In this paper, we propose GCommit, a cost-effective hardware-based STCC-SEQ protocol. GCommit employs a G-Arbiter microarchitecture for achieving minimal-latency and high-efficient commits. We implement G-Arbiter with a standard 45 nm cell library. For a target 16-core CMP, a G-Arbiter just represents 0.07 % of the whole on-chip area, requiring marginal energy consumption. Full-system simulations of the target system with the STAMP benchmarks show that GCommit achieves average reductions of 15.7 and 13.7 % in execution time and energy, respectively, when compared with STCC-SEQ.

ACS Style

Epifanio Gaona; José L. Abellán; Manuel E. Acacio. Fast and efficient commits for Lazy-Lazy hardware transactional memory. The Journal of Supercomputing 2015, 71, 4305 -4326.

AMA Style

Epifanio Gaona, José L. Abellán, Manuel E. Acacio. Fast and efficient commits for Lazy-Lazy hardware transactional memory. The Journal of Supercomputing. 2015; 71 (12):4305-4326.

Chicago/Turabian Style

Epifanio Gaona; José L. Abellán; Manuel E. Acacio. 2015. "Fast and efficient commits for Lazy-Lazy hardware transactional memory." The Journal of Supercomputing 71, no. 12: 4305-4326.

Journal article
Published: 10 February 2015 in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Reads 0
Downloads 0

In manycore systems, the silicon-photonic link technology is projected to replace electrical link technology for global communication in network-on-chip (NoC) as it can provide as much as an order of magnitude higher bandwidth density and lower data-dependent power. However, a large amount of fixed power is dissipated in the laser sources required to drive these silicon-photonic links, which negates any bandwidth density advantages. This large laser power dissipation depends on the number of on-chip silicon-photonic links, the bandwidth of each link, and the photonic losses along each link. In this paper, we propose to reduce the laser power dissipation at runtime by dynamically activating/deactivating L2 cache banks and switching ON/OFF the corresponding silicon-photonic links in the NoC. This method effectively throttles the total on-chip NoC bandwidth at runtime according to the memory access features of the applications running on the manycore system. Full-system simulation utilizing Princeton application repository for shared-memory computers and Stanford parallel applications for shared-memory-2 parallel benchmarks reveal that our proposed technique achieves on an average 23.8% (peak value 74.3%) savings in laser power, and 9.2% (peak value 26.9%) lower energy-delay product for the whole system at the cost of 0.65% loss (peak value 2.6%) in instructions per cycle on average when compared to the cases where all L2 cache banks are always active.

ACS Style

Chao Chen; Jose L. Abellan; Ajay Joshi. Managing Laser Power in Silicon-Photonic NoC Through Cache and NoC Reconfiguration. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 2015, 34, 972 -985.

AMA Style

Chao Chen, Jose L. Abellan, Ajay Joshi. Managing Laser Power in Silicon-Photonic NoC Through Cache and NoC Reconfiguration. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. 2015; 34 (6):972-985.

Chicago/Turabian Style

Chao Chen; Jose L. Abellan; Ajay Joshi. 2015. "Managing Laser Power in Silicon-Photonic NoC Through Cache and NoC Reconfiguration." IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 34, no. 6: 972-985.

Book chapter
Published: 01 January 2015 in Transactions on Petri Nets and Other Models of Concurrency XV
Reads 0
Downloads 0

Virtual Screening (VS) methods can considerably aid clinical research, predicting how ligands interact with drug targets. Most VS methods suppose a unique binding site for the target, usually derived from the interpretation of the protein crystal structure. But it has been demonstrated that in many cases, diverse ligands interact with unrelated parts of the target and many VS methods do not take into account this relevant fact.However, this fact increases the computationally complexity exponentially. In this work we enhance the parallelization of non-bonded interactions kernel for VS methods on Nvidia GPU architectures. We show several parallelization strategies that lead to a speed up factor of 15x compared to previous GPU implementations.

ACS Style

Baldomero Imbernón; Antonio Llanes; Jorge Peña-García; José L. Abellán; Horacio Pérez-Sánchez; José M. Cecilia. Enhancing the Parallelization of Non-bonded Interactions Kernel for Virtual Screening on GPUs. Transactions on Petri Nets and Other Models of Concurrency XV 2015, 9044, 620 -626.

AMA Style

Baldomero Imbernón, Antonio Llanes, Jorge Peña-García, José L. Abellán, Horacio Pérez-Sánchez, José M. Cecilia. Enhancing the Parallelization of Non-bonded Interactions Kernel for Virtual Screening on GPUs. Transactions on Petri Nets and Other Models of Concurrency XV. 2015; 9044 ():620-626.

Chicago/Turabian Style

Baldomero Imbernón; Antonio Llanes; Jorge Peña-García; José L. Abellán; Horacio Pérez-Sánchez; José M. Cecilia. 2015. "Enhancing the Parallelization of Non-bonded Interactions Kernel for Virtual Screening on GPUs." Transactions on Petri Nets and Other Models of Concurrency XV 9044, no. : 620-626.

Conference paper
Published: 01 January 2013 in Computer Vision
Reads 0
Downloads 0

In the search for new paradigms to simplify multithreaded programming, Transactional Memory (TM) is currently being advocated as a promising alternative to lock-based synchronization. Among the two most important alternatives proposed for conflict detection and data versioning in today’s Hardware Transactional Memory systems (HTMs), the Lazy-Lazy one allows increased concurrency, potentially bringing higher performance levels in most cases. Unfortunately, the implementation of the commit protocol in Lazy-Lazy systems results in increased complexity and has severe impact on performance and energy consumption. In this work, we propose GCommit, an efficient and low cost hardware implementation of the SEQ commit protocol based on the use of hardware locks. Specifically, GCommit deploys hardware locks to ensure exclusive access to shared data at commit time. Implementing this functionality using dedicated hardware brings important benefits in terms of execution time as well as energy consumption with respect to traditional commit protocols that use the general-purpose interconnection network . Additionally, our proposal has negligible requirements in terms of area. Results for a 16-core CMP show that the GCommit protocol obtains average reductions of 15.7% and 13.7% in terms of execution time and energy consumption, respectively, compared with a traditional implementation of Scalable TCC with SEQ, a high-performance commit protocol proposed in the literature.

ACS Style

Epifanio Gaona; José L. Abellán; Manuel E. Acacio; Juan Fernández. Deploying Hardware Locks to Improve Performance and Energy Efficiency of Hardware Transactional Memory. Computer Vision 2013, 7767, 220 -231.

AMA Style

Epifanio Gaona, José L. Abellán, Manuel E. Acacio, Juan Fernández. Deploying Hardware Locks to Improve Performance and Energy Efficiency of Hardware Transactional Memory. Computer Vision. 2013; 7767 ():220-231.

Chicago/Turabian Style

Epifanio Gaona; José L. Abellán; Manuel E. Acacio; Juan Fernández. 2013. "Deploying Hardware Locks to Improve Performance and Energy Efficiency of Hardware Transactional Memory." Computer Vision 7767, no. : 220-231.

Journal article
Published: 01 January 2013 in Procedia Computer Science
Reads 0
Downloads 0
ACS Style

José L. Abellán; Alberto Ros; Juan Fernández; Manuel E. Acacio. Efficient Dir0B Cache Coherency for Many-core CMPs. Procedia Computer Science 2013, 18, 2545 -2548.

AMA Style

José L. Abellán, Alberto Ros, Juan Fernández, Manuel E. Acacio. Efficient Dir0B Cache Coherency for Many-core CMPs. Procedia Computer Science. 2013; 18 ():2545-2548.

Chicago/Turabian Style

José L. Abellán; Alberto Ros; Juan Fernández; Manuel E. Acacio. 2013. "Efficient Dir0B Cache Coherency for Many-core CMPs." Procedia Computer Science 18, no. : 2545-2548.

Journal article
Published: 13 December 2011 in IEEE Transactions on Parallel and Distributed Systems
Reads 0
Downloads 0

Traditional software-based barrier implementations for shared memory parallel machines tend to produce hotspots in terms of memory and network contention as the number of processors increases. This could limit their applicability to future many-core CMPs in which possibly several dozens of cores would need to be synchronized efficiently. In this work, we develop GBarrier, a hardware-based barrier mechanism especially aimed at providing efficient barriers in future many-core CMPs. Our proposal deploys a dedicated G-line-based network to allow for fast and efficient signaling of barrier arrival and departure. Since GBarrier does not have any influence on the memory system, we avoid all coherence activity and barrier-related network traffic that traditional approaches introduce and that restrict scalability. Through detailed simulations of a 32-core CMP, we compare GBarrier against one of the most efficient software-based barrier implementations for a set of kernels and scientific applications. Evaluation results show average reductions of 54 and 21 percent in execution time, 53 and 18 percent in network traffic, and also 76 and 31 percent in the energy-delay2 product metric for the full CMP when the kernels and scientific applications, respectively, are considered.

ACS Style

Jose L. Abellan; Juan Fernandez; Manuel E. Acacio. Efficient Hardware Barrier Synchronization in Many-Core CMPs. IEEE Transactions on Parallel and Distributed Systems 2011, 23, 1453 -1466.

AMA Style

Jose L. Abellan, Juan Fernandez, Manuel E. Acacio. Efficient Hardware Barrier Synchronization in Many-Core CMPs. IEEE Transactions on Parallel and Distributed Systems. 2011; 23 (8):1453-1466.

Chicago/Turabian Style

Jose L. Abellan; Juan Fernandez; Manuel E. Acacio. 2011. "Efficient Hardware Barrier Synchronization in Many-Core CMPs." IEEE Transactions on Parallel and Distributed Systems 23, no. 8: 1453-1466.

Conference paper
Published: 01 September 2010 in 2010 39th International Conference on Parallel Processing
Reads 0
Downloads 0

Barrier synchronization in shared memory parallel machines has been widely implemented through busy-waiting on shared variables. However, typical implementations of barrier synchronization tend to produce hot-spots in terms of memory and network contention, thus creating performance bottlenecks that become markedly more pronounced as the number of cores or processors increases. To overcome such limitations, we present a novel hardware-based barrier mechanism in the context of many-core CMPs. Our proposal is based on global interconnection lines (G-lines) and the S-CSMA technique, which have been recently used to enhance a flow control mechanism (EVC) in the context of networks-on-chip. Based on this technology, we have designed a simple and scalable G-line-based network that operates independently of the main data network, and that is aimed at carrying out barrier synchronizations efficiently. In the ideal case, our design takes only 4 cycles to perform a barrier synchronization once all cores or threads have arrived at the barrier. As a proof of concept, we examine the benefits of our proposal by comparing it with one of the best software approaches (a binary combining-tree barrier). To do so, we run several kernels and scientific applications on top of the Sim-PowerCMP performance simulator that models a 32-core CMP with a 2D-mesh network configuration. Our proposal entails average reductions in terms of execution time of 68% and 21% for kernels and scientific applications, respectively. Additionally, network traffic is also lowered by 74% and 18%, respectively.

ACS Style

Jose L. Abellan; Juan Fernandez; Manuel E. Acacio. A G-Line-Based Network for Fast and Efficient Barrier Synchronization in Many-Core CMPs. 2010 39th International Conference on Parallel Processing 2010, 267 -276.

AMA Style

Jose L. Abellan, Juan Fernandez, Manuel E. Acacio. A G-Line-Based Network for Fast and Efficient Barrier Synchronization in Many-Core CMPs. 2010 39th International Conference on Parallel Processing. 2010; ():267-276.

Chicago/Turabian Style

Jose L. Abellan; Juan Fernandez; Manuel E. Acacio. 2010. "A G-Line-Based Network for Fast and Efficient Barrier Synchronization in Many-Core CMPs." 2010 39th International Conference on Parallel Processing , no. : 267-276.

Conference paper
Published: 01 January 2010 in Proceedings of the 7th ACM international conference on Web search and data mining
Reads 0
Downloads 0
ACS Style

José L. Abellán; Juan Fernández; Manuel E. Acacio. Efficient and scalable barrier synchronization for many-core CMPs. Proceedings of the 7th ACM international conference on Web search and data mining 2010, 1 .

AMA Style

José L. Abellán, Juan Fernández, Manuel E. Acacio. Efficient and scalable barrier synchronization for many-core CMPs. Proceedings of the 7th ACM international conference on Web search and data mining. 2010; ():1.

Chicago/Turabian Style

José L. Abellán; Juan Fernández; Manuel E. Acacio. 2010. "Efficient and scalable barrier synchronization for many-core CMPs." Proceedings of the 7th ACM international conference on Web search and data mining , no. : 1.

Journal article
Published: 01 April 2009 in The Journal of Supercomputing
Reads 0
Downloads 0

The Cell Broadband Engine (Cell BE) is a heterogeneous chip-multiprocessor (CMP) architecture to offer very high performance, especially on game and multimedia applications. The singularity of its architecture, nine cores of two different types, along with the variety of synchronization and communication primitives offered to programmers, make the task of developing efficient applications very challenging. This situation gets even worse when dual Cell-based blade platforms are considered, where two separate Cells can be linked together through a dedicated high-speed interface. In this work, we present a characterization of the main synchronization and communication primitives provided to programmers in the context of a dual Cell-based blade under varying workloads through our CellStats tool. In particular, we focus on the DMA transfer mechanism, the mailboxes, the signals, the read-modify-write atomic operations, and the time taken by thread creation. Our performance results expose the bottlenecks and asymmetries of these platforms, which must be taken into account by programmers for choosing the most adequate primitives to improve the efficiency of their applications.

ACS Style

José L. Abellán; Juan Fernández; Manuel E. Acacio. Characterizing the basic synchronization and communication operations in Dual Cell-based Blades through CellStats. The Journal of Supercomputing 2009, 53, 247 -268.

AMA Style

José L. Abellán, Juan Fernández, Manuel E. Acacio. Characterizing the basic synchronization and communication operations in Dual Cell-based Blades through CellStats. The Journal of Supercomputing. 2009; 53 (2):247-268.

Chicago/Turabian Style

José L. Abellán; Juan Fernández; Manuel E. Acacio. 2009. "Characterizing the basic synchronization and communication operations in Dual Cell-based Blades through CellStats." The Journal of Supercomputing 53, no. 2: 247-268.

Conference paper
Published: 01 February 2008 in 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008)
Reads 0
Downloads 0

The Cell Broadband Engine (Cell BE) is a recent heterogeneous chip-multiprocessor (CMP) architecture jointly developed by IBM, Sony and Toshiba to offer very high performance, especially on game and multimedia applications. The significant number of processor cores that it contains (nine in its first generation), along with their heterogeneity (they are of two different types) and the variety of synchronization and communication primitives offered to programmers, make the task of developing efficient applications for the Cell BE very challenging. In this work, we present CellStats, a tool aimed at characterizing the performance of the main synchronization and communication primitives provided by the Cell BE under varying workloads. In particular, the current implementation of CellStats allows to evaluate the DMA transfer mechanism, the read-modify-write atomic operations, the mailboxes, the signals and the time taken by thread creation. As an example of application of CellStats, we present a characterization of the Cell BE incorporated into the PlayStation 3. From this characterization, we extract some recommendations that can help programmers to identify the most appropriate primitive under different assumptions.

ACS Style

Jose L. Abellan; Juan Fernandez; Manuel E. Acacio. CellStats: A Tool to Evaluate the Basic Synchronization and Communication Operations of the Cell BE. 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008) 2008, 261 -268.

AMA Style

Jose L. Abellan, Juan Fernandez, Manuel E. Acacio. CellStats: A Tool to Evaluate the Basic Synchronization and Communication Operations of the Cell BE. 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008). 2008; ():261-268.

Chicago/Turabian Style

Jose L. Abellan; Juan Fernandez; Manuel E. Acacio. 2008. "CellStats: A Tool to Evaluate the Basic Synchronization and Communication Operations of the Cell BE." 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008) , no. : 261-268.

Conference paper
Published: 01 January 2008 in Computer Vision
Reads 0
Downloads 0

The Cell Broadband Engine (Cell BE) is a heterogeneous chip-multiprocessor (CMP) architecture to offer very high performance, especially on game and multimedia applications. The singularity of its architecture, nine cores of two different types, along with the variety of synchronization and communication primitives offered to programmers, make the task of developing efficient applications very challenging. This situation gets even worse when we consider Dual Cell-Based Blade architectures where two separate Cells can be linked together through a dedicated high-speed interface. In this work, we present a characterization of the main synchronization and communication primitives provided by dual Cell-based blades under varying workloads. In particular, we focus on the DMA transfer mechanism, the mailboxes, the signals, the read-modify-write atomic operations, and the time taken by thread creation. Our performance results expose the bottlenecks and asymmetries of these platforms which must be taken into account by programmers for improving the efficiency of their applications.

ACS Style

José L. Abellán; Juan Fernández; Manuel E. Acacio. Characterizing the Basic Synchronization and Communication Operations in Dual Cell-Based Blades. Computer Vision 2008, 5101, 456 -465.

AMA Style

José L. Abellán, Juan Fernández, Manuel E. Acacio. Characterizing the Basic Synchronization and Communication Operations in Dual Cell-Based Blades. Computer Vision. 2008; 5101 ():456-465.

Chicago/Turabian Style

José L. Abellán; Juan Fernández; Manuel E. Acacio. 2008. "Characterizing the Basic Synchronization and Communication Operations in Dual Cell-Based Blades." Computer Vision 5101, no. : 456-465.