This page has only limited features, please log in for full access.

Unclaimed
Diogo Pratas
Department of Virology, University of Helsinki, 00014 Helsinki, Finland

Honors and Awards

The user has no records in this section


Career Timeline

The user has no records in this section.


Short Biography

The user biography is not available.
Following
Followers
Co Authors
The list of users this user is following is empty.
Following: 0 users

Feed

Journal article
Published: 26 April 2021 in Entropy
Reads 0
Downloads 0

Recently, the scientific community has witnessed a substantial increase in the generation of protein sequence data, triggering emergent challenges of increasing importance, namely efficient storage and improved data analysis. For both applications, data compression is a straightforward solution. However, in the literature, the number of specific protein sequence compressors is relatively low. Moreover, these specialized compressors marginally improve the compression ratio over the best general-purpose compressors. In this paper, we present AC2, a new lossless data compressor for protein (or amino acid) sequences. AC2 uses a neural network to mix experts with a stacked generalization approach and individual cache-hash memory models to the highest-context orders. Compared to the previous compressor (AC), we show gains of 2–9% and 6–7% in reference-free and reference-based modes, respectively. These gains come at the cost of three times slower computations. AC2 also improves memory usage against AC, with requirements about seven times lower, without being affected by the sequences’ input size. As an analysis application, we use AC2 to measure the similarity between each SARS-CoV-2 protein sequence with each viral protein sequence from the whole UniProt database. The results consistently show higher similarity to the pangolin coronavirus, followed by the bat and human coronaviruses, contributing with critical results to a current controversial subject. AC2 is available for free download under GPLv3 license.

ACS Style

Milton Silva; Diogo Pratas; Armando Pinho. AC2: An Efficient Protein Sequence Compression Tool Using Artificial Neural Networks and Cache-Hash Models. Entropy 2021, 23, 530 .

AMA Style

Milton Silva, Diogo Pratas, Armando Pinho. AC2: An Efficient Protein Sequence Compression Tool Using Artificial Neural Networks and Cache-Hash Models. Entropy. 2021; 23 (5):530.

Chicago/Turabian Style

Milton Silva; Diogo Pratas; Armando Pinho. 2021. "AC2: An Efficient Protein Sequence Compression Tool Using Artificial Neural Networks and Cache-Hash Models." Entropy 23, no. 5: 530.

Journal article
Published: 18 December 2020 in Computers in Biology and Medicine
Reads 0
Downloads 0

Privacy issues limit the analysis and cross-exploration of most distributed and private biobanks, often raised by the multiple dimensionality and sensitivity of the data associated with access restrictions and policies. These characteristics prevent collaboration between entities, constituting a barrier to emergent personalized and public health challenges, namely the discovery of new druggable targets, identification of disease-causing genetic variants, or the study of rare diseases. In this paper, we propose a semi-automatic methodology for the analysis of distributed and private biobanks. The strategies involved in the proposed methodology efficiently enable the creation and execution of unified genomic studies using distributed repositories, without compromising the information present in the datasets. We apply the methodology to a case study in the current Covid-19, ensuring the combination of the diagnostics from multiple entities while maintaining privacy through a completely identical procedure. Moreover, we show that the methodology follows a simple, intuitive, and practical scheme.

ACS Style

João Rafael Almeida; Diogo Pratas; José Luís Oliveira. A semi-automatic methodology for analysing distributed and private biobanks. Computers in Biology and Medicine 2020, 130, 104180 .

AMA Style

João Rafael Almeida, Diogo Pratas, José Luís Oliveira. A semi-automatic methodology for analysing distributed and private biobanks. Computers in Biology and Medicine. 2020; 130 ():104180.

Chicago/Turabian Style

João Rafael Almeida; Diogo Pratas; José Luís Oliveira. 2020. "A semi-automatic methodology for analysing distributed and private biobanks." Computers in Biology and Medicine 130, no. : 104180.

Journal article
Published: 11 November 2020 in GigaScience
Reads 0
Downloads 0

Background The increasing production of genomic data has led to an intensified need for models that can cope efficiently with the lossless compression of DNA sequences. Important applications include long-term storage and compression-based data analysis. In the literature, only a few recent articles propose the use of neural networks for DNA sequence compression. However, they fall short when compared with specific DNA compression tools, such as GeCo2. This limitation is due to the absence of models specifically designed for DNA sequences. In this work, we combine the power of neural networks with specific DNA models. For this purpose, we created GeCo3, a new genomic sequence compressor that uses neural networks for mixing multiple context and substitution-tolerant context models. Findings We benchmark GeCo3 as a reference-free DNA compressor in 5 datasets, including a balanced and comprehensive dataset of DNA sequences, the Y-chromosome and human mitogenome, 2 compilations of archaeal and virus genomes, 4 whole genomes, and 2 collections of FASTQ data of a human virome and ancient DNA. GeCo3 achieves a solid improvement in compression over the previous version (GeCo2) of $2.4\%$, $7.1\%$, $6.1\%$, $5.8\%$, and $6.0\%$, respectively. To test its performance as a reference-based DNA compressor, we benchmark GeCo3 in 4 datasets constituted by the pairwise compression of the chromosomes of the genomes of several primates. GeCo3 improves the compression in $12.4\%$, $11.7\%$, $10.8\%$, and $10.1\%$ over the state of the art. The cost of this compression improvement is some additional computational time (1.7–3 times slower than GeCo2). The RAM use is constant, and the tool scales efficiently, independently of the sequence size. Overall, these values outperform the state of the art. Conclusions GeCo3 is a genomic sequence compressor with a neural network mixing approach that provides additional gains over top specific genomic compressors. The proposed mixing method is portable, requiring only the probabilities of the models as inputs, providing easy adaptation to other data compressors or compression-based data analysis tools. GeCo3 is released under GPLv3 and is available for free download at https://github.com/cobilab/geco3.

ACS Style

Milton Silva; Diogo Pratas; Armando J Pinho. Efficient DNA sequence compression with neural networks. GigaScience 2020, 9, 1 .

AMA Style

Milton Silva, Diogo Pratas, Armando J Pinho. Efficient DNA sequence compression with neural networks. GigaScience. 2020; 9 (11):1.

Chicago/Turabian Style

Milton Silva; Diogo Pratas; Armando J Pinho. 2020. "Efficient DNA sequence compression with neural networks." GigaScience 9, no. 11: 1.

Journal article
Published: 01 August 2020 in GigaScience
Reads 0
Downloads 0

Background Advances in sequencing technologies have enabled the characterization of multiple microbial and host genomes, opening new frontiers of knowledge while kindling novel applications and research perspectives. Among these is the investigation of the viral communities residing in the human body and their impact on health and disease. To this end, the study of samples from multiple tissues is critical, yet, the complexity of such analysis calls for a dedicated pipeline. We provide an automatic and efficient pipeline for identification, assembly, and analysis of viral genomes that combines the DNA sequence data from multiple organs. TRACESPipe relies on cooperation among 3 modalities: compression-based prediction, sequence alignment, and de novo assembly. The pipeline is ultra-fast and provides, additionally, secure transmission and storage of sensitive data. Findings TRACESPipe performed outstandingly when tested on synthetic and ex vivo datasets, identifying and reconstructing all the viral genomes, including those with high levels of single-nucleotide polymorphisms. It also detected minimal levels of genomic variation between different organs. Conclusions TRACESPipe’s unique ability to simultaneously process and analyze samples from different sources enables the evaluation of within-host variability. This opens up the possibility to investigate viral tissue tropism, evolution, fitness, and disease associations. Moreover, additional features such as DNA damage estimation and mitochondrial DNA reconstruction and analysis, as well as exogenous-source controls, expand the utility of this pipeline to other fields such as forensics and ancient DNA studies. TRACESPipe is released under GPLv3 and is available for free download at https://github.com/viromelab/tracespipe.

ACS Style

Diogo Pratas; Mari Toppinen; Lari Pyöriä; Klaus Hedman; Antti Sajantila; Maria F Perdomo. A hybrid pipeline for reconstruction and analysis of viral genomes at multi-organ level. GigaScience 2020, 9, 1 .

AMA Style

Diogo Pratas, Mari Toppinen, Lari Pyöriä, Klaus Hedman, Antti Sajantila, Maria F Perdomo. A hybrid pipeline for reconstruction and analysis of viral genomes at multi-organ level. GigaScience. 2020; 9 (8):1.

Chicago/Turabian Style

Diogo Pratas; Mari Toppinen; Lari Pyöriä; Klaus Hedman; Antti Sajantila; Maria F Perdomo. 2020. "A hybrid pipeline for reconstruction and analysis of viral genomes at multi-organ level." GigaScience 9, no. 8: 1.

Journal article
Published: 30 July 2020 in Bioinformatics
Reads 0
Downloads 0

Motivation Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has caused more than 14 million cases and more than half million deaths. Given the absence of implemented therapies, new analysis, diagnosis and therapeutics are of great importance. Results Analysis of SARS-CoV-2 genomes from the current outbreak reveals the presence of short persistent DNA/RNA sequences that are absent from the human genome and transcriptome (PmRAWs). For the PmRAWs with length 12, only four exist at the same location in all SARS-CoV-2. At the gene level, we found one PmRAW of size 13 at the Spike glycoprotein coding sequence. This protein is fundamental for binding in human ACE2 and further use as an entry receptor to invade target cells. Applying protein structural prediction, we localized this PmRAW at the surface of the Spike protein, providing a potential targeted vector for diagnostics and therapeutics. In addition, we show a new pattern of relative absent words (RAWs), characterized by the progressive increase of GC content (Guanine and Cytosine) according to the decrease of RAWs length, contrarily to the virus and host genome distributions. New analysis shows the same property during the Ebola virus outbreak. At a computational level, we improved the alignment-free method to identify pathogen-specific signatures in balance with GC measures and removed previous size limitations. Availability and implementation https://github.com/cobilab/eagle. Supplementary information Supplementary data are available at Bioinformatics online.

ACS Style

Diogo Pratas; Jorge M Silva. Persistent minimal sequences of SARS-CoV-2. Bioinformatics 2020, 36, 5129 -5132.

AMA Style

Diogo Pratas, Jorge M Silva. Persistent minimal sequences of SARS-CoV-2. Bioinformatics. 2020; 36 (21):5129-5132.

Chicago/Turabian Style

Diogo Pratas; Jorge M Silva. 2020. "Persistent minimal sequences of SARS-CoV-2." Bioinformatics 36, no. 21: 5129-5132.

Journal article
Published: 07 July 2020 in Forensic Science International: Genetics
Reads 0
Downloads 0

The imprints left by persistent DNA viruses in the tissues can testify to the changes driving virus evolution as well as provide clues on the provenance of modern and ancient humans. However, the history hidden in skeletal remains is practically unknown, as only parvovirus B19 and hepatitis B virus DNA have been detected in hard tissues so far. Here, we investigated the DNA prevalences of 38 viruses in femoral bone of recently deceased individuals. To this end, we used quantitative PCRs and a custom viral targeted enrichment followed by next-generation sequencing. The data was analyzed with a tailor-made bioinformatics pipeline. Our findings revealed bone to be a much richer source of persistent DNA viruses than earlier perceived, discovering ten additional ones, including several members of the herpes- and polyomavirus families, as well as human papillomavirus 31 and torque teno virus. Remarkably, many of the viruses found have oncogenic potential and/or are likely to reactivate in the elderly and immunosuppressed individuals. Thus, their persistence warrants careful evaluation of their clinical significance and impact on bone biology. Our findings open new frontiers for the study of virus evolution from ancient relics as well as provide new tools for the investigation of human skeletal remains in forensic and archeological contexts.

ACS Style

Mari Toppinen; Diogo Pratas; Elina Väisänen; Maria Söderlund-Venermo; Klaus Hedman; Maria F. Perdomo; Antti Sajantila. The landscape of persistent human DNA viruses in femoral bone. Forensic Science International: Genetics 2020, 48, 102353 .

AMA Style

Mari Toppinen, Diogo Pratas, Elina Väisänen, Maria Söderlund-Venermo, Klaus Hedman, Maria F. Perdomo, Antti Sajantila. The landscape of persistent human DNA viruses in femoral bone. Forensic Science International: Genetics. 2020; 48 ():102353.

Chicago/Turabian Style

Mari Toppinen; Diogo Pratas; Elina Väisänen; Maria Söderlund-Venermo; Klaus Hedman; Maria F. Perdomo; Antti Sajantila. 2020. "The landscape of persistent human DNA viruses in femoral bone." Forensic Science International: Genetics 48, no. : 102353.

Original software publication
Published: 20 June 2020 in SoftwareX
Reads 0
Downloads 0

Next-generation sequencing triggered the production of a massive volume of publicly available data and the development of new specialised tools. These tools are dispersed over different frameworks, making the management and analyses of the data a challenging task. Additionally, new targeted tools are needed, given the dynamics and specificities of the field. We present GTO, a comprehensive toolkit designed to unify pipelines in genomic and proteomic research, which combines specialised tools for analysis, simulation, compression, development, visualisation, and transformation of the data. This toolkit combines novel tools with a modular architecture, being an excellent platform for experimental scientists, as well as a useful resource for teaching bioinformatics enquiry to students in life sciences. GTO is implemented in C language and is available, under the MIT license, at https://bioinformatics.ua.pt/gto.

ACS Style

João R. Almeida; Armando J. Pinho; Olga Margarida Fajarda Oliveira; Olga Fajarda; Diogo Pratas. GTO: A toolkit to unify pipelines in genomic and proteomic research. SoftwareX 2020, 12, 100535 .

AMA Style

João R. Almeida, Armando J. Pinho, Olga Margarida Fajarda Oliveira, Olga Fajarda, Diogo Pratas. GTO: A toolkit to unify pipelines in genomic and proteomic research. SoftwareX. 2020; 12 ():100535.

Chicago/Turabian Style

João R. Almeida; Armando J. Pinho; Olga Margarida Fajarda Oliveira; Olga Fajarda; Diogo Pratas. 2020. "GTO: A toolkit to unify pipelines in genomic and proteomic research." SoftwareX 12, no. : 100535.

Journal article
Published: 01 May 2020 in GigaScience
Reads 0
Downloads 0

Background The development of high-throughput sequencing technologies and, as its result, the production of huge volumes of genomic data, has accelerated biological and medical research and discovery. Study on genomic rearrangements is crucial owing to their role in chromosomal evolution, genetic disorders, and cancer. Results We present Smash++, an alignment-free and memory-efficient tool to find and visualize small- and large-scale genomic rearrangements between 2 DNA sequences. This computational solution extracts information contents of the 2 sequences, exploiting a data compression technique to find rearrangements. We also present Smash++ visualizer, a tool that allows the visualization of the detected rearrangements along with their self- and relative complexity, by generating an SVG (Scalable Vector Graphics) image. Conclusions Tested on several synthetic and real DNA sequences from bacteria, fungi, Aves, and Mammalia, the proposed tool was able to accurately find genomic rearrangements. The detected regions were in accordance with previous studies, which took alignment-based approaches or performed FISH (fluorescence in situ hybridization) analysis. The maximum peak memory usage among all experiments was ∼1 GB, which makes Smash++ feasible to run on present-day standard computers.

ACS Style

Morteza Hosseini; Diogo Pratas; Burkhard Morgenstern; Armando J Pinho. Smash++: an alignment-free and memory-efficient tool to find genomic rearrangements. GigaScience 2020, 9, 1 .

AMA Style

Morteza Hosseini, Diogo Pratas, Burkhard Morgenstern, Armando J Pinho. Smash++: an alignment-free and memory-efficient tool to find genomic rearrangements. GigaScience. 2020; 9 (5):1.

Chicago/Turabian Style

Morteza Hosseini; Diogo Pratas; Burkhard Morgenstern; Armando J Pinho. 2020. "Smash++: an alignment-free and memory-efficient tool to find genomic rearrangements." GigaScience 9, no. 5: 1.

Journal article
Published: 16 January 2020 in Entropy
Reads 0
Downloads 0

Sources that generate symbolic sequences with algorithmic nature may differ in statistical complexity because they create structures that follow algorithmic schemes, rather than generating symbols from a probabilistic function assuming independence. In the case of Turing machines, this means that machines with the same algorithmic complexity can create tapes with different statistical complexity. In this paper, we use a compression-based approach to measure global and local statistical complexity of specific Turing machine tapes with the same number of states and alphabet. Both measures are estimated using the best-order Markov model. For the global measure, we use the Normalized Compression (NC), while, for the local measures, we define and use normal and dynamic complexity profiles to quantify and localize lower and higher regions of statistical complexity. We assessed the validity of our methodology on synthetic and real genomic data showing that it is tolerant to increasing rates of editions and block permutations. Regarding the analysis of the tapes, we localize patterns of higher statistical complexity in two regions, for a different number of machine states. We show that these patterns are generated by a decrease of the tape’s amplitude, given the setting of small rule cycles. Additionally, we performed a comparison with a measure that uses both algorithmic and statistical approaches (BDM) for analysis of the tapes. Naturally, BDM is efficient given the algorithmic nature of the tapes. However, for a higher number of states, BDM is progressively approximated by our methodology. Finally, we provide a simple algorithm to increase the statistical complexity of a Turing machine tape while retaining the same algorithmic complexity. We supply a publicly available implementation of the algorithm in C++ language under the GPLv3 license. All results can be reproduced in full with scripts provided at the repository.

ACS Style

Jorge M. Silva; Eduardo Pinho; Sérgio Matos; Diogo Pratas. Statistical Complexity Analysis of Turing Machine tapes with Fixed Algorithmic Complexity Using the Best-Order Markov Model. Entropy 2020, 22, 105 .

AMA Style

Jorge M. Silva, Eduardo Pinho, Sérgio Matos, Diogo Pratas. Statistical Complexity Analysis of Turing Machine tapes with Fixed Algorithmic Complexity Using the Best-Order Markov Model. Entropy. 2020; 22 (1):105.

Chicago/Turabian Style

Jorge M. Silva; Eduardo Pinho; Sérgio Matos; Diogo Pratas. 2020. "Statistical Complexity Analysis of Turing Machine tapes with Fixed Algorithmic Complexity Using the Best-Order Markov Model." Entropy 22, no. 1: 105.

Preprint content
Published: 07 January 2020
Reads 0
Downloads 0

SummaryNext-generation sequencing triggered the production of a massive volume of publicly available data and the development of new specialised tools. These tools are dispersed over different frameworks, making the management and analyses of the data a challenging task. Additionally, new targeted tools are needed, given the dynamics and specificities of the field. We present GTO, a comprehensive toolkit designed to unify pipelines in genomic and proteomic research, which combines specialised tools for analysis, simulation, compression, development, visualisation, and transformation of the data. This toolkit combines novel tools with a modular architecture, being an excellent platform for experimental scientists, as well as a useful resource for teaching bioinformatics inquiry to students in life sciences.Availability and implementationGTO is implemented in C language and it is available, under the MIT license, at http://bioinformatics.ua.pt/[email protected] informationSupplementary data are available at publisher’s Web site.

ACS Style

Joao Rafael Almeida; Armando J. Pinho; José Luis Oliveira; Olga Fajarda; Diogo Pratas. GTO: a toolkit to unify pipelines in genomic and proteomic research. 2020, 1 .

AMA Style

Joao Rafael Almeida, Armando J. Pinho, José Luis Oliveira, Olga Fajarda, Diogo Pratas. GTO: a toolkit to unify pipelines in genomic and proteomic research. . 2020; ():1.

Chicago/Turabian Style

Joao Rafael Almeida; Armando J. Pinho; José Luis Oliveira; Olga Fajarda; Diogo Pratas. 2020. "GTO: a toolkit to unify pipelines in genomic and proteomic research." , no. : 1.

Preprint content
Published: 25 December 2019
Reads 0
Downloads 0

Background The development of high-throughput sequencing technologies and, as its result, the production of huge volumes of genomic data, has accelerated biological and medical research and discovery. Study on genomic rearrangements is crucial due to their role in chromosomal evolution, genetic disorders and cancer; Results We present Smash++, an alignment-free and memory-efficient tool to find and visualize small- and large-scale genomic rearrangements between two DNA sequences. This computational solution extracts information contents of the two sequences, exploiting a data compression technique, in order for finding rearrangements. We also present Smash++ visualizer, a tool that allows the visualization of the detected rearrangements along with their self- and relative complexity, by generating an SVG (Scalable Vector Graphics) image; Conclusions Tested on several synthetic and real DNA sequences from bacteria, fungi, Aves and mammalia, the proposed tool was able to accurately find genomic rearrangements. The detected regions complied with previous studies which took alignment-based approaches or performed FISH (Fluorescence in situ hybridization) analysis. The maximum peak memory usage among all experiments was ~1 GB, which makes Smash++ feasible to run on present-day standard computers.

ACS Style

Morteza Hosseini; Diogo Pratas; Burkhard Morgenstern; Armando J. Pinho. Smash++: an alignment-free and memory-efficient tool to find genomic rearrangements. 2019, 1 .

AMA Style

Morteza Hosseini, Diogo Pratas, Burkhard Morgenstern, Armando J. Pinho. Smash++: an alignment-free and memory-efficient tool to find genomic rearrangements. . 2019; ():1.

Chicago/Turabian Style

Morteza Hosseini; Diogo Pratas; Burkhard Morgenstern; Armando J. Pinho. 2019. "Smash++: an alignment-free and memory-efficient tool to find genomic rearrangements." , no. : 1.

Journal article
Published: 02 November 2019 in Entropy
Reads 0
Downloads 0

The development of efficient data compressors for DNA sequences is crucial not only for reducing the storage and the bandwidth for transmission, but also for analysis purposes. In particular, the development of improved compression models directly influences the outcome of anthropological and biomedical compression-based methods. In this paper, we describe a new lossless compressor with improved compression capabilities for DNA sequences representing different domains and kingdoms. The reference-free method uses a competitive prediction model to estimate, for each symbol, the best class of models to be used before applying arithmetic encoding. There are two classes of models: weighted context models (including substitutional tolerant context models) and weighted stochastic repeat models. Both classes of models use specific sub-programs to handle inverted repeats efficiently. The results show that the proposed method attains a higher compression ratio than state-of-the-art approaches, on a balanced and diverse benchmark, using a competitive level of computational resources. An efficient implementation of the method is publicly available, under the GPLv3 license.

ACS Style

Diogo Pratas; Morteza Hosseini; Jorge M. Silva; Armando J. Pinho. A Reference-Free Lossless Compression Algorithm for DNA Sequences Using a Competitive Prediction of Two Classes of Weighted Models. Entropy 2019, 21, 1074 .

AMA Style

Diogo Pratas, Morteza Hosseini, Jorge M. Silva, Armando J. Pinho. A Reference-Free Lossless Compression Algorithm for DNA Sequences Using a Competitive Prediction of Two Classes of Weighted Models. Entropy. 2019; 21 (11):1074.

Chicago/Turabian Style

Diogo Pratas; Morteza Hosseini; Jorge M. Silva; Armando J. Pinho. 2019. "A Reference-Free Lossless Compression Algorithm for DNA Sequences Using a Competitive Prediction of Two Classes of Weighted Models." Entropy 21, no. 11: 1074.

Conference paper
Published: 22 June 2019 in Advances in Intelligent Systems and Computing
Reads 0
Downloads 0

Primer and adapter sequences are synthetic DNA or RNA oligonucleotides used in the process of amplification and sequencing. In theory, while similar primer sequences can be present on assembled genomes, adapter sequences should be trimmed (filtered) and, hence, absent from assembled genomes. However, given ambiguity problems, inefficient parameterization of trimming tools, and others, uncommonly they can be found in assembled genomes, on an exact or approximate state. In this paper, we investigate the occurrence of exact and approximate primer-adapter subsequences in assembled and, specifically, in the whole archaeal genomes of the NCBI database. We present a new method that combines data compression with custom signal processing operations, namely filtering and segmentation, to localize and visualize these regions given a defined similarity threshold. The program is freely available, under GPLv3 license, at https://github.com/pratas/maple.

ACS Style

Diogo Pratas; Morteza Hosseini; Armando J. Pinho. Visualization of Similar Primer and Adapter Sequences in Assembled Archaeal Genomes. Advances in Intelligent Systems and Computing 2019, 129 -136.

AMA Style

Diogo Pratas, Morteza Hosseini, Armando J. Pinho. Visualization of Similar Primer and Adapter Sequences in Assembled Archaeal Genomes. Advances in Intelligent Systems and Computing. 2019; ():129-136.

Chicago/Turabian Style

Diogo Pratas; Morteza Hosseini; Armando J. Pinho. 2019. "Visualization of Similar Primer and Adapter Sequences in Assembled Archaeal Genomes." Advances in Intelligent Systems and Computing , no. : 129-136.

Conference paper
Published: 25 September 2018 in Computer Vision
Reads 0
Downloads 0

In this paper, we address handwritten digit classification as a special problem of data compression modeling. The creation of the models—usually known as training—is just a process of counting. Moreover, the model associated to each class can be trained independently of all the other class models. Also, they can be updated later with new examples, even if the old ones are not available anymore. Under this framework, we show that it is possible to attain a classification accuracy consistently above 99.3% on the MNIST dataset, using classifiers trained in less than one hour on a common laptop.

ACS Style

Armando J. Pinho; Diogo Pratas. An Application of Data Compression Models to Handwritten Digit Classification. Computer Vision 2018, 487 -495.

AMA Style

Armando J. Pinho, Diogo Pratas. An Application of Data Compression Models to Handwritten Digit Classification. Computer Vision. 2018; ():487-495.

Chicago/Turabian Style

Armando J. Pinho; Diogo Pratas. 2018. "An Application of Data Compression Models to Handwritten Digit Classification." Computer Vision , no. : 487-495.

Journal article
Published: 06 September 2018 in Genes
Reads 0
Downloads 0

The sequencing of ancient DNA samples provides a novel way to find, characterize, and distinguish exogenous genomes of endogenous targets. After sequencing, computational composition analysis enables filtering of undesired sources in the focal organism, with the purpose of improving the quality of assemblies and subsequent data analysis. More importantly, such analysis allows extinct and extant species to be identified without requiring a specific or new sequencing run. However, the identification of exogenous organisms is a complex task, given the nature and degradation of the samples, and the evident necessity of using efficient computational tools, which rely on algorithms that are both fast and highly sensitive. In this work, we relied on a fast and highly sensitive tool, FALCON-meta, which measures similarity against whole-genome reference databases, to analyse the metagenomic composition of an ancient polar bear (Ursus maritimus) jawbone fossil. The fossil was collected in Svalbard, Norway, and has an estimated age of 110,000 to 130,000 years. The FASTQ samples contained 349 GB of nonamplified shotgun sequencing data. We identified and localized, relative to the FASTQ samples, the genomes with significant similarities to reference microbial genomes, including those of viruses, bacteria, and archaea, and to fungal, mitochondrial, and plastidial sequences. Among other striking features, we found significant similarities between modern-human, some bacterial and viral sequences (contamination) and the organelle sequences of wild carrot and tomato relative to the whole samples. For each exogenous candidate, we ran a damage pattern analysis, which in addition to revealing shallow levels of damage in the plant candidates, identified the source as contamination.

ACS Style

Diogo Pratas; Morteza Hosseini; Gonçalo Grilo; Armando J. Pinho; Raquel M. Silva; Tânia Caetano; João Carneiro; Filipe Pereira. Metagenomic Composition Analysis of an Ancient Sequenced Polar Bear Jawbone from Svalbard. Genes 2018, 9, 445 .

AMA Style

Diogo Pratas, Morteza Hosseini, Gonçalo Grilo, Armando J. Pinho, Raquel M. Silva, Tânia Caetano, João Carneiro, Filipe Pereira. Metagenomic Composition Analysis of an Ancient Sequenced Polar Bear Jawbone from Svalbard. Genes. 2018; 9 (9):445.

Chicago/Turabian Style

Diogo Pratas; Morteza Hosseini; Gonçalo Grilo; Armando J. Pinho; Raquel M. Silva; Tânia Caetano; João Carneiro; Filipe Pereira. 2018. "Metagenomic Composition Analysis of an Ancient Sequenced Polar Bear Jawbone from Svalbard." Genes 9, no. 9: 445.

Conference paper
Published: 01 September 2018 in 2018 26th European Signal Processing Conference (EUSIPCO)
Reads 0
Downloads 0
ACS Style

Diogo Pratas; Armando J. Pinho. Metagenomic composition analysis of sedimentary ancient DNA from the Isle of Wight. 2018 26th European Signal Processing Conference (EUSIPCO) 2018, 1 .

AMA Style

Diogo Pratas, Armando J. Pinho. Metagenomic composition analysis of sedimentary ancient DNA from the Isle of Wight. 2018 26th European Signal Processing Conference (EUSIPCO). 2018; ():1.

Chicago/Turabian Style

Diogo Pratas; Armando J. Pinho. 2018. "Metagenomic composition analysis of sedimentary ancient DNA from the Isle of Wight." 2018 26th European Signal Processing Conference (EUSIPCO) , no. : 1.

Journal article
Published: 01 September 2018 in Pattern Recognition Letters
Reads 0
Downloads 0

The Normalized Relative Compression (NRC) is a recent dissimilarity measure, related to the Kolmogorov Complexity. It has been successfully used in different applications, like DNA sequences, images or even ECG (electrocardiographic) signal. It uses a compressor that compresses a target string using exclusively the information contained in a reference string. One possible approach is to use finite-context models (FCMs) to represent the strings. A finite-context model calculates the probability distribution of the next symbol, given the previous k symbols. In this paper, we introduce a generalization of the FCMs, called extended-alphabet finite-context models (xaFCM), that calculates the probability of occurrence of the next d symbols, given the previous k symbols. We perform experiments on two different sample applications using the xaFCMs and the NRC measure: ECG biometric identification, using a publicly available database; estimation of the similarity between DNA sequences of two different, but related, species – chromosome by chromosome. In both applications, we compare the results against those obtained by the FCMs. The results show that the xaFCMs use less memory and computational time to achieve the same or, in some cases, even more accurate results.

ACS Style

João M. Carvalho; Susana Brás; Diogo Pratas; Jacqueline Ferreira; Sandra C. Soares; Armando J. Pinho. Extended-alphabet finite-context models. Pattern Recognition Letters 2018, 112, 49 -55.

AMA Style

João M. Carvalho, Susana Brás, Diogo Pratas, Jacqueline Ferreira, Sandra C. Soares, Armando J. Pinho. Extended-alphabet finite-context models. Pattern Recognition Letters. 2018; 112 ():49-55.

Chicago/Turabian Style

João M. Carvalho; Susana Brás; Diogo Pratas; Jacqueline Ferreira; Sandra C. Soares; Armando J. Pinho. 2018. "Extended-alphabet finite-context models." Pattern Recognition Letters 112, no. : 49-55.

Conference paper
Published: 17 August 2018 in Advances in Intelligent Systems and Computing
Reads 0
Downloads 0

The progress in sequencing technologies and the increasing availability of DNA sequences from extant and extinct organisms is shaping our knowledge about species origin and development, as well as originating an improvement of the computational methods for storage and analysis purposes. Given the large volume of DNA sequences, computational models that efficiently represent diverse DNA sequences using low computational resources are very welcome. Currently, for benchmarking compression algorithms there is absence of a standard corpus that enables a wide and fair comparison. This should be a corpus that reflects the main domains and kingdoms, without being exaggerated in size and number of sequences. In this paper, we provide such DNA sequence corpus, overviewing its elements and furnishing a comparison of some of the algorithms for DNA sequence compression. The corpus is available at https://tinyurl.com/DNAcorpus.

ACS Style

Diogo Pratas; Armando J. Pinho. A DNA Sequence Corpus for Compression Benchmark. Advances in Intelligent Systems and Computing 2018, 208 -215.

AMA Style

Diogo Pratas, Armando J. Pinho. A DNA Sequence Corpus for Compression Benchmark. Advances in Intelligent Systems and Computing. 2018; ():208-215.

Chicago/Turabian Style

Diogo Pratas; Armando J. Pinho. 2018. "A DNA Sequence Corpus for Compression Benchmark." Advances in Intelligent Systems and Computing , no. : 208-215.

Conference paper
Published: 17 August 2018 in Advances in Intelligent Systems and Computing
Reads 0
Downloads 0

Amino acid sequences are known to be very hard to compress. In this paper, we propose a lossless compressor for efficient compression of amino acid sequences (AC). The compressor uses a cooperation between multiple context and substitutional tolerant context models. The cooperation between models is balanced with weights that benefit the models with better performance, according to a forgetting function specific for each model. We have shown consistently better compression results than other approaches, using low computational resources. The compressor implementation is freely available, under license GPLv3, at https://github.com/pratas/ac.

ACS Style

Diogo Pratas; Morteza Hosseini; Armando J. Pinho. Compression of Amino Acid Sequences. Advances in Intelligent Systems and Computing 2018, 105 -113.

AMA Style

Diogo Pratas, Morteza Hosseini, Armando J. Pinho. Compression of Amino Acid Sequences. Advances in Intelligent Systems and Computing. 2018; ():105-113.

Chicago/Turabian Style

Diogo Pratas; Morteza Hosseini; Armando J. Pinho. 2018. "Compression of Amino Acid Sequences." Advances in Intelligent Systems and Computing , no. : 105-113.

Conference paper
Published: 17 August 2018 in Advances in Intelligent Systems and Computing
Reads 0
Downloads 0

The great increase in the amount of sequenced DNA has created a problem: the storage of the sequences. As such, data compression techniques, designed specifically to compress genetic information, is an important area of research and development. Likewise, the ability to search similar DNA sequences in relation to a larger sequence, such as a chromosome, has a really important role in the study of organisms and the possible connection between different species. This paper proposes NET-ASAR, a tool for DNA sequence search, based on data compression, or, specifically, finite-context models, by obtaining a measure of similarity between a reference and a target. The method uses an approach based on finite-context models for the creation of a statistical model of the reference sequence and obtaining the estimated number of bits necessary for the encoding of the target sequence, using the reference model. NET-ASAR is freely available, under license GPLv3, at https://github.com/manuelgaspar/NET-ASAR.

ACS Style

Manuel Gaspar; Diogo Pratas; Armando J. Pinho. NET-ASAR: A Tool for DNA Sequence Search Based on Data Compression. Advances in Intelligent Systems and Computing 2018, 114 -122.

AMA Style

Manuel Gaspar, Diogo Pratas, Armando J. Pinho. NET-ASAR: A Tool for DNA Sequence Search Based on Data Compression. Advances in Intelligent Systems and Computing. 2018; ():114-122.

Chicago/Turabian Style

Manuel Gaspar; Diogo Pratas; Armando J. Pinho. 2018. "NET-ASAR: A Tool for DNA Sequence Search Based on Data Compression." Advances in Intelligent Systems and Computing , no. : 114-122.