[PDF] Computational Methods And Machine Learning For Crosslinking Mass Spectrometry Data Analysis eBook

Computational Methods And Machine Learning For Crosslinking Mass Spectrometry Data Analysis Book in PDF, ePub and Kindle version is available to download in english. Read online anytime anywhere directly from your device. Click on the download button below to get a free pdf file of Computational Methods And Machine Learning For Crosslinking Mass Spectrometry Data Analysis book. This book definitely worth reading, it is an incredibly well-written.

Computational Methods for Mass Spectrometry Proteomics

Author : Ingvar Eidhammer
Publisher : John Wiley & Sons
Page : 296 pages
File Size : 48,73 MB
Release : 2008-02-28
Category : Medical
ISBN : 9780470724293

GET BOOK

Proteomics is the study of the subsets of proteins present in different parts of an organism and how they change with time and varying conditions. Mass spectrometry is the leading technology used in proteomics, and the field relies heavily on bioinformatics to process and analyze the acquired data. Since recent years have seen tremendous developments in instrumentation and proteomics-related bioinformatics, there is clearly a need for a solid introduction to the crossroads where proteomics and bioinformatics meet. Computational Methods for Mass Spectrometry Proteomics describes the different instruments and methodologies used in proteomics in a unified manner. The authors put an emphasis on the computational methods for the different phases of a proteomics analysis, but the underlying principles in protein chemistry and instrument technology are also described. The book is illustrated by a number of figures and examples, and contains exercises for the reader. Written in an accessible yet rigorous style, it is a valuable reference for both informaticians and biologists. Computational Methods for Mass Spectrometry Proteomics is suited for advanced undergraduate and graduate students of bioinformatics and molecular biology with an interest in proteomics. It also provides a good introduction and reference source for researchers new to proteomics, and for people who come into more peripheral contact with the field.

High-Performance Algorithms for Mass Spectrometry-Based Omics

Author : Fahad Saeed
Publisher : Springer Nature
Page : 146 pages
File Size : 36,69 MB
Release : 2022-09-02
Category : Science
ISBN : 3031019601

GET BOOK

To date, processing of high-throughput Mass Spectrometry (MS) data is accomplished using serial algorithms. Developing new methods to process MS data is an active area of research but there is no single strategy that focuses on scalability of MS based methods. Mass spectrometry is a diverse and versatile technology for high-throughput functional characterization of proteins, small molecules and metabolites in complex biological mixtures. In the recent years the technology has rapidly evolved and is now capable of generating increasingly large (multiple tera-bytes per experiment) and complex (multiple species/microbiome/high-dimensional) data sets. This rapid advance in MS instrumentation must be matched by equally fast and rapid evolution of scalable methods developed for analysis of these complex data sets. Ideally, the new methods should leverage the rich heterogeneous computational resources available in a ubiquitous fashion in the form of multicore, manycore, CPU-GPU, CPU-FPGA, and IntelPhi architectures. The absence of these high-performance computing algorithms now hinders scientific advancements for mass spectrometry research. In this book we illustrate the need for high-performance computing algorithms for MS based proteomics, and proteogenomics and showcase our progress in developing these high-performance algorithms.

Data Analysis in Proteomics Novel Computational Strategies for Modeling and Interpreting Complex Mass Spectrometry Data

Author :
Publisher :
Page : pages
File Size : 41,59 MB
Release : 2008
Category :
ISBN :

GET BOOK

Contemporary proteomics studies require computational approaches to deal with both the complexity of the data generated, and with the volume of data produced. The amalgamation of mass spectrometry -- the analytical tool of choice in proteomics -- with the computational and statistical sciences is still recent, and several avenues of exploratory data analysis and statistical methodology remain relatively unexplored. The current study focuses on three broad analytical domains, and develops novel exploratory approaches and practical tools in each. Data transform approaches are the first explored. These methods re-frame data, allowing for the visualization and exploitation of features and trends that are not immediately evident. An exploratory approach making use of the correlation transform is developed, and is used to identify mass-shift signals in mass spectra. This approach is used to identify and map post-translational modifications on individual peptides, and to identify SILAC modification-containing spectra in a full-scale proteomic analysis. Secondly, matrix decomposition and projection approaches are explored; these use an eigen-decomposition to extract general trends from groups of related spectra. A data visualization approach is demonstrated using these techniques, capable of visualizing trends in large numbers of complex spectra, and a data compression and feature extraction technique is developed suitable for use in spectral modeling. Finally, a general machine learning approach is developed based on conditional random fields (CRFs). These models are capable of dealing with arbitrary sequence modeling tasks, similar to hidden Markov models (HMMs), but are far more robust to interdependent observational features, and do not require limiting independence assumptions to remain tractable. The theory behind this approach is developed, and a simple machine learning fragmentation model is developed to test the hypothesis that reproducible sequence-specific intens.

Novel Data Analysis Approaches for Cross-linking Mass Spectrometry Proteomics and Glycoproteomics

Author : Lei Lu
Publisher :
Page : pages
File Size : 46,65 MB
Release : 2021
Category :
ISBN :

GET BOOK

Bottom-up proteomics has emerged as a powerful technology for biological studies. The technique is used for a myriad of purposes, including among others protein identification, post-translational modification identification, protein-protein interaction analysis, protein quantification analysis, and protein structure analysis. The data analysis approaches of bottom-up proteomics have evolved over the past two decades, and many different algorithms and software programs have been developed for these varied purposes. In this thesis, I have focused on improving the database search strategies for the important special applications of bottom-up proteomics, including cross-linking mass spectrometry proteomics and O-glycoproteomics. In cross-linking mass spectrometry proteomics, a sample of proteins is treated with a chemical cross-linking reagent. This causes peptides within the proteins to be cross-linked to one another, forming peptide doublets that are released by treatment of the sample with a protease such as trypsin. The data analysis tools are designed to identify the cross-linked peptides. In O-glycoproteomics, the peptides that are released by protease digestion of the protein sample can be modified with any of or even multiple distinct O-glycans, and the data analysis tools should be able to identify all of the glycans and the modification sites at which they are located. In both cases, traditional database searching strategies which try to match the experimental spectra to all potential theoretical spectra is not practical due to the large increases in search space. Researchers suffered from a lack of efficient data analysis tools for these two applications. Here we successfully devised new search algorithms to address these problems, and impemented them in two new software modules in our laboratories' bottom-up software engine MetaMorpheus (Crosslinking data analysis via MetaMorpheusXL and O-glycoproteomics data analysis via O-Pair Search). The new search strategies used in the software program are both based on ion-indexed open search, which was first developed for large scale proteomic studies in the programs MSFragger and Open-pFind. The ion-indexed open search was optimized for cross-linking mass spectrometry proteomics and O-glycoproteomics in this study, and combined with other algorithms. In O-glycoproteomics, a graph-based algorithm is used to speed up the identification and localization of O-glycans. Other useful features have been added in the software program, such as enabling analysis of both cleavable cross-links and non-cleavable cross-links in the cross-link search module, and calculating localization probabilities in the O-glyco search module. Further optimizations including machine learning methods for false discovery rate (FDR) analysis, retention time prediction and spectral prediction could further improve the current best search approaches for cross-link proteomics and O-glycoproteomics data analysis. Chapter 1 provides an overview of bottom-up proteomics data analysis methods and outlines how ion-indexed open search could be useful for special bottom-up proteomics studies. Chapter 2 describes the development of a cross-linking mass spectrometry proteomics search module, resulting in efficiency improvements for both cleavable and non-cleavable cross-link proteomics data analysis. Chapter 3 describes the development of an O-glycoproteomics search module; by combining the ion-indexed open search algorithm with the graph-based localization algorithm, the O-pair Search is more than 2000 times faster than the currently widely used software program Byonic. In Chapter 4, a novel top-down data acquisition method is described. Chapter 5 provides conclusions and future directions.

Novel Computational Methods for Mass Spectrometry Based Protein Identification

Author : Rachana Jain
Publisher :
Page : 129 pages
File Size : 43,86 MB
Release : 2010
Category :
ISBN :

GET BOOK

Mass spectrometry (MS) is used routinely to identify proteins in biological samples. Peptide Mass Fingerprinting (PMF) uses peptide masses and a pre-specified search database to identify proteins. It is often used as a complementary method along with Peptide Fragment Fingerprinting (PFF) or de-novo sequencing for increasing confidence and coverage of protein identification during mass spectrometric analysis. At the core of a PMF database search algorithm lies a similarity measure or quality statistics that is used to gauge the level to which an experimentally obtained peaklist agrees with a list of theoretically observable mass-to-charge ratios for a protein in a database. In this dissertation, we use publicly available gold standard data sets to show that the selection of search criteria such as mass tolerance and missed cleavages significantly affects the identification results. We propose, implement and evaluate a statistical (Kolmogorov-Smirnov-based) test which is computed for a large mass error threshold thus avoiding the choice of appropriate mass tolerance by the user. We use the mass tolerance identified by the Kolmogorov-Smirnov test for computing other quality measures. The results from our careful and extensive benchmarks suggest that the new method of computing the quality statistics without requiring the end-user to select a mass tolerance is competitive. We investigate the similarity measures in terms of their information content and conclude that the similarity measures are complementary and can be combined into a scoring function to possibly improve the over all accuracy of PMF based identification methods. We describe a new database search tool, PRIMAL, for protein identification using PMF. The novelty behind PRIMAL is two-fold. First, we comprehensively analyze methods for measuring the degree of similarity between experimental and theoretical peaklists. Second, we employ machine learning as a means of combining the individual similarity measures into a scoring function. Finally, we systematically test the efficacy of PRIMAL in identifying proteins using highly curated and publicly available data. Our results suggest that PRIMAL is competitive if not better than some of the tools extensively used by the mass spectrometry community. A web server with an implementation of the scoring function is available at http://bmi.cchmc.org/primal. We also note that the methodology is directly extensible to MS/MS based protein identification problem. We detail how to extend our approaches to the more complex MS/MS data.

Machine Learning Methods for the Analysis of Liquid Chromatography-mass Spectrometry Datasets in Metabolomics

Author : Francesc Fernández Albert
Publisher :
Page : 216 pages
File Size : 20,2 MB
Release : 2014
Category :
ISBN :

GET BOOK

Liquid Chromatography-Mass Spectrometry (LC/MS) instruments are widely used in Metabolomics. To analyse their output, it is necessary to use computational tools and algorithms to extract meaningful biological information. The main goal of this thesis is to provide with new computational methods and tools to process and analyse LC/MS datasets in a metabolomic context. A total of 4 tools and methods were developed in the context of this thesis. First, it was developed a new method to correct possible non-linear drift effects in the retention time of the LC/MS data in Metabolomics, and it was coded as an R package called HCor. This method takes advantage of the retention time drift correlation found in typical LC/MS data, in which there are chromatographic regions in which their retention time drift is consistently different than other regions. Our method makes the hypothesis that this correlation structure is monotonous in the retention time and fits a non-linear model to remove the unwanted drift from the dataset. This method was found to perform especially well on datasets suffering from large drift effects when compared to other state-of-the art algorithms. Second, it was implemented and developed a new method to solve known issues of peak intensity drifts in metabolomics datasets. This method is based on a two-step approach in which are corrected possible intensity drift effects by modelling the drift and then the data is normalised using the median of the resulting dataset. The drift was modelled using a Common Principal Components Analysis decomposition on the Quality Control classes and taking one, two or three Common Principal Components to model the drift space. This method was compared to four other drift correction and normalisation methods. The two-step method was shown to perform a better intensity drift removal than all the other methods. All the tested methods including the two-step method were coded as an R package called intCor and it is publicly available. Third, a new processing step in the LC/MS data analysis workflow was proposed. In general, when LC/MS instruments are used in a metabolomic context, a metabolite may give a set of peaks as an output. However, the general approach is to consider each peak as a variable in the machine learning algorithms and statistical tests despite the important correlation structure found between those peaks coming from the same source metabolite. It was developed an strategy called peak aggregation techniques, that allow to extract a measure for each metabolite considering the intensity values of the peaks coming from this metabolite across the samples in study. If the peak aggregation techniques are applied on each metabolite, the result is a transformed dataset in which the variables are no longer the peaks but the metabolites. 4 different peak aggregation techniques were defined and, running a repeated random sub-sampling cross-validation stage, it was shown that the predictive power of the data was improved when the peak aggregation techniques were used regardless of the technique used. Fourth, a computational tool to perform end-to-end analysis called MAIT was developed and coded under the R environment. The MAIT package is highly modular and programmable which ease replacing existing modules for user-created modules and allow the users to perform their personalised LC/MS data analysis workflows. By default, MAIT takes the raw output files from an LC/MS instrument as an input and, by applying a set of functions, gives a metabolite identification table as a result. It also gives a set of figures and tables to allow for a detailed analysis of the metabolomic data. MAIT even accepts external peak data as an input. Therefore, the user can insert peak table obtained by any other available tool and MAIT can still perform all its other capabilities on this dataset like a classification or mining the Human Metabolome Dataset which is included in the package.