Author : Bo Song
Publisher :
Page : 0 pages
File Size : 14,13 MB
Release : 2020
Category : Data mining
ISBN :
The amount of available data has experienced significant growth as the result of technology advances in this era of Big Data. The biomedical domain, in particular, is one exemplar field where the number and scale of data sources have increased exponentially in the last decade. They are expected to keep growing even more rapidly to reach the level of Zetta bytes per year in the following year very soon. While more data obtained from advanced biotechnologies, such as high-throughput sequencing that encodes valuable information, are becoming overwhelmed, to discover knowledge from which for biology and medical research is still facing challenging problems with existing approaches. We study in this dissertation how to effectively and efficiently utilize these large-scare data from different numbers and types of sources for biomedical knowledge discovery. Raw data from biological organism such as microbiome usually have intrinsic high dimensionality of the feature space, which inevitably and exponentially raises the computational complexity of existing algorithms. We proposed a new approach using spectral interpolation technique to represent the high-dimensional data in low dimension space that not only greatly improves the efficiency of computing large-scale data but also preserves as much information as possible from original data. The resulting preferable outcomes for clustering and visualization tasks better facilitate the knowledge revealing of patterns and insights for microbial communities. We further studied how to enhance knowledge discovery while more than one data sources are available. Large-scale relational data such as protein-protein interactions (PPI) can be constructed in the form of network to invoke a system-wide perspective than traditional mechanistic approaches to interpret complex biological processes and functionalities. While bio-experiments are exhausted and costly, with two or more networks from different data sources we can apply computational comparative analysis such as Network Alignment to bridge the knowledge between well-studied species and under-examined species. We proposed new methods to globally align multiple large-scale biological networks from different species at the same time. We utilize both topological features and biological features of PPI networks and search heuristically for the best results. Representation learning for network is also integrated into our proposed framework to provide a new way to quantify the structural features of a node with its surrounding topology for the node embedding. The real data experiments showed promising results in finding homologous proteins as well as conserved protein complexes in poor-studied species for knowledge transferring from well-studied species. Besides utilizing homogeneous data from one and more data sources of one type, we keep exploring the possibility of harnessing sources of different types to take advantage of their underlying relational knowledge across heterogeneous data and capture the complex biomedical associations. The heterogeneous disease information networks we formulated in one research include types of sources from disease, pathway, and chemicals. They are filtered and calculated using Dynamic Time Warping (DTW) algorithm and meta path method for topological and semantics scores which lead to effective measurement of the similarity of diseases. In another study, we proposed a novel framework with Graph Convolutional Network to identify and predict disease-RNAs associations to better support the discovery of relational knowledge at the molecular level for medical applications such as disease diagnosis, therapy, and monitoring.