Statement of Rearch Interests
Xinghua Lu
My rearch interests concentrate on applying statistical data mining and machine learning techniques to system biology. I am especially interested in developing and applying statistical learning algorithms to identify patterns from large amounts of high dimensional data that reflect the states of the signal transduction system. As a pharmacologist, I am always intrigued by cellular signal transduction pathways and complexity of the system. Before my transition to the computational biology field two years ago, my rearch as a pharmacologist had mainly concentrated on individual pathways or protein molecules. It often occurred to me that the biomedical rearch of the last few decades had accumulated a wealth of knowledge at the molecular level, and it is time for one to take a step back and view the cellular signal transduction system as a full-fledged forest with most of the leaves painted colorfully. Advance in biological techniques, such as DNA microarray and high through-put screening, has produced large amounts of data regarding many aspects of cell. The data offer biologists opportunities to study the cellular system, but also po challenges for conventional biologists. The transition from an experimental to computational biologist was quite natural for me becau of my long-lasting interest and experience in scientific computing. Winning the National Library of Medicine trainin
g grant award provided me a great opportunity to extend my rearch ability in this direction. My study and rearch benefited greatly from the exceptionally excellent artificial intelligence and statistics community in Pittsburgh area.
My current rearch in computational biology falls in two major areas, which are described below: The first is to develop a latent variable generative model, variational Bayesian cooperative vector quatizer (VBCVQ) model, to analyze the DNA microarray data and model the gene transcription regulation pathways. I have finished mathematical derivation and implementation of the model. In addition to its potential biological application, the model can be ud in a wide range of applications, e.g. image processing, image compression and content-bad image retrieval. The model cloly simulates the gene expression regulation system. It can overcome some drawbacks of the commonly ud existing techniques and address questions other models fail to address. Generally, the model has following advantages: (1) Data dimension reduction. (2) Identification of the key components of gene expression regulation pathways. (3) Capability of inferring the state of key components when given new microarray data. Such information can be uful for further exploring the mechanism of dia, drug effect or toxicity and the construction of diagnosis tools. Full Bayesian learning of the model allows us to address questions like ``what is the most efficient way to
encode the information controlling gene transcription?'' or ``what are the key signal transduction components that control gene expression in a given kind of cell?'' Currently, I am testing the model with image encoding and mixed image paration. Once this stage finished, I will apply the model in microarray analysis.
The cond area I am working on is to identify and predict the function of a protein motif using data mining approaches. The Gene Ontology is a t of annotations that describe the biological system in a hierarchical fashion. The current Gene Ontology databa can also rve as a
knowledge ba to facilitate biological discovery becau it contains a large amount of information regarding the molecular function, biological process and cellular location of proteins. To make effective u of such a knowledge ba, a biologist would like to query the knowledge ba in the following fashion: ``what is the protein motif that encodes a given molecular function?'' or ``what is the potential function of a conrved motif we identified?'' However, the current Gene Ontology databa can not answer such queries due to the way of information being stored and the potential ambiguity caud by a conventional databa query, even though the information is actually available. Working with collaborators at the University of Pittsburgh and Carnegie Mellon University, I have developed a general method to address the issue using data mining approaches. We have ex
tracted a t of features that help to disambiguate the association of protein motifs and the Gene Ontology terms. Then, we trained a statistical classifier to determine whether a Gene Ontology term should be assigned to a protein motif, using probability to reflect the confidence or uncertainty. The method performs well when tested on known protein motifs from PROSITE. I will further extend the work in two directions: (1) To develop a system bad on the method and make it available to the scientific community for data mining. (2) To study the evolution of protein quence motifs by further exploiting the knowledge in Gene Ontology with hierarchical aspect models. The studies will help identify the key residues among the motifs, and allow us to address the questions like ``what amino acid plays the key role in proteins that act as kina or
reducta/oxida?''
Overall, my training in both experimental and computational biology enables me to combine the knowledge of both fields without any communication gap. I foree that my rearch will follow both directions of computational method development and biological discovery. As a computational biologist, I will extensively collaborate with both experimental biologists and computer scientists to solve interesting biological problems. My short term goal is to further extend my current rearch as described above. In the long run, I will continue to learn, identify, develop and apply computational m
ethods in the fields of drug discovery, drug toxicity prediction and developing diagnostic tools bad on biological data.