Molecular Biology rapidly evolves from experimental science to computational discipline. This transformation is fueled by simultaneous advances in modern computing and explosion of global molecular profiling methods. Typical molecular profile contains tens of thousands data points and its interpretation relies on the relational database storing formalized knowledge about molecular interactions. The development of computerized knowledgebase for pathway and network analysis started in the beginning of this century in response to the advances in DNA hybridization microarray technology which allowed simultaneous mRNA expression measurement for all genes in a biological sample. In 2003 Ariadne Genomics pioneered MedScan information extraction technology in order to find statements about molecular interactions in scientific literature and to automatically populate the knowledgebase with extracted information. MedScan is highly accurate technology which reliably converts the enormous amount of literature accumulated during more than 60 years of research into the knowledgebase suitable for computational analysis. MedScan makes Pathway Studio a unique software product which provides tools for navigating the most comprehensive knowledgebase in molecular biology.
Scientific literature proved to be extremely rich source of molecular interaction data which suffers from large number of errors and omissions. Biological data is intrinsically ambiguous not only due to the technical noise from the experimental set up but also due to the natural genetic variability and genetic linkage in biological samples. Genetic variability makes a response to the same environmental changes unique in every biological sample. Genetic linkage causes every response to include non-specific components which are functionally irrelevant to the response. High level of noise in the knowledgebase is exuberated with the noise in high-throughput molecular profiling data that is analyzed using the knowledgebase. Hence, the necessity to sift through the knowledgebase lead to the development of statistical algorithms capable of finding key regulatory events relevant to biological response or cell process in focus. I have worked at Ariadne Genomics on developing sub-network enrichment analysis which has become the major tool for interpreting raw molecular profiling data. I am happy to see how extensively SNEA is used throughout this book for making inferences from gene expression microarray data providing the foundation for building mechanistic models.
Building predictive mechanistic models in biology requires multiple expert skills including thorough understanding of context and experimental approaches used for measuring interactions in the knowledgebase, thorough understanding of the limitations of high-throughput molecular profiling technologies, expert understanding of cellular processes involved in disease or biological response, and thorough understanding of statistical algorithms enabling the knowledge inference. Very few people in the world possess this combination of skills. Therefore I am not surprised that the book is written almost entirely by Ariadne team who also took advantage of the powerful graphical interface for pathway visualization and construction available in Pathway Studio. This book provides readers with the deep insights into how the raw biological data can be converted into predictive in silco models.
Andrey Sivachenko, Ph.D.
Chapters in this book describe building mechanistic models for various human diseases and conditions. While each chapter provides novel insights into the disease mechanism and should be of interest to any expert in this disease, we note that the authors in this book have never published articles about the disease described in their chapter and have never performed any experiments to study the disease. All authors have learned and advanced the understanding of the disease mechanism either through analysis of knowledge networks or through analysis of publicly available gene expression datasets using knowledge networks. All chapters also have in common the use of Pathway Studio software from Ariadne Genomics. Pathway Studio provides access to the biological knowledge networks and tools for their navigation and analysis. Most knowledge in the Pathway Studio database is extracted automatically from scientific literature using MedScan information extraction technology. While MedScan is thoroughly described in publications from Ariadne Genomics, the goal of this book is to show how to use the extracted information for knowledge inference, for building mechanistic models, and for learning how to use the model and knowledge networks to make more informed predictions about disease targets and biomarkers. We emphasize that while every model in this book required MedScan-extracted knowledge networks, Pathway Studio also allows import and navigation of additional knowledge from other sources and databases. Some examples of additional knowledge - protein homology network or network of physical interaction imported from public PINA database - are described in the chapters about cholestasis and gastric cancer models.
So what are “knowledge networks”? There are a couple of ways to answer this question. The analogy with the computer science term “Semantic Web” is the first that come to mind. For readers with a biological background, another definition of “molecular biological knowledge networks” can be compressed, formalized representation of the knowledge about biological molecular interactions described in scientific literature. Statements about molecular interactions, molecular function, and about molecule roles in disease and other phenotypes are scattered among millions of articles published by the scientific community in the last 60 years. MedScan converts such statements into semantic triplets, e.g., “A regulates B” or “C binds B”, in order that they can be imported into a relational database. The Pathway Studio database generated by MedScan 5.0 technology contains more than 2.5 million unique relationships described in more than 18 million molecular biological articles. Knowledge networks stored in the Pathway Studio relational database provide instantaneous access to the knowledge generated by entire molecular biological research that has been supported by trillions of dollars of investment.
The compression of quintessential molecular biological knowledge into semantic triplets allows both a quick overview for users and rapid traversing using network navigation algorithms. By bringing together in one database information extracted from disparate knowledge domains, Pathway Studio enables individual domain experts to make analytical connections that have been previously unnoticed. It allows the making of statistically sounder conclusions that are based on all published observations rather than on the limited set of papers familiar to only one expert. There are three major domains in biomedical knowledge: physical and regulatory molecular interactions measured in basic academic molecular biological research; pharmacological effects and drug interactions published by medicinal chemists from the pharmaceutical industry and pharmacology and translational medicine departments in academia; disease - related molecular changes published by clinicians and medical doctors. Medical doctors rarely know molecular biology and basic scientists usually do not know much about pharmaceutical research. Bringing together molecular interactions and clinical observations are essential, however, for building a molecular mechanism of a disease. Knowledge about drug mechanisms is necessary for finding new drugs based on the mechanistic disease models.
Any given drug or disease may affect the activity of dozens and often hundreds of biological molecules. While contemporary high-throughput molecular profiling technologies, such as gene expression microarrays, can measure global molecular response, the interpretation of observed profile requires an overview of thousands of publications describing individual interactions between genes and proteins in the profile.
Such intermolecular dependencies are often measured in individual academic labs independently from clinical or drug research. Another example of separation in biomedical knowledge is the context specificity of observed molecular interactions and functions. Due to the high cost of molecular biological experiments, individual molecular interactions are usually measured only in the context of one tissue, organism or condition. Most of these context-specific interactions can be used for building a model for another disease or to explain the molecular profile measured in a different tissue or organism. While borrowing interactions from another organism or tissue is a common practice for building biological models, the search for such interactions through biomedical literature would be a daunting task without Pathway Studio and its knowledge networks.
This book is written by Pathway Studio experts to show how one can leverage the information integrated into the knowledge networks for building mechanistic models. While the knowledge networks consititute a global compendium of molecular interactions observed by entire molecular biological research, the mechanistic model of a disease, phenotype, or trait contains only a subset of such interactions. This subset must be sufficient to explain all or a majority of molecular observations about the condition. The first step in building a model is collecting all observations from various scientific publications and enriching it with the results obtained by a global molecular profiling experiment. For many complex diseases, such as cancer, this effort leads to the collection of several thousand proteins affected by the disease state. The process of model building can be described as complexity reduction of the observed molecular profile for a given disease or condition. You will learn from the book chapters that even changes in thousands of genes and metabolites affected by disease can be explained by the activity change in only a few biological pathways.
Three chapters in the book use public gene expression datasets profiling the disease state and comparing it to healthy control “normal” state. The principal technique of reducing complexity of a molecular profile is called sub-network enrichment analysis (SNEA). In the case of gene expression, SNEA uses the expression regulatory knowledge network to find transcription factors and other regulators responsible for the biggest changes observed in the experiment. You will see that, throughout the book, SNEA regulators can often be mapped onto one or several canonical pathways, indicating that pathway changes its activity in the disease state. Due to the small number of pathways known for the human organism, it is not always possible to map significant expression regulators identified by SNEA. Therefore, the last two chapters suggest other techniques - regulator clustering and pathway reconstruction - to classify expression regulators into a smaller number of functional communities in order to further reduce the complexity of the molecular profile.
We hope that the examples from this book will allow readers to start building models for their disease or phenotype of interest. The book starts with simpler chapters that use knowledge networks to review the state-of-the-art in a disease field. The last chapters describe more complicated applications of knowledge networks for building disease models by analyzing public gene expression datasets. Some chapters go beyond model building. Once the disease model is built, it can be used for more accurate prediction of biomarkers, repositioning of existing drugs, target selection for future drugs, and design of personalized therapy using the same knowledge networks available in Pathway Studio.
Senior Director of Application Science at Ariadne Genomics
Chief Scientific Officer at Ariadne Genomics
List of Contributors
Ph.D. Anton Yuryev
Senior Director of Application Science at Ariadne Genomics
Ph.D. Nikolai Daraselia
Chief Scientific Officer at Ariadne Genomics