Manuela Helmer-Citterich - BITS Lecture
Manuela Helmer-Citterich is full professor of Bioinformatics and Molecular Biology at the University of Rome Tor Vergata, where she also coordinates the Master Degree in Bioinformatics. Her research interests are focused on bioinformatics, computational biology, systems biology, genome annotation, RNA analysis, structural bioinformatics, molecular interactions, and development of newcomputational tools. She has been president of the Italian Society of Bioinformatics for several years and she is now its honorary president.
Dominik Heider (born 1982) is Professor for Bioinformatics at the Department of Mathematics and Computer Science at the University of Marburg, Germany.He studied Computer Science at the University of Muenster from 2002 until 2006 and subsequently started his PhD studies at the Department of Experimental Tumorbiology and the Department of Computer Science at the University of Muenster. After receiving his PhD in 2008, he worked as a Postdoc at the Department of Bioinformatics at the University of Duisburg-Essen where he finished his habilitation thesis in 2012. He then became an Associate Director and Head of the Clinical and Diagnostic Bioinformatics at QIAGEN, and in 2014 he accepted a professorship for Bioinformatics at the Straubing Center of Science and an adjunct professorship at the Technical University of Muenchen, before joining the University of Marburg in 2016. His main research focus is set on the development of bioinformatics solutions for next-generation sequencing (NGS) data, e.g., machine learning algorithms for predicting drug resistance of pathogens or for modeling of diseases. In another main part of our research he aims to develop new methods and algorithms for analyzing (meta-)genomic and (meta-)transcriptomic data of microorganisms, as well as genome assembly and functional annotation. Since NGS technologies have great potential in biomedical research but data processing is still limited by computational power, he further investigates in techniques based on high-performance computing. Dominik Heider is an Associate Editor for the international journals BMC Bioinformatics and BioData Mining. Moreover, he is a member of the PC of the German Conference on Bioinformatics. He is also a member of the German Society for Computer Science and FaBI.
Alexander Kel1,2 received his Ph.D. in Bioinformatics, Molecular Biology and Genetics in 1990. He studied biology and mathematics at Novosibirsk State University and obtained his M.S. in biology with special focus on mathematical biology in 1985. He worked for 15 years at the Institute of Cytology and Genetics, Russia (ICG) holding positions as a programmer, scientist, senior scientist and Vice-Head of the Laboratory of Theoretical Molecular Genetics. In 1995, he won the Academician Belaev Award. In 1999 he received an independent funding from the Volkswagen foundation and organized a Bioinformatics group at ICG. From 2000 to 2010, he has been the Senior Vice President Research & Development of BIOBASE GmbH, Wolfenbüttel, Germany. The scientific career of Dr. Kel includes numerous research stays in the USA (e.g. 1993: Supercomputer Center, Tallahassee; 1997: University of Pennsylvania, Philadelphia; 1999, 2000: Cold Spring Harbor, NY), in Italy (1991, 1992: ITBA, Milan), and in Germany (1994, 1995, 1996, 1997-1998: GBF; 1997: MPI of Molecular Biology, Berlin). The research experience of Dr. Kel in Bioinformatics totals more than 20 years. During his career, he has worked in almost all branches of current bioinformatics including: theoretical models of molecular genetic information systems, sequence analysis, gene recognition, promoter analysis and prediction, analysis of protein secondary structure, prediction of RNA secondary structure, theory of mutation and recombination process, molecular evolution, databases and gene expression studies. Alexander Kel is the author of more than 90 scientific publications. He is also an author of several chapters in books on bioinformatics, tutorials and education materials.
1) GeneXplain GmbH, Am Exer 10B, D-38302 Wolfenbüttel, Germany
2) Institute of Chemical Biology and Fundamental Medicine, Novosibirsk, Russia
Pietro Liò is a Reader in Computational Biology in the Computer Laboratory which is the department of Computer Science of the University of Cambridge and I am a member of the Artificial Intelligence group of the Computer Laboratory.
He has a MA from Cambridge, a PhD in Complex Systems and Non Linear Dynamics (School of Informatics, dept of Engineering of the University of Firenze, Italy) and a PhD in (Theoretical) Genetics (University of Pavia, Italy).
Andrew C.R. Martin - Preparata Lecture
Andrew Martin studied Biochemistry at the University of Oxford where he stayed for his D.Phil. After working at the National Institute for Medical Research in London and a few years as a self-employed scientific software developer, he joined the group of Professor Dame Janet Thornton, FRS at University College London. He moved from there to Inpharmatica, a UCL spin-out company, and then to the University of Reading as a Lecturer in Bioinformatics. He returned to UCL in 2004 where he is now a Reader in Bioinformatics and Computational Biology. His research focuses on two main areas: the sequence and structure of antibodies and the effects of mutations on protein structure and function. He has published over 80 papers and reviews, six book chapters and a co-authored book on moonlighting proteins. As well as widely used web-based software, he has developed software that has been downloaded over 8,000 times. He has consulted for a number of companies and acted as an expert witness in several patent disputes related both to antibodies and to general bioinformatics. He is also an advisor to the WHO International Nonproprietary Names (INN) committee on the naming of antibody-based drugs.
In recent years, data are being collected on long non coding RNA (lncRNA) genes, that more than double the Ensembl human gene set. Such RNAs are actively transcribed and processed in cells, but most of them are not yet associated to a known function. New tools for RNA analysis are badly needed, since sequence comparison is a fundamental tool for functional annotation, but often fails when comparing RNA sequence sharing less than 60% sequence identity. It is possible that important functional information is encoded into RNA strucure, but experimental data on RNA 3D structure is very sparse and poor. RNA secondary structure has pros and cons: it can be predicted by computational methods, and therefore can be calculated for all RNA molecules of interest, but the reliability of the prediction is not yet entirely satisfactory. We defined an alphabet for the description of RNA secondary structure, and therefore we can translate a secondary structure into a sequence of characters. Thanks to this alphabet, we could compute a substitution matrix of secondary structure elements, with the rates of variation of structural elements in functionally related RNAs. The alphabet and the matrix were used for building tools for the global and local comparison of RNAs with very low computational costs.
The development of computational approaches for predictive modeling of diseases or drug resistance predictions has opened a new era in precision medicine. Clinical decision-support-systems have been designed for assistance in molecular diagnostics (MDx) or companion diagnostics (CDx) to enhance therapeutic success. These systems are typically based on statistical or machine learning models that were build based on clinical data. The main pitfall of computational models for precision medicine developed in academia is however that most of the software is developed by individuals on a one-person-one-project basis. Thus, these researchers develop software in a prototype-centered manner, meaning that they develop software for quick publishable results and that they neither care on regulatory aspects regarding software development processes nor on the documentation and the maintenance for MDx or CDx software. One important aspect of producing reliable computational models is however the evaluation in clinical trials, which is a necessary, but not sufficient condition towards application in MDx or CDx scenarios.
Main aspects of the talk:
- Statistical and machine learning models for precision medicine
- Clinical evaluation and clinical trials
- Regulatory aspects of software for MDx and CDx applications
If our genes are so similar, what really makes a human different fromE. coli? The answer lies in the difference in gene regulation used. In this presentation I will discuss the evolutional advantages of high plasticity of gene regulatory networks that is characteristic of multicellular eukaryotic organisms. At the same time these advantages comes with the price - terrible diseases such as cancer. Non-reversible structural changes of the regulatory networks due to an epigenetic “evolution” of genome regulatory regions provide the basis for realization of normal development programs. On the other side they may cause transformations switching the normal state to a disease state. We call such structural network changes as “walking pathways”. The analysis of this phenomenon helps us to understand the mechanisms of molecular switches (e.g. between programs of cell death and programs of cell survival) and to identify prospective drug targets of cancer. Such structural plasticity of regulatory networks observed in genomes of higher eukaryotes, in my view, is the result of an evolutional "aramorphose" towards emergence of completely new mechanism of evolution of multi-cellular organisms on the basis of combinatorial regulatory code.
The aims of the talk is to describe the design and implementation of different types of neural network architectures to perform inference on gene expression and methylation data. The models were chosen such that each of them explores different properties of the epigenetic data in order to overcome the problem of having sparse and imbalanced datasets. All of these models were subsequently compared in order to assess their pattern recognition abilities.
The talk will contain elements of tutorial so also beginners willing to practice on deep learning could benefit by bringing their laptops.
High-throughput sequencing platforms are increasingly used to screen patients with genetic disease for pathogenic mutations, but prediction of the effects of mutations remains challenging.
We have developed SAAPdap (Single Amino Acid Polymorphism Data Analysis Pipeline) which uses a set of rule-based analyses for predicting the likely local effects of mutations. These analyses are then used with SAAPpred (Single Amino Acid Polymorphism Predictor) which uses a random forest to predict whether a mutation is likely to be pathogenic. The method gives a fully crossvalidated MCC=0.692 and a partially cross-validated MCC=0.944 (where the same protein is allowed in training and test sets, but not the same mutation). This considerably outperforms well known methods such as MutationAssessor, SIFT and PolyPhen2 (MCC between 0.452 and 0.572).
We have also extended the method to create SAAPpred-myh7 which is able to distinguish between the two major clinical phenotypes (hypertrophic cardiomyopathy, HCM and dilated cardiomyopathy, DCM) associated with mutations in the beta-myosin heavy chain (MYH7) gene product (Myosin-7). Despite having a small and unbalanced dataset, we achieve an MCC=0.53 and a post hoc removal of machine learning models that performed particularly badly, further increased the performance (MCC=0.61). Thus our method for performing the difficult task of differential phenotype prediction is competitive with other methods simply for performing pathogenicity prediction.
Preserving the currency of genomics outcomes over time through selective re-computation: model and initial findings
Complex and computationally expensive processes are common in many areas of bioinformatics, e.g. in genomics and metagenomics.
The outcomes of such processes are time-sensitive, as they depend on algorithms, tools, and reference databases which all evolve over time, often independently of one another. This suggests that some of the processes may need to be repeatedly re-computed in response to these changes. However, these computations can be expensive, consuming tens of CPU hours each, and not all past cases will be affected by all changes.
In the ReComp project (http://recomp.org.uk), we have started to investigate methods for optimising re-computation of common genomics processes in response to changes in the underlying reference data. We have chosen a metadata analytics approach: for each execution, e.g. of a variant calling pipeline, we record the provenance of its outcome, detailed cost (time, cloud resources), and details of the process structure, for instance a workflow, into an ever-growing history database. When changes are detected, this meta-database is analysed to determine the expected impact of a change on a population of past cases, as well as the minimal sub-workflow that needs to be re-enacted on each of them. In this talk I will present the ReComp model and our initial findings using a simple variant interpretation process, implemented on the eScience Central workflow manager, as a testbed.