Details .:. BITS Bioinformatics Italian Society

Postdoc Position at the Laboratory of Computational and Quantitative Biology - Paris - France

Université Pierre et Marie Curie, Paris, France

We are interested in finding a postdoc with interests in protein functional annotation and/or machine learning. The position is open for 3 years at the Laboratory of Computational and Quantitative Biology at the Université Pierre et Marie Curie, located in the heart of Paris. Candidates with experience on comparative genomics, phylogenetic reconstruction and/or machine learning are very encouraged to apply.

Theme: Large-scale protein annotation exploiting a multitude of diversified probabilistic models

Keywords: Protein function annotation, multiple probabilistic models, domain architectures, machine learning, combinatorial optimization, computer grid, metagenomics.

Content: Precise genome annotations are a gold mine for biologists that use them to identify proteins involved in biological processes. Databases of protein domains and functional sites are vital resources to provide functional analysis for these new proteins. Most of databases describe known domains with probabilistic models representing consensus among all domain sequences, while only a few ones associate to each protein domain family different probabilistic models, built from a sample of diversified homologous sequences. In the attempt of unifying the annotation process and providing a more accurate tool, integrative approaches combine different types of protein signatures from multiple databases into a single searchable resource. However, the increasing number of proteins with no annotation, present in highly divergent genomes, and the large number of erroneous annotations produced by current tools ask for the development of innovative solutions.

We propose a novel integrated approach for large-scale protein annotation that exploits an unprecedented amount of genomic data as well as sophisticated machine learning techniques and combinatorial optimization approaches taking advantages of High Performance Computing (HPC) environments. The idea is to uncover as much as possible the evolutionary processes of protein sequences that took place throughout the whole tree of life and that affected the evolution of a protein family. We have already demonstrated in previous work that the problem of functional annotation is inherent to the ability of uncovering such paths. Now, we shall extend this approach to large scale genome annotation by considering 11 different protein databases, constituted by about 10^9 protein sequences, and by producing a large pool of diversified probabilistic models coding for about 10^7 evolutionary protein pathways. Such models will be used to search for specific domains in genomes to be annotated. Our previous methodology needs to be fundamentally improved to deal with this large amount of biological data. In this project, we shall work on the algorithms to reduce the space of models and the search complexity, and we shall implement some important algorithmic changes towards the realization of a powerful integrated annotation tool.

We expect to draw many consequences in molecular evolution from such a refined annotation.

Important direct applications to metagenomics data are also envisaged within the framework of a project on prokaryotic and eukaryotic marine environments going on in the lab.

Where: This project is run on the Laboratory of Computational and Quantitative Biology (LCQB)

UMR7238 CNRS-UPMC – Analytical Genomics team, headed by A. Carbone. It is co-advised with

Pierre-Henri Wuillemin, Laboratoire d’Informatique de Paris 6 – Equipe DECISION. The LCQB has a very dynamic environment where 7 teams work at the frontier of computational biology. Information on the lab and the teams can be found at http://www.lcqb.upmc.fr.

Period: The postdoc will last 3 years and it is available from now.