Details .:. BITS Bioinformatics Italian Society

Post-doc position at UPMC - Paris - France

We wish to recruit a postdoc with interests in protein functional annotation and/or machine learning. The position is open for 3.5 years at the Laboratory of Computational and Quantitative Biology at the Université Pierre et Marie Curie, located in the heart of Paris. Candidates with experience on comparative genomics, phylogenetic reconstruction and/or machine learning are very encouraged to apply.

Theme: Large-scale domain annotation, functional and evolutionary characterization of metagenomic sequences based on a multitude of diversified probabilistic models

Keywords: Metagenomics , domain annotation, protein functional annotation, probabilistic models, domain arrangements, machine learning, combinatorial optimization, computer grid.

Content: A precise domain annotation of genomes and metagenomes is a gold mine for biologists that use them to identify proteins involved in biological processes. Databases of protein domains and functional sites are vital resources to provide functional analysis for these new proteins. Most of databases describe known domains with probabilistic models representing consensus among all domain sequences, while only a few ones associate to each protein domain family different probabilistic models, built from a sample of diversified homologous sequences. In the attempt of unifying the annotation process and providing a more accurate tool, integrative approaches combine different types of protein signatures from multiple databases into a single searchable resource. However, the increasing number of proteins with no annotation, present in highly divergent genomes, and the large number of erroneous annotations ask for the development of innovative solutions. At the laboratory, we recently developed a novel integrated approach (based on machine learning approaches and optimization strategies) for large-scale protein annotation that exploits an unprecedented amount of genomic data as well as sophisticated machine learning techniques and combinatorial optimization approaches taking advantages of High Performance Computing (HPC) environments.

In this project we shall build up from this recent methodological effort and try to uncover as much as possible the evolutionary processes of protein sequences that took place throughout the whole tree of life and that affected the evolution of a protein family. The main tasks of the projects are:

1. The extention of our methodological approach to large-scale (meta)genome annotation by considering 11 different protein databases, constituted by about 10⁹ protein sequences, and by producing a large pool of diversified probabilistic models coding for about 10⁷ evolutionary protein pathways.

2. The construction of classes of evolutionary “similar” probabilistic models. This step demands the definition of appropriate measures of similarity between probabilistic models and the development of an appropriate clustering approach.

3. Probabilistic models will be used to search for specific domains in metagenomic samples to be annotated. For this, we want to integrate new “evolutionary units”, similar in spirit to the notion of domain, but adapted to the handling of metagenomic reads, in our large-scale search.

In this project, we shall work on the algorithms to reduce the space of models and the search complexity, and we shall implement some important algorithmic changes towards the realization of a powerful integrated annotation tool.

We expect to draw many consequences in molecular evolution from such a refined annotation. Important direct applications to metagenomics data are also envisaged within the framework of projects on prokaryotic and eukaryotic marine environments going on in the lab in collaboration with the University of East Anglia in Norwick (UK) and the Ecole Normale Supérieure in Paris.

Bibliographical references:

- J.Bernardes, F.R.J.Vieira, G.Zaverucha, A.Carbone, “A multi-objective optimisation approach accurately resolves protein domain architectures”, 2015. Submitted.

- J.Bernardes, G.Zaverucha, C.Vaquero, A.Carbone, “High performance domain identification in proteins explores a multitude of diversified profiles with grid computing”, 2015. To be submitted.

- A.Ugarte, J.Bernardes, A.Carbone, “Meta-clade: a highly precise annotation method for metagenomic samples”, 2015 (in preparation).

- A.Ugarte, T.Mock, A.Falciatore, A.Carbone, “A new approach to the functional annotation of metagenomic samples”, 2015 (in preparation).

Where: This project is run on the Laboratory of Computational and Quantitative Biology (LCQB) - UMR7238 CNRS-UPMC – Analytical Genomics team, headed by A. Carbone. The LCQB has a very dynamic environment where 7 teams work at the frontier of computational biology. Information on the lab and the teams can be found at http://www.lcqb.upmc.fr. For the team, seehttp://www.lcqb.upmc.fr/AnalGenom/ and http://www.lcqb.upmc.fr/AnalGenom/projects.html

Period: The postdoc will last 3.5 years and it is available from now.