- Scientific content
- Practical information
|7, 8 et 9 septembre||
Journées Ouvertes en Biologie, Informatique et Mathématiques
Table of Contents
JOBIM 2010 Poster Accepted Papers with Abstracts
T-REKS and PRDB: new tools for large scale analysis of protein tandem repeats.
Abstract: Over the last years a number of evidences has been accumulated about high incidence of tandem repeats in proteins carrying fundamental functions and being related to a number of human diseases. Protein repeats are strongly degenerated during evolution and, therefore, cannot be easily identified. We developed a program called T-REKS for ab initio identification of the tandem repeats. It is based on clustering of lengths between identical short strings by using a K-means algorithm. Benchmarks on several sequence datasets show that T-REKS detects the tandem repeats in protein sequences better and faster than the other tested programs (1). Using T-REKS we have filled a Protein Repeat DataBase (PRDB) with repeats identified in large databanks such as SwissProt and Non-Redundant databank of NCBI. T-REKS and PRDB are available via our webpage at http://bioinfo.montp.cnrs.fr.
In this poster we will also present some results of our analysis obtained by these tools. For example, we analysed structural properties of protein regions containing perfect and nearly perfect tandem repeats and found that the more they are perfect the less they are structured (2). Naturally occurring protein domains with perfect repeats are absent in PDB. The abundance of natural structured proteins with tandem repeats is inversely correlated with the repeat perfection: the chance to find natural structured proteins in PDB increases with a decrease in the level of repeat perfection. Prediction of intrinsic disorder within the tandem repeats in the SwissProt proteins supports the conclusion that the level of repeat perfection correlates with their tendency to be unstructured. This correlation is valid across the various species and subcellular localizations, although the level of disordered tandem repeats varies significantly between these datasets. Our study supports the hypothesis that in general, the repeat perfection is a sign of recent evolutionary events rather than of exceptional structural and (or) functional importance of the repeat residues.
(1) Jorda J and Kajava A.V. - T-REKS: identification of Tandem REpeats in sequences with a K-meanS based algorithm. (2009) Bioinformatics. 25(20):2632-2638.
(2) Jorda, J. Xue, B., Uversky, V. N. and Kajava A.V. (2010) Protein tandem repeats: the more perfect the less structured, FEBS J (2010) 277:2673–2682
TriAnnot V2.0 a friendly web interface for monocot genomic sequences automatic annotation
Abstract: Annotation is one of the most difficult tasks in genome sequencing projects, yet it is essential for connecting genome sequence to biology (Elsik et al. 2006 Genome Research 16:1329). Structural and Functional annotation consist in determining the position and structure of genes as well as of other features such as transposable elements and non coding RNA, and inferring their putative function in the genome. This requires a complex successive combination (pipeline or workflow) of softwares, algorithms and methods. Automation of such a pipeline is necessary to manage large amounts of data released by genome sequencing projects (Gouret et al. 2005 BMC Bioinformatics 198:1). To achieve a systematic and comprehensive annotation of the wheat genome sequence (17 Gb), a pipeline called TriAnnot V2.0 (http://urgi.versailles.inra.fr/index.php/urgi/Tools/Triannot-Pipeline) has been developed by INRA Clermont-Ferrand (GDEC) and Versailles (URGI) in partnership with NIAS, under the umbrella of the IWGSC (International Wheat Genome Sequencing Consortium - http://www.wheatgenome.org). The objective of TriAnnot is to provide the international scientific community with an online user friendly, fast and as complete as possible annotation tool in view of the sequencing of the wheat genome. As it is the case for every workflow, the TriAnnot pipeline should minimize manual expertise which is slow and labor-intensive, and maximize relevant automatic annotation which is a relatively rapid process that allows frequent updates to accommodate new data (Curwen et al. 2004 Genome Research 14:942-950).
TriAnnot is customizable through a user friendly web interface; it is written in PERL object; its architecture is modular; it has be design for wheat but can be used for other monocot plants, and it is organized in three main panels:
The first panel comprises modules for repeats and Transposable Elements (TEs) masking, and annotation. The programs used are RepeatMasker, Tandem Repeat Finder and TEannot (REPET package) developed at URGI (Quesneville et al. 2005 PLoS Computational Biology 1:166-175).
The second panel is dedicated to protein-coding genes structural and functional annotation. The whole process is based on ab initio gene prediction programs (FGeneSH, GeneMarkHMM, GeneID) and similarity search (BLASTn + Gmap) against transcript databanks. The confidence of gene model is assessed through a grade level, visualized using GBrowse with a color code. Thus, TriAnnot enables clear distinctions between annotations based on biological data and ab initio data. Gene models based on biological data (mRNA Full-cDNA and complete CDS) are made by the NIAS-search module (BLASTn, est2genome, BLASTx) whereas gene models based on ab initio data are made with Eugene which combines FGeneSH predicted genes with plant transcripts hits. The functional annotation of proteins derived from gene model CDS follows the IWGSC annotation guide line, and provides five categories (Known and Putative proteins; Domain Containing protein; Expressed Gene; Conserved Hypothetical Gene and Hypothetical Gene).
The third panel is focused on search for biological evidences (BLASTn and BLASTx against public databanks), as well as other biological targets such as tRNAs after masking TEs from Panel 1 and Gene Models from Panel 2 thereby facilitating the identification of unknown biological features based on comparative genomics.
Annotations can be viewed through an online Genome Browser, and GFF and EMBL output files can be downloaded for manual expertise using editing annotation software such as Artemis or Apollo.
To evaluate the performance of TriAnnot V2.0, a set of manually annotated BACs (Choulet et al, 2010 Plant Cell in press) has been annotated with several international pipelines (MIPS - Klaus Mayer, RiceGAAS - http://ricegaas.dna.affrc.go.jp/, FPGP - http://fpgp.dna.affrc.go.jp/, DNA subway - http://dnasubway.iplantcollaborative.org/) and TriAnnot. The results are currently under comparison using Eval (Keibler & Brent 2003 BMC Bioinformatics 4:50) and the final output will be presented and discussed.
A Novel Approach for Comparative Genomics & Annotation Transfer
Abstract: With the rapid development of sequencing techniques, the situation where a newly sequenced genome needs to be annotated using available genomes from close species should become more prevalent in the future. However, because of the cost of genome finishing we may have to handle incomplete or not fully assembled genomes. Undoubtedly, the need for comparative annotation will increase, but the genomic community still lack computational solutions that are both efficient and sensitive under various conditions. Present approaches are mainly based on the sequence similarity detected at the gene or protein levels, which are mostly further analysed independently one of each other, despite the dependency implied by the genome.
Hence, we propose a novel approach to genome comparison and use it to develop a system that transfers annotations between the compared genomes. Besides features' sequence similarity, it accounts for the synteny it detects across multiple genomes. This approach is simple for it avoids to solve complex questions that makes other approaches computationally hard.
The underlying idea is to partition a focus genome according to its pairwise similarities with the other compared genomes. The question is formulated as searching for the intervals that are shared across all genomes under consideration, and maximal in length (i.e., not extendible). If a genomic region is covered by at least one interval it is conserved across all genomes, and the number of such intervals tells how many possibilities exist for aligning it with different regions of the other genomes. Hence, our algorithm partitions the genome into regions following two criteria: 1/ being shared or unshared across all genomes, 2/ offering a unique or several alignment possibilities. The annotation transfer procedure crosses the focus genome's annotations with these regions and automatically derives the possible alignments for each feature. All features falling entirely in a region offering only one alignment possibility are declared as potentially transferable, and the user may interactively select among those according to various criteria: alignment's percent of identity, feature class, etc.
We implemented these procedures in an efficient and flexible tool, named QOD, equipped with a user-friendly graphical interface. Graphical and textual results representations allow both to grasp the overall genome similarity at a glance and to browse the conserved and unshared features in various ways. This enables the investigation of genome specific genes or of rearrangements, and copy number variations, for instance. For it does not require the genome sequence to be completely assembled, our approach allows to compare and pre-annotate unfinished genomes, as well as assemblies of Next Generation Sequencing data.
Towards the unbiased prioritarization of Huntington Disease targets using network-based analysis of genome-wide datasets
Abstract: Background : The identification and validation of neuroprotective targets is of primary importance to develop Huntington's disease (HD) therapeutics. This inherited neurodegenerative disease is extensively studied thanks to well-characterized models that were developed in several species (invertebrates, mammals) and that recapitulate complementary components of HD pathogenesis. Genome-wide analyses in these models have generated a large amount of data (dysregulated genes, modifier genes) with high potential for target and marker selection. The comprehensive and unbiased integration of 'omics data' on HD may allow better decisions in candidate target selection to be reached. To this end, the network-based analysis of large datasets is anticipated to be highly instructive.
Results : We have designed a network-based procedure for integrating data from different models of HD pathogenesis. We aimed at preserving useful information from individual screens and allowing to test for the probabilistic interdependencies of different datasets/variables. The core method is the spectral analysis of the data using large and integrated networks such as WormNet to gradually remove unreliable information. Our procedure thus extracts gene clusters that are highly interconnected, enriched in HD data and automatically annotated for their biological role and biomedical potential.
Conclusion : Preliminary results will be shown to illustrate how our data analysis procedure is able to identify biological processes/pathways/genes of high interest in HD. Further investigation will aim at developing analyses and making the resulting information publically available on-line.
This work was done in the Core Working Group 'Biological modifiers' of the European Huntington's Disease Network.
OrthoInspector : comprehensive orthology analysis and visual exploration
Abstract: The accurate determination of orthology and paralogy relationships is essential for comparative sequence analysis, functional annotation and evolutionary studies. Various methods have been developed based on either simple blast all-vs-all pairwise comparisons or on time-consuming phylogenetic tree analyses. We have developed OrthoInspector, a new software system incorporating an original algorithm for the rapid detection of orthology and in-paralogy relations between different species. In comparisons with existing methods, OrthoInspector improves detection sensitivity, with a minimal loss of specificity. Moreover, different visualization tools have been integrated to allow rapid and easy comparative studies based on these predictions. The system has been used to study the orthology/in-paralogy relationships for a large set of 950,000 protein sequences from 60 different eukaryotic species.
Estimating the size of the S. cerevisiae interactome
As protein interactions mediate most cellular mechanisms, protein-protein interaction networks are essential in the study of cellular processes. Consequently, many large scale interactome mapping projects have been undertaken, and protein-protein interactions are being distilled into databases through literature curation; yet protein-protein interaction data are still far from comprehensive, even in the model organism S.cerevisiae. Estimating the interactome size is important for evaluating the completeness of current datasets, in order to measure the remaining efforts that are required.
Several estimates of the size of the S. cerevisiae interactome have been proposed, but none of them directly take into account information from both literature-curated and high-throughput data. We propose here a simple and reliable method for estimating the size of an interactome, combining these two data sources. Our method yields an estimate of at least ~ 35,000 direct physical protein-protein interactions in S.cerevisiae.
Combining literature-curated and high-throughput data leads to higher estimates of the S. cerevisiae interactome size. This confirms the complementarity of these two data sources, and provides a sobering view of the coverage of current interactomes: extensive efforts and/or new methods are sorely needed.
RNAspace: a web application for ncRNA identification
Abstract: RNAspace is a web application to support biologists in annotating non-protein-coding RNA (ncRNA) in genomes. The platform is an integrated environment to run a variety of ncRNA gene finders (sequence homology search, structure comparison, comparative analysis ...). It allows to explore predictions with dedicated tools for comparison, visualization and edition of candidate ncRNAs and exporting results in various formats.
The application is running the web server http://rnaspace.org that does not required to install any software locally.
Furthermore the code is open source and parametrizable so it may also be used to create custom web sites.
Bio++: Object-oriented libraries for sequence analysis, population genetics, molecular evolution and phylogenetics
Abstract: The large amount of molecular data now available to biologists requires dedicated pipelines to automate data analysis. Such pipelines use existing "bricks" (programs, methods, algorithms) and combine them. These bricks are widely used methods, but their combination is often project-specific. A powerful way to quickly assemble dedicated pipelines is to use code libraries and object-oriented programming, which provide ready-to-use tools for a given programming language (see for instance the BioJava, BioPerl or BioPython projects). The development of new models and methods for biological data analysis is also a very active field, empowered by the increasing availability of large data sets. In most cases, new methods build on existing ground, and will benefit from available, efficient and
The Bio++ libraries offer an efficient and extensible implementation of a wide set of methods for sequence analysis, population genetics, molecular evolution and phylogenetics. It is written in C/C++, designed as fully object-oriented, and thoroughly documented. (Open) Source code and packages for several operating systems can be downloaded from the website http://biopp.univ-montp2.fr . A suite of fully documented executables based on the Bio++ libraries is also available for many examples of data analysis.
MGCA: a flexible tool for phylogenomic analysis of prokaryotic genomes
Abstract: The introduction of next generation sequencing approaches has caused a rapid increase in the number of completely sequenced genomes and unfinished genomes, available as contigs. Consequently our ability to annotate genomes - identifies protein-encoding, rRNA and tRNA genes, assigns functions to the genes - and methods for genome data mining are to adapt to this new volumes of data.
We are developing an automatic workflow MGCA “ Multi-Genome Cluster Analysor”, to integrate in a same resource, methods for functional annotation and genomic comparison together with phylogenetic tools, to help the researchers in the exploration of genome information. At end, MGCA would make easier the identification of genes involved in pathogen adaptation and in many other processes of biological interest.
MGCA has been devised to be at the same time flexible enough to take in input contigs, either as complete genomes or proteomes in different file formats. In the case of a draft genome still in contigs, we called upon the RAST server an automatic gene calls and annotation service, which predicted RNA and CDS genes and provide their respective DNA and protein sequences . In addition, high-quality annotation and updated annotation, indispensable for understanding a genome, its components, and the protein products is obtained by using RPS-Blast  against the CDD database . A multi-genome clustering of homologs (orthologs and in-paralogs) using the Inparanoid  MultiParanoid  algorithms is done on a subset of genomes selected by the user. Then, the classification of genes as core genes, dispensable genes (present in some but not all compared genomes) and orphan genes (genes specific of one genome) is obtained.
At present, MGCA simplifies the process of obtaining new biological insights into the differential gene content of related genomes. At end, this tool will also allow the user to carry out more in-depth phylogenetic analysis, like the detection of genetic transfers or the measure of selection pressure within sets of homologs of interest.
As one result of this development, it is now feasible to analyze large groups of related genomes in a phylogenomic approach. MGCA gives the opportunity to compare closely related strains or species as well as bacteria from different phyla sharing a same lifestyle and/or different features. It helps to predict genes conferring important phenotypes, which are present or absent depending on strains and retrieves a short list of proteins for laboratory experiments.
To demonstrate a specific application case, we analyzed ten genomes of Chlamydiaceae including human pathogens, C. trachomatis & C. pneumoniae, through a comparison of genomes available in the superphylum PVC (Planctomycetes – Verrucomicrobia – Chlamydiae). The scope of comparisons of Chlamydiae, obligated intracellular bacteria, infecting animals, humans, as well as Protozoa (amibes) and Planctomycetes or Verrucomicrobia, free living in different environmental niches, is to identify some proteins involved in intracellular survival mechanisms.
 Alexeyenko, A. et al, (2006) Automatic clustering of orthologs and inparalogs shared by multiple proteomes, Bioinformatics, vol. 22 (14) pp. e9-15
 Altschul, S. et al, (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, NAR 25:3389-3402.
 Aziz, R. et al, (2008) RAST Server : Rapid Annotations using Subsystems Technol- ogy, BMC Genomics, 9:75
 Marchler-Bauer, A. et al, (2009) CDD: specific functional annotation with the Conserved Domain Database, Nucleic Acids Res., Jan;37(Database issue):D205-10. Epub 2008 Nov 4.
 Remm, M. et al, (2001) Automatic Clustering of Orthologs and In-paralogs from Pairwise Species Comparisons, J mol Biol, vol. 314 (5) pp. 1041-1052
Orylink: a personalized integrated system for plant functional genomic analysis
Abstract: The availability of genome sequences for rice and Arabidopsis thaliana, the model species for monocot and dicot plants, respectively, has allowed plant science to enter a new era of functional genomics. Plant functional genomics is now a major driving force in scientific research and the next challenge will be the identification of each gene’s function and comparing their evolution in the monocot and dicot lineages. There are several ways to identify gene function, however, random insertional mutagenesis by either T-DNA or transposable elements has been the most successful strategy to identify plants gene functions on a genome-wide scale. Moreover, through phylogenetic and phylogenomic studies, identification of orthologous genes allows to create structural and functional links between species. These links, allow the transfer of knowledge from one species to another (e.g. the identification of syntenic blocks, cascade of gene regulation). In this context, it’s necessary to organize this information in way that scientists with divergent backgrounds and research of interests could easily access, retrieve and integrate various types of data. The well-known aspect of genome colinearity among grass species makes it meaningful to view rice genomics data in relation with other members of the Poaceae. Unfortunately, browse among a large variety of data types (e.g. omics data, genetic data and field observations) available in numerous distributed databases is a time-consuming task for geneticists and molecular biologists. They need flexible systems that automatically execute their actions and synthesize results; Applications that retrieve data from remote databases, stores queries and results, and re-performs a given search, in order to take advantage of database updates. The aim of Orylink is to (a) increase interoperability between major plant databases and resources used by plant community, (b) merge data from different origins (different species, different types: nucleotidic sequences, mutant libraries…) to enhance the transfer of knowledge in the Poacea family, (c) assist life science researchers in the discovery of gene function.
Polymorfind: an automatic pipeline for detecting SNP and indel in sequences of PCR products from heterozygous species
Abstract: While detecting nucleotidic polymorphism is more and more done by NGS approach, there is still a lot of Sanger data to analyze, mainly on small targeted genome areas. Mainly for its ease and speed, detecting SNP and indels in such small determined areas is done by direct PCR product sequencing. Lots of tools work well with homozygous or haploid organisms while heterozygous ones are often badly supported. Our objective is to develop a pipeline to detect SNP and indels in PCR product from heterozygous species. To study diversity in Rosa, a genus with heterozygous and often polyploid species, we develop a pipeline based on efficient known tools. This pipeline, polymorfind, combines these tools with a few new methods to provide a fully automatized polymorphism detection software. This software is simple enough to be used by non-informaticians with fully functional default parameters while keeping all fine tuning of every module available for advanced users. Polymorfind drastically reduces the time to curate manually chromatograms letting biologists focus on special cases. Polymorfind has been validated against two manually cleaned out datasets of Rosa and grape vine. Polymorfind minimizes the discovering of false polymorphism and limits the number of missed signals.
Mixed-formalism hierarchical modeling and simulation with BioRica
Abstract: BACKGROUND. A recurring challenge for in silico modeling of cell behavior is that experimentally validated models are so focused in scope that it is difficult to repurpose them. Hierarchical modeling is one way of combining specific models into networks. Effective use of hierarchical models requires both formal definition of the semantics of such composition, and efficient simulation tools for exploring the large space of complex behaviors.
OBJECTIVES. BioRica (Soueidan et al, 2007) is a high-level hierarchical modeling framework integrating discrete and continuous multi-scale dynamics within the same semantics domain. It is an adaptation of the AltaRica formalism (Arnold et al., 2000). It explicitly addresses model reusability, repurposing and other engineering best practices that are necessary for sustainable, incremental development of comprehensive models incorporating individually validated components. The goal of the present work was to make the BioRica framework accessible for a wider audience.
METHODS. The BioRica approach expresses each existing model (in SBML) as a BioRica node, which are hierarchically composed to build a BioRica system. Individual nodes can be of two types. Discrete nodes are composed of states, and transitions described by constrained events, which can be non deterministic. This captures a range of existing discrete formalisms (Petri nets, finite automata, etc.). Stochastic behavior can be added by associating the likelihood that an event fires when activated. Markov chains or Markov decision processes can be concisely described. Timed behavior is added by defining the delay between an event's activation and the moment that its transition occurs. Continuous nodes are described by ODE systems, potentially a hybrid system whose internal state flows continuously while having discrete jumps.
RESULTS. The system has been implemented as a distributable software package. The BioRica model compiler and associated tools are available from the INRIA, (address to be provided).
DISCUSSION. By providing a reliable and functional software tool backed by a rigorous semantics, we hope to advance real adoption of hierarchical modeling by the systems biology community. By providing an understandable and mathematically rigorous semantics, this will make is easier for practicing scientists to build practical and functional models of the systems they are studying, and concentrate their efforts on the system rather than on the tool.
IMGT/mAb-DB: the IMGT® database for therapeutic monoclonal antibodies.
Abstract: IMGT/mAb-DB is the monoclonal antibodies database of IMGT®, the international ImMunoGeneTics information system® (http://www.imgt.org) that is acknowledged as the global reference in immunogenetics and immunoinformatics. IMGT/mAb-DB provides a unique expertised resource on immunoglobulins (IG) or monoclonal antibodies (mAb) with clinical indications, and on fusion proteins for immune applications (FPIA). IMGT/mAb-DB is a relational database using the open source MySQL (http://www.mysql.com) management system database. The IMGT/mAb-DB query Web interface allows requests on: (i) IMGT/mAb-DB ID, INN name and number, INN proposed and recommended list, common and proprietary name, (ii) receptor type, species, radiolabelled, conjugated, isotype or fusion protein format, entries with links to IMGT/2Dstructure-DB and IMGT/3Dstructure-DB, (iii) origin clone species and name, specificity (target) and origin, (iv) clinical indication, development status, regulatory agency such as Food and Drugs Administration (FDA) and European Medicines Agency (EMA), company, expression system, and (v) application (diagnostic, therapeutic) and clinical domain. IMGT/mAb-DB entries correspond to about two hundred clinical indications belonging to ten clinical domains (oncology, immunology, hematology, infectiology…). There is a choice of twenty-six fields that the user can select for the results display. By providing links to IMGT/2Dstructure-DB (amino acid sequences and IMGT Colliers de Perles) and IMGT/3Dstructure-DB (3D structures) for entries available in these databases, IMGT/mAb-DB facilitates comparative studies of antigen receptors and FPIA, and of their constitutive chains, even if 3D structures are not yet available. Since 2008, amino acid sequences of mAb (suffix -mab) and of FPIA (suffix -cept) from the World Health Organization/International Nonproprietary Name (WHO/INN) Programme have been entered in IMGT®. In June 2010, IMGT/mAb-DB contains 343 entries (176 -mab, 15 -cept), of which 213 have an INN. IMGT/mAb-DB is freely available for academics at http://www.imgt.org/.
S-MART: how to handle your RNA-Seq data?
Abstract: Whereas several tools are now available to map high-throughput sequencing data on a genome, few can extract biological knowledge from the mapped reads. As a consequence, many labs develop their own short-lived scripts to analyze their data.
We have developped a toolbox called S--MART which handles mapped RNA-Seq data. S--MART is an intuitive and lightweight tool which performs many tasks usually required for the analysis of mapped RNA-Seq reads, including the selection or exclusion of the reads overlapping to a reference set, the clustering of the data and the conversion from a mapper format (like BWA, SOAP2 or BLAT) to an other format (like GFF, or the GBrowse format). S--MART can also output high-quality graphics such as the size of the size, the nucleotidic composition, the density on the genome and the distance of the reads with respect to another set. S--MART may also compute the different expression of some wild type vs mutant experiments using Fisher's exact test and several possible normalizations.
S--MART has been successfully applied to our own Illumina / Solexa and Roche / 454 data and can reproduce many analyses carried out in recent publications without any single line of code.
S--MART does not require any computer science background and thus can be used by the whole biologist community through a graphical interface. S--MART can run on any personnal computer, yielding results without an hour for most queries, even for Gb of data. S--MART is freely available from the URGI Web Site at http://http://urgi.versailles.inra.fr/index.php/urgi/Tools/S-MART under the CeCILL license, compatible with the GNU GPL license. S--MART can be used on the Windows, Mac and Linux platforms.
This work has been partly funded through the ANR TransNet project.
Assessing bioinformatics tools for metalloproteins identification: the iron-sulphur proteins case study
Abstract: Metalloproteins are of major importance within the three domains of life. However, a dedicated approach to identify new members of this large and ubiquitous protein family is still lacking. This work aims at evaluating the current state-of-the art of bioinformatics tools based on different biological concepts (e.g. domain recognition, evolutionary profiles or structural conservation), by assessing their respective predictive power towards iron-sulphur proteins inference. Their effectiveness and their potential combination into a suitable strategy will be discussed.
The EvolScope project : an extension of the MicroScope platform to study the evolution of bacterial polymorphism from high-throughput sequencing data.
Abstract: Since their very recent appearance, next-generation sequencing (NGS) techniques have literally revolutionized genomic research by democratizing access to sequence data and by enabling the investigation of genetic and biological fields hitherto unexplored. Among the broad range of applications offered by these new technologies, detection of both single nucleotide polymorphisms (SNPs) and insertion/deletion events (indels) via (re-)sequencing and comparative analysis of several highly related organisms (or clones of the same species at different generation times) constitutes a key source of information for researchers interested in microbial evolution.
Taking advantage of the local production of high-throughput sequencing data, different tools have thus been developed and integrated into our MicroScope platform (1,2) to address this specific issue. We first designed a bioinformatic pipeline, called SNiPer (unpublished method), which allows the detection of small variations between reads of newly sequenced clones and a reference sequence, which is generally already integrated into MicroScope. Based on the SSAHA2/Pileup package (3), SNiPer provides a crude list of mutational events which are almost unexploitable as such by biologists. Consequently, new relational tables have been created in our Prokaryotic Genome Database (PkGDB) to gather the SNiPer results and to easily link them to already stored annotation data. In addition, a set of web-based query interfaces have been implemented to ease exploration and enable in-depth analysis of these mutations through the MaGe interface: summary, event type, sorting of mutations and the genes they affect according to their various statuses, graphical representation of mutations along the reference genome, etc. Several use cases will be illustrated on the poster using data from two evolutionary projects on which we are currently working.
On the whole, these tools can help biologists to answer fundamental questions such as: what are the mechanisms and the dynamics underlying adaptive evolution? Are the genes involved mostly global pleiotropic regulators, or do some mutations also affect narrow-spectrum genes? Do mutations affect gene stability, active sites, or mainly regulatory regions? Furthermore, these new developments will be the starting point for the use of our platform in the context of other NGS applications, such as RNA-seq and ChIP-seq analysis.
(1) Vallenet, D. et al. MaGe: a microbial genome annotation system supported by synteny results. Nucl. Acids Res. 34, 53-65 (2006).
(2) Vallenet, D. et al. MicroScope: a platform for microbial genome annotation and comparative genomics. Database 2009: bap021 (2009).
(3) Ning, Z., Cox, A.J. & Mullikin, J.C. SSAHA: a fast search method for large DNA databases. Genome Res. 11, 1725-9 (2001).
GENE-RICH DOMAINS OF THE MAMMALIAN CHROMATIN DISPLAY OSCILLATORY CONTACT FREQUENCIES
Abstract: In interphasic cells, the mammalian genome, packed into the chromatin, is spatially restrained to specific chromosomal territories. However, beyond the simple nucleosomal array, very little is known about the organization and the dynamics of the chromatin within chromosomal territories. Although it is largely admitted that one essential determinant is chromatin-looping in relation with gene expression and other chromosomal activities, the basic structural landscape of the chromatin remains largely unknown in vivo at the supranucleosomal level. Indeed, at that scale (~10-500 kb), the organization of the chromatin and its dynamics in living cells are difficult to access by cell imaging techniques due to intrinsic limitations of light microscopes. The Chromosome Conformation Capture (3C) assay, and derived technologies, represent a real technological advance. Indeed, such techniques allow direct quantification of the average interaction frequency between two distant genomic regions, at the supranucleosomal scale, in their native genomic context.
To study the basic folding of the chromatin, we have used the 3C technique to determine random collision frequencies occurring between sites separated by increasing genomic distances in the mouse. Floating means analyses reveal that, for five distinct gene-rich chromosomal domains, random collisions frequencies are oscillating with a periodicity of ~90-100kb. This result suggests that, in gene-rich domains, the dynamics of long-range locus-specific interactions may be influenced by this unexpected fundamental oscillation.
To assess this hypothesis, we used bioinformatics analyses and showed that conserved sequences in clusters of co-expressed genes in the mouse are highly overrepresented at genomic distances corresponding to one and two modulations relative to transcription start sites. Furthermore, using Hi-C data published by Lieberman et al. (2009 - Science 326, 289-293), we found that long-range interactions in the human genome occur more frequently around 90kb in gene-rich than in gene-poor domains. These data strongly suggest that oscillatory contact frequencies influence long-range interactions in gene-rich domains of the mammalian genomes.
Finally, we show that such oscillations can be described by polymer models as if gene-rich domains of the mammalian chromatin were shaped into a statistical helix.
Our work provides new insights on mammalian genome organization through the discovery that gene-rich domains of the mammalian chromatin displays oscillatory contact frequencies. Importantly, this fundamental oscillation appears to influence long-range locus-specific chromatin interactions and, consequently, has significant impact on genome evolution in mammals.
MobyleNet: user-centered large spectrum service integration over distributed sites.
Abstract: The recent years have seen an important growth in the number of bioinfor-
matics tools available to run analyses, covering an increasing range of domains.
Existing web-based platforms which publish these tools as services to the com-
munity have to anticipate and address the challenges raised by this issue.
A traditional approach is to setup centralized resource centers aggregating all
tools and data. However, such a strategy does not scale well with the growing
diversity of the methods to operate, which can exceed the skills of the staff
attached to such centers.
We propose here an alternate organization that promotes a network of smaller
platforms with speci?c skills. The services are published on a framework that
favors interoperability between the sites, enabling the seamless integration of
resources to run cross-domain protocols.
MobyleNet is a federation of web portals running the open-source Mobyle
framework. Its goals are:
* to integrate into a single framework (Mobyle) the services located on dif-
* to initiate a large application spectrum framework, covering complemen-
tary aspects of bioinformatics.
* to guarantee the interoperability of the services.
* to setup a con?dence network, spanning over the speci?c skills of each
platform, by enabling quality management.
The MobyleNet project is funded by the IBISA initiative (French National
coordination of the Infrastructures in Biology, Health and Agronomy). It cur-
rently involves nine sites, either providers of services and/or developers of the
core MobyleNet framework. These platforms are distributed over France, their
skills range from genomics, microarrays, sequence analysis, phylogeny, structural
bioinformatics to chemoinformatics, with diverse focuses such as microorganism,
plants, pharmacology or cancer. This network is also open to international con-
tributions. More information about the MobyleNet project can be found on
The PSICQUIC Interface – a portal into the world of the Interactome
Abstract: Molecular interactions are key to our understanding of biological systems and there are many data resources, such as the IntAct molecular interaction database (www.ebi.ac.uk/intact) dedicated to capturing such experimental data. This increasing number of resources, however, presents the research worker interested in a specific biological domain with a significant data access challenge. To instantaneously access multiple interaction data resources, a common interface for computational access allowing software clients to interact with multiple sources using the same interface is required. This has lead to the development of PSICQUIC, the PSI Common Query InterfaCe, a community standard for computational access to molecular interaction resources. PSICQUIC has been jointly developed by major interaction data and tool providers in the context of the HUPO Proteomics Standards Initiative using the standard formats developed by the PSI Molecular interaction workgroup. This initiative is supported by an increasing number of major interaction databases and a number of practical applications are demonstrated.
TuberGAS: annotation and visualization of the Black Truffle genome
Abstract: The Périgord Black Truffle (Tuber melanosporum) is a symbiotic fungus that belongs to Ascomycota phylum. The genome of the ectomycorrhizal fungus Tuber melanosporum , sequenced at Genoscope, is characterized by genes rich islands in an ocean of transposable elements. This fungus genome is the largest sequenced so far with a size of 120 Mb.
To annotate this genome, we developed an integrated environment for browsing annotations and predictions with dedicated tools for comparison, visualization and edition of gene of interest. The genomic annotation system for Tuber (TuberGAS) is composed of three elements: an in-house database (TuberDB) for browsing annotated sequences coupled with two generic tools provided by the Generic Model Organism Database : Tuber Gbrowse for graphical visualization and Tuber Apollo for genome annotation. TuberGAS integrates sequencing data, automatic and manual annotations and expression data.
Automatic annotations of sequences were performed with blastx versus generic banks (GenBank nr, Uniprot swissprot, Kegg genes). Based on the best blast hit versus nr, we extract Gene Ontology (GO) classification and PFAM motifs with a Python script. We also predict gene models cellular localization, transmembrane helices and peptide signal with TargetP , TMHMM  and SignalP .
TuberGAS is accessible through a dedicated web interface: http://mycor.nancy.inra.fr/IMGC/TuberGenome.
This work is supported by INRA, Région Lorraine and EvolTree Network of Excellence.
 F. Martin and al., Périgord black truffle genome uncovers evolutionary origins and mechanisms of symbiosis. Nature, 464, 1033-1038, 2010.
 O. Emanuelsson, S. Brunak, G. von Heijne, H. Nielsen. Locating proteins in the cell using TargetP, SignalP, and related Tools. Nature Protocols 2, 953-971 (2007).
 A. Krogh, B. Larsson, G. von Heijne, E.L. Sonnhammer. Predicting transmembrane protein topology with a hidden Markov model: Application to compl
Mining sequence similarity and microsynteny for functional inference
Abstract: With the advent of next generation sequencing technology, thousands of complete prokaryotic genomes are soon to be made available to the scientific community. Such an abundance of fully sequenced genomes opens up unprecedented comparative genomics and evolutionary study perspectives which, at the same time urges the needs for reliable automatic functional annotation and pathway reconstruction methods. In this context, one of the most crucial and determining step for subsequent analyses is the identification of genes and systems fulfilling the same function across genomes. Current methods for the analysis of protein and domain families rely either on multiple sequence alignment or BBH clusters, which yields generally to the definition of profiles that are specific to a family or a sub-family. These approaches gave rise to valuable general resources such as the Pfam (Finn et al., 2008) and SMART (Letunic et al., 2008) databanks on protein and domain families, or the COG database (Tatusov et al., 1997) on clusters of orthologous groups of proteins. Unfortunately, the application of these profiles to multigenic families still displays high paralogy rates. To circumvent this, dedicated projects have emerged, each focusing on a specific super-family, to achieve a better resolution, i.e. subfamily and in some cases sub-sub-family, based on phylogenetic trees reconstruction (Fichant et al., 2006). Yet, these projects face the challenge to automate the analyses. Indeed, the quality of these resources relies on the multiple sequence alignment of thousands of sequences which generally requires manual intervention and expert validation.
In this present work, we propose an original approach achieving higher specificity in orthology relationships automatic prediction based on the mining of sequence similarity and microsynteny simultaneously. It is based on a refinement of the orthology relationships, named isorthology. The key principle is to make use of microsynteny as an indication of evolutionary and functional relatedness: for a given gene, its genomic neighbors are also taken into account. All the homologs in that region are retrieved by sequence similarity searches in complete prokaryotic genomes, and analyzed in parallel by biclustering these two complementary relationships: homology across genomes (putative isorthology) and genomic neighborhood conservation within genomes (microsynteny).
To infer isorthology, we assign this relationship between sequences for which no duplication event occurred after speciation. This is motivated by the fact that such constraints maximize the chances that the gene functions have not evolved and specialized too drastically since the speciation event. Once the isorthology links have been inferred between all pairs of genes in all pairs of genomes, the expected results should consist of a graph composed of connected components corresponding to sets of genes playing the same role in their respective organism. Unfortunately, due to complex evolutionary histories involving rounds of gene duplications and losses, the isorthology function, though very restrictive, still identifies false links in real biological data. To detect and remove such false isorthology links, the data are represented as an unweighted undirected graph in which nodes represent genes and edges correspond to isorthology links. First, in order to strengthen the evolutionary signal, genes (vertices) that do not belong to at least one triangle (cycle of length 3) are removed. Second, a partitioning method, MCL (Dongen, 2000), is applied on the filtered graphs, in order to isolate sets of true isorthologs by cutting false positive links. Since, for this algorithm, the number of clusters obtained depends on the inflate factor parameter (IF), a parameter sweep over IF is performed. To select the best partition, we use an external criterion: the modularity (Newman and Girvan, 2004) which aims at quantifying the quality of a partition of a graph into clusters by favoring within cluster connections and penalizing between clusters connections.
Conserved genomic neighborhood, i.e. microsynteny, can further reinforce the confidence of what annotation should be transferred to what genes in an isorthologs cluster or can provide clues to gene associations. Here, we use the microsynteny relationship in parallel with the isorthology relationship to identify true strong links between genes from different genomes. The principle is as follows. Starting from a signature of interest, for example a conserved domain, all the genes matching the signature are retrieved from all the complete genomes. Then, for each of these genes, its neighbors on the genome are added to the set. The next step consists in reconstructing the isortholog graphs by using the gene–gene isorthology relationships inferred from the sequence comparisons. After filtering and partitioning, as described previously, one obtains clusters of isorthologous genes. The microsynteny and the isorthology relationships can be represented simultaneously in a matrix in which a cell corresponds to a gene, a column corresponds to a set of genes that are neighbors in one genome, and a row corresponds to a cluster of isorthologs. At this stage, we make use of a biclustering method, the iterative signature algorithm (Bergmann et al., 2003) to identify sets of genes (biclusters) in which their members are both linked (i) by isorthology accross genomes and (ii) by microsynteny within a genome.
By applying this method, we show that it correctly separates the essential ssb gene, involved in DNA replication, recombination and repair, which is present in all prokaryotic genomes from the ssb gene involved in the competence physiological state found in some genomes. At the same time, this approach allows to highlight a conserved genomic context specific to a set of genomes. In a second experiment, we were able to identify sets of genes within the chaperone-usher (CU) pathway that correspond to previously described sub-families of systems (Nuccio and Bäumler, 2007). The CU pathway is used for fimbrial assembly at the surface of Gram- negative bacteria, facilitating the attachment of bacteria to the host tissue and promoting biofilm formation, thereby contributing to bacterial pathogenesis (Giraud et al, 2009).Genes involved in the biogenesis of fimbriae belonging to the CU assembly class are almost invariably clustered into operons. These operons encode, at minimum, three different proteins: an outer membran protein called the usher, a periplasmic chaperone and at least one fimbrial subunit.
This approach opens a variety of possibilities in comparative genomics and strengthens approaches relying on orthologs identification. It is applicable to a single gene, such as ssb, as well as to protein assemblies and pathways, such as CU pathway.
Bergmann, S., Ihmels, J. and Barkai, N. (2003) Iterative signature algorithm for the analysis of large-scale gene expression data. Phys Rev E Stat Nonlin Soft Matter Phys., 67, 031902-1:031902-18.
van Dongen, S. (2000) Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht, Utrecht, Nederlands.
Fichant, G., Basse, M.-J. and Quentin, Y. (2006) ABCdb: an online resource for ABC transporter repertories from sequenced archaeal and bacterial genomes. FEMS Microbiology Letters, 256(7), 333-339.
Finn, R.D., Tate, J., Mistry, J., Coggill, P.C., Sammut, J.S., Hotz, H.R., Ceric, G., Forslund, K., Eddy, S.R., Sonnhammer, E.L. and Bateman, A. (2008) The Pfam protein families database. Nucleic Acids Research Database Issue, 36, D281-D288.
Giraud, C., Bernard, C., Ruer, S. and de Bentzmann, S. (2009) Biological 'glue' and 'Velcro': molecular tools for adhesion and biofilm formation in the hairy and gluey bug Pseudomonas aeruginosa. Environmental Microbiology Reports.
Letunic, I., Doerks, T. and Bork, P. (2008) SMART 6: recent updates and new developments. Nucleic Acids Research Database Issue, 37, D229-D232.
Newman, M.E.J. and Girvan, M. (2004) Finding and evaluating community structure in networks. Physical Review E., 69(2), 026113.
Nuccio, S. and Bäumler, A. (2007) Evolution of the Chaperone/Usher Assembly Pathway: Fimbrial Classification Goes Greek. Microbiol. Mol. Biol. Rev., 71: 551-575.
Tatusov, R.L., Koonin, E.V., Lipman, D.J. (1997) A Genomic Perspective on Protein Families. Science, 278(5338), 631-637.
PROTIC workshop: a bioinformatics environment for proteomics data analysis, validation and integration
Abstract: Proteomic approaches generate massive and various data that need appropriate databases and tools to be managed, processed and distributed efficiently by scientists. PROTIC workshop is a set of tools dedicated to qualitative and functional annotation, statistical analyses and data distribution. This new environment is integrated with PROTICdb2 databases. It enables scientists to annotate data such as mass spectrometry samples or protein identifications and to launch R scripts on sets of selected data. Thanks to the module PROTICport data may be distributed through web services and integrated within World-2DPAGE.
EuPathDomains: The Divergent Domain Database for Eukaryotic Pathogens
Abstract: Eukaryotic pathogens (e.g. apicomplexans, trypanosomes, etc.) are a major source of morbidity and mortality worldwide. In Africa, one of the most impacted continents, they cause millions of deaths and are an immense economic burden. While the genome sequence of several of these organisms is now available, the biological functions of more than half of their proteins are still unknown. This is a serious issue for bringing to the foreground the expected new therapeutic targets. In this context, the identification of protein domains is a key step to improve the functional annotation of the proteins.
Hidden Markov Models (HMMs) have proved to be a powerful tool for protein domain identification . Notably, Pfam database  provides a large collection of HMMs covering 73% of Uniprot proteins . Each Pfam HMM is a probabilistic model characterizing a given domain. When analyzing a new protein sequence, a score is computed to measure the similarity between the sequence and the domain. This score is compared to a stringent Pfam threshold above which the domain presence in the protein is asserted. However, when applied to eukaryotic pathogens, numerous domains are missed by this procedure. For example, within Plasmodium falciparum (main causal agent of Malaria) and Leishmania major (agent of leishmaniasis), a domain is discovered in only 50% of their proteins, and only 1,400 and 1,600 different domain families are detected, respectively. In comparison, 2,400 different domain families are known in yeast, covering 76% of its proteins. Although this observation could be explained by the existence of domains that are unique to a pathogen life style, it is likely further exacerbated by the high phylogenetic distance between these organisms and the classical eukaryote models used to learn the HMMs, which makes homology detection particularly difficult. We recently proposed a new method, the Co-Occurring Domain Detection algorithm (CODD), that improves the sensitivity of HMM domain detection by exploiting the tendency of the domains to appear preferentially with a few other favorite domains in a protein  (article available at http://hal-lirmm.ccsd.cnrs.fr/lirmm-00431171/). This property enables CODD to certify the presence of a divergent domain on the basis of the presence of another domain in the same protein. The CODD algorithm comes along with a statistical procedure that associates confidence measurements to the newly detected domains. Here, we present EuPathDomains (accessible at http://www.atgc-montpellier.fr/EuPathDomains/), an extended database of divergent Pfam domains that have been detected by applying CODD on several eukaryotic pathogens from EuPathDB . Currently, ten organisms are in EuPathDomains: Giardia lamblia, Trypanosoma brucei, three Leishmania species, and five apicomplexan species including three Plasmodium species, Toxoplasma gondii and Cryptosporidium parvum.
EuPathDomains gathers all newly detected domains and their GO annotations , along with the associated confidence measurements. EuPathDomains significantly extends the domain coverage of all selected genomes, by proposing new domain occurrences and by detecting new domain families that were not known in these organisms. For example, with a false discovery rate lower than 20%, it increases the number of domain occurrences of 13% in T. gondii genome and until 28% in C. parvum. Similarly, the total number of domain families rises by 10% in G. lamblia and until 16% in C. parvum. The EuPathDomains database can be queried by keywords, proteins, domains, GO terms or Interpro entries, and by varying the confidence threshold associated with the new domains. EuPathDomains should become a valuable new resource to help deciphering the function of the proteins of eukaryotic pathogens.
 R. Durbin, S. Eddy, A. Krogh and G. Mitchison, Biological sequence analysis: Probabilistic models of proteins and nucleic acids, Cambridge University Press, New York, 1998.
 R.D. Finn, J. Tate, J. Mistry, P.C. Coggill, S.J. Sammut, H.R. Hotz, G. Ceric, K. Forslund, S.R. Eddy, E.L.L. Sonnhammer and A. Bateman, The Pfam Protein Families Database. Nucleic Acids Res., 36(Database issue):D281-D288, 2008.
 The UniProt Consortium, The Universal Protein Resource (UniProt). Nucleic Acids Res., 36(Database issue):D190-D195, 2008.
 N. Terrapon, O. Gascuel, E. Maréchal and L. Bréhélin, Detection of new protein domains using co-occurrence: application to Plasmodium falciparum. Bioinformatics, 25(23):3077-3083, 2009.
 C. Aurrecoechea, J. Brestelli, B.P. Brunk, S. Fischer, B. Gajria et al., EuPathDB: a portal to eukaryotic pathogen databases. Nucleic Acids Res., 38(Database issue):D415-D419, 2009.
 The Gene Ontology Consortium, The Gene Ontology (GO) project in 2006. Nucleic Acids Res., 34(Database issue):D322-D326, 2006.
MeRy-B: a web knowledgebase for the storage, visualization, analysis and annotation of plant metabolomics profiles obtained from NMR
Abstract: Thanks to the improvement of metabolomics analytical techniques, and the growing interest in metabolomics approaches, more and more metabolic profiles are generated. This high quantity of high throughput data needs to be saved and structured according to accepted standards (MSI). To exploit these data, scientists need tools to store data, identify metabolites and disseminate results. To meet this need, different tools already exist which are specific of a species, an analytical technique or a given usage: reference spectra databases (BMRB), profiles management databases (GMD, PlantMetabolomics.org), metabolites databases (KEGG), or knowledgebase (HMDB). Each of these tools addresses one or more of the above needs. However, management of NMR plant metabolomics profiles remains poorly addressed.
To fill this lack, we have developed MeRy-B, a plant metabolomic platform allowing the storage and display of NMR plant metabolomics profiles. We will present MeRy-B knowledgebase with a description of its functionalities:
- Data capture from a comprehensive metabolomic experiment: metadata (experimental and analytical), spectra data, peak lists, and detected analytes. MeRy-B uses MSI requirements for metadata description as well as suitable OBO ontologies. The application supports pdf format for protocols and Jcamp-DX for NMR spectra outputs.
- Data visualisation: spectra viewer (one spectrum), spectra overlay (all spectra from a project or experiment) and statistical tools for synthetic representation.
- Analyte identification support, thanks to MeRy-B knowledgebase, fed with new identifications.
- Export for the purpose of exploiting data by other statistical analysis tools that may contribute to biomarker discovery.
Currently, MeRy-B contains more than one hundred different plant metabolites and unknown compounds with information about experimental conditions and metabolite concentrations from several plant species compiled from more than one thousand of annotated NMR profiles on various organs or tissues.
MeRy-B is a web-based application with either public or private access. url: http://www.cbib.u-bordeaux2.fr/MERYB/index.php
IMGT/LIGMotif: a tool for immunoglobulin and T cell receptor gene identification and description in large genomic sequences
Abstract: The antigen receptors, immunoglobulins (IG) and T cell receptors (TR), are specific molecular components of the adaptive immune response of vertebrates. Their genes are organized in the genome in several loci that comprise different gene types: variable (V), diversity (D), joining (J) and constant (C) genes. Synthesis of the IG and TR proteins requires rearrangements of V and J, or V, D and J genes at the DNA level, followed by the splicing at the RNA level of the rearranged V-J and V-D-J genes to C genes. Owing to the particularities of IG and TR gene structures related to these molecular mechanisms, conventional bioinformatic software and tools are not adapted to the identification and description of IG and TR genes in large genomic sequences. In order to answer that need, IMGT®, the international ImMunoGeneTics information system® , has developed IMGT/LIGMotif, a tool for IG and TR gene annotation. This tool is based on standardized rules defined in IMGT-ONTOLOGY, the first ontology in immunogenetics and immunoinformatics [2, 3].
IG and TR V, D and J genes belong to multigene families. IG and TR loci contain several hundreds of genes, including many pseudogenes. Many of these pseudogenes are degenerated and/or partial genes and, therefore difficult to annotate. However, D and J genes have very small coding regions (8-37 base pairs (bp) and 37-69 bp, respectively) making their identification difficult. Interestingly, functional IG and TR genes have characteristics that, if present, allow their unambiguous identification. Thus, for example, the IG and TR V, D and J genes have recombination signals (RS) which allow them to rearrange at the DNA level in B cells (for the IG) and T cells (for the TR) and that constitute one of the major differences with conventional genes. RS are localized in 3’ of the V genes, in 5’ of the J genes and on both sides of the D genes [4, 5]. They consist of conserved heptamers and nonamers separated by less conserved spacers of 12 ±1 or 23 ±1 bp which vary between loci and species (IMGT Repertoire, http://www.imgt.org/). In IMGT®, the IG and TR gene characteristics used for gene identification are defined by concepts generated from the IMGT-ONTOLOGY ‘IDENTIFICATION’ axiom [2, 3]. Three concept instances of the ‘Molecule_EntityType’ concept of identification are used in the IMGT/LIGMotif model: V-gene, D-gene and J-gene. These concept instances are defined by the gene type (V, D, J), the molecule type (gDNA) and the configuration type (germline) .
The IG and TR gene features are described, in IMGT®, according to the standardized concepts of description generated from the IMGT-ONTOLOGY ‘DESCRIPTION’ axiom [2, 3]. Thus, the V-gene, D-gene and J-gene are described by three concept instances of the ‘Molecule_EntityPrototype’ concept of description: V-GENE, D-GENE and J-GENE, respectively. Among the 242 IMGT® labels defined for the nucleotide sequences, 47 are used in the IMGT/LIGMotif model of which 43 are specific of one prototype (23 for a V-GENE, 11 for a D-GENE and 9 for J-GENE). For the purpose of gene description, IMGT/LIGMotif uses the ‘gene unit’ labels. Indeed, these labels, in contrast to the ‘gene’ labels (V-GENE, D-GENE and J-GENE) have the advantage to be precisely delimitated in 5’ and 3’, respectively, by the 5’ end and 3’ end of constitutive labels. Moreover, the part of the prototype they encompass can be defined by conserved motifs that constitute a pattern. In a pattern, the conserved motifs are separated from each other by a distance in bp comprised between a minimal and a maximal length. Motifs are ordered from 5’ to 3’ with a rank that corresponds to their relative localization in the pattern, the motif the most in 3’ having a rank that corresponds to the number of motifs in the pattern. The gene functionality identification can only be assigned to precisely described IG and TR genes. In IMGT-ONGOLOGY, an unrearranged genomic V, D or J gene can be functional (F), open reading frame (ORF) or pseudogene (P) [2, 3].
The algorithm is composed of 4 modules . The ‘Gene identification’ module identifies potential V, D and J genes along the genomic sequence to analyse. First, a heuristic search for local alignments is performed against IMGT/LIGMotif reference motif databases. These databases comprise nucleotide sequences that correspond to IG and TR gene unit labelled motifs. The alignments obtained in this first step provide labelled high-scoring segment pairs (or HSPs) on both DNA strands of the sequence to analyse. Then, there is a selection of the labelled HSPs and a grouping of these selected HSPs given their topology and their gene type (V, D or J). Thus, the ‘Gene identification’ module provide the potential V genes, D genes and J genes identified as grouped and labelled HSPs along the sequence to analyse. The ‘Gene description’ module provides the description of each potential gene identified in the first module. It comprises a search of conserved motifs based on prototypes and patterns. Codons of conserved amino acids in the patterns are very frequent in sequences. For that reason, the codons of conserved amino acids of the V genes are identified by the software IMGT/V-QUEST . The expected outputs of the ‘Gene description’ module are described gene units, although partially described and undescribed outputs can also be obtained. The ‘Functionality identification’ module includes the control of features needed for the functionality assignment and allows to obtain annotated gene units. In this final module, genes (V-GENE, D-GENE and J-GENE) are delimited and assembled in a cluster if the analysed genomic sequence contains several genes. Two genes are delimited by distributing equally the distance between them. The final outcome of IMGT/LIGMotif is the annotated genomic sequence.
IMGT/LIGMotif algorithm is implemented in JAVA. A web application of IMGT/LIGMotif is running on a Tomcat server and is available at http://www.imgt.org/ligmotif/. The genomic sequence to analyse can be copied/pasted by the biocurators or uploaded in Fasta/EMBL format. Reference motif databases can be selected on their gene type, locus, functionality and organism. The execution time depends on the gene types and the number of genes in the sequence. The analysis of a single gene takes a few seconds whereas a complete locus containing more than 100 genes takes between 30 minutes to 1 hour, using standard parameters.
IMGT/LIGMotif is currently used by the IMGT® biocurators to annotate, in a first step, IG and TR genomic sequences of human and mouse in new haplotypes and those of closely related species, nonhuman primates and rat, respectively. More distant species will still require manual expertise in the control of the annotations. However, it is expected that the progressive enrichment of the IMGT/LIGMotif reference motif databases with data IG and TR annotated by IMGT® will save a considerable amount of time in the process of the genomic annotation of vertebrate antigen receptor loci.
M.-P. Lefranc, V. Giudicelli, C. Ginestoux, J. Jabado-Michaloud, G. Folch, F. Bellahcene, Y. Wu, E. Gemrot, X. Brochet, J. Lane, L. Regnier, F.Ehrenmann, G. Lefranc and P. Duroux, IMGT®, the international ImMunoGeneTics information system®. Nucleic Acids Res., 37,1006-1012, 2009.
V. Giudicelli and M.-P. Lefranc, Ontology for Immunogenetics: IMGT-ONTOLOGY. Bioinformatics, 15:1047-1054, 1999.
P. Duroux, Q. Kaas, X. Brochet, J. Lane, C. Ginestoux, M.-P. Lefranc and V. Giudicelli, IMGT-Kaleidoscope, the Formal IMGT-ONTOLOGY paradigm. Biochimie, 90:570-583, 2008.
M.-P. Lefranc, G. Lefranc: The Immunoglobulin FactsBook, Academic Press 2001, pp1-458.
M.-P. Lefranc, G. Lefranc: The T cell receptor FactsBook, Academic Press 2001, pp1-398.
J. Lane, P. Duroux and M.-P. Lefranc, From IMGT-ONTOLOGY to IMGT/LIGMotif: the IMGT® standardized approach for immunoglobulin and T cell receptor gene identification and description in large genomic sequences, BMC Bioinformatics 2010, 11:223.
X. Brochet, M.-P. Lefranc, V. Giudicelli: IMGT/V-QUEST: the highly customized and integrated system for IG and TR standardized V-J and V-D-J sequence analysis. Nucleic Acids Res 2008, 36:W503-508.
Transposition detection using NGS approaches in Asian Rice
Abstract: The release of the next-generation sequencing and the large decrease of the cost per sequenced base allows us to analyse precisely the genomic variability within a plant variety, and possibly within an individual.
The Asian rice NipponBare was the first grass genome sequenced and assembled in 2000. It was widely analysed for its physiological and genomic features, including its transposable elements content. It has been demonstrated that callus tissue culture (and the subsequent regeneration of plants) of this variety will trigger the reactivation of at least the Tos17 and Lullaby LTR retrotransposons, the Ping/Pong MITEs system, and the Karma LINE. All these elements were identified through positional cloning of induced mutations or through the screening of candidates produced by a transcriptomic approach.
We have used a novel resequencing approach to tackle the identification of actively transposing elements. We have re-sequenced a specific regenerated individual from the Nipponbare variety, using 36-mers Pair End Solexa/Illumina technology with a 4,2x low coverage. Using two different approaches for the detection of the transpositions (one based on the identification of TE specific movement and using MAQ, and one based on 'anonymous' detection of incongruous mapping and using Bowtie), we were able to identify 11 new families of elements to be active in Asian rice, which were validated on wet-bench. Here we present the two approaches, comparatively, their advantages and defaults as well as the limits of the analysis itself.
CSPD: a database and search engine for carbonylated proteins
Abstract: Protein carbonylation is an irreversible oxydative modification, that often leads to a loss of the protein function, and is a widespread indicator of severe oxydative damage and disease-derived protein dysfunctions. We present CSPD (Carbonylated Site and Protein Detection) a web-based and freely accessible resource, developed with the aim of predicting carbonylated proteins. The current release contains more than 709000 putative carbonylated proteins (CP) (~23.5% of the analyzed proteins) from complete microbial genomes. The CSPD database provides a complete overview of carbonylation information for each of these proteomes as well as a detailed annotation of each CP with their predicted HSC (Hot Spot of Carbonylation) motifs. The CSPD database can be queried with specific keywords or gene names, and is accompanied by a search engine which allows the user to predict HSC motifs in a protein sequence. With this ongoing project, we aim to provide carbonylation data that would facilitate the understanding of the carbonylation process.
Development of knowledge-based system for analysing the effects of single nucleotide polymorphisms on the protein function
Abstract: Motivation: The understanding of the cascade of events leading to human genetic diseases is a major goal of biomedical research in order to implement the development of diagnostics and effective therapeutic solutions. A key issue in this research area is the ability to understand and predict the effects of genetic variation on the phenotype of an individual. Here, we use Inductive Logic Programming (ILP) to characterize and predict deleterious/neutral mutations in the context of the SM2PH-db (“from Structural Mutation to Pathology Phenotypes in Human database”).
Results: After using ILP for learning, we obtain classification rules that can be interpreted by human expert and help to provide a better understanding of the relationships between genotypic and phenotypic features. The results also show that the proposed method can be applied to predict the impact of single amino acid replacement on the function of a protein with high sensitivity and specificity.
Availability: The rules and the estimated effect of human non-synonymous polymorphisms on the function of a protein is available at http://decrypthon.igbmc.fr/sm2ph/cgi-bin/prediction. The data set and prolog code can be downloaded at http://decrypthon.igbmc.fr/sm2ph/cgi-bin/mutation_ilp.zip
SNiPlay, a web application for SNP analysis
Abstract: SNiPlay is a web-based tool dedicated to SNP discovery and polymorphism analysis.
It integrates a pipeline, freely accessible through the Internet, combining existing softwares to compute different kinds of treatment. From allelic data, alignment, or sequencing traces given as input (notably using the Polymorfind program), it detects SNP and insertion/deletion events. In a second time, it sends sequences and allelic data into an integrative pipeline able to achieve successive steps:
- Mapping and annotation (genomic position, intron/exon, synonyme/non synonyme)
- Haplotype reconstruction (using Phase, Gevalt or ShapeIT)
- Haplotype network (Haplophyle)
- Linkage Disequilibrium
- Diversity analysis (SeqLib library)
SNiPlay is flexible enough to easily incorporate in the future new modules able to manage other kinds of analyses (kinship, association studies, population structure…).
It also includes a database to store polymorphism and genotyping data produced by projects leaded on Grape. A private access allows collaborators to explore data: SNP retrieval using various filters, comparison SNP between populations or others external information on accessions, export of genotyping data in different formats…
This database can collect and combine genotyping data coming from sequencing project as well as those obtained by Illumina chips, and provides a section facilitating comparison between experiences.
Towards a multi-scale and formalized representation of protein sequence-structure-function relationships – the nsLTP family as a case of study
Abstract: Understanding complex biological processes requires managing many complex data sets. In such a context, data integration appears as a key for modern biology. We are currently developing a new information system which will propose a multi-scale and formalized representation of the sequence-structure-function relationships. Our informatics implementation will be enough generic to be used for any protein family involved in any given biological process. However, the nsLTP (non-specific lipid transfer proteins) family is a particularly interesting first use case for validation of our methodology. In the context of this work, we first studied the primary, secondary and tertiary structures of the nsLTP family. The nsLTP set of study is composed of plant proteins from 6.5 to 10.5 kDa, containing eight cysteine residues arranged in the so-named “8CM” pattern. 606 distinct mature amino acid sequences of nsLTPs belonging to about 100 plant species were retained for analysis. Using multiple alignment and phylogenetic tools, the amino acid sequences have been aligned and clustered into 9 different types, considering that for each type, sequences present at least 30% of identity between each others. Using comparative modeling method, we developed a modeling pipeline which allowed calculating reliable three-dimensional models for each nsLTP, with and without ligand. The nsLTPs were clustered according to their 1D, 2D (considering disulfide bridges and helices) and 3D structures respectively, and the different classifications were confronted to each others. Our results show correlations which will allow inferring new hypotheses about the lipid binding characteristics of the nsLTPs.
Topological characteristics of the functionalization process for duplicated genes in PPI networks of Arabidopsis thaliana
Abstract: Gene duplication is readily accepted as a primary mechanism for generating organismal complexity. However, the mechanisms responsible for the maintenance of duplicated genes at the genome scale are still poorly understood. Analysis of biological networks can help us to understand better which evolutionary forces are acting on duplicated genes because their interacting context is taken into account. Our purpose is to tackle the following questions: What is the evolution of a biological network after duplication of its constituents or a part of its constituents? When a network is duplicated, which part is “neo-functionalized” or “sub-functionalized”? Can the maintenance of a pattern in a genome after duplication be explained by the involved duplication mechanism or by the biological function category? How much duplicated networks can differ in topology and does it depend on the duplication mechanism or the function?
Taking advantage of available knock-out data from literature, we have topologically analysed a protein-protein interaction network in Arabidopsis thaliana. We found that in this network, the evolutionary process of functionalization can be characterized.
Stratégie de recherche et d’annotations de nouvelles synthétases non-ribosomiales à partir de génomes bactériens.
Abstract: Les synthétases non-ribosomiales (NRPS) sont de grands complexes multi-enzymatiques responsables de la synthèse de peptides appelés peptides non-ribosomiaux car non issus de la voie classique (transcription de l'ADN en ARN puis traduction de cet ARN en protéine). Les synthétases sont organisées en modules, chaque module étant responsable de l'incorporation d'un acide aminé dans le peptide final. Les modules sont eux-mêmes subdivisés en domaines. Les domaines possèdent les activités enzymatiques nécessaires à l'incorporation ou la modification des acides aminés. L'étude fine du site actif des domaines, ainsi que leur organisation au sein des modules permet de prédire le peptide produit par la synthétase. Quelques outils dédiés aux synthétases permettent cette prédiction. A notre connaissance, 6 outils ont été développés dont 4 sont interrogeables via une interface web (PKS/NRPS Analysis Web-site, Bachmann and Ravel, Methods Enzymol. 2009 ; NRPS-PKS, Ansari et al., NAR 2004 ; NRPSpredictor, Raush et al., NAR 2005 and NP.searcher, Li et al., BMC Bioinformatics. 2009), un est un client téléchargeable sur demande (ClustScan, Starcevic et al., NAR 2008)et le dernier est un ensemble de programmes perl téléchargeable (Clusean, Weber et al., J. Biotechnol. 2009). Il existe autant de synthétases différentes que de peptides non-ribosomiaux. Norine, la banque de données de référence sur ces peptides, en recense plus de 1000.
Une recherche systématique de gènes codant des synthétases non-ribosomiales au sein de séquences génomiques disponibles dans les banques de séquences a été réalisée sur plusieurs genres et espèces bactériens. Ce travail met en exergue la mauvaise qualité des annotations générées automatiquement.
Afin d'identifier des protéines susceptibles d'être des synthétases, trois stratégies ont été suivies en parallèle : une recherche par mots clés au sein des d’annotations disponibles dans les banques, une analyse des protéines de grande taille et une recherche par BLASTP en entrant la séquence protéique d'une synthétase comme requête. Les protéines trouvées possèdent différentes annotations. Certaines sont vagues, telles que "putative NRPS", "amino-acid adenylation domain", "AMP-dependent synthetase and ligase". Lorsque l'annotation reste vague, elle est correcte, même si elle n’est malheureusement pas uniformisée. Par contre, lorsque l'annotation est plus précise, elle est généralement fausse. Par exemple, une protéine est annotée comme "dimodular NRPS" alors qu'elle ne contient même pas un module complet. Certaines NRPS sont également retrouvées en tant que "putative protein" ou "hypothetical protein". Ce constat est valable pour toutes les espèces testées. Par ailleurs, la banque de protéines annotées de façon automatique UniProtKB/TrEMBL, contient de nombreuses protéines annotées comme produisant un peptide donné alors que leur produit est tout autre.
Dans ce travail, une démarche est donc proposée pour limiter ces sources d’erreurs et l’application de cette méthode aux génomes séquencés des Erwiniae est présentée. Les séquences des synthétases potentielles ont été systématiquement analysées à l'aide des logiciels dédiés NRPS-PKS analysis website, NRPS-PKS et NRPSPredictor. Ce travail a permis de reconstituer les modules qui composent les synthétases, de prédire les acides aminés potentiellement incorporés par chaque module ainsi que l'activité du produit de la synthétase en utilisant Norine. Enfin, l’analyse du contexte génomique des NRPS permet de reconstituer les clusters impliqués dans la biosynthèse d'un peptide non-ribosomial déjà connu chez une autre espèce.
En conclusion, l’accroissement du nombre de données disponibles dans les banques peut s’accompagner d’une diminution de la qualité des informations fournies. Ce constat est particulièrement édifiant pour les gènes codant des enzymes modulaires de type NRPS. Une analyse plus fine conduisant à leur annotation est nécessaire et une stratégie est proposée dans ce travail.
PROTEOSCAN-DB: AN OPEN-SOURCE PIPELINE FOR AUTOMATIC VALIDATION OF PHOSPHOPEPTIDES FROM CID MS SPECTRA
Abstract: Phosphorylation is one of the post-translational modifications (PTM) the most involved in cell signaling, and phosphoproteomics becomes the tool of choice for such investigations. Associated to the power of mass spectrometry, the size of generated data sets for protein phosphorylation analysis continuously increases. However, identification of most of the phosphoproteins often depends on a low number of phosphopeptides, even often on a single one phosphopeptide. In addition, the majority of phosphopeptides in LC-MS/MS within CID fragmentation data sets remain unassigned because of the low quality of the MS/MS spectra that contain a reduced number of sequence specific ions. After identification against databases, all the peptides need therefore to be validated to avoid false positive, usually by a manual in depth-analysis. As such an approach is very time consuming and operator-dependent (scientist experience), the process cannot be performed with confidence for a large sets of MS/MS spectra. The in silico analysis, using a portable software that allows both peptide validation and assignment of phosphorylation site localization, becomes thus a logical alternative. Although initiated several years ago, presently available softwares are suited for specific platforms or application fields and require specific input data formats.
Our purpose was to develop a phosphopeptide validation versatile tool based on a scheme that closely mimics the approach used when interpreting manually MS/MS spectra. Such a validation scheme, using a weighting scoring of each step that follows expert criteria to validate phosphopeptides, was proposed by Schlosser et al. (Schlosser et al. 2007, Anal Chem, 79, 7439-7449), but was restricted to pure phosphoproteins. We used the same rationale to write a new scoring Perl script adapted to large-scale and complex proteomic data sets. After querying with the Mascot search engine, against both target and decoy databases to provide a false positive rate, the software uses the Mascot identification URL as input data (typically at less than 1% false positive rate). For peptides validated both on the FDR basis and on the implemented expert checking, a last step addressed the location of phosphorylation sites within the peptides. For this purpose, we used the probability-based Ascore algorithm, made available for any platform through the PhosCalc software (McLean et al. 2008, BMC,1, 1-9). This integrated data handling pipeline, that uses flexible and platform-independent parameters, was named ProteoScan. Coupled to a MySQL-database, where the output results (available as csv file) can be sent for further retrieval, the ProteoScan-DB tool, was assessed using more than 600 manually validated phosphopeptide sites from Arabidopsis thaliana plasma membrane, and shown to allow automatic analysis of large-scale phosphoproteome.
Anatomy of druggable pockets and associated ligands.
Abstract: Concepts and knowledge about ligand-binding sites have significantly evolved over the last 40 years. Today, pockets can be predicted to some extent and improved investigations of "druggable" pockets are essential not only to assist the hit identification and optimization steps but also to predict binding to anti-targets and off-targets.
In this trend we decided to re-visit pocket-ligand pairs with a recent and high-resolution set of protein pockets in complexes with drug-like ligands. We analyzed 56 high-resolution structures and to ensure that our analysis could be extended to other proteins, we also investigated 564 protein-ligand complexes. Statistical analysis sheds light about pocket properties, ligand properties and the correlations between the two spaces.
VectorBase, a home for invertebrate vectors of human pathogens
Abstract: Tropical diseases such as malaria, filariasis or trypanosomiasis are transmitted via vector species directly through blood meals. Many of these vector species are attracting more interest, having become more frequent in Northern hemisphere countries, and several of these vectors have now been sequenced and annotated.
VectorBase [http://www.vectorbase.org] is a Bioinformatics Resource Centre responsible for the storage, the organisation and the update of these data, mainly for the mosquitoes, the tick and the body louse.
Three types of data are served: (i) genomic data, including gene predictions, DNA/protein similarities, expression and comparative data, (ii) controlled vocabulary and ontologies, encompassing anatomies, physiological process and insecticide resistance, and (iii) population genomic data.
Data can be visualised via a Genome Browser, queried on a large scale using BioMart, or more simply downloaded as flat files. Tools have been developed to help users mining the data: the BLAST program allows users to compare their sequences against DNA or protein sequences and the HMMER suite has been installed and, given a sequence alignemnt, builds a HMM that can then be used to query protein data sets.
VectorBase maintains a strong link with its community. Users are highly encouraged to participate in the data generation and interpretation, or in a better representation, of data within the public domain. A Community Annotation Pipeline has been developed to help users submit their annotations, which will then be incorporated into VectorBase. Such community input has proven to greatly enhance our data sets.
Assemblage de génomes bactériens séquencés par NGS : Comparaison d’outils et choix de paramètres.
Abstract: Les méthodes de Séquençage de Nouvelle Génération (NGS) permettent d'obtenir rapidement et à faibles coûts des séquences génomiques de nouvelles souches ou de nouveaux individus. Un éventail d'outils d’assemblage prenant en compte les spécificités de ce type de données (très nombreuses lectures courtes) a été développé au cours des dernières années. La multiplication des outils et leur évolution ainsi que les changements incessants dans les techniques de séquençage rendent indispensable une comparaison des diverses stratégies d'assemblage sur un jeu de données commun.
L'utilisation d’une séquence de référence permet, par « mapping », d'assembler et d'annoter rapidement des souches proches nouvellement séquencées. Cette méthode renseigne uniquement sur les régions communes entre la souche d'intérêt et celle de référence. La reconstruction des régions spécifiques (gènes isolés et îlots génomiques) à la nouvelle souche nécessite une approche complémentaire d'assemblage de novo. C'est à cette tâche que l'on s'intéresse ici et l'on souhaite en particulier comparer deux stratégies :
- réaliser un assemblage de novo de l'ensemble des lectures et ainsi reconstruire des « contigs » représentant l'ensemble du génome séquencé (incluant les régions spécifiques);
- réaliser un assemblage de novo uniquement des lectures rejetées suite à l'étape d'assemblage sur référence et ainsi reconstruire sous forme de « contigs » les régions spécifiques d’intérêt.
L'ensemble des analyses utilise six jeux de données issus d'un séquençage Solexa/Illumina de six souches de la bactérie Flavobacterium psychrophilum. Chacun est constitué de 10 millions de lectures de 100 paires de bases, assurant une couverture maximale théorique supérieure à 200x (projet FLAVOPHYLOGENOMICS, coordinateur Eric Duchaud). L'assemblage par « mapping » s'appuie sur la disponibilité d'une séquence complète et annotée pour deux des souches étudiées. L'évaluation des différentes stratégies utilise une comparaison validée manuellement des répertoires de gènes de ces deux souches.
Pour cette étude, on considère l'utilisation des suites logicielles Velvet pour l'assemblage de novo et MAQ pour l'assemblage sur référence.
Au delà de la comparaison des deux stratégies d'assemblage des régions spécifiques mentionnées ci-dessus, nous avons aussi évalué différentes méthodes de nettoyage des lectures, une étape rendue nécessaire lorsque l'assembleur de novo (ici Velvet) ne prend pas en compte l’information de qualité des données NGS (contenue dans les fichiers de séquences Fastq).
De plus, une analyse est proposée afin de déterminer la couverture minimale d'un séquençage de type Illumina/Solexa lors d’un assemblage de novo ayant pour objectif la détection de nouveaux gènes.
La suite de ce travail consistera à comparer différents outils (SOAP de novo, Mira, etc.) lors de l’assemblage de génomes bactériens.
Development and optimization of metagenomic analyses
Abstract: Metagenomic studies by high throughput sequencing generate a substantial amount of data to analyse. The sequences have to be compared to a data bank of reference for identification. Then, the resulting alignments have to be interpreted. We managed to enhance this analysis methodology by studying sequence filtering, building new data banks and optimizing software configuration in order to extract as much information as possible from initial raw data.
UniProt knowledge database and cross references.
Abstract: The UniProt Knowledgebase (UniProtKB) is the central access point for extensively curated protein information. UniProtKB is a protein-centric, non-redundant database aiming to provide everything that is known about a protein. UniProtKB provides an integrated and uniform presentation of disparate data, including annotations such as protein name and function, taxonomy, enzyme-specific information (catalytic activity, cofactors, metabolic pathways, regulatory mechanisms), domains and sites, post-translational modifications, subcellular locations, tissue or developmentally-specific expression, interactions, splice isoforms, diseases, and sequence conflicts. Most of experimental data evidences come from literature citations.
The UniProtKB contains two sections. UniProtKB/Swiss-Prot contains records with full manual annotation (or computer-assisted, manually-verified annotation) performed by biologists and based on published literature and sequence analysis. UniProtKB/TrEMBL contains computationally generated records enriched with automatic classification and annotation.
Entries in UniProtKB are connected to various external data collections divided in 14 categories such as the underlying DNA sequence entries, protein structure databases, protein domain and protein family databases, species-specific and function/feature-specific data collections As a result, UniProtKB acts as a central hub connecting biomolecular information archived in 126 cross-referenced databases (release 2010_07 13th June 2010). A document describing all of the databases that we cross-reference to is available on the UniProt ftp site (1). Statistics about cross-references (Xrefs) are computed at each release for both sections and can be found in release statistics summary web page (2).
A third of databases are updated each release, external databases provide to UniProtKB a mapping file linking their database stable identifiers to an UniProtKB primary accession, as soon as a new file is provided, it is inegrated into the UniProtKB..
Some Xrefs are very specific and involve only a small number of entries for example ArachnoServer links 368 TrEMBL and 456 Swiss-Prot entries. On the other hand, some databases have a huge number of Xrefs; for example GO links 6,865,647 TrEMBL and 484,004 Swiss-Prot entries over a total of 11,109,684 entries for TrEMBL and 517802 entries for Swiss-Prot (release 2010_07 13th June 2010). Overall, there is an average of 11.25 Xrefs per entry for TrEMBL and 25.50 Xrefs per entry for Swiss-Prot..
Cross-references are a vitally important way in which to improve the information we have about proteins without duplicating the work done by different databases. Data about a specific cross-reference can be retrieved directly by using the search form on the UniProt web site.. The ID-mapping web page allows you to retrieve a mapping file of UniProtKB accessions and external database identifiers for a UniProtKB accession list or for an external identifier list. Programmatic access to the database mapping service is also available using Perl, Python, Ruby or Java (3).
(2.1) http://www.expasy.org/sprot/relnotes/relstat.html for SwissProt division
(2.2) http://www.ebi.ac.uk/uniprot/TrEMBLstats/ for TrEMBL division
Chado Controller: un superviseur pour la confidentialité, la qualité et le suivi des annotations
Abstract: Dans le cadre du séquençage des génomes, divers systèmes d'annotation ont été développés tel que celui proposé par le projet international Generic Model Organism Database (GMOD). Le projet GNPAnnot a pour but de développer un système d'annotation communautaire (CAS) pour les plantes, les champignons et les insectes basé sur des composants GMOD. Le Chado Controller fait parti des apports du projet GNPAnnot qui complète le coeur du CAS GMOD: base de données Chado, navigateur de génome GBrowse et éditeur Apollo ou Artemis.
Le Chado Controller qui permet de gérer la confidentialité, d'améliorer la qualité des annotations manuelles et de les tracer, comporte trois modules: la restriction des droits d'accès des utilisateurs aux "features", l'inspecteur et l'historique des annotations manuelles. Le module de restriction des droits d'accès permet non seulement de limiter l'accès d'un utilisateur à tout ou partie des données mais également de choisir son niveau d'accès (visualisation seulement ou modification). Pour améliorer la qualité des annotations, l'inspecteur emploie du vocabulaire contrôlé et facilite le travail de l'annotation manuelle en contrôlant le travail du curateur. Il rapporte les incohérence des annotations par rapport à des règles (e.g. structure incorrecte, propriété non ou mal renseignée) et met à jour automatiquement certains champs suite à l'enregistrement des modifications de l'annotateur (e.g. le nom de l'annotateur). Enfin, le module d'historique permet de garder une trace de toutes les modifications apportées aux annotations, quelque soit le type de "feature" (e.g. gène, élément transposable) modifiée et quelque soit l'éditeur utilisé. L'historique des annotations d'une "feature" peut être visualisé à l'aide d'une page Web.
Le Chado Controller, destiné aux bases de données Chado (PostgreSQL), est principalement composé de scripts SQL embarqués dans la base, garantissant ainsi son contrôle global des données. Des parties secondaires du Chado Controller ont été intégrées aux outils GBrowse 1.70 (fenêtre de login, page d'historique) et Artemis (utilisation de l'inspecteur d'annotation). L'intégration de l'inspecteur à un logiciel comme Apollo est aussi simple que d'effectuer des requêtes SQL à partir de celui-ci. Enfin, le Chado Controller est rétro-compatible avec les outils existants et ses différents modules peuvent être activés ou désactivés indépendamment.
Le Chado Controller du projet GNPAnnot est déjà utilisé sur six CAS installé sur la plate-forme South Green (plate-forme de bioinformatique montpelliéraine): sorgho, bananier, canne à sucre, palmier, cacaoyer et caféier. Il a déjà permis l'annotation manuelle de qualité, confidentiellement via le Web, de 964 gènes et de 479 éléments transposables (TE) qui peuvent être facilement exportée dans divers formats à l'aide d'un extracteur.
Un visualisateur dynamique de synténie pour génomes microbiens
Abstract: GenoList (http://genolist.pasteur.fr/GenoList) est un environnement intégré de représentation et d’analyse des génomes microbiens, reposant sur une base de données relationnelle et exploité par le biais d’une interface Web. La version actuelle intègre 750 génomes bactériens provenant de Genome Reviews (EBI) pour la plupart, et des annotations enrichies pour quelques organismes sélectionnés. Des données de génomique comparative sont pré-calculées et associées aux annotations primaires.
Le Synteny Viewer est une application Flash (ou flashlet) écrite en AS3, permettant de naviguer dans les données génomiques et comparatives et de représenter la synténie locale ou globale. L’information de synténie est calculée à partir de la correspondance entre protéines de différents organismes (« best hits » bidirectionnels) et de la conservation de l’ordonnancement des gènes correspondants sur les génomes respectifs.
Deux modes d’accès au Synteny Viewer sont proposés. L’utilisateur peut choisir un génome pivot qu’il pourra parcourir à l’aide d’un Genome Viewer intégré au module (navigation dans le génome, accès direct à des positions, recherche par nom de gène, zoom, etc.), puis ajouter des cartes de génomes à comparer. Il est également possible d’accéder au Synteny Viewer à partir de la page d’information sur un gène particulier proposée par GenoList, qui devient alors le gène pivot pour la comparaison avec les autres génomes sélectionnés dans GenoList.
Un ensemble de fonctionnalités permettant de visualiser dynamiquement les relations inter-génomes sont proposées par le module. Outre les déplacements sur le génome pivot, il est notamment possible de sélectionner n’importe quel gène en tant que pivot pour la comparaison, d’échanger la position des cartes comparatives, d’adapter les échelles horizontales et verticales des cartes, d’obtenir des informations sur les liens d’orthologie avec un retour vers GenoList et l’accès direct à l’alignement des séquences.
Des sous-cartes peuvent être associées aux cartes génomiques, comme des informations de type GC% ou SNPs obtenus par séquençage haut débit.
Des représentations globales dynamiques sont également proposées, comme les « dot plots », les « line plots » ou les profils phylogénétiques. Les protéomes sont ici comparés deux à deux dans une vue globale. Dans le dot plot (un génome en abscisse et un génome en ordonnée), l’utilisateur peut sélectionner graphiquement les orthologues et les exporter en fichier tabulé.
Le principal attrait du module Synteny Viewer réside dans l’interactivité permise par l’utilisation de la technologie Flash, tout en conservant une intégration forte avec les fonctionnalités générales de GenoList. Une version AIR « stand-alone » non liée à la base de données est également disponible et permet à l’utilisateur de travailler sur ses génomes d’intérêt. Les prochains développements viseront à gérer les génomes incomplets.
PALMapper: Fast and Accurate Spliced Alignments of RNA-seq Reads
Abstract: Genome and transcriptome sequencing experience a challenging renewal
with the advent of Next Generation Sequencing (NGS)
technologies. Notably, short mRNA sequences produced by RNA-Seq
enhance transcriptome analysis and promise great opportunities for the
discovery of new genes and the identification of alternative
transcripts. One way to analyze this data is aligning the reads
against a reference genome. However, the sheer amount of NGS data
requires highly efficient methods for accurate spliced alignments,
which is further challenged by the size and quality of the sequence
We propose a combination of the spliced alignment method QPALMA 
with the short read alignment tool GenomeMapper . The resulting
method, called PALMapper, efficiently computes both spliced and
unspliced alignments at high accuracy while taking advantage of base
quality information and (optionally) splice site predictions. QPALMA
relies on a machine learning strategy and is highly sensitive and
adaptive to the error characteristics in the reads. It, however,
suffers from its time consumption in the Smith-Waterman alignment
step. To speed this up and thus to improve efficiency, we combined it
with GenomeMapper that quickly carries out an initial read
mapping. Resulting partial alignments, information from previously
mapped reads, and (optionally) computational splice site predictions
then guide a very efficient banded glocal algorithm that allows for
long gaps that correspond to introns. PALMapper considerably reduced
time consumption without decreasing accuracy compared to QPALMA. In
fact, it runs around 50 times faster and allows to align around 7
million reads per hour on a single AMD CPU core (similar speed as, for
instance, TopHat ). PALMapper only requires a few matching seeds to
trigger an alignment that may have arbitrarily many mismatches or
gaps. It is therefore also very well suited for reads from other
technologies (e.g., Pacific Biosciences), which exhibit many
deletions, where many other mapping approaches are deemed to fail. We
illustrate that PALMapper indeed benefits from computational splice
site predictions. Nevertheless, PALMapper can also be used without any
splice site information and thus allows the discovery of novel
We performed an extensive comparison with other spliced alignment
methods. Our study shows that PALMapper predicts introns with very
high sensitivity (72%) and specificity (82%) when using the annotation
as ground truth. PALMapper is considerably more sensitive than, for
instance, TopHat (47% and 81%, respectively).
PALMapper is open source and available from
http://fml.mpg.de/raetsch/suppl/palmapper. Moreover, it can be used in
the Galaxy instance available at http://galaxy.fml.mpg.de in
combination with other tools for transcriptome reconstruction and
 F. De Bona, S. Ossowski, K. Schneeberger, and G. Rätsch. Optimal
spliced alignments of short sequence reads. Bioinformatics,
24(16):i174–180, August 2008.
 K. Schneeberger, J. Hagmann, S. Ossowski, N. Warthmann, S. Gesing,
O. Kohlbacher, and D. Weigel. Simultaneous alignment of short reads
against multiple genomes. Genome Biology, 10(9):R98, 2009.
 C. Trapnell, L. Pachter, and S.L. Salzberg. TopHat: discovering
splice junctions with RNA-Seq. Bioinformatics. 25 (9): 1105-11, 2009.
ARPAS : logiciel de gestion de collection de ressources biologiques
Abstract: ARPAS (ARiane PASteur) est une application WEB en intranet conçue pour la conservation et la gestion d’une grande variété de données relatives aux nombreuses souches entretenues dans diverses collections pasteuriennes regroupées au sein du Centre de Ressources Biologiques de l'Institut Pasteur (CRBIP). Sont ainsi concernées des collections de bactéries, de cyanobactéries, de champignons et de virus.
Développé par une SSII, ARPAS est maintenu et amélioré en continu par un bio-informaticien consacrant à temps plein ses activités au CRBIP. Tout en répondant aux contraintes des critères d’uniformité et de traçabilité, le logiciel permet de centraliser et d’accompagner les différentes étapes de la gestion des ressources biologiques : acquisition, caractérisation, conservation, production, localisation des stocks et vente.
En relation avec ses fonctionnalités, ARPAS est divisé en cinq modules :
-administration des droits d'accès des utilisateurs, création et gestion des collections, modification de paramètres, export de données,
-gestion des données biologiques des microorganismes (données catalogue, caractéristiques, galeries, rapport d’essai, antibiogramme/antifongigramme, stockage, localisation des produits, historique des ventes, profils, séquences…),
-gestion des milieux et réactifs,
-suivi du cycle de production (planification, production, contrôle, approvisionnement, étiquetages…),
-traitement des commandes (en relation avec le service Comptabilité et le service Import/export de l’Institut Pasteur).
Grâce à son architecture sous forme de modules, l'application est également extensible suivant l’évolution des besoins. De plus, comme ses utilisateurs sont dispersés dans différents laboratoires de l’institut, ARPAS a été développé en tant qu’application WEB dans un environnement Java/Tomcat couplé à une base de données relationnelle (dans le système de gestion de base de données relationnelle Sybase). Cette même base de données peut être interrogée pour la consultation du catalogue via le site WEB du CRBIP www.crbip.pasteur.fr/.
Les futurs développements d’ARPAS devraient inclure la portabilité, l’interopérabilité et l’ajout de nouveaux modules.
Control of lipase enantioselectivity by engineering the substrate binding site: An investigation using mixed molecular modelling and robotic-based path planning approaches
Abstract: The use of enzymes as biocatalysts is of a great industrial interest for the preparation of chiral building blocks, especially by kinetic resolution of racemic mixtures. In this field, lipases are among the most employed enzymes. Widely distributed in nature, these enzymes catalyze the hydrolysis and the synthesis of a wide range of soluble and insoluble organic compounds, making them potential catalysts for a wide variety of applications in chemical, pharmaceutical and food industries. The interest of using lipases lies in their enantioselectivity which has been shown to be modulated by reaction conditions, such as the temperature or the solvent employed. Nonetheless, structural determinants and molecular motions controlling lipase activity and selectivity are not yet fully understood.
In this context, the main objective of our work is the development of an engineering procedure for the design of enantioselective mutants. Our recent work suggested that enantioselectivity could be influenced by substrate accessibility toward the buried active site of lipases, such as Burkholderia cepacia lipase (BCL). To study further this hypothesis, a novel computational approach [2,3], based on motion-planning algorithms, originally used in robotics, was developed in collaboration with the LAAS-CNRS and applied to investigate BCL molecular motions1. Compared to classical molecular modelling techniques, this novel approach allows a performance gain of several orders of magnitude (hours vs weeks) to compute accessibility pathways of substrates from the protein surface to a buried protein active site and continuous large amplitude protein motions in solvent environment.
Computational simulations were used to drive the construction of small libraries of BCL variants which led to the fast isolation of variants with a remarkable 10-fold enhanced enantioselectivity and a 15-fold enhanced specific activity compared to the parental wild-type BCL .
This study demonstrates the efficiency of the semi-rational engineering strategy. At a more general level, this fast technique could be used as a pre-filtering procedure to select a catalyst or accelerate the engineering of a given catalyst for a given racemate resolution.
Validation automatique des sites de phosphorylations par comparaison de scores et intégration d’annotations protéiques sur des données LC/MS-MS.
Abstract: La phosphorylation est un mécanisme impliqué dans de nombreux processus biologiques (cycle cellulaire, différenciation, métabolisme entre autres). Déterminer les sites, l’abondance de ces modifications post-traductionnelles dans un échantillon est devenu possible grâce aux avancées technologiques des spectromètres de masse.
La spectrométrie de masse couplée à la chromatographie liquide (LC-MS/MS) sur des échantillons enrichis en phosphopeptides (iMAC, TiO2) permet d’identifier à large échelle ces phosphorylations. Ces analyses à haut débit rendent le processus comportant successivement l’identification des peptides phosphorylés et la validation du site d’ancrage fastidieux voire impossible manuellement. Plusieurs programmes ont été développés pour attribuer un score de significativité du site de phosphorylation sur le peptide et reposent sur des méthodes statistiques. La méthode présentée dans ce poster vise à comparer les scores d’identification du peptide obtenus par 3 moteurs de recherche (Mascot, Inspect, Sequest), les scores d’assignation du site (PLS, PhosphoScore, DeltaScore) et les annotations disponibles dans les banques de données (SwissProt) pour consolider la validation automatique du peptide et de la position de la phosphorylation sur un résidu particulier. Des règles de décision basées d’une part sur les avis des experts du domaine de la spectrométrie, et d’autre part sur des valeurs de scores contribuent à renforcer la validation pour améliorer la spécificité et diminuer le taux de faux positifs.
A high throughput multi-technological Research Information Management System for the Joomla CMS: DJEEN.
Abstract: The current growth of high-throughput experimentation in biology research implies a huge challenge to store and share generated data and their annotations. Manipulating and integrating heterogeneous data types in a multidisciplinary project remain an even higher hurdle for which proper management systems must be deployed. Data annotations need to be recorded and homogenized for proper data integration, while respecting Minimum Information standards (MIAMI, MIFlowCyt).
Several databases and LIMS have been previously proposed, usually dedicated to a single technology data management [DNA microarray (BASE, EzArray, Longhorn Array Database), proteomics (ms_lims)]. Minimum Information standards are typically hardcoded and users have to be retrained on new laboratory practices. This prevents interdisciplinary and translational collaborations and the transfer of knowledge among laboratories generating heterogeneous information that need to be integrated. On the bioinformatics development side, using these databases implies understanding complex and mostly unsupported APIs and difficult administrative tasks, which are serious restraints for laboratory deployment.
In contrast with these solutions, we developed the Database for Joomla’s Extensible Engine (DJEEN) system, a multi-technological Research Information Management System (RIMS). It contains a complete pipeline to manage heterogeneous projects and organize experiments in a hierarchy, it allows user and group right management, it features a template manager to create and manage standards (for instance Minimum Information standards), records experimental parameters, and manages multiple types of laboratory files with a simple and unique system. DJEEN is also capable of managing experimental and quality control parameters and clinical information. Focus has been placed user-centric needs such as rapid and coherent annotation of a large set of files and data sharing with collaborators.
This tool was build as a Joomla Content Management System (CMS) component. This allows speeding up development by direct re-use of major Joomla’s features, including, user and right management and web interface. DJEEN can be deployed quickly on a web server running a database and Joomla, which meet administrator needs for quick deployment.
DJEEN is publicly available from http://bioinformatique.marseille.inserm.fr/djeen .
PEPOP ou le design de peptides ciblant des épitopes discontinus
Abstract: PEPOP est dédié à la prédiction de peptides représentatifs d'une protéine (Moreau, Fleury et al. 2008). L'outil a évolué et propose aujourd'hui 34 méthodes différentes pour prédire les séquences de peptides ciblant des épitopes discontinus sur les protéines.
Les peptides prédits par PEPOP peuvent être antigéniques ou immunogéniques selon l'utilisation faite de ces peptides. Les peptides antigéniques sont utilisés pour remplacer la protéine dans une interaction Ag-Ac existante. La prédiction de peptides immunogéniques permet de cibler une région particulière de la protéine et d'obtenir des Ac dirigés contre elle après injection dans un organisme hôte.
Pour palier à la difficulté de mesurer expérimentalement la performance des méthodes proposées, elles ont été évaluées de façon théorique. La sensibilité et la valeur prédictive positive ont été calculées, pour chaque méthode, en comparant l'habileté des peptides à correspondre à un épitope discontinu connu. Ainsi, les peptides des 34 méthodes ont été prédits pour les 75 antigènes pour lesquels la structure 3D du complexe avec l'anticorps est disponible. Ces performances ont également été comparées avec le hasard, c'est-à-dire une méthode qui prédirait des peptides au hasard parmi les acides aminés de la protéine. Ceci afin de confirmer que les méthodes proposées ne sont pas dues au hasard. Ces analyses permettent de guider l'expérimentaliste dans sa démarche pour obtenir un ou des peptides représentatifs d'une protéine selon sa propre problématique.
Bovine promoter annotation platform for the identification of transcription factor binding sites in genes involved in early pregnancy.
Abstract: Deciphering mechanisms of early implantation of the embryo in cattle is of economical and fundamental interest, as these findings could help to reduce the implantation failure and also give clues to solve human sterility issues.
One possible approach is to identify precisely the transcription factors (TF) responsible for gene expression in the uterus-conceptus dialogue. TF are proteins which interact directly with the DNA molecule by recognizing small and slightly degenerate nucleotide motifs (TFBS) in the promoter region of the genes.
The first issue is the accurate definition of promoter regions. In mammal, it is well known that the starting point for transcription of a gene (TSS) could be far from the translation start point, and that regulatory motifs can be found as far as several kb from the regulated gene. Regulatory annotation and comparison with orthologs could help to identify putative transcription start regions in cases where the biological information (EST, mRNA) is not available for cow.
In the present work we propose the combination of several bioinformatic approaches to identify regulatory motifs in cattle promoters, we compare results from phylogenetic footprinting (footprinter) with orthologous genes from human, mouse and rat genomes, genetic algorithm (GALF-P) and statistical based motif search (Weeder) as the complete Bos taurus genome is now available.
This pipeline will eventually be integrated in a platform allowing the analysis of datasets from high throughput transcriptomic analysis such as microarrays and ChIP-Seq, and is currently tested with previously annotated promoters dataset from milk regulatory genes.
Hierarchical classification of helical distortions related to proline
Abstract: The presence of proline in a helix can be accommodated by several types of distortions. In this study, we present a hierarchical method to classify these distortions into a limiter number of canonical structures. The first level of classification is based on DSSP and differentiates “typical” or “non-typical” distortions (ratio of 0.65:0.35). They correspond to proline in contiguous helices and in helical HXnH motifs (two helices joined by n residues in a generous alpha-helical conformation). “Non-typical” distortions are further classified as a function of the dihedral angles of the linker residues. This second level of classification is equivalent to classification based on main chain – main chain H-bonding pattern and differentiates bulges and tight turns. The third level of classification corresponds to sub-division of bulges and turns. It is based on the number of linker residues and on the position of proline in the second helix. Using these parameters, about 85% of “non-typical” proline distortions are described by only five canonical structures. This robust method can be implemented on a large scale and will help develop predictive tools for molecular modeling.
MGX – Montpellier GenomiX: Plateforme de services en génomique
Abstract: La plateforme MGX - Montpellier GenomiX - propose aux communautés académique et industrielle différents services en génomique: production et utilisation de microarrays, séquençage nouvelle génération (Illumina GAIIx et HiSeq2000), ainsi que les analyses qualité et statistiques des données microarrays et séquençage générées sur la plateforme. Chaque application est prise en charge sous forme d'un pipeline spécifique incluant des outils d'analyses de différents niveaux: traitement et analyse des données, annotation fonctionnelle, méta-analyses... Chaque projet est réalisé dans le cadre d'une démarche d'amélioration continue et est piloté par un système de management de la qualité.
Processus d’analyse statistique pour la découverte de biomarqueurs en diagnostic médical
Abstract: De nombreuses problématiques en santé humaine s’accompagnent de la génération d’une quantité de plus en plus importante de données, qui nécessitent d’être exploitées et transformées en information décisionnelle. Cette étape qui suggère la maîtrise de plusieurs disciplines est déterminante pour la recherche de biomarqueurs diagnostiques.
Notre équipe a développé une chaine méthodologique d’analyse statistique des données fournissant une aide pour la découverte de nouveaux marqueurs biologiques. Celle-ci allie contrôle-qualité des données, modélisation et apprentissage statistique. Le croisement des approches utilisées permet d’assurer la robustesse et les performances de la combinaison de biomarqueurs sélectionnée.
Nous illustrerons ce processus d’analyse statistique au travers du problème diagnostique d’une pathologie complexe.
PhyloWeb : A dynamic web viewer for microbial population genetics
Abstract: Genotyping methods such as MultiLocus Sequence Typing (MLST) or MultiLocus VNTR Analysis (MLVA) provide a common language for bacterial strain typing.
A numerical profile is associated with each strain: this standardized format allows the exchange of knowledge on the geographical and temporal distribution of strain types for evolution and epidemiology studies. This data sharing is facilitated by databases and web interfaces (e.g. http://www.pasteur.fr/mlst, http://www.pasteur.fr/mlva).
We developed an interactive web application to visualize and analyze the relationships between profiles and the cognate strains, using the Minimum Spanning Tree method (MST).
Each profile is represented by a vertex in the tree and contains one or more strains. The edges of the tree are the relationships between two profiles, i.e. the number of differences between those profiles.
PhyloWeb allows the user to explore several features of the population:
- Profile clusters of distance 1 (a parameter chosen by the user), which form a clonal complex, can be underlined with a specific color.
- The size of the vertices is weighted according to the number of strains in the profile.
- Different features related to each strain (e.g. phenotypes, environmental or clinical origin, etc.) can be visualized using colored pie charts. For instance, if one wants to analyze the distribution of strain hosts, vertex wedges can be colored according to this feature. In that way, it is easy to spot a prevailing host origin in one profile containing many strains.
- A selection tool is available, allowing the user to select a subset of the data. That selection can be made graphically by drawing a rectangle. It can also be performed using a multi-criteria filter. In both cases, a textual list is created with all the resulting strains. That list can be exported.
- An image of the generated tree can be exported in png format.
PhyloWeb is a Flash application written in ActionScript 3. This enables a great level of interactivity for the end-user. PhyloWeb was developed in a generic way and can be used to visualize MLST and MLVA data from any source. PhyloWeb is now evolving to enable the visualization of microbial diversity based on SNP (Single Nucleotide Polymorphisms), with the integration of a parsimony algorithm and genotyping data obtained by high-throughput sequencing techniques (e.g. Illumina).
Reliable identification of hundreds of proteins without peptide fragmentation.
One of the most common approaches for large scale protein
identification is Hight Performance Liquid Chromatography, followed by
Mass Spectrometry (HPLC-MS). If more that a few proteins have to be
identified, the additional fragmentation of individual peptides has
been considered essential. Here we present evidence that, by combining
high-precision mass measurement and modern retention time prediction
algorithms (Krokhin et al 2006) with a robust scoring scheme, hundreds
of proteins can be identified without having to rely on peptide
Material and Methods
For each candidate protein, taken from a relevant protein sequence
database, we predict the peptides resulting from the digestion of the
protein and sort them according to their computed HPLC retention
times. Corresponding peaks are then searched in the same order in the
series of spectra from HPLC-MS. For this purpose, a set of peaks with
masses corresponding to those of the predicted peptides is searched by
a dynamic programming algorithm maximizing an alignment score. Using
quantile regression, this alignment score is compared to those
obtained for proteins from a suitable decoy database.
The method was tested on HPLC-MS data obtained from the pathogenic
bacteria Francisella tularensis, for which HPLC-MS/MS fragmentation
spectra were also available. Out of 1719 possible proteins in
F. tularensis, we were able to detect 257 proteins with a FDR of 5.7%
(The FDR was estimated from the score distribution of the proteins in
the decoy database). This is 59% of the number of proteins detected by
applying the Mascot tool to the fragmentation spectra from the same
sample. 31 additional proteins were seen that had not been detected in
the fragmentation data by Mascot. Results obtained with other samples will also be presented.
Previous works describe the identification of proteins from MS and
measured retention times (Strittmatter et al, 2003) or use the
predicted retention times of peptides to filter false identifications
from MS/MS (Pfeifer et al. 2009). Relying only on MS and predicted
retention times, we detect many more proteins than Palmblad et al
20004 using the same inputs. Using the high accuracy now available in
MS, and eliminating the need for the measurement of the retention
times of peptides, this method could be an alternative to peptide
fragmentation for the protein identification in complex samples.
Work of P. Clote was partially funded by the Digiteo Foundation, as
well as by the National Science Foundation grants DBI-0543506 and
Wheat and barley data in GnpIS, The URGI information system
Abstract: Abstract in English
URGI is an INRA bioinformatics unit dedicated to plants and pest genomics. We develop and maintain an information system called GnpIS, for plants of agronomical interest and especially wheat and barley. GnpIS is composed of multiples modules which contain wheat and barley data. Each module is specific for one type of data: GnpMap for genetic mapping data, GnpSNP for polymorphism data, GnpSeq for EST data, Siregal for genetic ressources data, and GnpGenome for physical map and some genetic annotation data.
Keywords: Wheat, barley, data, tools, genomic.
Résumé en francais
L’URGI est une unité de bioinformatique de l’INRA qui est dédiée à la génomique des plantes et de leurs nuisibles. Nous développons et maintenons un système d’information appellé GnpIS, pour les plantes à intérêt agronomique et en particulier le blé et l’orge. GnpIS est composé de multiples modules contenant des données de blé et d’orge. Chaque module est spécifique pour un type de données : GnpMap pour les cartes génétiques, GnpSNP pour les polymorphismes, GnpSEQ pour les EST, Siregal pour les ressources génétiques, et GnpGenome pour les cartes physiques et des annotations génétiques.
Mots-clés Blé, orge, données, outils, génomique.
Dans le cadre du projet TriticeaeGenome (http://urgi.versailles.inra.fr/index.php/urgi/Projects/TriticeaeGenome), nous avons mis l’accent sur l’intégration des données de blé et d’orge dans le système d’information de l’URGI, appelé GnpIS (http://urgi.versailles.inra.fr/).
Pour simplifier la soumission de ces données, nous avons mis en place un nouveau format de soumission en utilisant Excel. Nous utilisons le système du fichier avec plusieurs feuilles liées. De cette façon, nous n’avons plus qu’un unique fichier dont les données sont déjà liées entre elles au travers des fonctions d’Excel, et le biologiste n’a plus à naviguer entre plusieurs fichiers (http://urgi.versailles.inra.fr/index.php/urgi/Data/Map/Data-submission).
Nous avons différents outils de visualisations des données, qui sont soit en accès libre sur notre site publique (http://urgi.versailles.inra.fr/) soit en accès restreint sur notre site privé (https://gpi.versailles.inra.fr/) auquel on accède avec un compte (http://urgi.versailles.inra.fr/index.php/urgi/Register).
Les outils qui sont en accès libre sont : la carte physique du chromosome 3B chez le blé, les cartes génétiques, les QTL et les marqueurs, les ressources génétiques, les EST et les SNP. Nous avons ces mêmes outils en accès restreint mais avec des données enrichies selon les utilisateurs.
Par exemples, pour Triticum aestivum (blé) :
- Carte physique : 1
- Cartes génétiques : 19
- QTL : 504
- Marqueurs : 9525
- Ressources génétiques: 2093
- EST : 599013
- SNP : 53160
Pour Hordeum vulgare (orge) :
- Cartes génétiques: 3
- Marqueurs : 8918
- SNP : 73016
Au travers des données que nous avons reçues, nous pouvons faire le lien entre les informations de blé et d’orge. En particuliers, nous pouvons faire les liens entre les cartes génétiques via les marqueurs communs, notamment entre une carte de blé et une carte d’orge via un marqueur en commun et visualisable au travers de notre outils de comparaison de cartes génétiques.
This work is supported by the TRITICEAEGENOME Project.
Using R for data management in ecophysiology Information Systems
Abstract: In the Laboratory of the Ecophysiology of Plants under Environmental Stresses LEPSE (1) at INRA, Montpellier (France) three experimental set-ups allow the study of the effect of genotype x environment interactions on plant ecophysiological traits: (i) a field network for maize populations, (ii) the PHENODYN semi-automated platform for maize phenotyping (high-throughput) including two environments, a greenhouse and a growth chamber, and (iii) the PHENOPSIS automated platform for Arabidopsis thaliana phenotyping (high-throughput).
As these experimental devices generate an important amount of data of different types (numeric, images) and natures (phenotypic, environmental, genetic), information systems were developed around each device for the collection of data and metadata, their storage and organization in MySQL databases, and their extraction, visualization and analysis via Web interfaces developed in PHP and HTML (Cincalli DB (2), Phenodyn DB (3) and Phenopsis DB (4)).
We have used R and the RODBC package for the management of data at the different
levels of the information systems. R scripts were developed to: (i) automatically insert online data issued from the platforms (growth measurements, weights of the pots, environmental data and irrigation data), (ii) manually insert offline datasets (such as phenotypic data measured on plants) via Web interfaces, (iii) transform datasets extracted from the databases in order to display them and render them available in downloadable files via Web interfaces, (iv) provide online tools for data visualization (environmental kinetics, growth curves) as a support for experiment monitoring or data exploration and (v) perform data analyses (such as growth modeling) and calculations of elaborated data. R scripts are either automatically ran for data insertion, or called in PHP programs of the Web interfaces for data extraction and transformation, data visualization and analysis. Some of them are available for download on the Web sites.
The poster presents the three experimental set-ups, the organization of their information systems and how we have used R at the different levels of these information systems.
BioInformatic analyses of sex-determination in Tilapia (Oreochromis spp)
Abstract: Tilapias (Oreochromis spp.) are the second most important fish group in aquaculture and a primary source of animal protein for millions of people in developing countries. Indeed, Tilapias have most of the qualities required in aquaculture such as a good growth-rate, large plasticity, an easy domestication and resistance to diseases. Nevertheless, their early and constant reproduction coupled to their parental behavior, leads to tank overpopulation and dwarfism of individuals. To overcome this and to benefit from males’ fast-growth rate, tilapia farming relies on all-male progeny production. These all-male progenies are mostly produced sex-inversion using hormones. However, this method raises questions on consumer security and environmental issues. New sex controlling methods (genetics and temperature) are being studied but a better understanding of sex determination in tilapia is needed. Sex determination in tilapia is complex since sex is influenced by major genetic factors (XX/XY), minor genetic factors (on an autosome) and temperature factors. Past studies have shown the positioning of these genetic factors and to which linkage groups (LG1, LG3, LG23) they are associated. However, we lack information on what genes are involved and need to increment the markers on these linkage groups. Over the past years a great effort has been done to increase the genomic tools in tilapia by obtaining data on Bac End Sequences (BES), Expressed Sequence tags (EST), physical map, RH map.... These tools together with genomic data from other fish (full sequenced genomes, genetic maps, physical maps...), allows us to use a comparative genomic approach to find markers linked to sex in tilapia. Consequently and in order to systematically discover these genomic sex markers, we are experiencing a modeling approach with a standard design notation, namely UML .
In this way, both genomic knowledge and in silico analysis results are expressed through sequence and use-case diagrams. Those dynamic diagrams help to define a more precise and valid approach for datas confrontation. Use case driven modeling also help to find and chain different computational services in order to provide relevant results underpinning the study of sex-determination. Data were cleaned and then underwent BLAST and BLAT analysis for annotation. In order to find more gene potentially linked to sex, using sequences and maps available of other species, we did comparative genomic analysis using the tilapia BESs with the full genome of stickleback, tetraodon, and medaka, and the results were kept and visualized via a Gbrowser. This analyze allowed us to create a first list of gene and BES located on region linked to sex and therefore potentially candidates genes for sex-determination. A comparison of the genetic maps of the tilapia, of other fishes with the RHMap of tilapia was done using the software cMAP. It helped us to visualize maps and in the future will help to fulfill gaps of those maps. The next step will be to compare the various results between each other, obtained using the different analyses above, to create more tools and source of information as linked the physical map with the RHMap, which will be done by comparing the BESs with the sequences of the RHmap marker.These comparisons will lead, first to a better coverage of the linkage groups involved in sex-determination, second to a better understanding of which gene, known in other fish, are potentially involved in these processes and third to find new genes which could improve our overall understanding in sex-determination in fish.
MetExplore: a web server to link metabolomic experiments and genome-scale metabolic networks
Abstract: MetExplore is a web server that offers the possibility to link the metabolites identified in untargeted metabolomics experiments within the context of genome-scale reconstructed metabolic networks. The analysis pipeline comprises mapping metabolomics data (from masses or identifiers) onto the specific metabolic network of an organism, then applying graph-based methods and advanced visualisation tools to enhance data analysis. MetExplore stores metabolic networks and information about metabolites from about 60 organisms into a relational database. Various filters can be applied in MetExplore to restrict the scope of the study, for example by selecting only particular pathways or by restricting the network to the small-molecule metabolism.
ABGD, Automatic Barcode Gap Discovery
Abstract: Homologous DNA sequences can be used to assign individuals to species, as barcodes are used to identify items in a shop. Accordingly, a set of homologous sequences from many (thousands of ?) individuals can be referred as a DNA barcode dataset.
Such datasets can also be used to delineate species in organisms where the species boundaries are unknown. In that case, all sequences from the input dataset are assign to groups that corresponds to hypothetical species. One feature that can be used to delineate species is a so-called barcode gap in the distribution of pairwise differences. Indeed, when the intra-specific pairwise differences are much lower than the inter-specific ones, we observed such a gap.
Here, we present a method that automatically detects this barcode gap. All in all, the method takes a set of homologous sequences as input and classify them into hypothetical species. The significance of such a partition is assessed from the predictions of population genetics models. We have tested the performance of the method on 5 published barcode datasets where the species were "known". We have also assessed its theoretical limit through simulations of explicit population genetics and speciation scenarios.
Using ontologies for R functions management
Abstract: Researches of a group of laboratories studying plants imply to develop a lot of R functions to manipulate and analyse experimental data. Every year, dozens of functions are produced, concerning various fields such as genetic analyses, high throughput data phenotyping or environmental interactions and interacting with several databases. It is not pertinent to organise these functions in packages, and the associated documentation is heterogeneous. Furthermore, there is an important turn-over of function authors and users which generates different problems like re-using, sharing, understanding, etc. To group, share and valorise these functions we have proposed a new kind of repository.
Our idea was to develop an ontology to formalise concepts, properties and relations of our R functions. This ontology provides a controlled and structured vocabulary and gives the possibility to infer and reason. We decided to use semantic Web technologies such as RDF (Resource Description Framework) language and RDFS (RDF Schema) which allow to perform expressive annotations of resources. We query these annotations with SPARQL (a RDF Query Language) requests interpreted by Corese.
Our ontology contains two kinds of concepts and properties: the first one is dedicated to function description like ”Author”, ”Intention”, ”Argument”, ”Value”; the second one concerns relations between functions such as ”hasForRCoreCall”, ”canBeUsedAfter”, ”lookLikes”, ”isANewVersionOf”. Properties of these relations are then defined like transitivity or symmetry and are used to infer wider relations between functions.
Our repository provides an environment for:
- storage and annotation: a prototype of Web user interface allows authors to upload a function (one function per file) and to annotate it in a few minutes. These annotations are stored in RDF files.
- powerful searches: users can find and get functions with a global and accurate understanding. Users can also access to suggestions to support their approach. We can use this environment to search for example functions:
- having one author and/or producing distribution graphics.
- having for intention to perform multidimensional exploration.
- calling the ”lm” R core function, or a specific function of the repository. More generally, it is relevant to generate the call graph of one function to understand it.
- adapted from another one. This property makes easier the maintenance operation.
- to be used after or before another one. These properties help to construct chainings of treatments. We can extend this result to similar functions by applying the following rule: if B can be used after / before A and if B and C are similar, then C can be used after / before A.
This repository of annotated R functions centralises and shares R programs (functions or scripts) within a wide community. Know-hows which can oftenly be lost, or become non-usable because of a lack of documentation and description. The annotation interface could be automatically generated from the ontology and improved with semi-automated acquisitions. A very user-friendly Web interface is necessary to promote use of the repository. We are convinced that this kind of repository could be very widely extended to R function authors and users and this work adapted for another language.
MolliGen 3.0, evolution of a database dedicated to the comparative genomics of mollicutes
Abstract: Bacteria belonging to the class of Mollicutes were among the first ones to be selected for complete genome sequencing because of the minimal size of their genomes and of their pathogenicity for humans and for a broad range of animals and plants. An ever-growing number of genomes sequences becomes available and constitutes a precious basis for a wide diversity of applied and fundamental researches including, development of typing markers, vaccine designe, metabolic networks reconstructions, evolution and "omics" studies and, recently, synthetic biology. In order to provide tools for the comparative genomics of mollicutes to the scientific community, we have developed a web-accessible database named MolliGen. First released in 2004 with 6 genomes, MolliGen 3.0 now includes 26 genomes from 22 species. This new release of the database offers the possibility to compare public genomes with unpublished ones that can be loaded into private sections. As genome annotation update constitues a constant necessity in genome databases, a re-annotation module was added that gives authorized users to modify gene annotation and to add or delete genetic elements. Search tools were improved to allow the formulation of complex queries in natural language. Results of queries can be saved as lists of CDSs over web sessions, combined with other lists and downloaded with chosen associated features. Homology relationships are available as pre-computed data or can be explored using blast queries, multiple alignments and motif search. Whole genome alignments and precise regions can be visualized using dedicated browsers. Many others tools were added or improved for the analysis and the comparaison of genomes. The MolliGen 3.0 fatabase is regularly updated and is available at http://www.molligen.org.
IMGT/HighV-QUEST: A High-Throughput System and Web Portal for the Analysis of Rearranged Nucleotide Sequences of Antigen Receptors - High-Throughput Version of IMGT/V-QUEST
Abstract: IMGT/HighV-QUEST has been developed by IMGT® , the international ImMunoGeneTics Information System® to answer the problematic of the analysis of the antigen receptor data from Next-Generation Sequencing (NGS). The analysis of the expressed repertoires of antigen receptors - immunoglobulins (IG) or antibodies and T cell receptors - represents a crucial challenge for the study of the adaptive immune response in normal and disease-related situations. Currently the standardized analysis of IG and TR nucleotide sequences, based on the IMGT-ONTOLOGY concepts [2,3], is performed by IMGT/V-QUEST , the integrated IMGT® tool available online for 50 or less IG and TR rearranged sequences. IMGT/HighV-QUEST is the high-throughput version of IMGT/V-QUEST that allows users to analyse batches of more than 100,000 rearranged sequences of antigen receptors, IG or TR, in one run. IMGT/HighV-QUEST is a secure system and web portal. It requires user identification and all data transactions are performed over a secure web connection provided by HTTPS technologies. It provides a simple user interface accessed via a classical web browser which is familiar to IMGT/V-QUEST users. In particular, the functionalities are identical to those provided for 'Detailed View' and 'Excel files' of the online IMGT/V-QUEST tool. The results comprise a set of text files which include:
1) Eleven files equivalent to the eleven sheets of the 'Excel files' whose content is detailed in IMGT/V-QUEST Documentation: (i) the 'Summary' file provides the synthesis of the analysis (the sequence functionality, the names of the closest variable (V), diversity (D), and joining (J) genes and alleles with identity percentage, framework (FR) and complementarity determining region (CDR) lengths, amino acid (AA) V-D-J or V-J JUNCTION, the description of insertions and deletions if any…), (ii) the 'IMGT-gapped-nt-sequences' file includes the nucleotide (nt) sequences of labels that have been gapped according to the IMGT unique numbering, (iii) the 'nt-sequences' file includes the nt sequences of all described labels (iv) the 'IMGT-gapped-AA-sequences' file the amino acid sequence that have been gapped according to the IMGT unique numbering , (v) the 'AA-sequences file' includes the AA sequence of labels without IMGT gaps, (vi) the 'Junction' file includes the results of IMGT/JunctionAnalysis , (vii) the 'V-REGION-mutation-table' file includes the list of mutations (nt mutation, AA change, AA class identity or change per FR and CDR, (viii) the 'V-REGION-nt-mutation-statistics' file includes the number of positions including IMGT gaps, the number of nt, the number of identical nt, the total number of mutations, the number of silent mutations, the number of nonsilent mutations, the number of transitions and the number of transversions per FR and CDR, (ix) the 'V-REGION-AA-mutation-statistics' file includes the AA positions including IMGT gaps, the number of AA, the number of identical AA, the total number of AA changes, the number of AA changes according to AAclassChangeType, and the number of AA class changes according to AAclassSimilarityDegree per FR and CDR, (x) the 'V-REGION-mutation-hot-spots' file indicates the localization of the hot spots motifs detected in the closest germline V-REGION with positions in FR and CDR, and (xi) the 'Parameters' file includes the date of the analysis, the IMGT/V-QUEST version, and the parameters used for the analysis.
2) The 'Detailed view' for each analysed sequence that allows visualizing the individual results: (i) the result summary summarizes the main characteristics of the analysed sequence with the names of the closest V and J genes and alleles with their alignment score and the percentage of identity, the name of the closest ‘D-gene and allele determined by IMGT/JunctionAnalysis with the D-REGION reading frame, the FR and CDR lengths and the AA JUNCTION sequence, and if selected, (ii) the Alignment for V, D, J genes and alleles, (iii) the detailed analysis of the JUNCTION by IMGT/JunctionAnalysis, (iv) different displays of the V-REGION, (v) the analysis of the mutations and AA changes, (vi) the localization of the mutation hot spots, and (vii) the annotation by IMGT/Automat .
One of the challenges of IMGT/HighV-QUEST is the management of many analyses of thousands of sequences and of their results. The analyses are managed locally and the corresponding jobs are dispatched to computational servers provided they have available resources. This requires the establishment of a stateful system which keeps trace and follows the status of the analyses even when users are not connected. Another strategy taken to enforce data security is the encryption of links sent by E-mail to users to let them deal with the analysis results. This encryption using RSA algorithm and the requirement of a user session, to follow the analysis status or to deal with the results, enforce the security aspect of IMGT/HighV-QUEST. The result files are saved on file systems that are protected by firewalls and interactions with these file systems are only possible by SSH (Secure Shell) connections. The user results are kept for a limited period of time on file systems (15 days) and are fully deleted afterwards. The current release of IMGT/HighV-QUEST is in test on internal IMGT® servers. It will then be accessible from the IMGT® Home page (http://www.imgt.org).
 M.-P. Lefranc, V. Giudicelli, C. Ginestoux, J. Jabado-Michaloud, G. Folch, F. Bellahcene, Y. Wu, E. Gemrot, X. Brochet, J. Lane, L. Regnier, F.Ehrenmann, G. Lefranc and P. Duroux, IMGT®, the international ImMunoGeneTics information system®. Nucleic Acids Res., 37,1006-1012, 2009.
 V. Giudicelli and M.-P. Lefranc, Ontology for Immunogenetics: IMGT-ONTOLOGY. Bioinformatics, 15:1047-1054, 1999.
 P. Duroux, Q. Kaas, X. Brochet, J. Lane, C. Ginestoux, M.-P. Lefranc and V. Giudicelli, IMGT-Kaleidoscope, the Formal IMGT-ONTOLOGY paradigm. Biochimie, 90:570-583, 2008.
 X. Brochet, M.-P. Lefranc and V. Giudicelli, IMGT/V-QUEST: the highly customized and integrated system for IG and TR standardized V-J and V-D-J sequence analysis. Nucleic Acids Res., 36:W503-508, 2008.
 M.-P. Lefranc, C. Pommié, R. Ruiz, V. Giudicelli, E. Foulquier, L. Truong, V. Thouvenin-Contet and G. Lefranc, IMGT unique numbering for immunoglobulin and T cell receptor variable domains and Ig superfamily V-like domains. Dev. Comp. Immunol., 27:55-77, 2003.
 M. Yousfi Monod, V. Giudicelli, D. Chaume and M.-P. Lefranc, IMGT/JunctionAnalysis: the first tool for the analysis of the immunoglobulin and T cell receptor complex V-J and V-D-J JUNCTIONs, Bioinformatics, 20:379-385, 2004.
 V. Giudicelli, D. Chaume, J. Jabado-Michaloud and M.-P. Lefranc, Immunogenetics sequence annotation: the strategy of IMGT based on IMGT-ONTOLOGY. Stud. Health Technol. Inform., 116:3-8, 2005
INVESTIGATE GENOME STRUCTURE AND GENES REGULATION: A NOVEL APPROACH TO IDENTIFY A CO-EXPRESSION AMONG AND BETWEEN GROUPS OF NEARBY GENES.
Abstract: New theories underline the importance of understanding the impact of the spatial component on the regulation of the gene expression. Recent studies of the genes expression of several eukaryotic genomes revealed positional clustering of co-expressed genes (worm, fruit fly, mouse, human...). Contrarily to prokaryote genomes, those clusters are often not involved in the same biological pathways. This clustering is however preserved between the species, what suggest that the genome architecture is involved in some high mechanisms of regulation . In addition, new theories have proposed that the chromosome conformation in the cells plays a role in the gene regulation. Indeed, the chromatin loops cause long-range interaction between chromosomal regions. This phenomena may let interact regulatory elements or let distant genes be co-expressed within a transcription factory. Numerous examples in the functional genomics literature have lead to the establishment of the recent theories that describe how the genome structure may determine the genes expression. .
To our knowledge, it exists no global computational methods to identify clusters of co-expressed genes regarding as well their chromosomal locations as the potential structural interactions that may impact their expression. The positional clustering of co-expressed genes are identified by analysing the cluster distribution. Whereas interacting genomic regions are identified with SNP and not the genes expression profiles. We thus developed a novel approach to investigate the co-expression among genomes with transcriptomic assays.
We propose here an approach based on Principal Component Analysis (PCA) to identify clusters of co-expressed genes regarding their location from datasets of transcriptomic assays . A PCA is applied within a genomic window, which is slid over the whole chromosome. The correlated genes that are the best contributors to the first identified components may be interpreted as co-expressed genes. Several estimators and graphical representations are built based on the computed eigenvalues and the gene locations (as the level of co-expression, the number of clusters within a region and the genes distribution). In addition, the co-expression between these clusters is assessed with Multiple Factor Analysis , in order to identify co-expressed clusters between different regions.
This method detected such co-expression in both simulated and real datasets. For example, the method highlighted some groups of co-expressed genes between some putative interacting regions on the chicken genome previously identified with a genetical genomic approach. First, we will perform a comparative analysis of these preliminary results against the results of the genetical genomic approach. Eventually the conclusions of this comparative analysis will be used to analyze the global results of our method at the whole genome level.
A R- package will be soon available.
 Elizondo Leah I., Jafar-Nejad Paymaan, Clewing J. Marietta and Boerkoel Cornelius F., 2010. Gene Clusters, Molecular Evolution and Disease: A Speculation. Current Genomics, 10 (1): 64-75.
 M. Ouedraogo, F. Lecerf and S. Lê, Understanding co-expression of co-located genes using a PCA approach, COMPSTAT 2010, 19th International Conference on Computational Statistics, Paris - France, Aug. 22-27
 B. Escofier and J. Pagès, Multiple Factor Analysis (AFMULT package), Computational Statistics & Data Analysis, v.18 n.1, p.121-140, Aug. 1994
Génolevures, bases de connaissance et annotation des génomes des levures hémiascomycètes
Abstract: L'évolution technologique des séquenceurs d'ADN a profondément changé la manière de concevoir les projets de séquençage pour la génomique comparée. Pour un coût et un temps d'acquisition réduits de plusieurs ordres de grandeur, un projet de génomique comparée consiste actuellement à séquencer plusieurs génomes complets apparentés à un génome de référence. Pour faire face à cet afflux de données, il est nécessaire de disposer de méthodes automatiques assurant une bonne qualité d'annotation. Grâce à l’exploration de la large échelle évolutive des levures hémiascomycètes, le consortium Génolevures possède une bonne expertise au niveau de l'annotation des génomes et des méthodes utilsée en génomique comparée.
Depuis 1999, le consortium Génolevures séquence, annote et analyse des génomes venant de la branche Hemiascomycetes et réalise sur ces données et celles provenant de génomes extérieurs, de nombreuses études de comparaison in silico et expérimentales. Les premiers travaux effectués sur des séquences génomiques partielles [1,2] ont révélé plusieurs types de relations entre éléments chromosomiques : synténie, redondance et classification fonctionnelle. En 2004, les séquences complètes de 4 génomes ont amené de nouveaux enseignements et développements [3,4]: conception d'un algorithme original de classification des protéines en familles , identification de plusieurs mécanisme impliqués dans l'évolution des génomes ( répétition en tandem des gènes, duplication segmentale des chromosomes, duplication totale du génome) et inférence des évènements majeurs de l'évolution du clade des levures hémiascomycètes. Depuis, 6 nouveaux génomes complets ont été analysés et publiés [6,7] ou en cours de publication. L'annotation des génomes a été effectué selon le modèle de l'annotation collaborative via Internet, grâce à des méthodes et outils développés pour répondre aux besoins des annotateurs. L'ensemble des données (séquences, élements chromosomiques et leurs relations, classifications, ...) est accessible à la consultation et au téléchargement dans la base de connaissances Génolevures (http://cbi.labri.fr/Genolevures/).
Le savoir-faire accumulé au cours de l'annotation des différents génomes inclue notamment la prise en compte des relations de synténie et de similarité à des familles de protéines. Ce qui nous a permis d'élaborer un méthode d'annotation en parallèle de génomes apparentés. Sur cette base, nous avons développé le système d’annotation de génome MAGUS  qui intégre les séquences des génomes, les éléments chromosomiques les composant, les analyses in silico, les vues de données externes via une interface requérant seulement un navigateur web. MAGUS permet soit une annotation simultanée de plusieurs génomes, soit une annotation d'un génome à la fois, les annotations saisies par l'un des annotateurs étant disponibles aux autres en temps réel. MAGUS implémente les workflows d’annotation élaborés pour Génolevures et oblige à suivre des standards de curation afin de garantir l’intégrité et la cohérence des données. Le système fournit un workflow pour l’annotation simultanée des génomes à travers l’utilisation des familles de protéines identifiées par les analyses in silico.
Les projets de génomique comparée ont changé d'échelle, passant en quelques années de la comparaison de génomes d'espèces modèles à la comparaison d'ensembles de génomes d'espèces apparentées. Ainsi, par exemple, dans le cadre de l'établissement de relations génotype-phénotype sur des groupes d'espèces, il est nécessaire de connaitre à la fois les similarités et les différences entre génomes. Pour mener à bien de tels projets, Il est nécessaire d'annoter les séquences génomiques de manière rapide et efficace. L'expertise accumulée par le consortium Génolevures et son implémentation dans le système MAGUS apporte une solution pour les petits génomes eucaryotes comme ceux des levures.
Ce travail est co-financé par le CNRS (GDR 2354, INS2I, INSB), ANR (ANR-05-BLAN-0331 : GENARISE), la région Aquitaine (‘Pôle Recherche en Informatique ‘) (2005-1306001AB, partiel, 2010) et par ACI IMPBIO (IMPB114, ‘Génolevures En Ligne’)
 Souciet JL, Génolevures Consortium. Special issue: Génolevures. FEBS Lett. 487:1–149, 2000.
 Sherman DJ, Durrens P, Beyne E, Nikolski M, Souciet JL. Genolevures: comparative genomics and molecular evolution of hemiascomycetous yeasts. Nucleic Acids Res. 32:D315–D318F. , 2004 .
 Dujon B, Sherman DJ, Fischer G, Durrens P, Casaregola S, Lafontaine I, De Montigny J, Marck C, Neuvéglise C, Talla E, et al. Genome evolution in yeasts. Nature 430:35–44,2004.
 Sherman D, Durrens P, Iragne F, Beyne E, Nikolski M, Souciet JL. Genolevures complete genomes provide data and tools for comparative genomics of hemiascomycetous yeasts. Nucleic Acids Res. 34:D432–D435, 2006
 D. J. Sherman and M. Nikolski, Family relationships : should consensus reign ?- consensus clustering for protein families, BMC Bioinformatics, 23 (2007).
The Genolevures Consortium, Comparative genomics of protoploid Saccharomycetaceae, Genome Research, Volume 19, 1967-1709, 2009
 Sherman DJ, Martin T, Nikolski M, Cayla C, Souciet JL, Durrens P. Génolevures: protein families and synteny among complete hemiascomycetous yeast proteomes and genome, Nucleic Acids Research, 37(Database issue):D550-D554, 2009
 Sherman DJ, Martin T., Durrens P., http://magus.gforge.inria.fr
GWAS-AS: assistance for a thorough evaluation of advanced algorithms dedicated to genome-wide association studies
Abstract: Advances in genotyping technologies nowadays allow the completion of genome-wide scans involving a million or more of genetic markers (SNPs) across the human genome. Relying on two cohorts - controls and cases - sampled from a population of unrelated individuals, genome wide association studies (GWASs) attempt to identify which marker variants significantly accumulate in the case cohort. Straightforward standard approaches test the association between each marker and the disease. Focusing on the single marker-based approach, we identified that there was still room for improvement.
Exploiting the existence of statistical dependencies between neighbouring markers, also called linkage disequilibrium (LD), is a key to association study improvement. Notably, LD allows reducing data dimensionality in GWASs. Relying on this feature, various approaches were proposed to achieve data dimensionality reduction. In this line, besides former works based on Bayesian networks (BN), we have proposed CFHLC, an algorithm that finely models dependencies between SNPs, using hierarchical networks with latent variables (LVs). Not only does a thorough evaluation of CFHLC require comparing our algorithm with standard association tests on real data made available by the Wellcome Trust Case-Control Consortium (WTCCC); this evaluation process also entails conducting extensive simulation studies under a wide variety of simulated conditions. Moreover, a noteworthy characteristic of CFHLC is that the LVs' cardinalities are not restrained to 3-arities, as SNPs' are. A software amenable to manage a genome-scale association study with n-ary outputs was still to be written. We filled this lack.
Finally, since any searcher proposing a novel GWAS approach will be confronted to our needs, at least partially, we especially designed a software suite to alleviate this evaluation task. Together with a tool box, this suite provides integrated and flexible pipelines to rule the whole process from simulated benchmark generation or WTCCC data handling, through data filtering, to GWAS.
Basically, in the CFHLC algorithm, LVs capture the information born by underlying markers. To their turn, LVs are clustered into groups and, if relevant, such groups are subsequently subsumed by additional LVs. Iterating this process yields a hierarchical structure. First, the great advantage to GWASs is that further statistical analyses can be chiefly performed on LVs. Thus, a reduced number of variables will be examined. Second, a model based on a hierarchical structure provides a flexible data mining tool. The hierarchical structure is meant to efficiently conduct refined association testing: zooming in through narrower and narrower regions in search for stronger association with the disease ends pointing out the potential markers of interest. To sum up, the framework proposed, a Forest of Hierarchical Latent Class models, brings together several advantages over its few challenger proposals: BN topology extended to n-ary trees organized in a forest; control to limit information decay as the level of the LV increases in the topology; flexible LV cardinality; scalability (the current implementation of CFHLC has been shown to be tractable on benchmarks describing 100 000 variables for 2000 individuals).
The GWAS-AS suite designed plugs WTCCC data or Hapgen outputs to a framework interfacing Python pipelines and R scripts with PLINK software. Data generation is performed through Hapgen, a well-known software, which allows prescribing the genetic model for the simulated causal SNP (additive, dominant, recessive), together with various disease severities. One of the most popular statistical software platforms dedicated to GWASs, PLINK, met our purpose of an easy implementation of standard association tests.
In a GWAS, any evidence of population substructure or outliers has to be taken into account to discard spurious associations. When dealing with real data (including WTCCC data), GWAS-AS automatically chains the various steps of a dedicated pipeline, for easy and safe non-expert usage.
When run at a whole, the quality control stage involves: checking for missing genotype rates (per SNP, per individual) lower than specified thresholds; filtering above minor allele frequency cut-offs, detecting Hardy-Weinberg equilibrium failure. Then, the filtered data undergoes several types of association analyses: Khi-squared test, Cochran-Armitage test and logistic regression. The latter test can integrate covariates such as sex and population stratification. Association significance is assessed empirically, through permutations.
To our knowledge, no genome-wide oriented package dedicated to genetic association can handle n-ary variables (with n > 3), including multiallelic markers. GWAS-AS offers this functionality through a dedicated R script running Khi-squared test and logistic regression. Besides, since latent variables are organized in layers within the hierarchical structure, each such layer must be assigned a relevant significance threshold for association rejection decision. The GWAS-AS suite automatically sets those cut-off values through permutations, relying on a quantile. Furthermore, testing many independent null association hypotheses in the same experiment entails multiple "opportunities" for chance false rejection. In our case, permutation results have already been calculated, which allow empirical correction for multiple testing within each layer.
We are currently under the process of evaluating CFHLC under 3 genetic models combined with 4 disease severities.
After data generation for human chromosome 1 (between 95.5k and 96k SNPs after quality control), the simulated causal variant was dismissed and two GWASs were carried out relying on the GWAS-AS tool: one on the observed SNPs and one on CFHLC ouputs (inferred data corresponding to the latent variables). The first GWASs performed show that data reduction does not decrease much the power of the analysis process: in all cases examined, the causative region is identified by both approaches. As expected, the detection is more accurate for high disease severities. However, CFHLC still enables the detection at low relative risks.
Designed for a peculiar usage, GWAS-AS software suite can be extremely helpful to other scientists involved in the design of innovating approaches for genome-scale association analysis. In the near future, this software will be put at the disposal of this community.
Food-Microbiome, une étude d’éco-génomique appliquée à des écosystèmes fromagers
Abstract: Food-Microbiome est un programme financé par l’ANR visant à développer de nouvelles approches d'études des écosystèmes avec pour modèle des fromages traditionnels. Ces produits sont fabriqués avec une flore complexe, initialement composée de bactéries lactiques, puis de diverses bactéries, levures et champignons. Certains microorganismes sont ensemencés, d’autres proviennent de l’environnement. Une meilleure connaissance de ces flores est nécessaire pour préserver la possibilité d’utiliser certains de ces microorganismes dans l’avenir et d’assurer l’innocuité des produits dans lesquels ils pourraient être ajoutés. Le programme comporte deux volets scientifiques, l’un de métagénomique quantitative et l’autre de génomique des champignons filamenteux. Nous développons ici le premier volet.
La métagénomique quantitative est une approche basée sur l’utilisation des séquenceurs à très haut-débit pour compter les acides nucléiques présents dans un écosystème. Appliquée à l’ADN, elle permet de quantifier les gènes, et parfois leurs variants et plus globalement, les génomes des microorganismes présents. Dans le programme Food-Microbiome, nous avons analysé 40 échantillons de fromages représentant en majorité différentes variétés de fromages traditionnels français (fermier et AOC), ainsi que quelques fromages plus industriels.
Les ADN de ces échantillons ont été séquencés avec le séquenceur SOLiD à une profondeur de 30 millions de séquences de 50 paires de bases. Ces séquences ont été analysées par la chaîne de traitement automatisée METEOR développée au laboratoire. Elles ont été indexées contre différents catalogues de référence adaptés à ces écosystèmes : (i) un catalogue de 80 génomes de bactéries susceptibles de coloniser des denrées alimentaires, (ii) un catalogue de 23 génomes de champignons et levures, et (iii) un catalogue de séquences "marqueurs" de 700 espèces de champignons. Cette analyse permet de reconstituer la composition des flores des échantillons de fromages et de comparer les différents compartiments écologiques au sein d'un même fromage comme le cœur ou la surface. Nous présenterons quelques illustrations des potentialités et des limites de cette méthode pour caractériser les écosystèmes.
A joint experimental and simulation study of aging and protein aggregation in E. coli
Abstract: Aging is a fundamental characteristic of all living organisms down to bacteria and can be defined as a decrease in reproduction rate with age. Recent research by our lab has shown that E. coli does not enjoy eternal youth.
The bacterium E. coli grows in the form of a rod, which reproduces by dividing in the middle. One new pole per progeny cell is produced during this division event. Therefore, one of the ends of each cell has just been created during division termed the new pole, and one is pre-existing from a previous division, termed the old pole. Old poles can exist for many divisions, and if cells are followed over time, an age in divisions can be assigned to each pole, and hence to each cell. Hence, each division of a single cell yields two daughters: one inherits the "older old pole", named "old pole cell'', while the other inherits the "younger old pole" named the "new pole cell" (Stewart, E. J., R. Madden, et al. (2005) PLoS Biol 3(2): e45). The old pole cell grows slower than the new pole cell produced in the same division. The new pole cell is slightly larger on average and is marginally more likely to divide sooner than the old pole cell.
Moreover, it was shown that this aging phenomenon is related to asymmetric protein aggregation: indeed chaperone-containing protein aggregates accumulate in the old poles of E. coli under native conditions (Lindner, A. B., R. Madden, et al. (2008) Proc Natl Acad Sci U S A 105(8): 3076-81). As the bacteria divide, these defective proteins are believed to participate in the slowing down of the growth of aging bacteria. Protein aggregation is therefore responsible for a substantial part of the growth defect associated to replicative aging in E. coli. Taken together, these results suggest that bacteria could be an excellent model organism to study the relation between protein aggregation and aging.
While the results above have been obtained in non-stressing conditions, related works have focused on the same system but in conditions where protein aggregation is triggered by a heat shock. Rokney and coworkers (Rokney, A., M. Shagan, et al. (2009) J Mol Biol 392(3): 589-601) have shown that the formation of large protein deposits in such heat shock conditions proceeds through two steps: the first is the formation of multiple small aggregates with random cellular distribution and the second is dependent on two energy sources: the proton motive force and ATP. Conversely, Winkler and collaborators (Winkler, J., A. Seybert, et al. (2010) EMBO J 29(5): 910-23) found evidence that protein aggregation at the pole upon heat shock is mainly driven by the large molecular crowding expected to region in the nucleoid. The absence of occlusion affected the aggregation process by decreasing the number of aggregates to one per cell and increasing the number of aggregates in mid-cell position. The aggregation process does not need active mechanisms. The action of the cytoskeleton is not involved and ATP is not required for aggregation.
Hence, the above results suggest that protein aggregation in the old pole of the cells is a progressive process eventually ending in the observed asymmetric localization of large aggregates in the pole. However, the available experimental results are not conclusive with respect to several important questions, such as: is the observed protein aggregation an active phenomenon (based e.g. on ATP-dependent mechanisms) or a purely passive one (resulting from the large molecular crowding in the nucleoids or some interactions with the cell membranes)? Can the observed difference between native and heat shock conditions concerning e.g. the number of aggregate per cell be related to simple properties such as total aggregating protein numbers or propensity?
To answer these questions, we developed an approach based on the synergy between innovative experimental approaches (microfluidics, synthetic biology, time-lapse fluorescence microscopy) and 3D individual-based modeling. On the experimental side, we monitored aggregates of fluorescently tagged chaperone proteins within single bacterial cells growing in a non-stressed favorable environment. Using automated image analysis of the microcopy movies, we quantified the spatial distribution of the protein aggregates and the properties of their trajectories/movements within the cells. The plausibility of the first interpretations from these experimental results was then tested in a 3D individual-based model of protein diffusion-aggregation in E. coli. In particular, we wanted to determine whether the observed localization and the observed trajectories of the aggregate within the cell could be accounted for by purely passive mechanisms such molecular crowding.
Our preliminary simulation results have confirmed that the experimentally-observed spatial distribution of the aggregate at first appearance can be replicated by this simple aggregation-diffusion model as soon as the large molecular crowding expected in the nucleoids is taken into account. The spatial distribution of the aggregates within the cell obtained by the simulations remains identical to the experimental observation whatever the initial number of molecules. Protein aggregates are localized at one of the poles or in the center of the cell.
Moreover, our preliminary results indicate that some of the observed differences between native and heat shock conditions could be simply explained by an increased aggregation propensity in the heat shock case.
To simulate the heat shock, we increased the initial number of molecules from 100 to 10000 molecules and we observed that with one hundred molecules, the probability to observe one large-size aggregate (a big aggregate is defined as more than 25% of the initial number of molecules) is higher, while it is much smaller for two aggregates and vanishes for three aggregates. Conversely, with 10,000 initial molecules, the probability to observe 3 aggregates is higher.
Taken together, our results provide strong evidence in favor of the hypothesis of a purely passive origin for the aggregation process and the asymmetry of the aggregate location at the early stage of the aggregation process. Our future work will try and identify molecular links between protein aggregation, growth rate and aging. As a first step, we will study the roles played by the different chaperones in the observed aggregation.
3D Printing Service @ RPBS - MTi
Abstract: A new 3D printing service is now open @ University Paris Diderot - located at the MTi laboratory (INSERM U973) - RPBS Platform. We are mostly specialized in molecular model printing, so the service is fully open to the structural bioinformatics community, but also to other domains like architecture.
- Can we print anything else than molecules?
Yes! We are able to print any CAO object, the only constrain is to send us a supported file format (e.g. ZBD, STL, BLD, PLY, ZCP, SFX, 3DS, ZEC, ZPR and VRML).
- I know a bit of molecular viewing, can I submit my model via a PyMol session file?
Sure, we support the three main viewers, e.g. PyMol, VMD and Chimera. Your session file can be uploaded as a CAO object in the submission form (pricing section).
- I don't have any structure for my protein, can you help?
Yes! The RPBS platform can provide its strong experiment in protein structure prediction to provide a printable model of your favorite polypeptide. In this case, please contact us via our contact form to describe your system.
To see some examples of in house printed models, take a look at our gallery: http://3dprinter.rpbs.univ-paris-diderot.fr/models/gallery/
Development of a workflow for SNP detection with Galaxy.
The URGI (http://urgi.versailles.inra.fr/) developed a pipeline of SNPs detection, integrated to a workflow manager named « Galaxy » (http://main.g2.bx.psu.edu/) [1, 2].
Galaxy allows, through a web page, to link different tools very simply and also very fastly. Thus, an infinite number of workflow can be achieved and shared with the entire community on a wide range of science.
From a reference genome and a set of Short Reads (in Single End or in Pair End), our workflow is able to predict all SNPs and indels from various filters, such as genome coverage, allele frequency, pValue, and so on...
For this, we used many tools such as Bwa [3, 4], SAM tools , Tablet  and VarScan .
However, nothing is fixed : we can modify, or replace, each tool and this, quickly and intuitively.
L'URGI (http://urgi.versailles.inra.fr/) a développé un workflow de détection de SNP, intégré à un gestionnaire de workflow, nommé « Galaxy » (http://main.g2.bx.psu.edu/) [1, 2].
Galaxy permet, par une interface Web, de relier de nombreux outils entre eux, très simplement et surtout très rapidement. Ainsi, un nombre infini de workflow peut être réalisé et ils peuvent être partagés avec l'ensemble de la communauté pour différents domaines scientifiques.
Notre workflow est capable, à partir d'un génome de référence et un ensemble de Short Reads (en Single End ou Pair End), de prédire l'ensemble des SNPs et indels selon divers filtres (couverture génomique, fréquence allélique, pValue, etc...)
Pour cela, on utilise de nombreux outils tels que Bwa [3, 4], SAM tools , Tablet  et VarScan .
Cependant, rien n'est figé : chaque outil utilisé peut être modifié ou remplacé, rapidement et intuitivement.
Dans le cadre de plusieurs projets ANR (PlanteReseq, GrapeReseq et Muscares), nous avons mis en place un pipeline de détection de SNPs et Indels. Celui-ci devait avoir comme particularité d'être performant, rapide à réaliser, évolutif et surtout réutilisable sur différents projets.
Galaxy est un gestionnaire de workflow présenté sous la forme d'une page Web. Ce gestionnaire met à disposition un ensemble d'outils variés et permet d'intégrer d'autres outils en une même page. A partir de tous ces outils, les workflows peuvent être réalisé très simplement.
D'un point de vue développeur, intégrer un outil à Galaxy se résume à créer un fichier XML, modifier un fichier de configuration et, éventuellement, écrire un script Perl ou Python (selon la complexité du programme). Tous les outils peuvent être intégrés à Galaxy, même nos scripts « personnels » ou ceux utilisant des bases de données, des clusters, etc...
D'un point de vue utilisateur, il est assez simple d'utiliser Galaxy. Pour se servir d'un outil, il faut cliquer sur le lien et renseigner les paramètres. De même pour créer un workflow, il faut cliquer sur un outil pour le faire apparaître sous forme de boîte, puis relier les boîtes les unes aux autres.
Notre workflow se décompose en trois grandes étapes. La première consiste à aligner l'ensemble des Short Reads (en Single End ou en Pair End) sur le génome de référence. Cette étape s'articule autour du logiciel d'alignement Bwa. Cet outil nécessite un fichier fasta pour le génome de référence et un (Single End), ou deux (Pair End), fichiers fastq.
La seconde étape consiste à prédire les SNPs, ainsi que les indels. Cela est principalement géré par SAM tools qui reprend le fichier de résultat de Bwa (un fichier sam) et créé un fichier du type pileup regroupant l'ensemble des informations utiles pour chacune des bases de la séquence de référence, qui ont été alignées avec au moins un Short Reads. Ces fichiers sont ensuite utilisés par VarScan qui, à partir de certains filtres paramétrés par l'utilisateur, nous prédit un ensemble de SNPs, Indels ainsi que la séquence consensus. La dernière étape consiste à visualiser graphiquement nos résultats. Pour cela, nous utilisons essentiellement Tablet.
Pour chacune des étapes du workflow, nous pouvons substituer un outil par un autre. Par exemple, nous pouvons remplacer Bwa par Maq  ou encore Bowtie . De même, au lieu d'utiliser Tablet, nous pouvons utiliser GenomeView  (qui permet la lecture des fichiers d'annotations).
 Taylor J, Schenck I, Blankenberg D and Nekrutenko A (2007). 'Using galaxy to perform large-scale interactive data analyses'.
Curr Protoc Bioinformatics. 2007 Sep;Chapter 10:Unit 10.5.
 Blankenberg D, Taylor J, Schenck I, He J, Zhang Y, Ghent M, Veeraraghavan N, Albert I, Miller W, Makova KD, Hardison RC and Nekrutenko A (2007). 'A framework for collaborative analysis of ENCODE data : making large-scale analyses biologist-friendly'.
Genome Res. 2007 Jun;17(6):960-4.
 Li H. and Durbin R. (2009). 'Fast and accurate short read alignment with Burrows-Wheeler transform'.
Bioinformatics, 25, 1754-60. [PMID: 19451168]
 Li H. and Durbin R. (2010). 'Fast and accurate long-read alignment with Burrows-Wheeler transform'.
Bioinformatics. [PMID: 20080505]
 Li H.*, Handsaker B.*, Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009). 'The Sequence alignment/map (SAM) format and SAMtools'.
Bioinformatics, 25, 2078-9. [PMID: 19505943]
 I. Milne, M. Bayer, L. Cardle, P. Shaw, G. Stephen, F. Wright and D. Marshall (2010). 'Tablet—next generation sequence assembly visualization'.
Bioinformatics 2010 26(3):401-402.
 Koboldt DC, Chen K, Wylie T, Larson DE, McLellan MD, Mardis ER, Weinstock GM, Wilson RK, & Ding L (2009). 'VarScan: variant detection in massively parallel sequencing of individual and pooled samples'.
Bioinformatics (Oxford, England), 25 (17), 2283-5 [PMID: 19542151]
 Langmead B, Trapnell C, Pop M, Salzberg SL. 'Ultrafast and memory-efficient alignment of short DNA sequences to the human genome'.
Genome Biology 10:R25.
Conformational Rearrangements of Lipases Investigated by Molecular Dynamics Simulations
Abstract: The ‘interfacial activation’ of most lipases at an oil-water interface or in non-polar solvents is thought to be related to large conformational changes of a so-called lid subdomain which switches the enzyme conformation from an (closed) inactive to an (open) active state. These rearrangements of the lid are generally assumed to be induced by an adsorption of the enzyme to a hydrophobic interface or the binding of a substrate or even the combination of both mechanisms. Although, the crystallographic resolution of a number of lipase structures in closed and/or open conformations have enhanced the knowledge on the mobile elements of lipases, the detailed description of the lid conformational transitions and the molecular mechanism by which the lid is triggered are still lacking.
To investigate further this phenomena, we have carried out molecular dynamics (MD) simulations in different conditions on two lipases, the lipase from Burkholderia cepacia (BCL) and the lipase from Yarrowia lipolytica lipase (LIP2) [1,2]. Our simulations reveal the conformational rearrangements occurring under the influence of explicit solvent environment (water, octane and water/octane interface) or upon substrate binding, and the effect of these rearrangements on the substrate accessibility to the enzyme active site.
The full closed <-> open conformational transitions were successfully obtained by MD simulations for both enzymes, although opening/closing molecular mechanisms appeared clearly distinct. In the case of BCL, the sole effect of the solvent (aqueous or apolar) was enough to induce spontaneously the complete interconversions. Conversely, the adsorption of the LIP2 enzyme to a hydrophobic phase only permitted to reach a partially open conformation which was however wide enough to allow the substrate to reach the catalytic site. MD simulations showed that only upon substrate binding, the lipase was able to adopt a fully open conformation, suggesting thus a two-step activation mechanism for the activation of this enzyme.
Altogether, these results provide us with a comprehensive understanding of the molecular determinants triggering the conformational transition undergone by BCL and LIP2 during their activation. This information is being used in our laboratory for engineering lipase mutants with improved properties for biotechnological purposes.
1- Barbe S., Lafaquière V., Guieysse D., Monsan P., Remaud-Siméon M. and André I. Insights into lid movements of Burkholderia cepacia lipase inferred from molecular dynamics simulations. Proteins. 2009; 77(3):509-523.
2- Bordes F., Barbe S., Escalier P., Mourey L., André I., Marty A. and Tranier S. Exploring the conformational states and rearrangements of Yarrowia lipolytica Lipase. Biophys. J. (in press).
Discriminating between spurious and significant matches
Abstract: Word matches are widely used to compare DNA sequences, especially when the compared sequences are too long to be aligned with classical methods. Thus, for example, complete genome alignment methods often rely on the use of matches for building the alignments and various alignment-free approaches that characterize similarities between large sequences are based on word matches.
Among the matches that are retrieved between two genomic sequences, a part of them may correspond to spurious matches (SMs), which are matches obtained by chance rather than by homologous relationship. The number of SMs depends on the minimal match length (l) that has to be set in the algorithm. Indeed, if l is too small, a lot of matches are recovered but most of them are SMs. Conversely, if l is too large, fewer matches are retrieved but many smaller significant matches are probably ignored. Last, it is obvious that the subsequent analysis of the obtained matches is significantly impaired if the number of SMs is high.
To date, the choice of l mostly depends on empirical threshold values rather than robust statistical methods. To overcome this problem, we propose a statistical approach based on the use of a mixture model of geometric laws to characterize the length distribution of matches obtained from the comparison of two genomic sequences. In this work, the basic principles of our approach are presented. Its strengths and weaknesses are then discussed through examples drawn from bacterial genome comparisons.
Définition de patchs 3D et fouille relationnelle pour la caractérisation et la prédiction de sites d’interactions protéine-protéine
Abstract: Les interactions protéine-protéine (IPP) jouent un rôle majeur dans les systèmes vivants. Ainsi, la compréhension des IPP est importante pour l’exploration des processus cellulaires complexes. De nombreuses méthodes expérimentales et informatiques permettent actuellement d’identifier ou prédire les IPP. Cependant, la précision des prédictions demeure insuffisante du fait de deux limitations. D’une part les structures tridimensionnelles (3D) connues de protéines et de complexes protéiques sont encore peu exploitées. D’autre part, la caractérisation des sites d’IPP n’est pas explicite. Dans ce contexte, nous proposons une nouvelle méthode avec un objectif double : (i) exploiter le nombre croissant de structures 3D protéiques pour la prédiction des IPP et (ii) aller au delà des limites des approches les plus courantes de type « boite noire ».
Nous proposons une représentation relationnelle de patchs 3D de protéines. Ces patchs correspondent à la fois à des exemples positifs de sites d’IPP et à des exemples négatifs. Ensuite, une méthode relationnelle de fouille de données est appliquée dans le but d’apprendre par induction une définition générale du concept de site d’IPP. Enfin, la définition obtenue par induction est utilisée pour prédire des sites d’IPP à partir d’une nouvelle structure protéique 3D.
Afin de réaliser une première validation de notre approche, nous l’avons appliquée à des IPP spécifiques : les sites de phosphorylation. Les résultats confirment l’intérêt de notre approche, d’une part pour la forme explicite des règles produites, d’autre part pour la précision de la prédiction qui atteint 90% en validation croisée. En effet, cette précision est tout à fait comparable à celle des nombreux systèmes de prédiction existant qui exploitent des ensembles d’apprentissage beaucoup plus conséquents.
Exploring the transcriptional response of Arabidopsis under stress conditions by a graph-mining approach highlights new insights into key metabolic pathways
Abstract: Plants respond to metals exposure by displaying a wide range of mechanisms at the cellular level that are potentially involved in the detoxification of heavy metals (e.g. cadmium). Previous studies in the model plant Arabidopsis thaliana have shown altered expression of different sets of genes that are mainly regulated at the transcriptional level yet the underlying mechanisms of control are still poorly understood (Herbette, 2006). Transcriptional coordination or co-expression patterns of gene expression across many microarray data sets may reveal groups of functionally related genes that are likely to be controlled by the same set of transcription factors. While graph-based algorithms for module detection are generally applied to data from a single microarray experiment, integration from a large collection of microarray experiments enhances the predictive power of co-expression analysis. In order to investigate the transcriptional network elicited by these stress conditions, we used here a graph-based data mining approach to identify genes in Arabidopsis thaliana that are co-expressed with known cadmium-responsive components. Conversely to most of co-expression studies in Arabidopsis based on classic approaches calculating expression correlation (for review see Usadel et al., 2009), our approach differs in how gene co-expression is determined and transcriptional modules selected (Boyer et al., 2009).
Co-expression graphs were built from the ArrayExpress 2-color Arabidopsis thaliana collection comprising 38 microarray datasets with more than 600 hybridization experiments (data downloaded in 2008). The resulting compendium was therefore used to explore part of the transcriptional network invoked upon plant exposure to stress with known cadmium-responsive genes as guide-genes. Given the importance of considering combinatorial control to gain new insights into the underlying regulatory network we subsequently derived all significantly overrepresented motifs combinations for transcriptional modules assessment as previously described (Lindlöf et al., 2009). This strategy, enriched by a careful gene-by-gene review based on information found in manual literature searches, revealed two main modules composed of genes related to key metabolic pathways. The first module is partly related to the sulfur/nitrate assimilation pathways and to the biosynthesis of glutathione amino-acid precursors while the second one is connected to the glucosinolate biosynthesis pathway. In this latter, in addition to genes known to be co-regulated, we also found new genes that could have implicit roles in the biosynthesis of glucosinolate as suggested by their co-expression relationship with specific cis-regulatory motifs combination. Interestingly some of these newly identified genes have been very recently demonstrated to actually play a role in this pathway thereby supporting the relevance of our findings.
Taken together, these results suggest that our approach is suitable for gene prioritization and can also be of interest for the identification of candidates for functional analysis in order to dissect more precisely the underlying transcriptional network elicited by stress conditions.
Boyer F et al. A graph-based approach for extracting transcriptional regulatory modules from a large microarray compendium: an application to the transcriptional response of Arabidopsis thaliana under stress condition. Proceedings of the 6th International Workshop on Computational Systems Biology 2009, Aarhus, Denmark, TICSP series 48 pp 23-27, 2009.
Herbette S et al. Genome-wide transcriptome profiling of the early cadmium response of Arabidopsis roots and shoots. Biochimie 88:1751-1765. 2006.
Lindlöf A. et al. In silico analysis of promoter regions from cold-induced genes in rice (Oryza sativa L.) and Arabidopsis thaliana reveals the importance of combinatorial control. Bioinformatics, 25: 1345-1348. 2009
Usadel B et al. Co-expression tools for plant biology: opportunities for hypothesis generation and caveats. Plant Cell Environ. 32:1633-51. 2009.
Comparison of mapping softwares for next generation sequencing data
Abstract: Recent DNA sequencers, usually called "next generation", produce reads that are shorter and in much larger amounts than previous sequencers. New alignment tool have been developed for these new type of reads. Our study evaluates the efficiency, strong points and weaknesses of these tools.
We have identified about 40 software tools that are currently used to map on known genomes the reads produced by next generation sequencers (NGS). Our study focuses on reads produced by Illumina sequencers, but also consider specificity associated with SOLiD reads (color code).
Methodology: We simulate two sets of reads of length 40 bp, that are drawn uniformly in a dataset. To reflect the diversity of genomic data, we use 2 kinds of datasets: the human genome and a concatenation of 1000 bacterial genomes. The sets contain 10M reads, close to the actual amount produced by NGS tools. In the first set reads are without errors, in the second, three mismatches are added at random positions. We use 11 of the most used tools (BWA, Novoalign, Bowtie, MOSAIK, MOM, Probematch, SOAP2, Bfast, SHRiMP, maq, and ZOOM) to align the simulated reads on the genome. We monitor several indicators of the performance of each tool: CPU time used, memory, whether the read matches at its "initial" position, number of match positions found for a given read, number of match positions that are in the original genome. We also take into account usability, flexibility, output format, documentation of the tools.
Functional prediction in the scope of large-scale multi-class learning
Abstract: With the continuously increasing amounts of biological data, the need for automated and rapid classification (or prediction) has become all the more urgent. This need is especially challenging when the number of classes is very large (beyond hundreds). A concrete case study is to predict functional groups of biological sequences. The inefficiency of the alignment-based solution in many cases has raised the question whether it is possible to benefit from machine learning to address that task. In this case, we show that the number of classes is a very considerable constraint when using machine learning. We briefly present some approaches that we have experimented, while showing the advantages and the disadvantages of each one. Finally, we recommend a two-phase approach coupling hidden markov models and standard classifiers.
Oasys : Un outil dédié à la visualisation et à l'exploration des données de biopuces
Abstract: Oasys est une application gratuite multiplateforme développée en Java et dédiée à la visualisation génomique et à l'exploration des données de biopuces. L'outil utilise des types de données génériques afin de prendre en compte de nombreux types de biopuces (ADN, ARN, SNP, miRNA, méthylome, ...). L'outil permet la visualisation de données numériques telles que les matrices de données normalisées ou les tests statistiques associés.
Outre les données d'expériences de biopuces, l'utilisateur peut facilement importer et gérer les annotations sur les marqueurs présents sur les biopuces (mRNA, clone, SNP, etc.) ainsi que les annotations cliniques et biologiques des échantillons concernés. Oasys permet également de manipuler les informations positionnées sur le génome et provenant de banques publiques (Ensembl, NCBI, UCSC, ...). L'utilisateur peut gérer ses propres listes (gènes, clones, probesets, échantillons, etc.) et importer ses documents aux formats Word, Excel ou autre.
Toutes les données positionnées au niveau chromosomique peuvent être incluses dans un graphique Oasys. Une visualisation Oasys peut être composée de multiples graphiques alignés sur un axe horizontal commun représentant la position génomique.
L’utilisateur peut utiliser des patrons de visualisation prédéfinis qui permettent de créer en quelques clics une grande variété de visualisations prêtes à être exportées dans des articles scientifiques. L’utilisation d’un patron de visualisation commence par une sélection des échantillons via des requêtes sur les annotations cliniques et biologiques disponibles. Il est alors possible de comparer facilement différents groupes d’échantillons entre eux, par exemple la fréquence des altérations chromosomiques des échantillons mutés en p53 avec la fréquence des altérations d’un groupe contrôle.
Oasys propose plusieurs fonctionnalités d'exploration telles que la mise en valeur des points dépassant un certain seuil, l'affichage d'information au passage de la souris sur les éléments visualisés, le zoom, l'affichage d’informations sur les gènes et les échantillons, la recherche de gènes, l'exportation au format texte d'un ensemble de points sélectionnées avec la souris, ou encore des pointeurs vers les visualisations des mêmes régions génomiques sur les sites Ensembl et UCSC.
Un site web dédié à l'outil est disponible avec une documentation utilisateur complète, de nombreuses captures d'écran, ainsi que des tutoriaux vidéo qui concernent des phases clés d'utilisation. La création d'un compte sur le site est nécessaire pour le téléchargement de l'outil ( http://cit.ligue-cancer.net/oasys )??
Oasys a été développé et est utilisé dans le cadre du programme « Cartes d'Identité des Tumeurs (CIT) » ( http://cit.ligue-cancer.net ) qui a pour but de caractériser de multiples types de tumeurs grâce à des analyses couplées de l'expression des gènes, des altérations chromosomiques ou des polymorphismes. Ce programme est financé et piloté par la Ligue Nationale Contre le Cancer ( http://www.ligue-cancer.fr ) et dispose d'une base de données concernant plus de 10 000 expériences de biopuces réalisées sur 8 000 échantillons.??
 Almagro-Garcia J. et al. (2009) SnoopCGH: software for visualizing comparative genomic hybridization data. Bioinformatics, 25(20), 2732-2733.
 Chen W. et al. (2005) CGHPRO -- a comprehensive data analysis tool for array CGH. BMC Bioinformatics, 6, 85.??
 Chi B. et al. (2004) SeeGH - A software tool for visualization of whole genome array comparative genomic hybridization data. BMC Bioinformatics, 5, 13.
 Kasprzyk A. et al. (2004) EnsMart: a generic system for fast and flexible access to biological data. Genome Res., 14, 160-169.
 Kent W.J. et al. (2002) The human genome browser at UCSC.?Genome Res., 12, 996-1006.
 La Rosa P. et al. (2006) VAMP: visualization and analysis of array-CGH, transcriptome and other molecular profiles. Bioinformatics, 22(17), 2066-2073.
 Lewis S. E. et al. (2002) Apollo: a sequence annotation editor.?Genome Biol, 3, 12??
 Lingjaerde O.C. et al. (2005) CGH-Explorer: a program for analysis of array-CGH data. Bioinformatics, 21(6), 821-822.
 Martin O. et al. (2009) AssociationViewer: a scalable and integrated software tool for visualization of large-scale variation data in genomic context. Bioinformatics, 25(5), 662-663.??
 Rutherford K. et al. (2000) Artemis: Sequence visualization and annotation. Bioinformatics, 16(10), 944-945.
 Wang P. et al. (2004) A method for calling gains and losses in array cgh data. Biostatistics, 6(1), 45-58.
Logical modelling of the regulatory network controlling the formation of egg appendages in Drosophila
The formation of dorsal appendages (DA) during Drosophila oogenesis provides a powerfull model to study epithelial morphogenesis . In contrast with most dipteran eggs which display a smooth surface, Drosophila's present specific structures, the most conspicuous of which are the DA at the anterior pole of the egg, thought toplay a respiratory role. The Drosophila eggshell is a chorionic structure secreted by the follicle cell (FC) epithelium of the egg chamber. Experimental studies have identified Gurken (Grk) and Decapentaplegic (Dpp) as the two main regulatory signals which trigger the pathways responsible for patterning the follicle cell epithelium that will secrete the eggshell in late oogenesis . Moreover, the pattern of expression of many downstream targets operating in the different eggshell regions is dynamic in both space and time. Indeed, the complexity of the underlying regulatory network has complicated the analysis of a good deal of experimental results. We use a modelling approach with the aim of improving our understanding of this biological system.
Material and Methods:
The logical formalism offers a suitable framework to study regulatory networks . Briefly, regulatory components are represented by variables that can take a finite number of discrete values representing their levels of activity with respect to given thresholds. Logical rules determine the target level of each variable depending on those of its regulators. Beyond topological properties of the wiring diagram, the logical formalism offers tools to study dynamic properties, to simulate experiments in silico and to make new predictions.
We review the literature to extract a map of the regulatory network controlling dorsal appendages formation in Drosophila melanogaster. Using the logical modelling software GINsim , we further simplify this map to build a dynamic model of the intracellular pathway, with Grk and Dpp as inputs, and Rhomboid (Rho), Broad (Br) and Fasciclin III (Fas3) as reading outputs: Br marks the roof cells, Rho and Fas3 together mark the floor cells, and Fas3 alone marks the operculum. Jordan et al. (2005) showed that the transcription factor Pangolin has a negative effect over Fas3 , but whether it acts directly of through Br remains unclear. Our preliminary results support the hypothesis that Br does inhibit Fas3 in the roof cells.
In spite of important modelling efforts (see for example [6,7]), the precise wiring of the regulatory network remains elusive. Our map synthesises in an intuitive format the current knowledge of the control of dorsal appendages patterning in Drosophila melanogaster. We have yet to experimentally confirm our modelling predictions. In the near future we also plan to use our cellular model as a module to build a multicellular model of the antero-dorsal follicle cell epithelium.
This work is supported by the Fundação para a Ciência e a Tecnologia, Portugal (project PTDC/EIACCO/099229/2008).
 S. Horne-Badovinac and D. Bilder, Mass transit: Epithelial morphogenesis in the Drosophila egg chamber. Developmental Dynamics 232:559-574, 2005.
 F. Peri and S. Roth, Combined activities of Gurken and Decapentaplegic specify dorsal chorion structures of the Drosophila egg. Development 127:841-850, 2000.
 R. Thomas and R. D’Ari, Biological Feedback, CRC Press, 1990.
 A. Naldi, D. Berenguier, A. Fauré, F. Lopez, D. Thieffry, and C. Chaouiya, Logical modelling of regulatory networks with GINsim 2.3. BioSystems 97:134-9, 2009.
 K. C. Jordan, S. D. Hatfield, M. Tworoger, E. J. Ward, K. A. Fischer, S. Bowers and H. Ruohola-Baker, Genome wide analysis of transcript levels after perturbation of the EGFR pathway in the Drosophila ovary. Dev Dyn. 232:709-24, 2005.
 S. Y. Shvartsman, C. B. Muratov and D. A. Lauffenburger, Modeling and computational analysis of EGF receptor-mediated cell communication in Drosophila oogenesis. Development 129:2577-89, 2002.
 J. Lembong, N. Yakoby, and S. Y. Shvartsman, Pattern formation by dynamically interacting network motifs. Proc Natl Acad Sci U S A 106:3213-8, 2009.
Pipeline for the pre-processing of Illumina reads
Abstract: Next-generation sequencing technologies produce a huge number of short sequences, called reads, that need to be processed to build the sequenced genomes. Either reads are assembled in contigs thanks to overlapping regions (de novo assembly) or reads can be mapped to a reference genome (mapping). The quality of de novo assembly is highly dependent on the quality of the reads and increases with the number of error-free reads. Moreover next-generation sequencing technologies are commonly used for comparative studies of a large number of organisms, strains and/or individuals and the identification of variations such as Single Nucleotide Polymorphisms (SNPs) and insertion/deletions (indels). Since SNP discovery relies on the comparison of reads, it strongly depends on their quality. Even though the high coverage provided by new technologies can be used to circumvent the presence of sequencing errors, the duplication of some reads can be artifactual thus leading to the discovery of false positive SNPs. Removing both artifactual and low quality reads prior to analysis is thus essential.
We developed a pipeline able to describe and clean any data set of Illumina reads generated in Single-end or Paired-end. The description of the reads provided by the pipeline includes the identification of reads and/or pairs of reads occurring many times, the calculation of the average quality by position of all reads and the count of reads containing the same nucleotide (A, C, G, T or N) more than half of their length. The pre-processing includes the removal of reads similar to the oligonucleotides used to generate the data, reads of low quality and reads exclusively made of one nucleotide or containing too many Ns and eventually the removal of duplicated reads.
This pipeline has been successfully used for several data sets of Illumina reads of 34 and 36 nucleotides. We are planning to test it with longer Illumina reads and to allow the trimming of low quality reads instead of their removal. Because some laboratories use a control lane on their flow cell, we are planning to add an option allowing the removal of reads contaminated by the control sequence.
An exhaustive mapping method for NGS short reads reveals deficiency of heuristic approaches
Abstract: The ability of Next Generation Sequencing (NGS) platforms such as ABI SOLiD System 3 and Illumina GAIIx to sequence several billions of basepairs (Gb) in a few days has made large-scale metagenomic projects possible. One of the main objectives of our metagenomics platform is to perform quantitative metagenomicconsisting to extract accurate information about the relative abundance of genes and genomes and the natural diversity of speciespresent in various complex ecosystems, such as human feces, grass or cheeses. One of the first analysis step consists on counting mapped reads at each position of a catalog of reference sequences. In order to process huge amounts of data in a reasonable computer time, most of the mapping tools are based on heuristic approaches, which usually achieve a mathematical criterion but remains suboptimal solutions for the mapping problem. This can become problematic as rare events can be missed in the quantitative metagenomic context. We have addressed the problem of testing the empirical criterion of such heuristic mapping methods by designing a partly parallelized exact mapping algorithm called GRIMA (Gpu Read Index MApper) in order to give a clear view of potential deficiency. By exploiting GPGPU technology for all the better, this algorithm runs in a reasonable time. We ran three heuristics (SHRiMP2, STORM and ABI corona-lite) and our algorithm on same conditions, in order to map two distinct SOLiD reads datasets obtained from B. subtilis samples in our lab on the B. subtilis strain 168 reference genome, and we compared their outputs. After describing the main featured of the algorithm, preliminary results will be presented. Rising questions on reliability of heuristic methods will be discussed.
Exploring the biodiversity of the world largest ecosystem: BioMarKs project first results and bioinformatics challenges
Abstract: BioMarKs integrates 8 European research institutes and 30 experts in eukaryotic microbial taxonomy and evolution, marine biology and ecology, genomics and molecular biology, bioinformatics, as well as marine economy and policy, to assess the taxonomic depth, environmental significance, human health and economical implications of the least explored biodiversity compartment in the biosphere: the unicellular eukaryotes or protists. Marine protists are microbial organisms that may build complex (in)organic skeletal structures, and have a profound impact on biogeochemical cycles and climate. They have complex genomes with thousands of genes producing molecules that influence marine ecosystem functioning, human health and economy, and which represent outstanding potential for future green energies, pharmaceutics and chemical industries.
BioMarKs assesses protist biodiversity at 3 depths (subsurface, deep-chlorophyll maximum, surface sediment) in 9 European coastal water sites, from Spitzbergen to the Black Sea, using massive rDNA sequencing (454 sequencing technologies). We use both rDNA and reverse transcribed rRNA general eukaryote and group-specific markers, in order to analyze the diversity and the abundance/activity of marine protists at different taxonomic levels. A suite of physical, chemical, and biological metadata from the same samples allows statistical analyses of the ecological forces shaping marine protist biodiversity.
This general strategy aims to (i) establish a baseline of protist biodiversity in European coastal waters, (ii) measure biodiversity change in marine protist communities facing ocean acidification, (iii) evaluate the impact of ballast water and pollution on marine protist biodiversity.
In addition BioMarKs will provide baseline data and new methods for future surveys of marine biodiversity change and for evaluation of its ecological and economic cost. The data retrieved in the frames of BioMarKs will become the world largest community resource on marine unicellular eukaryotic biodiversity, providing a reference platform for current and future projects dealing with this important biodiversity compartment, and elevating the European community to the forefront of marine eukaryote microbial ecology.
A platform for real-time control of gene expression
Abstract: Biology is undergoing a historical revolution with the development of systems and synthetic biology. To cope with the complexity of biological systems, quantitative models are increasingly needed. Obtaining a quantitative description of biological processes necessitates the capability to observe the response of biological systems subjected to large numbers of different perturbations. However, because of the difficulty to perturb precisely a biological system in a dynamical manner, most studies simply use step-response, static perturbations. In contrast, time-varying perturbations have the capacity to provide much richer information on the system dynamics. To improve our capacity to express in vivo a protein of interest in a chosen time-dependant way, and thus perturb biological systems, we propose to develop a platform for the real-time control of gene expression.
The platform integrates a microfluidic device that allows modulating gene expression by changing the extracellular environment, a fluorescent microscope that allows monitoring gene expression, and computational approaches that in real time compute the inputs to apply to the system to obtain the desired outputs. In this work, we present preliminary results on the control of the nuclear localization of the osmoresponsive transcription factor Hog1 in yeast using our platform equipped with a simple PID controller, and on the development more elaborate, model based control approaches. To the best of our knowledge, this is the first application of control theory to the actual control of gene expression at the single-cell level.
RENABI GRISBI - Grande Infrastructure pour la Bioinformatique
Abstract: La plateforme GRISBI (Grande Infrastructure pour la Bioinformatique) est une initiative conjointe entre six plateformes nationales de bioinformatique, membres du réseau RENABI : PRABI Lyon, GenOuest Rennes and Roscoff, CBiB Bordeaux, BIPS Strasbourg, CIB Lille, MIGALE Jouy-en-Josas. L’objectif est de bâtir une infrastructure bioinformatique distribuée, au service de la communauté scientifique nationale, dans le cadre du réseau français RENABI (Réseau National des plateformes de Bioinformatique) qui coordonne les treize centres nationaux de bioinformatique. L'initiative GRISBI est financée par le GIS IBISA (coordination nationale des Infrastructures en Biologie, Santé et Agronomie), qui a labellisé la plateforme GRISBI en 2008. La vocation de la plateforme GRISBI est de permettre la réalisation d'expériences traitant de systèmes biologiques de grande taille dans des domaines comme la génomique comparative, l'annotation du génome, la biologie des systèmes, la prédiction de fonction de protéines ou les interactions moléculaires telles que les interactions protéine/protéine ou protéine/ADN. Les six centres initiaux de GRISBI travaillent à partager et à relier leurs ressources nationales, dédiées à la bioinformatique, à l’aide de composants logiciels de type grille : le stockage et les ressources de calcul proprement dit, mais également leurs bases de données et leurs bases de logiciels.
Reconstruction and validation of the genome-scale metabolic model of Yarrowia lipolytica iNL750
Abstract: For many fungal genomes of biotechnological interest, the combination of large-scale sequencing projects and in-depth experimental studies has made it feasible to undertake metabolic network reconstruction. An excellent representative of this new class of organisms is Yarrowia lipolytica, an oleaginous yeast studied experimentally for its role as a food contaminant and its use in bioremediation and cell factory applications. As one of the hemiascomycetous yeasts completely sequenced in the Génolevures program it enjoys a high quality manual annotation by a network of experts.
We have developed a method of semi-automatic reconstruction of metabolic models, based on the prediction of conservation of enzymatic activities between two species. Following our reconstruction protocol, we extrapolated a Y.lipolytica genome-scale metabolic model (called iNL705) from an existing S. cerevisiae model (iIN800). This draft model was curated by a group of experts in Y. lipolytica metabolism, and iteratively improved and validated through comparison with experimental data by flux balance analysis.
In order to design better automatic methods, we formalized the steps of expert manual curation as an algebra of edit operations on metabolic models. The cycle of iterative refinements is represented as transformations of the automatically produced draft model, with an evaluation of improvement in accuracy after every step.
This study underscores the particular challenges of metabolic model reconstruction for eukaryotes. The experience acquired with our methods and formalizations should prove useful for similar reconstruction efforts.
TFM-Explorer: mining cis-regulatory regions in genomes
Abstract: DNA-binding transcription factors (TFs) play a central role in transcription regulation, and computational approaches that help in elucidating complex
mechanisms governing this basic biological process are of great use.
In this perspective, we present the TFM-Explorer web server that is a
toolbox to identify putative TF binding sites within a set of upstream regulatory sequences of genes sharing some regulatory mechanisms.
TFM-Explorer finds local regions showing overrepresentation of binding sites. Accepted organisms are human, mouse, rat, chicken and drosophila. The server employs a number of features to help users to analyze their data: visualization of
selected binding sites on genomic sequences, and selection of cis-regulatory modules. TFM-Explorer is available at http://bioinfo.lifl.fr/TFM.
RNA locally optimal secondary structures
Abstract: The folding space of an RNA provides a rich knowledge about the structures and function of the RNA molecule. Partition functions or sample of optimal and suboptimal structures give some insight on this folding space. An alternative description of this space can also be achieved by studying so-called locally optimal secondary structures (Clote 2005). Locally optimal secondary structures cannot be extended without creating a pseudoknot or a base triplet. They give a very good picture of the RNA folding space, because they contain all potential secondary structures. We propose an efficient algorithm that computes all locally optimal structures
for an RNA sequence. The algorithm is implemented in a software called regliss. It is available for download and runs on a publicly accessible web server: http://bioinfo.lifl.fr/RNA/regliss.
Role of geography and languages in shaping population genetic structure
Abstract: In human population genetics, there is a long standing interest in the relationships between genes, geography and languages. In this study, we investigate to which extent geography and languages predict the genetic structure of human populations with a particular focus on Native American populations. In studies of Native American genetic variation, the use of linguistic data is complicated by the fact that two major different classifications, Greenberg’s and Campbell’s classifications, have been proposed. Our approach is based on a Bayesian genetic clustering algorithm in which geographic and linguistic data are included. We devise statistical procedures to evaluate the capacity of spatial and linguistic variables to predict population structure as obtained from a large collection of microsatellite loci of the Human Genome Diversity Panel. Although geography is a good determinant of Native American genetic structure, we find that adding linguistic information provides a significant improvement. We further compare the ability of the different linguistic classifications to predict genetic structure in the Americas. Although some linguistic families in Campbell’s classification, and especially the Tupi, are poorly genetically characterized, we find that Campbell’s classification provides the best characterization of Native American genetic structure.
RNA-seq data analysis provides evidence for a new molecular mechanism generating antisense transcripts in human cells
Abstract: Recent large-scale projects of massive transcriptome profiling have revealed an
unsuspected complexity in the repertoire of RNA molecules produced by eukaryotic
cells1,2. The plethora of (protein) non coding transcripts generated by pervasive
transcription suggests that hidden layers of regulation remain to be identified in the
process of gene expression3.
Next-Generation Sequencing4, or more generally High Throughput Sequencing5 has
been proven to be a powerful tool to investigate this transcriptional dark matter6,7, not
only by characterizing new transcripts8, but also by unveiling new molecular pathways
occurring in vivo3. For instance, deep sequencing of human RNA suggested the
existence of an unknown recapping mechanism operating on cleaved transcripts9 by
means of a biochemical component that was later identified10.
Here we report the investigation of the human small RNA complement (sRNA, <200nt)
by the application of a computational analysis pipeline to short reads from true singlemolecule
sequencing experiments11, followed by further experimental validations. The
genomic distribution of the sequence reads around annotated gene features exhibits a
clear accumulation pattern for the reads that map at gene termini, within and antisense
to the last 50bp of the annotated 3’UTRs, and reveals a new class of sRNAs that we call
aTASRs, for antisense termini-associated short RNAs. Surprisingly, as these aTASRs
have non-genomically encoded 5’ polyU tails, they do not appear to be generated by
antisense transcription but rather by a hitherto unknown RNA copying mechanism.
We explain here why such aTASRs could not be identified by previous studies, present
experimental confirmations of their existence, and discuss about the prevalence, impact
and function of a novel RNA-dependent RNA polymerase activity in human cells.
ISsaga, a platform for identification and semi-automatic annotation of prokaryotic insertion sequences.
Abstract: The in depth analysis of the growing number of completely sequenced prokaryotic genomes is providing important contributions to our understanding of the biological world. Large-scale sequencing of prokaryotic genomes demands automation of certain annotation tasks, such as prediction of coding sequences and promoters. But automatic processes are severely limited and can lead to poor quality annotation at the DNA level. A crucial example of this are the mobile genetic elements (MGE) which have played a major role in genome evolution and contribute massively to horizontal gene transfer. A detailed and accurate analysis of MGE content and distribution would provide an important picture of the evolution of their host genomes. Unfortunately, a majority of bacterial and archeal genome sequences deposited in the public databases are seriously compromised in their MGE annotation, particularly for ISs, one of the simplest autonomous MGEs found. In particular fragments of incomplete ISs which represent the scars of previous recombination events are rarely annotated but provide important information on genome evolution. As judged by the poor quality of annotations in the public databases, it is clear that there are no accurate tools available for IS identification and annotation. To facilitate and provide the annotation of these elements we have developed a system, ISsaga (http://issaga.biotoul.fr), which provides a high-quality semi-automatic annotation system, directed to the accurate identification and annotation of ISs.
POTChIPS: a new method for ChIP-chip data analysis
Abstract: Chromatin ImmunoPrecipitation on Chip (ChIP on chip or ChIP-chip) technology is used to detect protein (transcription factors generally) binding site on DNA. The statistical analysis consists in looking for significant peak regions, in order to find binding sites. The POTChIPS method presented here is inspired by the POT model (Peaks Over Thresholds), stemming from the extreme value theory. It consists in distribution tails modeling, by only retaining the values exceeding a given threshold, both intensities of excesses over threshold and occurrences of these excesses are modeled. This method allows to determine a threshold beyond which peaks can be considered as significant.
Comparing graph-based representations of protein for mining purposes
Abstract: Recently, the principles of graph theory are being adopted to address molecular and chemical structures investigations such as 3D protein structure prediction and spatial motifs discovery. Proteins have been parsed into graphs according to several approaches and methods and then studied based on graph theory concepts and data mining tools. In this paper we make a brief survey on the most used graph-based representation and we propose a naïve method to help with the protein graph making since a key step of a valuable protein structure mining process is to build concise and correct graphs holding reliable information. We, also, show that some existing and widespread methods present remarkable weaknesses and don’t really reflect the real protein conformation.
Prédiction de la structure secondaire des protéines : mise en oeuvre optimisée de l'architecture en cascade
Abstract: Au cours du repliement des protéines se forment des éléments structuraux périodiques locaux reliés entre eux par des éléments apériodiques. Ces éléments constituent la structure secondaire des protéines. La prédiction de cette structure consiste à affecter à chaque résidu d'une séquence protéique son état conformationnel : hélice alpha, brin bêta ou apériodique. Les méthodes statistiques fournissent actuellement les meilleures prédictions. L'architecture de base sur laquelle elles s'appuient est pratiquement toujours la même : deux classifieurs sont utilisés en cascade. Le premier effectue la prédiction de la structure à partir de la séquence (ou d'un alignement multiple), le second lisse la prédiction initiale en exploitant la corrélation des états conformationnels des résidus consécutifs. Cette brique de base est incorporée dans des combinaisons prenant des formes variées (par exemple la combinaison de plusieurs centaines de perceptrons multi-couches), ce qui soulève des questions relevant de la sélection de modèles.
Nous proposons une version optimisée de l'architecture en cascade. Ses performances sont proches de l'état de l'art, pour une complexité en échantillon inférieure d'au moins un ordre de grandeur. Les sorties sont des estimations des probabilités a posteriori des catégories, ce qui autorise leur post-traitement par des modèles génératifs. Le premier niveau est constitué par les quatre principales machines à vecteurs support multi-classes (M-SVM) proposées dans la littérature, le second est une combinaison linéaire multivariée des sorties de ces machines.
DroPNet: Bioinformatics web platform for functional and proteomics data analysis
Abstract: DroPNet is a bioinformatics platform allowing biologists to visualise interaction networks from functional and proteomic data. In deed, data coming from large-scale reverse genetic screens are not easy to handle. Confronting a group of proteins combined for a particular biological reason, to protein-protein interactions data could pull up meaningful interactions. The objective of this user-friendly web-platform is to greatly facilitate the manipulation of such large amounts of data for its users.
In this way, experimental data can be combined with information available on the web in various public databases of protein-protein interactions. After searching for direct and indirect interactions, according to a chosen deepness, it is possible for the user to visualise, filter, manipulate, browse, personalise and save the result networks. The platform provides an innovative approach to understanding biological mechanisms at the molecular level, allowing users to efficiently browse the interactions referenced in proteomic databases that contain a lot of "noise" and offering an overview of studied mechanisms, like the Drosophila for our first choice.
It results in highlighting links between interactions and functional properties of proteins, facilitating the selection of targets for biological further studies. This project leads to a multipurpose and adaptable tool for studies that have a biomedical content.
Can we link metagenome gene content and iron supply in the ocean ?
Abstract: Iron is a rare ressource in many oceanic areas and consequently its bioavailability often limits the growth of marine microorganisms. To investigate the link between iron availability and sequence prevalence in the environment, we performed a gene centric meta-analysis of available oceanic metagenomic data.
We listed 145 genes involved in Iron metabolism from the literature to build a non redundant database of 2357 sequences (limited to one per genus) corresponding to 8 Iron Metabolism Pathways (IMP): Inorganic Iron Uptake (Fe2+ or Fe3+), Heme-Fe metabolism, Siderophore Synthesis and Uptake, Iron Storage, Regulation, and Oxidative Stress.
We used protein sequences of all available species (both from GenBank and from the Moore foundation microbial isolates) to span the large phylogenetic diversity of microorganisms. We performed a sequence similarity blast based search to identify the genomic basis of Iron uptake strategy in 65 metagenomic samples. Iron concentrations and other environmental parameters of the sampling sites were provided by the CAMERA database and estimated from Ocean General Circulation and biogeochemistry model NEMO-PISCES. We used the General Linearized Model to relate the amount and the prevalence of IMP to environmental variables.
We found a strong effect of Habitat on 3 Iron Metabolic Pathways : Fe(II) and Fe(III) uptake and response to oxidative stress. We further identified a strong positive and negative correlation for Fe(II) and Fe(III) uptake with Iron concentration, possibly reflecting iron speciation in Coastal areas or Open Ocean, respectively. Iron Storage and Siderophore Uptake did not show any significant Habitat effect, but a strong increase with predicted Iron availability. We propose 3 new biomarkers of Iron bioavailability that are strongly correlated with predicted Iron concentration : Bacterioferritin (storage, rho=0.4, p=0.006), FeoB (Fe(II) uptake, rho=0.33, p=0.02) and FecA (siderophore uptake, rho=-0.45, p=0.002). Taking the Habitat effect into account, we moreover found that IMP are better correlated to environmental variables and in particular Iron concentration, than Taxonomy. This suggests that environmental variables have more effect on IMP genomic strategies than on taxonomic diversity.
BioDesc : gestion d’un entrepôt multiformat de descriptions de ressources bioinformatiques
Abstract: Dans cet article, nous décrivons BioDesc, un logiciel qui permet de décrire des ressources bioinformatiques et de gérer ces descriptions au fil du temps. L’architecture à base de plug-in permet d’enrichir BioDesc pour ajouter de nouveaux formats.
Logical modelling of MAPK pathways
Abstract: Mammalian Mitogen-Activated Protein Kinase (MAPK) pathways can be activated by a wide variety of stimuli, including growth factors, inflammatory cytokines belonging to the TNF family, as well as environmental stresses. This activation affects diverse cellular activities, including gene expression, cell cycle, survival, apoptosis, and cell differentiation. To date, six distinct groups of MAPK pathways have been characterised. Three of them have been more extensively studied and serve as a reference for the present study: extracellular regulated kinases (ERK1/2), Jun NH2 terminal kinases (JNK1/2/3), and p38 kinases (p38 alpha/beta/gamma/delta).
A recurrent feature of MAPK pathways is the presence of a central three-tiered core signalling module, consisting of a set of sequentially acting kinases: a MAPK, a MAPK kinase (MAPKK) and a MAPK kinase kinase (MAPKKK). The MAPKKKs are activated by phosphorylation or by interaction with GTP proteins belonging to Ras/Rho family in response to extracellular stimuli. MAPKKKs activation leads to phosphorylation and activation of downstream MAPKKs, which in turn phosphorylate MAPK. Once activated, MAPKs phosphorylate their target substrates, which could be transcription factors, other kinases or proteins. At each level of a MAPK cascade, protein phosphorylation is regulated by the opposing actions of phosphatases. Since the physiological outcome of MAPK signalling depends on the magnitude and duration of kinase activation, this regulation by phosphatases plays an important role.
Given the wide spectrum of stimuli activated and the large number of processes regulated, a fundamental and debated question is how signalling specificity is achieved. In this respect, at least five inter-related mechanisms have been proposed: distinct duration and strength of the signal; presence of multiple components with different roles in each level of the cascade; interaction with scaffold proteins that direct each component to distinct upstream regulators and downstream targets; cross-talks among signalling cascades that are activated simultaneously; distinct sub-cellular localisations of cascade components and their targets .
To address this question, using the CellDesigner software , we are currently encoding all the components and interactions involved to generate an annotated regulatory chart for each of the three reference MAPK pathways (including links to relevant scientific articles and database entries). At present, a regulatory map describing the ERK signalling pathway has been realised and the remaining two maps are under construction. The resulting regulatory charts will then be merged to obtain a comprehensive description of the MAPK pathways and analyse their cross-talks. Using the logical modelling software GINsim , the regulatory chart will be translated into a predictive, qualitative dynamical model, recapitulating the properties of MAPK pathways in reference cell types.
The resulting model will be merged with a generic model of cell fate recently proposed by Calzone et al.  to analyse the impact of varying MAPK signalling onto the selection of alternative cell death or survival modalities. Finally, this model will be used to predict cell fate in normal or pathological situations.
This work is financed by the EU FP7 APO-SYS project, the ANR SYSCOMM CALAMAR project (ANR-08-SYSC-003), and the the Belgian Science Policy Office (IAP BioMaGNet).
 E. Zehorai, Z. Yao, A. Plotnikov and R. Seger. The subcellular localization of MEK and ERK - a novel nuclear translocation signal (NTS) paves a way to the nucleus. Mol. Cell. Endocrinol., 314: 213-220, 2010.
 A. Funahashi, Y. Matsuoka, A. Jouraku, M. Morohashi, N. Kikuchi and H. Kitano. CellDesigner 3.5: A Versatile Modeling Tool for Biochemical Networks. Proc. IEEE, 96: 1254-1265, 2008.
 A. Naldi, D. Berenguier, A. Fauré, F. Lopez, D. Thieffry and C. Chaouiya. Logical modelling of regulatory networks with GINsim 2.3. Biosystems, 97: 134-139, 2009.
 L. Calzone, L. Tournier, S. Fourquet, D. Thieffry, B. Zhivotovsky, E. Barillot and A. Zinovyev. Mathematical modelling of cell-fate decision in response to death receptor engagement. PLoS Comput. Biol., 6: e1000702, 2010.
Logical modelling of drosophila signalling pathways
Abstract: A limited number of signalling pathways are involved in the specification of cell fate during animal development. Several of these pathways have been originally identified using Drosophila model.
Focusing on mesoderm specification during Drosophila development, we are currently building a dynamical model to integrate all published data and predict the behaviour of the system for known or novel perturbations. As available data are mostly qualitative, we use a logical framework enabling a flexible representation of regulatory components in terms of binary of multi-level variables [1-4]. Our current model involves a dozen of transcription factors under the control of six major signalling pathways, namely Notch (N), Wingless (Wg), Decapentaplegic (Dpp), Hedgehog (Hh), Epidermal Growth Factor (EGF), and Fibroblast Growth Factor (FGF) pathways [5-12].
To clarify their roles and possible cross-talks, we have build a separate logical model of each of these pathways. In each case, we consider the different ligands, receptors, signal transducers, modulators, and transcriptional factors reported in the literature. Extensively annotated, the resulting models qualitatively reproduce the the main characteristics of the corresponding pathways, in the wild type situations as well as for various mutants.
In the context of mesoderm specification, these signalling pathways determine the fate of each cell by providing the necessary signals to activate the genes required for differentiation. Altogether, these pathways define different territories (e.g., the territory receiving both Dpp and Wg gives rise to the presumptive territory of the heart, etc.). In contrast, FGF is required for the migration of the mesodermal cells to enable them to receive effective signals from the ectoderm and pursue their differentiation. Notch and RTKs (EGF, FGF) pathways are involved in the control of cell division, as well as in muscle and heart cell progenitor selection. By and large, a deeper understanding of the functioning of these pathways is necessary to model the formation of muscle and cardiac differentiation, which involves both intra- and inter-cellular interactions.
A comparative analysis of these pathway models should enable the delineation of functional similarities and differences, e.g. regarding the activatory or inhibitory activities of pathway effectors, being transcriptional factors or kinases. The same proteins are sometime involved in different pathways (e.g., CBP Grochou, Sprouty). For example, Slimb orchestrates proteolysis of different components downstream of Wg and Hh. All six pathways include a negative feedback and various inhibitors. In this respect, our models enable the simulation of various inhibition modes (partial versus total inhibition, protein-protein interactions versus transcriptional inhibition).
Although current pathway models will likely require improvements as further biological data will be gathered, they can be considered as bricks or modules to build more comprehensive models of complex developmental processes such as mesoderm specification. In this respect, the recent development of a rigourous methods for the reduction of logical models enables the compaction of the resulting models when they become too large for efficient simulations .
These models highlight different characteristics of signalling pathways in the course of their activation: the formation of gradients (e.g. dpp), the actions of various inhibitors, the requirement of different receptors or ligand, inhibitors, and the phenomena of dimerizations or cooperation between different factors, etc.
We are currently simulating all known mutants for each pathway to validate our models and check consistency with in vivo data.
Furthermore, the development of these models are very useful to refine our comprehensive model for the specification of mesoderm and cardiac cells. Indeed, the corresponding signalling pathways constitute major inputs controlling these processes, and it is thus essential to model each pathway and their cross-talks adequately.
Finally, more prospectively, Drosophila pathway models could serve as scaffold to build more sophisticate models for orthologous mammalian pathways.
Abibatou Mbodj and Duncan Berenguier are supported by PhD grants from the French Ministry of Research and Technology. Our research is further supported by the Belgian Science Policy Office (IAP BioMaGNet).
A. Naldi, F. Lopez, D. Thieffry and C. Chaouiya. Qualitative modelling and analysis of biological regulatory networks with GINsim 2.3. Biosystems, 97: 134-139, 2009.
A. Naldi, E. Remy, D. Thieffry and C. Chaouiya. Dynamically consistent reduction of logical regulatory graphs. Theor. Comp. Sci., in press, 2010.
A.G. González, C. Chaouiya and D. Thieffry. Qualitative dynamical modelling of the formation of the anterior-posterior compartment boundary in the Drosophila wing imaginal disc. Bioinformatics, 24: i234-i240, 2009.
L. Sánchez, C. Chaouiya and D. Thieffry. Segmenting the fly embryo: logical analysis of the role of the Segment Polarity cross-regulatory module. Int. J. Dev. Biol., 52: 1059-1075, 2009.
A. Tapanes-Castillo and M.K. Baylies. Notch signaling patterns Drosophila mesodermal segments by regulating the bHLH transcription factor twist. Development, 131: 2359–2372, 2004.
E.S. Seto and H.J. Bellen. The ins and outs of wingless signaling. Trends Cell Biol., 14: 45-53, 2004.
E.L. Ferguson and K.V. Anderson . Decapentaplegic acts as a morphogen to organize dorsal-ventral pattern in the Drosophila embryo. Cell, 71: 451-61, 1992.
K. Arora, M.S. Levine and M.B. O’Connor. The screw gene encodes a ubiquitously expressed member of the tgf-beta family required for speci?cation of dorsal cell fates in the drosophila embryo. Genes Dev., 8: 2588–2601, 1994.
Ingham PW (1998). Transducing hedgehog: the story so far. EMBO J 17: 3505–11.
L. Lum, C. Zhang, R.K. Oh, R.K. Mann, D.P. von Kessler, J. Taipale, F. Weis-Garcia, R. Gong, B. Wang and P.A. Beachy. Hedgehog signal transduction via smoothened association with a cytoplasmic complex scaffolded by the atypical kinesin, costal-2. Mol. Cell, 12: 1261-1274, 2003.
Yarden Y, Shilo BZ (2007). Snapshot: EGFR signaling pathway. Cell 131: 1018.
R. Wilson, E. Vogelsang and M. Leptin. FGF signalling and the mechanism of mesoderm spreading in Drosophila embryos. Development, 132: 491–501, 2005.
A. Naldi, E. Remy, D. Thieffry and C. Chaouiya. Dynamically consistent reduction of logical regulatory graphs. Theor. Comp. Sci., in press, 2010.
Identification of cis-regulatory elements and functional associations from clusters of genes co-expressed during Drosophila zygotic activation.
Abstract: During the three hours following fertilization, Drosophila embryo undergoes drastic changes at both regulatory and morphologic levels. During the first hour after fertilization, only nuclei divide through fast mitotic cycles and no transcription occurs, the control of the development is ensured by maternal mRNAs loaded in the egg during oogenesis. From one to three hours after fertilization, nuclei continue to divide and migrate to the peripheral membrane to form the syncytial blastoderm. Meanwhile, a set of about 60 genes are transcribed, this is called the minor wave of zygotic genome activation (ZGA). At three hours, the 14th mitotic cycle pauses at interphase, the plasmic membrane invaginates and surrounds the nuclei. This process individualizes the cells, thereby forming the cellular blastoderm. At the same time, more than 300 genes are transcribed, this is the major wave of ZGA. We aim to decipher the mechanisms governing zygotic activation, such as interaction pathways between transcription factors and their target genes, as well biological processes in which these genes are involved.
Several studies provide us transcriptome data from early embryo. Using a time course analysis and RNAi experiments, Pilot et al (2006) sought to identify zygotic genes involved in nuclear shape changes occurring during cellularization. Using time course analysis and chromosomal deletion, De Renzis et al (2007)  classified genes according to their origin (maternal versus zygotic contribution) and their transcriptional timing. More over, they performed motif discovery on upstream region of purely zygotic genes and on UTR 3' of maternal degraded genes. They highlighted TAGteam motifs later identified as binding motif for Zelda directly implicated in the first wave of ZGA. They found also motifs putatively belonging to binding motifs of pumilio family and AU rich elements involved in mRNAs degradation. Finally, Lu et al (2009)  studied the influence of nucleo-cytoplasmic ratio on zygotic activation on the basis of a comparison between transcription profiles of wild type versus haploid mutant embryos. In each case, the authors identified several clusters of co-expressed genes.
In a first step, we retrieved raw transcriptome data from Pilot and Lu studies, we re-normalized these data sets using the RMA method, which appears to be the most robust . Next, we have computed expression profiles relative to whole median signal, as well as differential profiles (log-ratios between consecutive time points). Clustering was then performed on differential profiles rather than on expression profiles, because the former directly represent the variations of expression at successive developmental stages. We tested different parameters for the hierarchical clustering (agglomeration rule, similarity metrics) but the resulting clusters were poorly robust. In order to obtain consistent and directly interpretable clusters, we applied a naive methods based on a discretization of differential profiles into ternary vectors (up, down, stable) based on multi-testing corrected p-values estimated from the chip-wise z-scores. For De Renzis, we decided to analyze only published clusters because of the complexity of the data.
The resulting clusters were analyzed using compare-classes  to identify over-represented Gene Ontology annotation terms. To discover cis-regulatory elements in their regulatory regions (upstream, introns, UTRs), we used the tools oligo-analysis and dyad-analysis from RSAT software suite . Over-represented motifs were assembled and converted to position-weight matrices (PWM), which were then compared with motifs annotated in several databases (TRANSFAC, FlyReg, JASPAR, Tiffin) using compare-matrices .
For several clusters, we detected significant associations between over-represented motifs and GO terms enrichment. For example, for a cluster grouping mRNAs maternally provided and rapidly degraded before cellularization, we find a highly significant motif corresponding to the binding site of Dref (DNA replication-related element factor) implicated in spindle organization. Consistently, this cluster is enriched for genes involved in this biological process, with a very low E-value. In same clusters we found significant unknown motifs, we are looking for significant associations between them and clusters of them in restricted regions. For clusters grouping over expressed genes during ZGA, we re-discovered TAGteam motifs that validates our protocol and other unknown motifs with high significance, in the same way, we will look for association and grouping of these motifs.
We are currently completing this analysis by analyzing the enrichment of gene clusters in known Drosophila PWMs, and by searching for over-representation of combinations of motifs associated with the gene clusters specifically induced around the MZT.
In addition, we plan to perform genomic comparative analysis to evaluate the conservation of the resulting cis-regulatory motifs, starting with the other Drosophila genomes available (12 species spanning c.a. 50 million years of divergence).
Finally, the most promising motifs will be tested for their involvement in the regulation of transcriptional zygotic activation.
This work is supported by the ANR (project NeMo) and by the the Belgian Science Policy Office (IAP BioMaGNet).
 F. Pilot, J-M. Philippe, C. Lemmers, J-P. Chauvin and T. Lecuit. Developmental control of nuclear morphogenesis and anchoring by charleston, identified in a functional genomic screen of Drosophila cellularisation. Development, 133:711-723, 2006.
 S. De Renzis, O. Elemento, S. Tavazoie and E.F. Wieschaus. Unmasking activation of the zygotic genome using chromosomal deletions in the Drosophila embryo. PLoS Biol., 5:e117, 2007.
 X. Lu, J. M. Li, O. Elemento, S. Tavazoie and E.F. Wieschaus. Coupling of zygotic transcription to mitotic control at the Drosophila mid-blastula transition. Development, 136:2101-2110, 2009.
 R.A. Irizarry, B.M. Bolstad, F. Collin, L.M. Cope, B. Hobbs and T.P. Speed. Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res., 31:e15, 2003.
 M. Defrance, R. Janky, O. Sand and J. van Helden. Using RSAT oligo-analysis and dyad-analysis tools to discover regulatory signals in nucleic sequences. Nat. Protoc., 3:1589-603, 2008.
Using Hoare logic for constraining parameters of discrete models of gene networks
Abstract: The technology of DNA chips allows the quantification of the expression level
of thousands genes in the same organism at the same time. Nevertheless, analysis of
data from DNA chips requires development of adapted methods.
We propose a path language that allows the description, in an abstract
way, of the concentration level variations from temporal data like temporal profiles of gene expression.
When concentration level variations have been expressed through a
program of the path language, it becomes possible to apply methods
from computer science like Hoare logic.
Hoare logic is made of a system composed of axioms and rules. It permits
one to prove if a program is correct in comparison to its specifications
that is described through assertions, that is, logical formulas, that
define properties on the program. The precondition specifies the initial
state before the execution of the program and the postcondition
specifies the final state after the execution of the program. A program
is said (partially) correct if it is possible to prove that from an
initial state satisfying the precondition, the program leads (if the
program terminates) to a final state satisfying the postcondition.
To model gene regulatory networks, the main difficulty remains in the
parameter identification problem, that is, the search of parameter
values which lead to a model behavior that is consistent with the known
or hypothetical properties of the system. So, we apply a Hoare-like
logic designed for the defined path language. The axioms and rules of
this Hoare-like logic are adapted to gene networks and permits us to
prove that the path described by the program exists in the dynamics.
Given a path program and a postcondition, we can apply calculus of the
weakest precondition, based on this Hoare-like logic. Calculus of the
weakest precondition, thanks to defined axioms and rules, permits us to
constrain parameters associated with the program and the
postcondition. Although Hoare logic is well known, its application to
constrain parameter values of gene networks appears to be brand new and
helpful in order to select consistent models. Moreover, expressing DNA profiles into programs gives another way to process such data.
Digital gene expression data, cross-species conservation and noncoding RNA
Abstract: Recently developed sequencing technologies offer massively parallel production of short reads and become the technology of choice for a variety of sequencing-based assays, including gene expression. Among them, digital gene expression analysis (DGE), which combines generation of short tag signatures for cellular transcripts with massively parallel sequencing, offers a large dynamic range to detect transcripts and is limited only by sequencing depth. As recently described (Philippe et al, 2009), tag signatures can easily be mapped to a reference genome and used to perform gene discovery. This procedure distinguishes between transcripts originating from both DNA strands and categorizes tags corresponding to protein coding gene (CDS and 3'UTR), antisense, intronic or intergenic transcribed regions. Here, we have applied an integrated bioinformatics approach to investigate tags’properties, including cross-species conservation, and the ability to reveal novel transcripts located outside the boundaries of known protein or RNA coding genes. We mapped the tags from a human DGE library obtained with Solexa sequencing, against the human, chimpanzee, and mouse genomic sequences. We considered the subset of uniquely mapped tags in the human genome, and given their genomic location, determined according to Ensembl if they fall within a region annotated by a gene (CDS, UTR and intron) or an intergenic region. We found that 76.4 % of the tags located in human also matched to the chimpanzee genome. The level of conservation between human and chimpanzee varied among annotation categories: 85 % of conserved tags in the CDS, 81 % in the UTR, 76% and 73% respectively in intron and intergenic regions. With the same procedure applied to human and mouse, we obtained 11% of conserved tags in the CDS, 7% in UTR, 1% and 3% respectively in intronic and intergenic regions. We analysed in depth the common CDS and UTR tags in human and mouse for their functional relevance: 90% of them correspond to orthologous genes with a common HUGO. We used DAVID database to extract biological features, the gene clustering revealed specific molecular functions belonging to transcription cofactor and regulator activity, nucleotide binding, ligase and proteine kinase, hormone receptor, histone methyltransferase or GTPase activity, and also important signaling pathways like WNT pathway. Indeed, intergenic transcription includes mainly new, non protein-coding RNAs (npcRNAs), which could represent an important class of regulatory molecules. By integrating also SAGE genie and RNA-seq expression data, we selected intergenic tags conserved across species and assayed experimentally the npcRNA transcriptome with Q-PCR. We validated 80% of the 32 tested biological cases. These results demonstrate that considering tag conservation helps to identify conserved genes and functions, which is of great relevance when investigating expressed tags located in intergenic regions.
BYKdb: a database of bacterial tyrosine kinases
Abstract: In bacteria, four types of protein phosphorylation systems are known. The last and most recently described system uses a special type of tyrosine kinases, the BY-kinases, only found in bacteria. This system turns out to represent a promising regulatory device of bacterial physiology. BY-kinases comprise two domains: a two-pass transmembrane activator domain (TAD) with a large extracellular part, partially matched by the Pfam profile PF02706 and an intracellular catalytic domain (CD). These two domains could belong to the same protein or to two distinct proteins encoded by two adjacent genes. In a previous work, we developed a pipeline and define a family signature that indentifies with a high specificity the BY-kinases. Using this pipeline, 1,318 sequences were found, including 90 sequences specifically found by our pipeline. We described in this paper BYKdb, a BY-kinase database that contains a collection of computer-annotated sequences. This database can be consulted thanks to a web interface, via the following web site: http://bykdb.ibcp.fr/. The sequences can be extracted and analysed using the NPS@ server tools.
HBVdb: A knowledge database for the Hepatitis B Virus
Abstract: The hepatitis B virus (HBV) is a major health problem worldwide with 350 million people being chronic carriers. Chronic infection is correlated with a strongly increased risk for the development of severe liver diseases, including liver cirrhosis, and hepatocellular carcinoma (HCC). Thus, a database allowing researchers to investigate the HBV genomic characteristics, the sequence variability and resistance to treatment will be an essential tool. We present HBVdb, a Hepatitis B Virus database that contains a collection of computer-annotated sequences based on manually annotated reference genomes integrated with analysis tools. The automatic annotation procedure ensures standardized nomenclature for all HBV entries across the database and builds a description of the HBV genomic regions and proteins that are included in the entry. Biologists can extract information from HBVdb thanks to our query system, QueBio, and sequences can be analyzed with the NPS@ web server (Network Protein Sequence Analysis). HBVdb is accessible at http://hbvdb.ibcp.fr and contains 32,178 entries from which 3,452 are complete genomes.
Sequence analysis of the proteins involved in CaCO3 biomineralization
Abstract: Biomineralization is the process by which living cells and organisms produce mineralized structures. Biomineralizations consist of organo-mineral assemblages where the dominant mineral is closely associated to a minor organic matrix. The organic matrix is essential for mineral formation and proteins are the key components of this process. In metazoans, the protein sequences associated to calcium carbonate biomineralization show particular characteristics since they seem to be organized in different modules. Some of these modules are of low complexity or seem to be specific for each organism suggesting that evolution is occurring independently multiple times. However, calcium carbonate proteins also show some remarkable sequence similarities and conserved domains in distantly related organisms, which suggests a Precambrian origin of the genes. Protein-mineral interactions are still in an early stage of comprehension and elucidation of this process relies on a large-scale analysis of the numerous proteins sequences.
Our work consists in developing an automated pipeline for the identification and analysis of the protein sequences associated to calcium carbonate mineralization in metazoans. Our pipeline was able to group these proteins by homologous regions. The preliminary analysis of the results indicates that known motifs involved in the biomineralization process are identified by the pipeline. The comprehensive analysis of the results will allow the characterization of new motifs and domains and the establishment of new functional relations. These data will be integrated into a specialized database, CaBiominDB, available through a web interface.
Extension of SEGM web server to stochastic evolution of tetranucleotides and pentanucleotides
Abstract: An extension of SEGM web server (Stochastic Evolution of Genetic Motifs) allows to study stochastic evolution of tetranucleotides and pentanucleotides. Thus, this new web application allows the computation of occurrence probabilities of genetic motifs of size 1 to 5 (nucleotides, dinucleotides, trinucleotides, tetranucleotides, pentanucleotides) at evolution time t as a function of motif initial occurrence probabilities and substitution parameters.
"GenoVA", a new approach to assess intraspecies genetic variability in complex genomic mixes.
Abstract: In natural ecosystems, such as the gut, the different species of microorganisms are composed of different population whose genetic background varies around a common theme. The assessment of the genetic diversity in each species is therefore an important data to consider to improve our studies, and understand phenomena such as dysbiosis and gut ecosystem functioning. High throughput sequencing technologies such as the SOLiD or Solexa may provide amounts of data in ecosystem allowing an assessment of intraspecies population diversity, and therefore a first insight in this field. We developed a new software, "GenoVA" (Genome Variability Anlayzer), enabling the analysis of intraspecies genetic diversity from complex genomic mixes sequenced by SOLiD. In a first step, "GenoVA" proposes a definition of the species core genome based on the mean coverage count on each gene compared to a reference genome. Second, it notifies and quantifies nucleotide polymorphism by the use of the information theory. This 2-steps approach was validated by the study of a biological mix sample of 132 Streptococcus thermophilus genomes.