------------------------------------------------------------------------------------------------------------------------ MT3.5v3 Genome Annotation Download ------------------------------------------------------------------------------------------------------------------------ * The MT3.5 genome sequences and assemblies are available from the JCVI FTP site. * This pages provide access to the latest IMGAG annotations and datasets based on the Medicago MT3.5 genome build. FILES IN THIS RELEASE Mt3.5_TIGRXMLs_20100825.tgz : TIGRXML files of Chr01 - Chr08 plus all extra BACS Mt3.5v3_annotation_fasta_files.tar.gz : Mt3.5v3 Annotation Fasta files -> Mt3.5_GenesProteinSeq_20100825.fa : Protein sequences of all gene calls -> Mt3.5_GenesCDSSeq_20100825.fa : CDS sequences of all gene calls -> Mt3.5_GenesTranscriptSeq_20100825.fa : Unspliced transcript sequence (incl. introns + UTR) -> Mt3.5_GenesSplicedSeq_20100825.fa : Spliced transcript sequence (CDS + UTR, excl. introns) Mt3.5_genes_20100825_MIPSGFF.gff : MIPS GFF formatted file for GENES on Chr01 - Chr08 plus all extra BACs Mt3.5_transposons_20100825_MIPSGFF.gff : MIPS GFF formated file for TRANSPOSABLE ELEMENTS on Chr01 - Chr08 plus all extra BACs Mt3.5_tRNA_20100825_MIPSGFF.gff : MIPS GFF formated file for tRNA on Chr01 - Chr08 plus all extra BACs RELATED FILES Mt3.5_name_mapping_Genes_20100825.txt : MT3.5 gene names mapping file (mapping old/EUGENE identifiers to new/IMGAG identifiers) Mt3.5_name_mapping_TEs_20100825.txt : MT3.5 TE names mapping file (mapping old/EUGENE identifiers to new/IMGAG identifiers) __________________________________________________________________________ Guidelines for use of unique gene id's for Medicago MT3.5 release (as proposed by MIPS and TAIR for Arabidopsis) * Format of chromosomal based nomenclature (these identifiers are assigned to all gene calls AND transposable elements on the pseudochromosome sequences, BAC-based identifiers are described later) o Medtr (Medicago truncatula, two-letter code such as 'MT' deprecated for Genbank submission) o 1,2,3,4,5,6,7,8 (chromosome number) or M for mitochondrial or C for chloroplast o 'g' for gene and 'te' for transposable elements o 123000 (six-digit code, numbered from top/north to bottom/south of chromosome) o .[digit] for each splice variant gene model in a locus. If there is only one model, this is '.1'. o The splice variant with the longest translated amino acid sequence gets '.1'. In case two or more splice variants have the same aa sequence length, the one with the longest gene size gets '.1'. * Chromosome based locus identifiers are/will be assigned to o protein-coding genes o RNA coding genes (sn, r, tRNAs) o pseudogenes o transposable elements * Usage o The first locus identifier release makes use of locus identifiers ending in zero, eg 010010, 010020, 010030 and so on so that intervening numbers can be used for newly discovered genes. o Where there are gaps in the sequence, this release skipped variable codes for each gap, depending on the gap size estimates based on optical map data. o At the begin of a chromosome sequence, this release skipped 500 codes and started with e.g. Medtr1g005000.1. * BAC sequence based locus identifiers o For the additional BAC sequences which could not been assigned or anchored to pseudochromosomes, BAC based identifiers were used. o The identifiers are based on the accession of the BAC. Apart from that, the same rules as for chromosomal based identifiers are true. o Example: AC235009_60.1 o Note that there is no information about the version of the BAC sequence in this identifier type. That information is in the IMGAG fasta line or in the MT3.5 sequence file. ___________________________________________________________________________ Important notes from MIPS made initially for Arabidopsis Most people assume that if they sort the identifiers by ascending numbers they get a list of genes that represents the order along the chromosome. This was true originally, but no longer: Some BACS needed to be flipped, i.e. their orientation reversed, as new data on overlaps was generated. So all genes on these BACS now number the wrong way round. At MIPS, we decided it is more important to conserve the identifiers than the order, as the order can also be sorted by coordinates. Generally, the identifier still gives a good idea of the location on the chromosomes, only local reversals are expected. Bottomline: In doubt, the "be user-friendly" rule should be considered superior over the "be conservative" rule :-) ------------------------------------------------------------------------------------------------------------------------ Download file format documentation: ------------------------------------------------------------------------------------------------------------------------ The fasta line headers in all annotation multi-fasta files are formatted following this convention: [>] Unique Namespace: gene predictions obtained through the IMGAG pipeline are named 'IMGA' [pipe] Gene ID: see rules above for details [space] free text description following FASTA conventions [space] Start-stop coordinates, where start is the first nucleotide of the translation start codon ATG and stop the last nucleotide of the translation stop, e.g. TGA. Separated by -/ dash/minus, no padding zeros, no whitespace. Coordinate 1 is always the first nucleotide of the sequence as retrieved using the seqversion from EMBL/GENBANK/DDBJ (reversing t he sequence to achieve forward orientation relative to the chromosome is not allowed). Gene calls on Crick/reverse/- strand have stop > start [space] Evidence: a single letter code for the level of evidence that underlies the gene call is given. These codes are: -> F full coverage/FL-cDNA: The complete gene model from translation start to translation stop is covered by expressed Medicago sequence, e.g. FL-cDNA or EST alignments across the full length of the coding sequence. -> E expressed/EST matches: Expression of the gene is supported by Medicago EST sequence that matches the gene call (partially). -> H homology/heterologous: the gene call is supported by similarity to Medicago or other ESTs, protein, FL-cDNA, genomic or other sequences with partial or full-length alignments. -> I intrinsic/ab initio/inferred/hypothetical: the gene call is based only on intrinsic prediction tools such as FGENESH, Genscan or Eugene, and no significant alignments to other sequences are available. The classification will be done top-down, so any gene call that does not fall under F will fall under E, if it does not satisfy the requirements of E it will be H and all gene calls that do not fulfill H will be called I -> L low probability: very small gene calls with less than 100 AAs without respect to other evidences [space] Method abbreviation that shows the method used to generate the gene call. See http://medicago.toulouse.inra.fr/imgag_egn/cgi-bin/egn_getinfo.cgi?page=EGN_Mt041209 for details. [space] Date where the gene call was made or last modified in yyyymmdd format [newline] Nucleotide/Protein Sequence ___________________________________________________________________________