DATA SUBMISSION, APPROPRIATE NOMENCLATURE, AND ADDITIONAL RESOURCES

RNA assumes that authors will act in good faith to ensure that all data and supporting data sets from a publication are made available to the broader community from the date of publication. This should be done through publicly available data sets, when available, or through the RNA website. When using public databases, the entry name/ID or accession number must be referred to in the Materials and Methods section of the paper.

The following list of public databases and resources serves as an introductory guide to data submission and appropriate nomenclature for authors contributing to RNA. However, this list should not be considered to be comprehensive. If there is an additional database or resource not listed here that would be of use to authors, please rnajournal{at}case.edu.


STRUCTURAL DATA

Papers describing biological structures should deposit all related data (atomic coordinates, structure factor amplitudes/intensities, and/or NMR restraints) at a member site of the Worldwide Protein Data Bank. See list below. Electron microscopy-derived density maps must be deposited into the EMDB through one of the partner sites (http://www.ebi.ac.uk/msd-srv/docs/emdb/ or https://www.ebi.ac.uk/emdb/). For NMR structures, data deposited should include resonance assignments and all restraints used in structure determination and the derived atomic coordinates. Details on deposition IDs and data sets should be included in the Materials and Methods section of the paper.

RCSB PDB (www.pdb.org)
PDBe (www.ebi.ac.uk/pdbe)
PDBj (www.pdbj.org)
BMRB (www.bmrb.wisc.edu)
Questions relating to depositions should be sent to deposit{at}wwpdb.org.


SEQUENCE DATA

All new sequence data should be submitted to and assigned an accession number(s) by an International Nucleotide Sequence Database Collaboration member (GenBank, EMBL-Bank, or DDBJ) prior to publication.

GenBank, the NIH genetic sequence database, is an annotated collection of all publicly available DNA sequences. Instructions for sequence data submission.

The EMBL Nucleotide Sequence Database (EMBL-Bank) obtains DNA and RNA sequences from direct submissions by individual researchers, genome sequencing projects and patent applications. Instructions for sequence data submission.

DNA Data Bank of Japan (DDBJ) collects DNA sequences from researchers and issues internationally recognized accession number to data submitters. Instructions for sequence data submission.

The NCBI Sequence Read Archive (SRA), in collaboration with Ensembl, archives short read data from next-generation sequencing technologies (e.g., 454 Life Sciences [Roche], Illumina, ABI SOLiD, Helicos). Instructions for data submission.

dbEST, a division of GenBank, contains sequence data and other information on "single-pass" cDNA sequences, or Expressed Sequence Tags (ESTs), from a number of organisms. Instructions for data submission.

miRBase collects microRNA (miRNA) data, containing all published miRNA sequences, genomic locations and associated annotation. The miRBase Registry section provides a confidential service assigning official names for novel miRNA genes prior to publication of their discovery.


MICROARRAY AND CHIP-CHIP, CHIP-SEQ DATA

The Gene Expression Omnibus (GEO) is a gene expression/molecular abundance repository and curated resource supporting MIAME-compliant data submissions, including microarray-based experiments that measure gene expression, or detect genomic gains and losses (arrayCGH), detect SNPs, or identify protein-binding genomic regions in conjunction with ChIP-chip, ChIP-seq, or locate transcribed regions. GEO also accepts non-array-based high-throughput data, including SAGE, MPSS, and some peptide profiling techniques such as MS/MS. Instructions for data submission.

ArrayExpress is a repository for MIAME-compliant microarray data available for browsing and querying. The ArrayExpress Data Warehouse stores gene-indexed expression profiles from a curated subset of experiments in the repository. Instructions for data submission.



GENOTYPE/PHENOTYPE AND GENOMIC VARIATION DATA

As the study of structural variation in the genome (i.e. indels, duplications, copy number variations, inversions, translocations, etc.) has outpaced the development of standards for the collection of data, it is currently recommended that authors review the structural variation data guidelines recommended by Scherer et al., (2007) Nat Genet. 39 (7 Suppl):S7-15. Sequence variations and small indels up to 10,000 bp are typically submitted to dbSNP. Nomenclature for the description of human gene variants from the Human Genome Variation Society.

The NCBI Database of Single Nucleotide Polymorphisms (dbSNP) includes data on genetic variation such as single nucleotide polymorphisms (SNPs), small-scale insertion/deletions, polymorphic repetitive elements, and microsatellite variation in humans and other organisms. Instructions for SNP data submission.

The NCBI Database of Genotype and Phenotype (dbGaP) archives and distributes the results of studies investigating the interaction of genotype and phenotype, including genome-wide association studies, medical sequencing, molecular diagnostic assays, as well as association between genotype and non-clinical traits. Instructions for data submission.

The EMBL-EBI European Genome-phenome Archive (EGA) is a repository for genotype experiments, including case control, population, and family studies. SNP and CNV genotypes from array based methods and genotyping done with re-sequencing methods are accepted. Data may be either publicly available or require authorized access depending on the study design. Instructions for data submission.

The EMBL-EBI Database of Genomic Variants archive (DGVa) is a public catalog of the large-scale insertions, deletions, duplications, and rearrangements found in the genomes of individuals within a species. Instructions for data submission.

The NCBI Database of Genomic Structural Variation (dbVar) is a database of large scale genomic variants including events such as insertions, deletions and inversions. Instructions for data submission.

The Database of Genomic Variants (DGV) provides a comprehensive summary of structural variation in the human genome and serves as a catalog of control data for studies aiming to correlate genomic variation with phenotypic data. The DGV presents detailed information on a few selected studies, while databases such as DGVa and dbVar provide a comprehensive archive of publicly available structural variation data.

The Human Structural Variation Database catalogues human genomic polymorphisms ascertained by experimental and computational analyses, including large-scale structural variation (LSV), copy number polymorphisms (CNPs) and intermediate-sized structural variation (ISV).


PROTEOMICS AND MOLECULAR INTERACTIONS

The International Molecular Exchange Consortium (IMEx), a group of major public interaction data providers, has established standards for the collection and curation of molecular interaction data. The IMEx site provides instructions for submitting interaction data to any of the partner databases (DIP, IntAct, HPRD, MINT, MPact, BioGRID, BOND).

The Proteome Commons Tranche repository is a distributed file storage system to upload and download proteomic data sets.

The Database of Interacting Proteins (DIP) catalogs experimentally determined protein interactions from a variety of sources to create a single set of protein-protein interactions.

IntAct is an open-source database system and analysis tool for freely available protein interaction data derived from literature curation or direct user submissions.

The Protein Data Bank (PDB) contains information about experimentally-determined structures of proteins, nucleic acids, and complex assemblies, curating and annotating data according to community standards.


GENE AND GENE PRODUCT NOMENCLATURE

Nomenclature for genes and proteins should be in the appropriate format (including appropriate italics and/or capitalization as it applies for each organism's standard nomenclature format) in text and figures, and where available, submitted and approved by the appropriate nomenclature committees. Specific nomenclature guidelines for commonly studied organisms are listed below.

Human nomenclature guidelines from the Human Genome Organisation (HUGO) Gene Nomenclature Committee. Search for current and approved gene names/symbols.

Chicken nomenclature guidelines from the Poultry Species Committee of the National Animal Genome Research Program (NAGRP).

Rat nomenclature guidelines from the Rat Genome Nomenclature Committee (RGNC). Search for current and approved gene names/symbols.

Mouse nomenclature guidelines from the Mouse Genomic Nomenclature Committee (MGNC). Search for current and approved gene names/symbols.

Zebrafish nomenclature guidelines from the Zebrafish Nomenclature Committee (ZNC). Search for current and approved gene names/symbols.

Drosophila nomenclature guidelines adopted by FlyBase. Search for current and approved gene names/symbols.

Arabidopsis nomenclature guidelines adopted by The Arabidopsis Information Resource (TAIR). Search for current and approved gene names/symbols.

C. elegans nomenclature guidelines from WormBase and the Caenorhabditis Genetics Center (CGC). Search for current and approved gene names/symbols.

S. cerevisiae nomenclature guidelines adopted by the Saccharomyces Genome Database (SGD). Search for current and approved gene names/symbols.

Bacteria nomenclature should follow the guidelines established by Demerec et al., (1966) Genetics 54:61-76.


ADDITIONAL RESOURCES

The Gene Ontology (GO) project provides a controlled vocabulary to describe gene and gene product attributes in any organism.

The ENCyclopedia Of DNA Elements (ENCODE) project aims to identify all functional elements in the sequence of the human genome. The recently completed pilot project phase tested and compared existing methods to rigorously analyze a defined portion of the human genome sequence.

The modENCODE Project will attempt to identify all of the sequence-based functional elements in the Caenorhabditis elegans and Drosophila melanogaster genomes. modENCODE is operated as a Research Network and data is publicly available, with some restrictions on its use for nine months following publication.

The 1000 Genomes Project aims to find most genetic variants that have frequencies of at least 1% in the populations studied. Data from the 1000 Genomes Project will be made available rapidly to the scientific community through freely accessible public databases.

The Cancer Genome Atlas (TCGA) is a comprehensive effort to accelerate the understanding of the molecular basis of cancer through the application of genome analysis technologies, including large-scale genome sequencing.

The International HapMap Project, a partnership of scientists and funding agencies from Canada, China, Japan, Nigeria, the United Kingdom, and the United States, developed a haplotype map resource to describe the common patterns of human DNA sequence variation to help researchers find genes associated with human disease and response to pharmaceuticals.

SeattleSNPs focuses on identifying, genotyping, and modeling the associations between SNPs in candidate genes and pathways underlying the human inflammatory response.

The H-Invitational Database (H-InvDB) is an integrated database of human genes and transcripts, containing curated annotations of human genes and transcripts that include gene structures, alternative splicing isoforms, non-coding functional RNAs, genetic polymorphisms (SNPs, indels and microsatellite repeats), relation with diseases, gene expression profiling, molecular evolutionary features, protein-protein interactions (PPIs) and gene families/groups.

The Kyoto Encyclopedia of Genes and Genomes (KEGG) is a database of biological systems, consisting of genes and proteins, endogenous and exogenous chemicals, interaction and reaction networks, and hierarchies and relationships of various biological objects.

The Human Protein Reference Database (HPRD) is a centralized platform to depict and integrate information manually extracted from the literature regarding ___domain architecture, post-translational modifications, interaction networks, and disease association for each protein in the human proteome.

The Biomolecular Object Network Databank (BOND), formerly BIND, combines sequence, interaction, and related interactome data and content, containing GenBank and BIND data, as well as related tools and information.

The Reactome project is a curated resource of core pathways and reactions in human biology, as well as electronically inferred orthologous events in 22 non-human species including mouse, rat, chicken, puffer fish, C. elegans, Drosophila, yeast, two plants, and E. coli.

The Clusters of Orthologous Groups (COGs) resource was constructed by comparing protein sequences encoded in complete genomes, representing major phylogenetic lineages. Each COG consists of individual proteins or groups of paralogs from at least 3 lineages and thus corresponds to an ancient conserved ___domain.

Online Mendelian Inheritance in Man (OMIM), a phenotypic companion to the human genome project, is a catalog of human genes and genetic disorders, focusing primarily on heritable genetic diseases.

Psuedogene.org is a comprehensive database of identified pseudogenes, utilities to identify pseudogenes, various publication data sets, and a pseudogene knowledgebase.

Repbase Update (RU) is a database of prototypic sequences representing repetitive DNA from a number of eukaryotic species, with instructions for the submission of sequence data.


[Back to main Instructions page]