Methods

Datasets and Annotation

To understand a potential functional role of chimeras, all publicly reported chimeric RNA (7829 transcripts) were suggested for the analysis. In particular, the chimeric ESTs for human, mouse, fruit fly and yeast (200 transcripts for every organism: human, mouse, and fruit fly and five transcripts for yeast) by Li et al [1] together with all chimeric ESTs ( 6178 sequences) and mRNAs (1046 sequences) from ChimerDB [2] were used. All chimeric RNAs have well-defined junction sites (at least six nucleotides on each side of the junction). However, only few chimeric sequences has canonical splice-junction sites [1]. For the dataset of Li et al [1], The UCSC BLAT search was used to find a sequence similarity between the chimeric RNA transcripts and human genomic regions in order to annotate genes participating in the chimera organization. Then, all aligned exons, introns or untranslated regions in the chimeras were identified using known transcripts information fro m ENSEMBL. The BLAST method was applied to recognize the corresponding protein domains for every exon of chimeric mRNAs. Finally, WU BLAST was employed when short or "strange" genomic regions were found in order to find their identity in more precise way, because WU BLAST was shown to be most efficient when the transcript composition is unknow n (see "Supplementary").

Mapping the Chimeric Transcripts by the RNA-Seq Paired Reads

To provide a large-scale mapping of the chimeric transcripts by paired-end RNA-seq reads the following procedure was used. First, we mapped the RNA-seq reads to the human genome a nd annotated exon junctions. Then we selected the reads which had not mapped at the previous stages and mapped them to the chimeric transcripts. Finally, we selected only the reads which mapped precisely on the junction of the chimera, with a minimum of 6 nucleotides (nt) or 5 nt for the short paired-end reads (50nt) mapping on each side of the junction. Thi s protocol is stringent as it ensures that if a read maps both to a known transcript and to a chimeric transcript, it will be assigned to the known transcript. All the mappings were performed using GEM [3] allowing for a maximum of 3 mismatches. The same procedure was applied for chimeric transcripts from mouse and fruit fly [1].

Visualization of Chimeras by SpliceGraphs

A bonus feature of our CHIMERA DATABASE is that it provides visualization of chimeric transcripts, and their genomic context, including the junction site. These figures were produced using the SpliceGrapher package, which was designed for analysis and visualization of RNA-Seq data [4]. These figures highlight the genes on either side of a chimeric junction, making it possible to visualize the potential transcripts that could arise from each chimera.

Identification Chimeric Proteins by the Mass-Spectrometry Experiments

To discover chimeras at the protein level, the peptide mass spectra from human proteomics experiments were used from the two publicly available proteomics databases. The GPM set co nsisted of 5,809 mzXML format spectra files and the PeptideAtlas set was 52,019 mzXML format spectra files. The unique peptides were identified by searching against the Gencode annotation of the human genome [5]. The GENCODE annotation is not yet complete for all human genes; therefore we can only distinguish peptides that map to the GENCODE a nnotations. In order to statistically evaluate found peptide the overall rate of the False Discovery Rate (FDR) was studied. The target/decoy strategy has been designed to accomplish this task by means of a random synthetic protein database (a decoy database) that preserves the general composition of the target database but does not overlap with it. The matching peptides from the decoy database were used to estimate the FDR, since they do not correspond to factual peptides. A decoy database was produced f or the 62,943 unique transcripts from the 22,027 unique genes of GENCODE. The threshold sensitivity (the fraction of true positive identifications together with Evalue) was used to estimate the significance of found unique peptides. Finally, chimeric transcripts having the junction site confirmed by one or more peptides with the combined Evalue less than 10^-4 were considered as true-positive.

  1. Short homologous sequences are strongly associated with the generation of chimeric RNAs in eukaryotes Li et al.
  2. ChimerDB 2.0-a knowledgebase for fusion genes updated. Kim P et al.
  3. GEM Tool and Library. Guido Lab.
  4. SpliceGraphs. Rogers MF et al.
  5. Proteomics studies confirm the presence of alternative protein isoforms on a large scale Tress ML et al, Genome Biology 2008.