Introduction
nuORFdb is a database of novel or unannotated open reading frames (nuORFs) with evidence of translation detected by ribosome profiling (Riboseq)
from 29 primary healthy and cancer samples as well as cell lines. nuORFdb was created to be a resource for identification of peptides
in immunopeptidomic mass spectrometry datasets.
The Riboseq data from all samples was combined via our hierarchical ORF prediction pipeline,
where ORFs were predicted at multiple nodes, consisting of each sample (leaf), tissue (clade) and across all samples combined (root). This
approach aggregated signal across our Riboseq dataset to predict lowly translated ORFs, while maintaining sensitivity for tissue-specific
overlapping ORFs translated nuORFs across tissues.
nuORFdb v1.2 - retain min length 8AA (229,251 ORFs - 7,292) June 2023
nuORFdb v1.1 - hg38 liftover (236,543 ORFs - 884) June 2023
235,851 nuORFs passed liftover from hg19 to hg38 to yield the exact same protein sequences.
692 additional passing nuORFs were rescued as they have the same protein length and a limited number of SNV/SAAVs.
nuORFdb ORF_IDs combine a Gencode v26lift37 transcript ID with genome coordinates. In v1.0 ORF_IDs contain hg19 genome coordinates, v1.1 uses hg38 genome coordinates.
The Annotations table contains ORF_ID_hg19 to ORF_ID_hg38 cross-references. An update of the transcript IDs and ORF biotypes is intended soon.
The 884 nuORFs failing liftover failed for a variety of reasons:
- 112 ORFs derived from hg19 unlocalized chromosomes chrGL* were excluded because their inclusion caused the liftover to crash. https://genome.ucsc.edu/cgi-bin/hgLiftOver
- 140 ORFs failed to lift yielding lifovers errors of the form: "Partially deleted in new" or "Boundary problem"
- 33 ORFS were on std hg19 chromosomes but on hg38 unlocalized chrKI* chromosomes not available in the local reference genome used for sequence retrieval
- 151 ORFs had some aberrant block sizes of 0 in the v1.0 .bed file for certain exons which yielded transcripts of very different length in hg38
- 162 ORFs had SNVs which introduced a stop codon and yielded proteins of very different length in hg38
- 19 ORFs had the same protein length but excess SAAVs
- 7 ORFs had the same transcript length but different protein length by +/- 1 AA in hg38
- 260 ORFS yielded proteins of very different length in hg38 with undetermined reason
nuORFdb v1.0 - hg19 (237,427 ORFs) December 2019
Publications
- Ouspenskaia T, Law T, Clauser KR, Klaeger S, Sarkizova S, Aguet F, Li B, Christian E, Knisbacher BA, Le PM, Hartigan CR, Keshishian H, Apffel A, Oliveira G, Zhang W, Chen S, Chow YT, Ji Z, Jungreis I, Shukla SA, Justesen S, Bachireddy P, Kellis M, Getz G, Hacohen N, Keskin DB, Carr SA, Wu CJ, Regev A.
Unannotated proteins expand the MHC-I-restricted immunopeptidome in cancer.
Nature Biotechnology 40, 209 - 217 (2022)
PMID: 34663921.
Data Availability (Riboseq and RNAseq)
- NCBI GEO (GSE143263)
-
Raw Ribo-seq data (fastq.gz), offset-corrected BAM files
used for translated ORF identification by RibORF and BigWig file generation,
BigWig files for Ribo-seq data visualization in genome browsers and Ribo-seq
translation levels (TPM) for established cell lines (B721.221, A375 and HCT116) and for primary melanocytes
(Thermo C0025C).
-
GTEx, TCGA, CLL and healthy B cell samples RNA-seq transcription quantification of transcript isoforms.
-
Ribo-seq translation levels (TPM) of primary GBM and melanoma samples.
-
NCBI GEO (GSE131267)
B721.221 RNA-seq data for HLA-C (C*04:01, C*07:01).
-
dbGAP Raw data pertaining to primary patient samples.
-
phs001998
CLL1-5 Ribo-seq and CLL4, CLL5 RNA-seq data.
-
phs001451.v1.p1
Ribo-seq data for MEL2, MEL11 and GBM7 and matching RNA-seq data for MEL11. Melanoma RNA-seq data.
-
phs001519.v1.p1
Glioblastoma bulk RNA-seq data.