Using Personalized Protein Sequence Databases in Spectrum Mill


Table of Contents


Introduction

Spectrum Mill development of support for personalized sequence databases has been motivated by proteomics studies of cohorts of cancer patients with proteomics data generated in multi-plexed samples analyzed using iTRAQ and TMT isobaric tagging reagents. Those multi-plexed studies often contain samples from on the order of a 100 individual patients where multiple TMT10 plexes are created with each plex containing a single channel devoted to a common control sample that is itself a mixture of samples from all or most of the patients. Consequently, any MS/MS spectrum could be acquired that represents detection of a personalized sequence derived from any of the individuals present in the study cohort. Therefore the sequence database used for searching MS/MS spectra should contain the personalized sequence from all the individuals. Hence, personalized protein sequence databases are supported in Spectrum Mill in two fundamental areas.
  1. SM Protein Databases tool - not only are .fasta formatted databases derived from multiple individuals that contribute to a study concatenated with the reference sequence database, but also summary tables are generated that enable tracking information about each proteogenomic feature (single amino acid variant, splice junction) and which individual they were detected in at the DNA/RNA level.
  2. SM Protein/Peptide Summary tool - after searches of MS/MS spectra against a personalized database, a special reporting mode (Protein - Prot Genom Site Comparison) will produce a table that aggregates all data for each proteogenomic feature in a single row, in a manner similar to the aggregation of data for each phosphosite in a Protein - Var Mod Site Comparison mode report. In the PG site table, the columns will be for directories/reporter ion channels, rows will be for PG sites, and cells will contain the quantitation of median TMT ratios for all PSMs associated with the row/column.

Sources and Formats of sequence databases

Single amino acid variants are typically detected in an individual's DNA from whole exome sequencing (WES) or whole genome sequencing (WGS). Somatic variants are annotated when sequence derived from a tumor differs from sequence derived from a corresponding germline sample, typically peripheral blood mononuclear cells (PBMCs). Germline variants are found when protein sequence derived from a germline sample differs from sequence in a reference proteome database typically Ensembl, RefSeq, or UniProt.

Splice junctions are typically detected in an individual's RNA after aligning RNA-seq reads to a reference transcripome. An RNA-seq experiment is typically performed by sequencing the whole transcriptome of a complementary DNA(cDNA)library prepared from RNA isolated from a tissue sample.

There are several different software packages for generating personalized protein sequence databases based on WES/WGS and RNA-seq data. Spectrum Mill development of support for personalized sequence databases was initiated using databases produced by QUILTS.

Using personalized sequence databases generated by QUILTS

QUILTS v3 is supported. ( http://openslice.fenyolab.org/cgi-bin/pyquilts_cgi.pl )

QUILTS can produce multiple .fasta files per individual with each containing different types of personalized sequence.

Suffixes:

Notes: indels and nonsense mutations (stop codon introduced or removed) are in the frameshift file. Gene-fusions are in the -other.fasta file

Or a single .fasta file that is a combination of the above

Suffix:

Spectrum Mill is intended to be used with the _variant_proteome.fasta type files as input.

The prefix of a file typically corresponds to the individual patient/sample identifier used for each individual in the sample cohort.

Using personalized sequence databases from other sources

If you don't use QUILTS to generate your personalized sequence databases, then in order to use them with the Spectrum Mill infrastructure you should match the .fasta header formats described on the QUILTS site: pyQUILTS/OUTPUT FASTA HEADER KEY.txt Also use a filename suffix of: _variant_proteome.fasta.


Preparing QUILTS-format sequence databases for MS/MS searches in Spectrum Mill

Use the Spectrum Mill Protein Database Utility - Concatenate FASTA files to perform 2 core functions:

Steps

  1. Place all the .fasta files for the proteomics experimental cohort in a subdirectory in the Spectrum Mill file system:
    1. seqdb/PGexperimentalcohort/*_variant_proteome.fasta
  2. Run the Concatenate FASTA files tool on the Spectrum Mill Protein Database Utilities page. Be sure to check the box for Make Proteo Genomic Summary Table and select the appropriate choice on the menus for Proteogenomic Database Source and Reference Database to include. Output will include a combined sequence .fasta file and 2 summary tables that will, after MS/MS searches, enable generation of protein/peptide summary reports where the proteogenomic (PG) site, variant or splice junction, is the primary organizing feature.

    In the following example from a breast cancer proteome study of 122 patients (Krug et al Cell 2020) using a reference proteome file:
    RefSeq.20160914_Human_ucsc_hg19_customProDBnr_mito_150contams_553smORFS.fasta
    The output included the 3 key files

  3. Run the Index new database utility on the resulting combined FASTA file.
  1. Note: If you later want to concatenate additional sequences of interest to the .FASTA file, you must maintain the filename relationship of the 2 summary tables and the final sequence database, in order for the SM Protein/Peptide summary tool to locate the summary files when creating PG-site specific reports.

Generating Protein/Peptide Summary reports based on search results specific to detection of Variants and Spliceforms

Protein/Peptide Summary Modes:

Protein - Prot Genom Site Comparison
This mode will produce a table that aggregates all data for each proteogenomic feature in a single row, in a manner similar to the aggregation of data for each phosphosite in a Protein - Var Mod Site Comparison mode report. In the PG site table, the columns will be for directories/reporter ion channels, rows will be for PG sites, and cells will contain the quantitation of median TMT ratios for all PSMs associated with the row/column.
Peptide - Spectrum Match and
Peptide - Distinct
Limit the rows displayed to only PG-feature containing peptides using the menu Filter by Proteogenomic Features
  1. Note: for all modes use the menu Filter by Proteogenomic Features, to select which subset to report. Typically this will involve choosing one of the following:

Check the Review Field checkbox Proteogenomic features to enable PG-feature specific columns (Variants and Spliceforms)

Supported Protein/Peptide Summary Report Modes:
Protein - Prot Genom Site Comparison
Peptide - Spectrum Match
Peptide - Distinct

Table below updated for SM v7.05
The most reliably current documentation can usually be found at: Report Column Descriptions (.xlsx)

Column Header Items specific to the Variants Table Column Header Items specific to the Splice Isoform Table
id ex: NP_000537_R273C_268_280

RefSeq identifier (NP_000537), Variant (R273C) see description below, the amino acid start (268) and end (280) positions of the variant's representative peptide in the protein sequence.

id ex: NP_053733_483_477_485.

RefSeq identifier (NP_053733), Frameshift Start AA (483) or Splice AA see descriptions below, the amino acid start (477) and end (485) positions of the isoform's representative peptide in the protein sequence.

G# Number of tumors in all samples with the variant called as germline from DNA exome sequence(1-n, where n is # of samples).
S# Number of tumors in all samples with the variant called as somatic from DNA exome sequence(1-n, where n is # of samples).
CC# Number of tumors in the common control sample with DNA exome sequence evidence for the variant (1-c, where c is # of samples contributing to common control).
CC# Number of tumors in the common control sample with RNA-seq evidence for the proteogenomic feature (1-c, where c is # of samples contributing to common control).
AS# Number of tumors in all samples with DNA exome sequence evidence for the variant (1-n, where n is # of samples). AS# Number of tumors in all samples with RNA-seq evidence for the proteogenomic feature (1-n, where n is # of samples).
Variant ex: R342P

The amino acid in the reference sequence (R), position in the protein sequence of the variant (342), the variant amino acid (P).

Reads Max Maximum number of RNA-seq reads supporting the proteogenomic feature in one of the n total tumors.
IF,FS In-frame or frameshift status of the proteogenomic feature.
Frameshift Type Primary splice types designated in QUILTS (full exon: A, partial exon: AN1, AN2, stop-introduced).
Frameshift Sub Type More detailed types designated in Spectrum Mill (frameshift-truncation, frameshift-extension, frameshift-substitution, insertion, deletion, insertion-short3, deletion-short3, stop-truncation). Truncation, extension, and substitution refer to the change in length of the isoform sequence relative to the reference sequence; shorter, longer, or equal, respectively.
Strand The strand for the reference gene (+/-).
Frameshift Start AA The position of the first amino acid in the isoform protein sequence that differs from the reference squence. If no differences, then the position of the first amino acid on the C-terminal side of the splice junction or the position of the last amino acid in the protein.
Frameshift End AA The position of the last amino acid in the isoform protein sequence that differs from the reference squence. If the length of differences is < 2, then the same as the Frameshift Start AA.
Splice AA The position of the first amino acid on the C-terminal side of the splice junction for full exon, in-frame splice isoforms.