Process Report


Table of Contents


Introduction

When run from the SM web page (SpectrumMill\millhtml\processReport.htm) a perl script (SpectrumMill\millscript\processReport.pl) is run which serves as a simple pipeline that conveys the input parameters and executes the user-selected R scripts with all input/output files residing in the SM server's filesystem.

These SM report processing R scripts were architected with the following in mind:

  1. Scripts could be run in multiple ways:
  2. Individuals would modify a script for related uses.
  3. Scripts would be used in pipelines other than Spectrum Mill.
  4. Scripts would be run on either Windows or Unix.
  5. Project-specific features would be added.


Input variables shared across all scripts


parseSMreport.r

Parses .ssv files generated by Spectrum Mill Protein/Peptide Summary to selectively extract the highest information value columns.

Input Options

Output Options

The default is .txt. There is no UI option to select output format. Instead the output format is controlled by the presence of a user-provided auxiliary file placed in the Data Directory before running the script.


Create a reporter_sample_template.txt file (horizontal layout)

When the parseSMreport script is run to process a P/P Summary report it can update the column headers in the report to include meta information, like more specific sample names attached to each reporter ion. These names will also be propagated through to the ratio distribution plots, and reports following normalization.

In order to provide the script with information on a sample name for each reporter ion you must create a tab-delimited text file in the SM directory that contains the initial report that is parsed. The text file must be named:
     reporter_sample_template.txt
And be organized like the following example:
This example is for an experiment involving 18 samples that were run in 2 separate TMT-10 plexes, with 9 samples and a common control (mix of all 18).
  1. line 1: reporter ion. The parser looks only for a substring in each cells text for the reporter ion (126, 127N, etc). The A and B are optional, and added only for user convenience.
  2. line 2: sample name, free text.
  3. line 3: directory name, directory the data was present in when the report was generated. This is necessary to match up the name and reporter ion, when more than 1 directory contributed to the report.

Special considerations;

A126A127NA127CA128NA128CA129NA129CA130NA130CA131B126B127NB127CB128NB128CB129NB129CB130NB130CB131
MD-8214C long survival, no chemoMD-8226C long survival, unknown chemoNP-8932NPC-8592TPI-8592N2PI-8762NPD-8832T long survival, palliative chemoPD-8715 short survival, adj chemoWD-8996C long survival, adj chemoCC-all18MD-7800T, long survival, unknown chemoMD-1044C short survival, adj chemoNP-8926N0308PC-1254TPI-8926CPD-8216C long survival, adj chemoPD-8216T short survival, adj chemoWD-8721C long survival, adj chemoWD-6563T long survival, unknown chemoCC-all18
TMT10A_bRPTMT10A_bRPTMT10A_bRPTMT10A_bRTMT10A_bRPTMT10A_bRPTMT10A_bRPTMT10A_bRPTMT10A_bRPTMT10A_bRPTMT10B_bRPTMT10B_bRPTMT10B_bRPTMT10B_bRPTMT10B_bRPTMT10B_bRPTMT10B_bRPTMT10B_bRPTMT10B_bRPTMT10B_bRP

MeanMulti


If the control ion was designated as MeanMulti, when generating the SM report, this leads to the mean intensity of multiple reporter ions being used as the denominator in ratios for each of the TMT10 mass labels. Consequently, there will be 1 additional ratio for each directory in the report. In the corresponding reporter_sample_template, indicate which ions were used as the denominator by joining the ions with a dot character (and keep them in alphabetic order, to match the column headers in the .ssv file). The dot delimiter enables the parser to prevent the parser from expecting this channel to also be used as a numerator.

126127N127C128N128C 129N129C130N130C131 126.127N.128C
Control_205 Control_206LM2.3(9.1)emptyControl_207 SNED1_KD_208231.1231.2SNED1_KD_210SNED1_KD_211 Control_3_mean
bRPfrxns_0813bRPfrxns_0813bRPfrxns_0813bRPfrxns_0813bRPfrxns_0813 bRPfrxns_0813bRPfrxns_0813bRPfrxns_0813bRPfrxns_0813bRPfrxns_0813 bRPfrxns_0813


Create a sample-annotation.csv file (vertical layout)

The requirements described below are, by design, harmonized between tools developed in the Broad Institute Proteomics Platform group including: Spectrum Mill and the downstream tools Protigy and Panoply.

In order for the parseSMreport script to correlate each reporter ion label with a sample name and metadata to produce a GCT v1.3 format output file you must provide a comma-delimited .csv file in the SM directory that contains the report that is parsed. While the filename may be prefixed at the user's discretion, the text file must be named with the suffix
     sample-annotation.csv
And be organized like the example below:

The following considerations apply:

Sorting of the sample-annotation.csv file:

Sort as you wish. It has been expected that when some users prepare a sample-annotation file they might frequently end up with rows sorted in a different order than the data columns in the SM report to which they will be mated, for reasons like sorted by Sample.IDs, or the Channel column sorted alphabetically instead of by mass. So to reduce excess "quick questions" back to the developer, the script has been written to tolerate differences in sort. However, keep in mind that the output .GCT file will maintain the order of the columns in the input .ssv file. It is expected that the metadata columns in the sample-annotation file which become rows in the .GCT file will enable facile re-sorting of the data columns during downstream analyses.

Example sample-annotation.csv file

Download a template suited to your labeling reagent: TMT6 TMT10 TMT11

TMT6-multimedian-2004-RedSox-reverse-the-curse-sample-annotation.csv

A hypothetical TMT6 experimental design contemplated at the ice cream social held across the river from Fenway Park in Cambridge, MA at the Broad Institute in late October of 2004. 3 TMT6 plexes with each of 9 samples run in duplicate for a P/P Summary report generated with MedianMulti as the control ion. Thus ratios were calculated using a denominator composed of the median of all 6 channels.

Sample.ID ParticipantExperimentChannelType
Manny.1 Ramirez 1 126MVP 2004
Manny.2 Ramirez 1 127MVP 2004
David.1 Ortiz 1 128MVP 2013
David.2 Ortiz 1 129MVP 2013
Dave.1 Roberts 1 130Slide
Dave.2 Roberts 1 131Slide
Johnny.1 Damon 2 126Yankee traitor
Johnny.2 Damon 2 127Yankee traitor
Pedro.1 Martinez 2 128Hall of Fame 2015
Pedro.2 Martinez 2 129Hall of Fame 2015
Jason.1 Varitek 2 130Captain
Jason.2 Varitek 2 131Captain
Kevin.1 Millar 3 126Cowboy Up
Kevin.2 Millar 3 127Cowboy Up
Curt.1 Schilling 3 128Bloody Sock
Curt.2 Schilling 3 129Bloody Sock
Terry.1 Francona 3 130Rookie Skipper
Terry.2 Francona 3 131Rookie Skipper


How to read/write a .GCT v1.3 file using R or Python

GCT - (Gene Cluster Text) v1.3 is a tab-delimited text file format that is convenient for analysis of matrix-compatible datasets as it allows metadata about an experiment to be stored alongside the data from the experiment.

Now perhaps you are thinking:
     "Hmmm, this .GCT format is kind of cool in that the data and metadata are combined, but how am I going to read it in to my script so I can do my own analyses?"
Probably just install the GCT package, right? Ummm, well kinda sorta, but not really, at least not at the moment anyway. However, currently there are the following options:

Python

Install cmapPy

R

Install cmapR and its 37 dependency packages directly or through Bioconductor and use in R-scripts via:
library(cmapR)

or

Grab just the necessary cmapR package from the Spectrum Mill server that produced your .GCT files or the cmapR Github repository. The cmapR installation used in SM was obtained from the cmapR Github repository by "cloning the repo". From the Github page, click the green Code button, then press Download zip. From cmapR-master.zip, unzip the contents to SpectrumMill/millr/cmapR. This involved renaming the top-level directory cmapR-master to cmapR. Within SM the package is then used in two R scripts: via:
source("cmapR/R/io.R")         #millR/cmapR/R/io.R
source("cmapR/R/GCT.R")     #millR/cmapR/R/GCT.R

to use the cmapR functions: write_gct(), parse_gctx(), and subset_gct()

Full cmapR documentation is available at rdrr.io/github/cmap/cmapR/man/


plotRatioDistributions.r

Plot histograms of iTRAQ/TMT ratio distributions in a dataset at the protein or VM-site level


normalizeReporterRatios.r

Normalize distributions of reporter ion ratios.


Running individual scripts from the command line

If you wish to run any of the individual R scripts from an MS-DOS command prompt do the following:

  1. Install R on your local machine, download from: https://www.r-project.org/
  2. Add Rscript.exe to the Windows system path on your local machine. Startup menu – control panels/system/advanced/environment variables/system variables/path
    c:\Program Files\R\R-3.2.2\bin\x64
  3. Open an MS-DOS Command Window. (From the Windows Start menu, select All programs then Accessories then Command Prompt)
  4. Change to the volume where the millR directory on your Spectrum Mill server is mounted.
  5. Type lines like the following which run each script along with required parameters


Customizing individual scripts

The individual scripts are installed on each Spectrum Mill server in the directory:
example: \\Cibola\SpectrumMill\millR