Process Report

Introduction
Input variables shared across all scripts
parseSMreport.r

Create a reporter_sample_template.txt file (horizontal layout)
Create a sample-annotation.csv file (vertical layout)
How to read/write a .GCT v1.3 file using R or Python

plotRatioDistributions.r
normalizeReporterRatios.r
Running individual scripts from the command line
Customizing individual scripts

Introduction

When run from the SM web page (SpectrumMill\millhtml\processReport.htm) a perl script (SpectrumMill\millscript\processReport.pl) is run which serves as a simple pipeline that conveys the input parameters and executes the user-selected R scripts with all input/output files residing in the SM server's filesystem.

These SM report processing R scripts were architected with the following in mind:

Scripts could be run in multiple ways:

Via a web browser, that navigates the Spectrum Mill filesystem.
From the command line.
Called from other R scripts

Individuals would modify a script for related uses.
Scripts would be used in pipelines other than Spectrum Mill.
Scripts would be run on either Windows or Unix.
Project-specific features would be added.

Input variables shared across all scripts

Data Directories: Choose the directoreis which contain SM reports, not all of the directories which contributed to a report.
Project: choices other than Generic are used to trigger customized features associated with a particular project:
- CPTAC2 TCGA: specific means of parsing LC-MS/MS run names to extract sample reporter ion associations.
- Matrisome: specific subsets of proteins are selected for separate handling in distribution plotting (ECM, Collagen, Fibrillar collagens) which are recognized by extra columns included in the report (Division, Category, Notes).
- GO Categories Present: Additional columns present in the report are extracted and passed through by the parser.
SM report type: Corresponds to the mode selected when generating a report from Protein/Peptide Summary. All reports are expected to have been generated with the run specific option enabled. The report type is used by the parser not only to determine what file name to retrieve from the data directory, but also to indicate what format to expect and trigger appropriate handling/presentation of the report's content.

parseSMreport.r

Parses .ssv files generated by Spectrum Mill Protein/Peptide Summary to selectively extract the highest information value columns.

Input Options

Reporter Ion label type: The type selected is used by the parser to specify how many data columns are expected to be present for each directory (the number of ratios should be the number of reporter ions - 1). However, TMT10 mean multi indicates 10 ratios are expected because when generating the SM report the control ion was designated as MeanMulti, leading to the mean intensity of multiple reporter ions was used as the denominator in ratios for each of the TMT10 mass labels.

Output Options

The default is .txt. There is no UI option to select output format. Instead the output format is controlled by the presence of a user-provided auxiliary file placed in the Data Directory before running the script.

.TXT - Simple tab-delimited output using either the original data column headers from P/P Summary or revised headers based on the content of a user provided reporter_sample_template.txt file (see below), if present in the data directory being processed.
.GCT - (Gene Cluster Text) v1.3 is a tab-delimited text file format that is convenient for analysis of matrix-compatible datasets as it allows metadata about an experiment to be stored alongside the data from the experiment. In order to produce this output format the sample identifiers and metadata must be correlated with the reporter ion labels used in the experiment via a user provided sample-annotation.csv file (see below), present in the data directory being processed.

Create a reporter_sample_template.txt file (horizontal layout)

When the parseSMreport script is run to process a P/P Summary report it can update the column headers in the report to include meta information, like more specific sample names attached to each reporter ion. These names will also be propagated through to the ratio distribution plots, and reports following normalization.

In order to provide the script with information on a sample name for each reporter ion you must create a tab-delimited text file in the SM directory that contains the initial report that is parsed. The text file must be named:
reporter_sample_template.txt
And be organized like the following example:
This example is for an experiment involving 18 samples that were run in 2 separate TMT-10 plexes, with 9 samples and a common control (mix of all 18).

line 1: reporter ion. The parser looks only for a substring in each cells text for the reporter ion (126, 127N, etc). The A and B are optional, and added only for user convenience.
line 2: sample name, free text.
line 3: directory name, directory the data was present in when the report was generated. This is necessary to match up the name and reporter ion, when more than 1 directory contributed to the report.

Special considerations;

Do not use a colon character : in any cells of the report, because colon will later be inserted by the parser to denote the numerator : denominator involved in a ratio.
The examples have the columns in mass order. However, you can put the columns in any order. The parsed report will come out with the same column order as the reporter_sample_template.txt.
When saving reporter_sample_template.txt in Excel
Save as type:
Text (Tab delimited) (*.txt)
not
Unicode Text (*.txt)
You can diagnose problems caused by saving as Unicode Text by the funky characters in the script output pane in the browser in the reporterIons field.
numerator: 126 denominator: 127C reporterIons: ÿþ1

A126

A127N

A127C

A128N

A128C

A129N

A129C

A130N

A130C

A131

B126

B127N

B127C

B128N

B128C

B129N

B129C

B130N

B130C

B131

MD-8214C long survival, no chemo

MD-8226C long survival, unknown chemo

NP-8932N

PC-8592T

PI-8592N2

PI-8762N

PD-8832T long survival, palliative chemo

PD-8715 short survival, adj chemo

WD-8996C long survival, adj chemo

CC-all18

MD-7800T, long survival, unknown chemo

MD-1044C short survival, adj chemo

NP-8926N0308

PC-1254T

PI-8926C

PD-8216C long survival, adj chemo

PD-8216T short survival, adj chemo

WD-8721C long survival, adj chemo

WD-6563T long survival, unknown chemo

CC-all18

TMT10A_bRP

TMT10A_bR

TMT10A_bRP

TMT10B_bRP

MeanMulti

If the control ion was designated as MeanMulti, when generating the SM report, this leads to the mean intensity of multiple reporter ions being used as the denominator in ratios for each of the TMT10 mass labels. Consequently, there will be 1 additional ratio for each directory in the report. In the corresponding reporter_sample_template, indicate which ions were used as the denominator by joining the ions with a dot character (and keep them in alphabetic order, to match the column headers in the .ssv file). The dot delimiter enables the parser to prevent the parser from expecting this channel to also be used as a numerator.

126	127N	127C	128N	128C	129N	129C	130N	130C	131	126.127N.128C
Control_205	Control_206	LM2.3(9.1)	empty	Control_207	SNED1_KD_208	231.1	231.2	SNED1_KD_210	SNED1_KD_211	Control_3_mean
bRPfrxns_0813	bRPfrxns_0813	bRPfrxns_0813	bRPfrxns_0813	bRPfrxns_0813	bRPfrxns_0813	bRPfrxns_0813	bRPfrxns_0813	bRPfrxns_0813	bRPfrxns_0813	bRPfrxns_0813

Create a sample-annotation.csv file (vertical layout)

The requirements described below are, by design, harmonized between tools developed in the Broad Institute Proteomics Platform group including: Spectrum Mill and the downstream tools Protigy and Panoply.

In order for the parseSMreport script to correlate each reporter ion label with a sample name and metadata to produce a GCT v1.3 format output file you must provide a comma-delimited .csv file in the SM directory that contains the report that is parsed. While the filename may be prefixed at the user's discretion, the text file must be named with the suffix
sample-annotation.csv
And be organized like the example below:

The following considerations apply:

Required columns are only: Sample.ID, Experiment, and Channel.
Though not required for SM, the column: Type (Tumor, NAT, etc) is required downstream for Panoply use.
Note that the sample-annotation file must not contain a row for the sample(s) used as a control ion (denominator in ratios). However, users are encouraged to include the isobaric label name and control ion channel in the filename of the sample-annotation file.
The headers of metadata columns need to all be R-compatible. No spaces, no control characters like (), and can not begin with a number.
Replicates: Sample.IDs must be unique and cannot have duplicates. If replicate samples are present, they must have unique Sample.IDs, but can be identified as replicates by using an additional column (named "Participant" for downstream Panoply use) with identical ids.
The column Experiment (integer values) must be numbered in the exact same order as the data directories appear in the report (or Sample.ID and metadata will be mis-applied) , which is the same order as they were selected when creating the report in P/P Summary. Note that the SM data directory need not, but could be a column in the sample-annotation file.
The column Channel should include the reporter ion label used in SM reports. Please note the following nuances: For TMT6, 10 SM reports: 126, 131, while for TMT11, 16, 18 SM reports: 126C, 131N

Sorting of the sample-annotation.csv file:

Sort as you wish. It has been expected that when some users prepare a sample-annotation file they might frequently end up with rows sorted in a different order than the data columns in the SM report to which they will be mated, for reasons like sorted by Sample.IDs, or the Channel column sorted alphabetically instead of by mass. So to reduce excess "quick questions" back to the developer, the script has been written to tolerate differences in sort. However, keep in mind that the output .GCT file will maintain the order of the columns in the input .ssv file. It is expected that the metadata columns in the sample-annotation file which become rows in the .GCT file will enable facile re-sorting of the data columns during downstream analyses.

Example sample-annotation.csv file

Download a template suited to your labeling reagent: TMT6 TMT10 TMT11

TMT6-multimedian-2004-RedSox-reverse-the-curse-sample-annotation.csv

A hypothetical TMT6 experimental design contemplated at the ice cream social held across the river from Fenway Park in Cambridge, MA at the Broad Institute in late October of 2004. 3 TMT6 plexes with each of 9 samples run in duplicate for a P/P Summary report generated with MedianMulti as the control ion. Thus ratios were calculated using a denominator composed of the median of all 6 channels.

Sample.ID	Participant	Experiment	Channel	Type
Manny.1	Ramirez	1	126	MVP 2004
Manny.2	Ramirez	1	127	MVP 2004
David.1	Ortiz	1	128	MVP 2013
David.2	Ortiz	1	129	MVP 2013
Dave.1	Roberts	1	130	Slide
Dave.2	Roberts	1	131	Slide
Johnny.1	Damon	2	126	Yankee traitor
Johnny.2	Damon	2	127	Yankee traitor
Pedro.1	Martinez	2	128	Hall of Fame 2015
Pedro.2	Martinez	2	129	Hall of Fame 2015
Jason.1	Varitek	2	130	Captain
Jason.2	Varitek	2	131	Captain
Kevin.1	Millar	3	126	Cowboy Up
Kevin.2	Millar	3	127	Cowboy Up
Curt.1	Schilling	3	128	Bloody Sock
Curt.2	Schilling	3	129	Bloody Sock
Terry.1	Francona	3	130	Rookie Skipper
Terry.2	Francona	3	131	Rookie Skipper

How to read/write a .GCT v1.3 file using R or Python

GCT - (Gene Cluster Text) v1.3 is a tab-delimited text file format that is convenient for analysis of matrix-compatible datasets as it allows metadata about an experiment to be stored alongside the data from the experiment.

Now perhaps you are thinking:
"Hmmm, this .GCT format is kind of cool in that the data and metadata are combined, but how am I going to read it in to my script so I can do my own analyses?"
Probably just install the GCT package, right? Ummm, well kinda sorta, but not really, at least not at the moment anyway. However, currently there are the following options:

Python

Install cmapPy

R

Install cmapR and its 37 dependency packages directly or through Bioconductor and use in R-scripts via:
library(cmapR)

Grab just the necessary cmapR package from the Spectrum Mill server that produced your .GCT files or the cmapR Github repository. The cmapR installation used in SM was obtained from the cmapR Github repository by "cloning the repo". From the Github page, click the green Code button, then press Download zip. From cmapR-master.zip, unzip the contents to SpectrumMill/millr/cmapR. This involved renaming the top-level directory cmapR-master to cmapR. Within SM the package is then used in two R scripts:

SpectrumMill/millR/parseSMreport.r
SpectrumMill/millR/normalizeReporterRatios.r

via:
source("cmapR/R/io.R") #millR/cmapR/R/io.R
source("cmapR/R/GCT.R") #millR/cmapR/R/GCT.R

to use the cmapR functions: write_gct(), parse_gctx(), and subset_gct()

Full cmapR documentation is available at rdrr.io/github/cmap/cmapR/man/

plotRatioDistributions.r

Plot histograms of iTRAQ/TMT ratio distributions in a dataset at the protein or VM-site level

normalizeReporterRatios.r

Normalize distributions of reporter ion ratios.

Running individual scripts from the command line

If you wish to run any of the individual R scripts from an MS-DOS command prompt do the following:

Install R on your local machine, download from: https://www.r-project.org/
Add Rscript.exe to the Windows system path on your local machine. Startup menu – control panels/system/advanced/environment variables/system variables/path
c:\Program Files\R\R-3.2.2\bin\x64
Open an MS-DOS Command Window. (From the Windows Start menu, select All programs then Accessories then Command Prompt)
Change to the volume where the millR directory on your Spectrum Mill server is mounted.
- If you have administrator privileges on the SM server and you have mounted Y: as E$(\\dunlop)
  Type 'Y:' then 'cd spectrumMill\millr' and the command prompt will read
  Y:\spectrumMill\millr>
- If you do not have administrator privileges, but the millR directory has ben shared, then map a network drive:
  millR(\\dunlop)Y:
  Type 'Y:' and the command prompt will read
  Y:>
  Type 'dir' to verify that *.r files are present
Type lines like the following which run each script along with required parameters

Rscript.exe parseSMreport.r proteinProteinCentricColumnsExport.1.ssv proteome TMT10 ..\msdataSM\Karl\test\
Rscript.exe plotRatioDistributions.r proteinProteinCentricColumnsExport.1-ratio.txt 3 ..\msdataSM\Karl\test\
Rscript.exe normalizeReporterRatios.r proteinProteinCentricColumnsExport.1-ratio.txt median ..\msdataSM\Karl\test\

Customizing individual scripts

The individual scripts are installed on each Spectrum Mill server in the directory:
example: \\Cibola\SpectrumMill\millR