Table of Contents
When run from the SM web page (SpectrumMill\millhtml\processReport.htm) a perl script (SpectrumMill\millscript\processReport.pl)
is run which serves as a simple
pipeline that conveys the input parameters and executes the user-selected R scripts with all input/output files residing in
the SM server's filesystem.
These SM report processing R scripts were architected with the following in mind:
- Scripts could be run in multiple ways:
Individuals would modify a script for related uses.
Scripts would be used in pipelines other than Spectrum Mill.
Scripts would be run on either Windows or Unix.
Project-specific features would be added.
- Via a web browser, that navigates the Spectrum Mill filesystem.
- From the command line.
- Called from other R scripts
Input variables shared across all scripts
- Data Directories: Choose the directoreis which contain SM reports, not all of the
directories which contributed to a report.
- Project: choices other than Generic are used to trigger customized features associated
with a particular project:
- CPTAC2 TCGA: specific means of parsing LC-MS/MS run names to extract sample
reporter ion associations.
- Matrisome: specific subsets of proteins are selected for separate handling
in distribution plotting (ECM, Collagen, Fibrillar collagens) which are recognized
by extra columns included in the report (Division, Category, Notes).
- GO Categories Present: Additional columns present in the report are extracted and
passed through by the parser.
- SM report type: Corresponds to the mode selected when generating a report from Protein/Peptide Summary.
All reports are expected to have been generated with the run specific option enabled. The report type
is used by the parser not only to determine what file name to retrieve from the data directory, but also
to indicate what format to expect and trigger appropriate handling/presentation of the report's content.
Parses .ssv files generated by Spectrum Mill Protein/Peptide Summary to selectively extract the highest information value columns.
- Reporter Ion label type: The type selected is used by the parser to specify how many data columns
are expected to be present for each directory (the number of ratios should be the number
of reporter ions - 1). However, TMT10 mean multi indicates 10 ratios are expected because when generating the SM
report the control ion was designated as MeanMulti, leading to the mean intensity of multiple
reporter ions was used as the denominator in ratios for each of the TMT10 mass labels.
The default is .txt. There is no UI option to select output format. Instead the output format is controlled by the presence of a user-provided auxiliary file placed in the Data Directory before running the script.
- .TXT - Simple tab-delimited output using either the original data column headers from P/P Summary or revised
headers based on the content of a user provided reporter_sample_template.txt file
(see below), if present in the data directory being processed.
- .GCT - (Gene Cluster Text) v1.3 is a tab-delimited text file format
that is convenient for analysis of matrix-compatible datasets as it allows metadata about an experiment to be stored alongside
the data from the experiment. In order to produce this output format the sample identifiers and metadata must be correlated with the reporter
ion labels used in the experiment via a user provided sample-annotation.csv file
(see below), present in the data directory being processed.
Create a reporter_sample_template.txt file (horizontal layout)
When the parseSMreport script is run to process a P/P Summary report it can update the column headers in the report to include meta information,
like more specific sample names attached to each reporter ion. These names will also be propagated through to
the ratio distribution plots, and reports following normalization.
In order to provide the script with information on a sample name for each reporter ion you must create a tab-delimited text
file in the SM directory that contains the initial report that is parsed. The text file must be named:
And be organized like the following example:
This example is for an experiment involving 18 samples that were run in 2 separate TMT-10 plexes, with 9 samples and a common control (mix of all 18).
- line 1: reporter ion. The parser looks only for a substring in each cells text for the reporter ion (126, 127N, etc). The A and B are optional, and added only for user convenience.
- line 2: sample name, free text.
- line 3: directory name, directory the data was present in when the report was generated. This is necessary to match up the name and reporter ion, when more than 1 directory contributed to the report.
- Do not use a colon character : in any cells of the report, because colon will later be inserted by the parser to denote the numerator : denominator involved in a ratio.
- The examples have the columns in mass order. However, you can put the columns in any order. The parsed report will come out with the same column order as the reporter_sample_template.txt.
- When saving reporter_sample_template.txt in Excel
Save as type:
Text (Tab delimited) (*.txt)
Unicode Text (*.txt)
You can diagnose problems caused by saving as Unicode Text by the funky characters in the script output pane in the browser in the reporterIons field.
numerator: 126 denominator: 127C reporterIons: ÿþ1
|MD-8214C long survival, no chemo||MD-8226C long survival, unknown chemo||NP-8932N||PC-8592T||PI-8592N2||PI-8762N||PD-8832T long survival, palliative chemo||PD-8715 short survival, adj chemo||WD-8996C long survival, adj chemo||CC-all18||MD-7800T, long survival, unknown chemo||MD-1044C short survival, adj chemo||NP-8926N0308||PC-1254T||PI-8926C||PD-8216C long survival, adj chemo||PD-8216T short survival, adj chemo||WD-8721C long survival, adj chemo||WD-6563T long survival, unknown chemo||CC-all18|
If the control ion was designated as MeanMulti, when generating the SM report,
this leads to the mean intensity of multiple reporter ions being used as the denominator in ratios for each of the TMT10 mass labels.
Consequently, there will be 1 additional ratio for each directory in the report.
In the corresponding reporter_sample_template, indicate which ions were used as the denominator by joining the ions with a dot character (and keep them in alphabetic order, to match the column headers in the .ssv file).
The dot delimiter enables the parser to prevent the parser from expecting this channel to also be used as a numerator.
Create a sample-annotation.csv file (vertical layout)
The requirements described below are, by design, harmonized between tools developed in the Broad Institute Proteomics Platform group including: Spectrum Mill
and the downstream tools Protigy and Panoply.
In order for the parseSMreport script to correlate each reporter ion label with a sample name and metadata to produce a
GCT v1.3 format output file you must provide a comma-delimited .csv
file in the SM directory that contains the report that is parsed. While the filename may be prefixed at the user's discretion,
the text file must be named with the suffix
The following considerations apply:
And be organized like the example below:
- Required columns are only: Sample.ID, Experiment, and Channel.
- Though not required for SM, the column: Type (Tumor, NAT, etc) is required downstream for Panoply use.
- Note that the sample-annotation file must not contain a row for the sample(s) used as a control ion (denominator in ratios). However, users are encouraged to include
the isobaric label name and control ion channel in the filename of the sample-annotation file.
- The headers of metadata columns need to all be R-compatible. No spaces, no control characters like (), and can not begin with a number.
- Replicates: Sample.IDs must be unique and cannot have duplicates. If replicate samples are present, they must have unique Sample.IDs,
but can be identified as replicates by using an additional column (named "Participant" for downstream Panoply use) with identical ids.
- The column Experiment (integer values) must be numbered in the exact same order as the data directories appear in the report (or Sample.ID and metadata will be mis-applied)
, which is the same order as they were selected when creating the report in P/P Summary. Note that the SM data directory need not, but could be a column in the sample-annotation file.
- The column Channel should include the reporter ion label used in SM reports. Please note the following nuances: For TMT6, 10 SM reports: 126, 131, while for TMT11, 16, 18 SM reports: 126C, 131N.
The use of and N and C on every channel was implemented for clarity, and I regret not implementing the convention for the earliest reagents, in deference to the manufacturer's nomenclature,
though considered it at the time. Now to ensure backwards compatibility the difference is what it is.
Sorting of the sample-annotation.csv file:
Sort as you wish. It has been expected that when some users prepare a sample-annotation file they might frequently end up with rows sorted
in a different order than the data columns in the SM report to which they will be mated, for reasons like sorted by Sample.IDs, or the Channel column sorted alphabetically instead of by mass.
So to reduce excess "quick questions" back to the developer, the script has been written to tolerate differences in sort. However, keep in mind that the output .GCT file will maintain
the order of the columns in the input .ssv file. It is expected that the metadata columns in the sample-annotation file which become rows in the .GCT file will enable
facile re-sorting of the data columns during downstream analyses.
Example sample-annotation.csv file
Download a template suited to your labeling reagent:
A hypothetical TMT6 experimental design contemplated at the ice cream social held across the river from Fenway Park in Cambridge, MA at the Broad Institute in late October of 2004.
3 TMT6 plexes with each of 9 samples run in duplicate for a P/P Summary report generated with MedianMulti as the control ion. Thus
ratios were calculated using a denominator composed of the median of all 6 channels.
|Manny.1 ||Ramirez ||1 ||126||MVP 2004|
|Manny.2 ||Ramirez ||1 ||127||MVP 2004|
|David.1 ||Ortiz ||1 ||128||MVP 2013|
|David.2 ||Ortiz ||1 ||129||MVP 2013|
|Dave.1 ||Roberts ||1 ||130||Slide|
|Dave.2 ||Roberts ||1 ||131||Slide|
|Johnny.1 ||Damon ||2 ||126||Yankee traitor|
|Johnny.2 ||Damon ||2 ||127||Yankee traitor|
|Pedro.1 ||Martinez ||2 ||128||Hall of Fame 2015|
|Pedro.2 ||Martinez ||2 ||129||Hall of Fame 2015|
|Jason.1 ||Varitek ||2 ||130||Captain|
|Jason.2 ||Varitek ||2 ||131||Captain|
|Kevin.1 ||Millar ||3 ||126||Cowboy Up|
|Kevin.2 ||Millar ||3 ||127||Cowboy Up|
|Curt.1 ||Schilling ||3 ||128||Bloody Sock|
|Curt.2 ||Schilling ||3 ||129||Bloody Sock|
|Terry.1 ||Francona ||3 ||130||Rookie Skipper|
|Terry.2 ||Francona ||3 ||131||Rookie Skipper|
How to read/write a .GCT v1.3 file using R or Python
GCT - (Gene Cluster Text) v1.3 is a tab-delimited text file format
that is convenient for analysis of matrix-compatible datasets as it allows metadata about an experiment to be stored alongside
the data from the experiment.
Now perhaps you are thinking:
"Hmmm, this .GCT format is kind of cool in that the data and metadata are combined, but how am I going to read it in to my script so I can do my own analyses?"
Probably just install the GCT package, right? Ummm, well kinda sorta, but not really, at least not at the moment anyway. However, currently there are the following options:
Install cmapR and its 37 dependency packages directly or through Bioconductor
and use in R-scripts via:
Grab just the necessary cmapR package from the Spectrum Mill server that produced your .GCT files or the cmapR Github repository.
The cmapR installation used in SM was obtained from the cmapR Github repository by "cloning the repo". From the Github page, click the green Code button, then press Download zip.
From cmapR-master.zip, unzip the contents to SpectrumMill/millr/cmapR. This involved renaming the top-level directory cmapR-master to cmapR. Within SM the package is then used in two R scripts:
to use the cmapR functions: write_gct(), parse_gctx(), and subset_gct()
Full cmapR documentation is available at rdrr.io/github/cmap/cmapR/man/
Plot histograms of iTRAQ/TMT ratio distributions in a dataset at the protein or VM-site level
Normalize distributions of reporter ion ratios.
Running individual scripts from the command line
If you wish to run any of the individual R scripts from an MS-DOS command prompt do the following:
- Install R on your local machine, download from: https://www.r-project.org/
- Add Rscript.exe to the Windows system path on your local machine.
Startup menu – control panels/system/advanced/environment variables/system variables/path
- Open an MS-DOS Command Window. (From the Windows Start menu, select All programs then
Accessories then Command Prompt)
- Change to the volume where the millR directory on your Spectrum Mill server is mounted.
- If you have administrator privileges on the SM server and you have mounted Y: as E$(\\dunlop)
Type 'Y:' then 'cd spectrumMill\millr' and the command prompt will read
- If you do not have administrator privileges, but the millR directory has ben shared, then map a network drive:
Type 'Y:' and the command prompt will read
Type 'dir' to verify that *.r files are present
- Type lines like the following which run each script along with required parameters
- Rscript.exe parseSMreport.r proteinProteinCentricColumnsExport.1.ssv proteome TMT10 ..\msdataSM\Karl\test\
- Rscript.exe plotRatioDistributions.r proteinProteinCentricColumnsExport.1-ratio.txt 3 ..\msdataSM\Karl\test\
- Rscript.exe normalizeReporterRatios.r proteinProteinCentricColumnsExport.1-ratio.txt median ..\msdataSM\Karl\test\
Customizing individual scripts
The individual scripts are installed on each Spectrum Mill server in the directory: