Spectrum Mill Basics

Mass spectrometry has become a core technology for proteomics research, but without modern tools, there are often bottlenecks in data interpretation and review. The Agilent Spectrum Mill MS Proteomics Workbench is a comprehensive suite of software tools designed to facilitate high-throughput proteomics experiments using mass spectrometry. Key features of the Spectrum Mill include:

Intelligent spectral extraction

The Spectrum Mill data extractors preprocess data to extract high-quality spectra for database searches. Data extractors identify and exclude noise spectra and poor quality spectra, to increase the speed of database searches and to reduce the number of false positives.

The data extractors for raw data files preprocess MS/MS spectra from Agilent and Thermo Fisher Scientific instruments. MS-only spectra can be searched using peak list files or by pasting a mass list into the Manual PMF form. These extractors produce files that contain mass - intensity lists suitable for use with Spectrum Mill search programs.

An optional Spectrum Mill Data Extractor for Generic Peak List Files enables use of the Spectrum Mill with peak list files, such as those as exported from Micromass Q-Tof using the ProteinLynx package. This extractor handles individual *.pkl and *.dta spectral files, or appended *.pkl files that contain multiple spectra. It also processes *.mgf files. The Spectrum Mill Generic Data Extractor prepares the peak list files for further Spectrum Mill processing.

Multiple search options

The Spectrum Mill provides multiple options for protein identification and characterization. You can search MS/MS spectra using MS/MS Search, or MS-only spectra using Manual Peptide Mass Fingerprinting (PMF) Search. Both searches include optimized scoring schemes that speed downstream data review.

MS/MS Search automates the search of large volumes of processed MS/MS spectra against protein databases. The MS/MS Search algorithm uses intelligent parallelization to provide extremely fast searches. It can operate in identity mode to find unmodified peptides or in variable modifications or homology modes to look for mutations, post-translational modifications, and chemical modifications.

Manual PMF Search performs searches of spectral peak lists that you enter into the Manual PMF Search form.

Automatic and manual match validation for MS/MS Search results

The Spectrum Mill offers both automatic and manual match validation of MS/MS Search results. Autovalidation quickly segregates those spectra that have matched well in the database search. Manual validation (in Protein/Peptide Summary) provides tools for fast, easy interactive data review and validation.

The Spectrum Mill segregates validated and unvalidated matches, and keeps a cumulative history of validated results. Spectra from remaining unvalidated matches can be re-searched using alternate parameters or databases. Each iterative search involves fewer and fewer spectra, making the searches even faster.

Fast, comprehensive result summaries

The Protein/Peptide Summary capability within the Spectrum Mill workbench allows you to summarize and correlate search results for MS/MS data. Protein/Peptide Summary includes tools to review entire directories of search results, and summaries can range from single samples to complex studies. The wide choice of summary modes makes the results accessible to biologists and biochemists, as well as mass spectrometrists.

Protein/Peptide Summary provides both qualitative and quantitative information. Qualitative results (validated search matches) are accompanied by either approximate quantitation (based on mean peak intensities of component peptides) or quantitation based on stable isotope or similar studies.

Advanced de novo spectral interpretation

For proteins not identified by database searching, the Spectrum Mill workbench also offers advanced de novo sequencing based on the Sherenga algorithm. The algorithm uses graph theory to generate a list of potential peptide sequences and to discard unrealistic solutions.

Workflow automation

The Spectrum Mill allows you to automate a typical data analysis workflow for MS/MS data files from protein digests:

Spectral extraction
MS/MS Search
Autovalidation
Quality Metrics
Protein/Peptide Summary
Archive Data

File system

Before running MS/MS Search or PMF Search with the Spectrum Mill workbench, the spectral files must be placed in the appropriate directory underneath the web root on the server running the Spectrum Mill workbench. Because of communication demands for computer / mass spectrometer during spectral acquisition, this is expected to be a separate computer from the one that controls the instrument, with file transfer occurring over a network.

Location of Spectral Files

After you configure your file system with data root directories, you can create directories to place spectra as shown below:

Directory structure

msdataSM
- blankDirectory
  - cpick_in
  - fit_batch_in
- mySampleDirectory
  - myAgilentDirectory.d (place Agilent *.d files at this level)
  - myLCQfile.raw or myLTQfile.raw (place Thermo Fisher Scientific *.raw files at this level)
  - myQTOFmultiFile.pkl (place Micromass appended *.pkl files, i.e. each file contains multiple spectra, at this level)
  - cpick_in
    - spectrum1.dta (Place *.dta files exported from Micromass Q-Tof instruments at this level)
    - spectrum1.0047.8.2.pkl (Place *.pkl files representing individual spectra exported from Micromass Q-Tof instruments at this level)
  - fit_batch_in
    - spectrum1.2mi (Place files exported from Applied Biosystems MALDI instruments at this level)

Note that you may have up to ten directory levels between msdataSM and mySampleDirectory. But we recommend shorter path lengths to reduce memory usage, especially for large data sets.

How Spectrum Mill locates data files

The Spectrum Mill recognizes the bottom of the directory hierarchy (the location of data files) when it finds one of the following:

A file with a recognized raw data file suffix
* .pkl file
One of the following folders: cpick_in, fit_batch_in (containing peak list file)

To ensure that the Spectrum Mill finds all your data files:

Do not copy a processed data folder into a higher level folder.
Keep your data files in subfolders that are at equivalent levels in the Spectrum Mill file system. Remember that the Spectrum Mill workbench can find only the highest level of data files in a given subfolder. For example, given these two data files,
- E:\SpectrumMill\msdataSM\study1\mydir\datafile1.d
- E:\SpectrumMill\msdataSM\study1\datafile2.d
The Spectrum Mill will recognize datafile2.d, but not datafile1.d.

Naming of files and folders

Do not use spaces and parentheses in folder or file names. The following characters are also not permitted: | , ; % < > ? . +.

Overview for MS/MS Interactive Processing

In an automated LC-MS/MS experiment, one can separate peptides by reversed-phase HPLC and acquire an MS/MS spectrum approximately every second on whatever happens to be eluting from the column at that particular instance. Hence in about a half hour, one can be awash in about 1000 spectra. The Spectrum Mill provides tools to extract information from that morass of data in a manner that attempts to minimize the amount of data overload frustration. The figure below was created to illustrate the overall process. Note that failure to perform any of the items properly is likely to diminish the usefulness of the final output.

Experiment Scheme

Getting Started for Agilent Q-TOF and Other MS/MS Data

Acquire some mass spectra.
Export spectral files.
- For Agilent Q-TOF or ion trap data; transfer *.d files to the Spectrum Mill computer in a data directory within the Spectrum Mill file system.
- For Thermo Fisher Scientific LCQ or LTQ data; transfer *.raw files to the Spectrum Mill computer in a data directory within the Spectrum Mill file system.
- For Micromass Q-Tof data; use the Masslynx data system to export *.pkl files, then transfer the files to the Spectrum Mill computer in a cpick_in data directory within the Spectrum Mill file system (if individual *.pkl files for each spectrum) or up one directory level if appended *.pkl files.
From the Spectrum Mill homepage, go to the Data Extractor page. Preprocess the spectral files. The Data Extractor program recognizes the data type and automatically uses the correct extractor:
- Agilent and Thermo Fisher Scientific data use the raw file Data Extractor
- Micromass Q-Tof data and *.mgf files use the generic Data Extractor
From the Spectrum Mill homepage, go to the MS/MS Search page.
Set the appropriate MS/MS Search parameters and run the searches.
Validate results in the Autovalidation page or manually in the Protein/Peptide Summary page.
Review the data from the Protein/Peptide Summary page.

For more details on the MS/MS Search page, see the MS/MS Search Help.

Spectral Preprocessing for MS/MS Data

Data Extractor

The Spectrum Mill Data Extractor preprocesses raw data files from Agilent and Thermo Fisher Scientific instruments, to extract high-quality spectra for database searches. The Data Extractor automatically detects which type of raw file (specific instrument vendor or generic format) you have submitted and then invokes the appropriate extraction program (provided that it has been purchased and installed on your server). The MS/MS raw file data extractors extract and merge nearby MS/MS spectra from the same precursor ion. They optionally apply MS/MS similarity criteria prior to merging scans, to avoid merging closely eluting or co-eluting isobaric peptides. For Agilent *.d ion trap and Thermo Fisher Scientific *.raw ion trap data, the extractors optionally merge MS² and MS³ scans from the same precursor. The extractors assign precursor charges where possible, centroid the MS/MS spectra, calculate spectral features, filter MS/MS spectra by quality, extract reporter ion intensities (iTRAQ and TMT), and calculate extracted ion chromatograms (EICs) for the intervening MS precursor scans. The intensities are later are used for quantitation by subsequent Spectrum Mill programs.

Note: As of Spectrum Mill B.05.00, XtractorFinnigan uses the Thermo (Xcalibur or MSFileReader) code rather than Spectrum Mill code to do centroiding. Xcalibur or MSFileReader centroiding does a better job of using appropriately narrow windows across the entire mass range (particularly important for the barely resolved TMT-10 peaks). It also requires half the extraction time. Because the intensities are scaled differently (10-100-fold), you should not mix Spectrum Mill centroiding and Xcalibur centroiding across multiple directories that will later be used for a combined report.

The functionality has been split into multiple programs:

XtractorAgilent invoked for Agilent Q-TOF *.d data directories
XtractorAgilentTrap invoked for Agilent ion trap *.d data directories that contain a .yep file
XtractorFinnigan invoked for *.raw files
In previous versions it required the Active-X component that is present with the Thermo Fisher Scientific Xcalibur data system.

download

extractorGeneric invoked for generic *.pkl, *.mic, *.dta, *.mgf files

For specifics on third-party software requirements, see the Installation Guide you received with your software. In general, Agilent Q-TOF and Agilent Trap (including ETD) do not require installation of offline software. Thermo data (*.raw) requires the offline software be installed on the server, and the version must be equal to or later than the version that was used to acquire the data.

Output from Data Extractor consists of three types of files.

mzXML files containing all quality-filtered, centroided individual MS/MSspectra for an LC-MS/MS run, for Agilent Q-TOF .d and Thermo Fisher Scientific .raw data (Spectrum Mill B.04.01 and later). With Spectrum Mill B.06, the Generic Extractor extracts *.pkl files to mzXML as well. Spectra from other instruments are extracted to individual *.pkl files.
A summary file: SpecFeatures.1.tsv, containing spectral characteristics such as Max. Sequence Tag length, MS/MS reporter ion intensities, precursor ion intensity, retention time, and chromatographic peak width from the MS/MS scans that are used in the MS/MS Search, Quality Metrics, Sherenga de novo Sequencing, Protein/Peptide Summary, and Spectrum Summary scripts.
Log files that describe reasons for rejecting particular MS/MS spectra and the means by which the precursor charge was determined.

If your input into the Spectrum Mill consists of peak list files (for example, from Micromass Q-Tof), see also Data Extractor for Generic (Peak List) Files.

Spectral Extraction

Merge scans with same precursor m/z - Using the user-designated time window and precursor m/z tolerance, duplicate MS/MS scans are merged into a single spectrum.

For scans with the same precursor m/z, the MS/MS scans are compared to ensure that they correspond to the same peptide. You can adjust settings in instrument.txt to control the comparison and merging.
For Agilent *.d ion trap and Thermo Fisher Scientific *.raw ion trap data, the extractors optionally merge MS² and MS³ scans from the same precursor.
For Agilent data files that contain spectra that alternate between CID and ETD, the software merges the ETD spectra the same way as the corresponding CID spectra. For example, if the CID spectra were merged from scan 2 thru 12, then the ETD spectra are merged from scan 3 thru 13. In no case are ETD spectra merged with CID spectra.
For the Thermo Fisher Scientific LTQ Orbitrap and LTQ FT, the software ignores the m/z tolerance for merging stated on the form and uses the instrument tolerance instead. This is also now the case for Agilent Q-TOF data.

Peak Merging - When spectra are merged, many of the corresponding peaks in the spectra will not have identical mass; hence using a tolerance of +/- 0.25 Da (for ion trap data) the peaks are merged by summing the intensity and retaining the mass of the most intense peak (does not try to centroid). This attempts to correct artifacts resulting from prior centroiding of the individual spectra.

Peak Detection

The Data Extractor performs the peak detection steps described below prior to precursor charge assignment, spectral quality filtering, and spectral feature calculation. However, the peak detection does not persist. The extracted files (*.mzXML) retain all centroided peaks, and peak detection is repeated when necessary in MS/MS Search, Spectrum Matcher, and Sherenga de novo Sequencing. Thus, the MS/MS spectrum viewer can visualize interpretation results on the full spectrum, rather than just the processed peak list.

Signal/Noise Calculation - a noise level is calculated across an entire spectrum. In order to minimize the extent to which signal contributes to the determination of the noise level, the following approach is employed to calculate the mean noise level: Start by considering all peaks below a default noise threshold (3% of base peak in spectrum for MS scans, 3% of the third largest peak for MS/MS spectra, allows for spectra dominated by a single fragment ion and it's major isotope). If they represent > 90% of the peaks - MS scans or 65% of the peaks - MS/MS scans, then calculate the noise mean and standard deviation; if not, then double the noise threshold and try again. The signal/noise calculation then becomes a standard RMS (root mean square) calculation where the actual threshold in a particular spectrum is determined by multiplying the user-supplied signal/noise ratio by the standard deviation of the noise mean and offsetting from zero by the noise mean.
Strip Isotopes by Looking Left - uses a "look-left" (towards lower m/z) approach to merge the intensities of peaks in an isotope cluster into the left-most member of the cluster. A cluster is defined as a set of peaks where the peak immediately to the left of another peak is at least 0.5 the height of the peak to its right. (0.5 is a hard-coded constant representing minimum relative isotopic intensity). For high-resolution data such as that from time-of-flight instruments, the charge of the fragment-ion would be assigned from the isotope spacing as the isotope cluster is merged. Note that Strip Isotopes by Looking Left is used only to calculate spectral features; the isotope peaks remain in the extracted spectral file (in the *.pkl *.mzXML file).
Strip Precursor Minus Neutrals - for MS/MS spectra, peaks are removed in a window below the precursor m/z value of width (20 Da / precursor charge) with an additional allowance of 2.5 m/z above the precursor m/z for precursor isotopes. Peaks are also removed in a 1.5 m/z window about the m/z value of (precursor m/z - 2H₂O / precursor charge) as well as all peaks above the mass of precursor MH⁺ - CO₂. For ETD spectra, this function also removes corresponding peaks for the charge-reduced forms of the precursor ion.
Filter By Max Num Intense Peaks (Max. # Peaks Retained) - retains no more than the specified number of peaks having the greatest intensity remaining after the above steps.

Spectral Features

A variety of spectral characteristics are pre-calculated for possible later use in the MS/MS Search, Quality Metrics, Sherenga de novo Sequencing, Protein/Peptide Summary, and Spectrum Summary scripts. MaxSequenceTagLength and totalIntensity are the most noteworthy. The following lists the more important spectral features. The extractors calculate additional features, depending on the amino acid modifications, etc. The extractors store the spectral features in the file specFeatures.tsv, with the variable names listed below. A subset of the fields that are reported are listed here.

precursorAveragineChiSquared 1 - Chi2 measure of the precursor ion isotope cluster shape (combined from the two MS1 scans immediately before and after the MS2 scan) as compared to the theoretical isotope cluster shape of averagine. (0.85 to 1.0 is good.)
precursorIsolationPurityPercent - intensity of the precursor ion and its isotopes divided by the total peak intensity in the precursor isolation window used for the MS2 scan (combined from the two MS1 scans immediately before and after the MS2 scan), <50% indicates reporter quantitation was not used because of expected contamination by co-fragmented peptides.
precursorIsolationIntensity - denominator used in the precursorIsolationPurity metric
ratioReporterIonToPrecursor - sum of the reporter ion intensities (iTRAQ_114+iTRAQ_115+iTRAQ_116+iTRAQ_117) divided by precursorIsolationIntensity
chromatographicPeakWidthSec - width of precursor ion chromatographic peak, 0 indicates no more than one MS1 scan had a satisfactory precursor isotope cluster shape
reporter ions - the abundance of each reporter ion for isobaric modifications (iTRAQ, TMT)
retention time - the retention time for the MS1 precursor as determined by the EIC
peak width - the peak width for the MS1 precursor as determined by the EIC and averagine cluster match over the retention time
precursor ion purity - a measure of how "pure" the precursor isolation was when fragmented. Co-eluting isobaric peptides will result in a lower purity.
maxSequenceTagLength - a powerful spectral quality metric calculated after peak detection and fragmetn charge assignment that represents the length of the longest continuous string of amino acids that can be created by following a path from high mass to low mass that links peaks separated by the mass of an amino acid; for low resolution MS/MS sspectra with a precursor charge > 1 this path may be formed assuming the ions are either all singly charged or all doubly charged.
maxSequenceTag - the string of amino acids found above. Since this makes no allowance for fragment ion types this should NOT be viewed as a de novo interpretation.
totalIntensity - extracted ion chromatogram (EIC) of the precursor ion, used for peptide quantitation. The EIC is calculated as the sum of precursor m/z abundance in the MS scans ( ~ chromatographic peak area), and is dependent upon the user-designated scan tolerance (chromatographic time in seconds), the putative precursor m/z ( as adjusted by user designation of Find precursor ¹²C ) and the user-designated mass tolerance for merging scans with the same precursor m/z.
- For Agilent Q-TOF data, when the charge state is determined (which is the typical cases with this high-resolution data), the calculation of totalIntensity is more accurate. The software sums the intensities of the monoisotopic peak with all other peaks in the isotopic cluster. For this calculation, it uses a +/-50 ppm window for each peak. With the least-squares curve-fitting used to determine the charge state for Agilent Q-TOF data, the masses of the isotopic peaks are well-defined, so the software is able to exclude interferences that occur within the m/z range of the isotopic cluster.
- For Thermo Fisher Scientific Orbitrap data with high resolution MS1 scans the extracted ion chromatogram (XIC) of each precursor ion is calculated in the intervening high-resolution MS1 scans of the LC-MS/MS runs using narrow windows around each individual member of the isotope cluster. Peak boundaries in both the time and m/z domains are dynamically determined based on MS scan resolution, precursor charge and m/z, subject to Chi2 quality metrics on the relative distribution of the peaks in the isotope cluster vs theoretical (averagine-based).
- For instruments that require the generic Data Extractor (because of lack of software access to MS scan peak tables) this value is instead the same as totalOriginalIntensity.
- For .pkl files from the Micromass Q-Tof , this value represents the intensity from the precursor m/z from the single MS scan preceding the MS/MS scan.
totalOriginalIntensity - total intensity of all peaks in the MS/MS spectrum before peak detection
noiseMean - the mean noise calculated as described in the Peak Detection section.
noiseStandardDeviation - the mean noise standard deviation calculated as described in the Peak Detection section.
parentSignalNoise- the precursor signal/noise ratio in the preceding MS scan calculated as described in the Peak Detection section.
numPeaks - number of peaks remaining after peak detection.
numOriginalPeaks - number of peaks before peak detection.
selected_parent_m_over_z - unadjusted precursor m/z designated at acquisition time.
parent_m_over_z - the final adjusted monoisotopic precursor m/z
parent_m_over_z_centroid - the adjusted average precursor m/z
parent_M_plus_H - MH⁺ calculated from the precursor_m_over_z and precursor_charge
parent_charge - the assigned precursor charge.
numScansAfterParent - number of scans taken between the MS scan and the particular MS/MS scan.
maxIntensity - after peak detection, the intensity of the tallest peak in the MS/MS spectrum
totalOriginalIntensity - after peak detection, the total intensity of all peaks in the MS/MS spectrum.
MaxToTotalIntensityRatio - little used measure; maxIntensity/totalOriginalIntensity.
BYpairs - the number of b/y pairs as described in the Precursor Charge Assignment.
dissociation_method - the fragmentation mode, either collision-induced dissociation (CID) or electron transfer dissociation (ETD)
phosphoProductIonsScore (PPIS) - Phosphopeptides, primarily Ser/Thr phosphopeptides, typically exhibit a strong neutral loss of phosphate from the precursor ion during CID/HCD dissociation. This yields a characteristic ion of -98 m/z from the precursor ion. Presence of the ion can be used to flag an MS/MS spectrum and craft a subset of spectra as candidates for faster searching with Phospho –STY variable mods enabled.
PPIS = 100 * phospho neutral loss ion Intensity / base peak intensity

Note: The PPIS is calculated when the SM Data Extractor is run. The values are stored in the file SpecFeatures.1.tsv.
Anticipate a future rev where PPIS becomes more like GPIS, and includes the p-Tyr 216 immonium ion. Additional phospho spectral features are calculated and stored in the specFeatures file but have not yet been reduced to a filterable score. These include:
- numH3po4LossesZ1 # of 98Da spaced peak pairs
- h3po4LossesZ1fractionalIntensity S -98 intensities / S +98 intensities
contaminantProductIonsScore (CPIS) - The name contaminant product ion score was intended in 2014 as a generic name that would evolve into a UI selection for various sets of ions. As of June 2021 the only one implemented continues to be the hardcoded Glycosylation signature set (see GPIS).
glycoProductIonsScore (GPIS) - Uses the 9 ion glycosylation-signature set: 126,138,144,168,186,204,274,292,366. Numerically, GPIS is a 2-part score. The integer portion is simply a count of the signature ions observed in the MS/MS spectrum. The right of the decimal portion is an intensity ratio metric : most abundant signature ion peak intensity / base peak intensity. The base peak is after peak dection and removal of residual precursor and its water-loss. The max allowed value of the ratio is .99 when the signature ion is the base peak. The design of the metric is intended to allow setting a particular threshold value that numerically enforces the dual concept threshold of: 1) some, but not all of the signature ions are present and that 2) at least one of them is quite intense. Thus a fixed threshold value of this metric > 4.5 is used in the SM Quality Metrics module to calculate the metrics PSMs_Containing_Glyco_Product_Ions_num and All_spectra_Containing_Glyco_Product_Ions_num. That same > 4.5 threshold is the default for GPIS spectral quality filtering in MS/MS search GPIS spectral feature filtering in Spectrum Summary.
Note: The GPIS is calculated when the SM Data Extractor is run. The values are stored in the file SpecFeatures.1.tsv in the column: contaminantProductIonsScore for the historical reasons described above for CPIS. For greater clarity, when the values are later reported asGPIS in Protein/Peptide Summary and as GPIS in Spectrum Summary the column header used for Excel export is glycoProductIonsScore.
percentDissociatedIntensity - 100 * (total peak intensity in the in the MS/MS spectrum - intensity of residual precursor and its neutral losses of water and ammonia) / the total peak intensity in the in the MS/MS spectrum. For ETD and ETHCD spectra charge-reduced precursor related ions would also be subtracted.

MS/MS Spectral Quality Filtering

Although the Data Extractor filters out very poor quality spectra, certain spectral features (see features described above) can be used to craft a smaller subset of high quality spectra to limit input to MS/MS search, Spectrum Matcher, and Spectrum Summary. The same filters control the Identifiability Metrics calculated by Quality Metrics.

Sequence tag length - longer tags are better. Lengths > 3 should be identifiable by database search, and poor MS/MS with a tag length < 1 are usually removed by the Data Extractor.
Precursor Ion Purity - 100% would be a perfect value; <50% indicates additional peptides likely contribute to the MS/MS spectrum.
Precursor isotope quality XIC's (Chi2 vs. Averagine) - a good shape is > 0.85; < 0.5 is poor and suggests misassigned monoisotope, low abundance, or non-peptidic.
Glyco Product Ions Score - a value > 4.5 is very likely to indicate the presence of a glycopeptide bearing a HexNAc. Use this filter with MS/MS search to restrict a search to only glycopeptide spectra. Also enable the OHexNAc (*-termS,*-termT) fixed modification, which triggers mass calculations for a modified precursor ion paired with unmodified product ions, typical of the prompt neutral loss of an OHexNAc moeity in CID and HCD spectra.

Multicore (Maximize CPUs) Data Extraction

Spectrum Mill B.05.00 now supports the ability to select Maximize CPUs when you extract data. Prior revisions only supported Maximize CPUs for MS/MS Search. Because data extraction can require much more memory than searches, Spectrum Mill implements a “memory governor” that prevents multiple extractions from running at the same time if available free memory becomes too low. When all physical memory is used, Windows will swap memory to disk, which significantly degrades performance. It is better to limit the number of parallel extractions than to have Windows go into swap file mode.

Configuring Service Request Manager Settings

The Spectrum Mill Service Request Manager (SRM) must be stopped for configuration changes to apply. See To Start and Stop the Spectrum Mill Workflow Manager Service for details. You must perform the following procedures from an elevated command window (cmd.exe, Run As Administrator).

The Spectrum flow configuration file (millsrm\smsrm.config) provides several parameters that configure how memory is governed:

<provider> section

maxConcurrentTasks	This attribute is set by default during installation to be one less than the number of (multicore) CPU cores detected.
minRequiredTaskMemoryGb	This attribute defaults to 2 Gb. If there is less than that amount available, no tasks that have been submitted to the workflow queue will be allowed to run. When currently running tasks complete, memory will be freed up and queued tasks will then run.

<provider> <supportedTasks> section

The <task> definitions for “xtractorAgilent” and “xtractorFinnigan” support multicore processing. These have “memFactor” attribute. Because it is not possible to predict how much memory an extraction will require, the memFactor is used to estimate it based on the data file size. For Agilent data, this factor defaults to 1.25 times the size of the file. This factor applies to both centroid and profile data. For Thermo .raw data, it is not possible for the request manager to determine whether the data is profile or is centroid data. The memFactor of 2.7 assumes data is centroid. If your lab typically generates only profile data, the memFactor for the xtractorFinnigan task should be set to 1.0 instead.

When to Change the memFactor Settings

You use Windows Task Manager to monitor the memory usage when multiple parallel extractions are occurring. You can also look at the Process tab to monitor how many xtractorAgilent.cgi or xtractorFinnigan.cgi processes are running at once.

If you find that available memory falls to near 0 or below, then consider increasing the memFactor setting. This will reduce the number of parallel extractions that can be run.

If you find that you do not see very many extractor processes running at the same time, yet there appears to be enough available memory (for example 4 or more Gb), then consider reducing the memFactor value. In general,Spectrum Mill should allow the number of CPUs minus 1 to run in parallel (if no other searches are running).

Note that reducing the MS/MS Search Batch Size setting can also reduce the amount of memory used in searches.

When to Select "Maximize CPUs"

Select Maximize CPUs in the Data Extractor when you are only extracting a data folder that contains multiple data files. However, if you are extracting multiple data folders (where the number selected is near or greater than the number of CPUs on the server) then you will generally get better performance if you do not select Maximize CPUs for the Data Extractor. The data folders will all be extracted in parallel.

To Use the Data Extractor Form (MS/MS)

The following topics describe options available on the Data Extractor form. In general, you should retain the default settings, except for the options highlighted in red text on the form. For more details, see Spectral Preprocessing for MS/MS Data. Note that the options change depending upon the vendor data type to be extracted.

Important note: If you wish to redo a data extraction, mark the check box for Remove all prior results.

Extraction

Extract - Click to place the task in the queue for execution. The program will execute the task to extract spectra from raw data files based on the time the command entered the queue, its capacity to process tasks in parallel, and dependencies. Click this button after you have either loaded a parameter file or manually set the parameters. The name of the current parameter file appears in red at the top of the form. Once you have saved a parameter file, you may start the extraction from a workflow rather than manually with the Extract button.
Save As - Click to save current data extraction settings in a parameter file.
Load - Click to load a parameter file that contains settings for data extraction. For default values, select a parameter file from the Defaults folder.
Remove all prior results - Mark this check box to remove prior extractions, searches and data summaries for this dataset.
Maximize CPUs - Mark this check box if you want this extraction to take advantage of all available CPUs (as opposed to using only a single CPU so that the other CPUs are available for other processes/users). If you mark this check box for a workflow, the request queue will show two requests -- the initial one to create the batch (of files) and the other to show the progress and extractor results. Mark this check box only if your data folder contains multiple data files, and if you have only selected a few data folders to extract.
Delete data files after extraction - If you are sure your extraction settings are good, AND you have archived your data elsewhere, mark this check box to remove the data files and save disk space. A placeholder file will be created so that you can continue with other processing. If you need to re-extract, you will need to copy the data files back to your Spectrum Mill server.
Generate spectral features file only - Mark this check box to generate the file SpecFeatures.#.tsv, without actually generating the extracted spectra. This option appears when you select a directory that contains peak list files but no raw data file. When you have *.dta files, or *.pkl files that represent individual spectra, you put your files in the cpick_in folder, and then you must mark this check box. (When you have appended *.pkl files, i.e. each file contains multiple spectra, then you put your file in the root sample directory and you do not mark the check box.)
Instrument: Select the instrument you used to collect the data. This option appears only when you select a peak list file rather than a raw data file.

Data Directories

Click the Select... button to select a data directory or data directories. See Selecting Data Directories.

Modifications

Click the Choose... button to select modifications appropriate for your sample. See Choosing Modifications.

MS/MS Spectral Feature Filtering

MH⁺ - Set the mass range of precursor ions. Spectra with precursor ions outside of this range are rejected.
Scan time range: Set the range of scan times you wish to extract from the raw data files. Use to this to avoid processing regions of the chromatogram that are not of interest -- for example, the beginnings and ends of runs. Keep the default (1 to 300) to extract all scans.
Disable quality filtering (sequence tag length = -1, no merging, attempt to assign charge +1 only) - Mark this check box if you wish to compare results with those from other database search engines. CAUTION: Because this mode disables signal-to-noise and spectral quality filtering, some of the spectra you submit for the search will be poorer quality and you will generate significantly more false positives! See Disable quality filtering mode/disable match filtering modes. Note that the check box for Disable quality filtering is available only if it is configured in SMglobals.js. See the server administration help for details.
Sequence tag length - The minimum sequence tag length is the length of the longest path of amino acids that is represented in the spectrum. You use this parameter to reject extracted spectra that are noisy or that do not represent peptides. For most applications, it is best to retain the default of > 0 so you are sure to extract all possible good spectra. You can set higher thresholds for spectral quality later in the data processing. For MALDI MS/MS spectra, set the value to -1 so that no filtering is performed. See MS/MS Spectral Quality Filtering.
Ignore spectra with dissociation mode: Mark check boxes for any spectra that you do not wish to extract. Note that the software displays different dissociation modes depending on the type of file you select.

Merge nearby MSⁿ scans with same precursor m/z:

Replicate MS/MS scans that were acquired nearby in time and have the same precursor m/z are merged into a single spectrum using the constraints below.

Retention time & m/z tolerance: Set time and mass window for merging scans, and for calculating chromatographic peak areas of precursor ions. See Spectral Extraction.

For Agilent Q-TOF data, keep the default mass window to +/- 1.4 m/z. The software uses this value to merge scans, but generally does not use the value to calculate chromatographic peak areas. When the software can determine a charge state, it uses a more accurate method to calculate the intensities. For those few spectra where it cannot determine a charge state, the software does use the +/- 1.4 m/z to calculate the intensities of the extracted ion chromatograms.
For MALDI, change the time window from the default of 60 to 1000 (or the total run time in seconds). Since MALDI is not chromatographic data, you want all instances of the same precursor merged.
For Thermo Fisher Scientific data the m/z tolerance on the form is used when MS1 scans were collected at low resolution in an ion trap. If there are high-resolution MS1 scans collected in an Orbitrap, the software ignores the m/z tolerance on the form and instead dynamically determines the m/z tolerance based on the MS1 resolution.
If you are attempting differential expression quantitation, and your labels differ in mass by only a few Da, see Quantitation for labels with small mass differences.

General MS/MS Merging Constraints: The MS/MS scans can be compared to ensure that they correspond to the same peptide. For Agilent data you can adjust settings in instrument.txt to control the comparison and merging. For direct control, select from the following list of options:

No merging (tolerate protein quantitation multi-counting) - For single proteins select this option to improve coverage and detect more low-level peptides.
Retention time & precursor m/z tolerance only - Select to merge scans based only on the values of the RT and precursor m/z tolerance entered above.
Spectral Similarity & RT & m/z - Select to merge scans based on similarity and on RT and m/z values. For more information, see the discussion under Settings in instrument.txt.
Precursor Selection Purity & RT & m/z - Select to merge scans based on RT and m/z values, as well as Precursor Selection Purity, which automatically calculates the proportion of ion current in the isolation window of a high resolution MS1 scan represented by the isotope cluster of precursor ion assigned to the resulting MS/MS scan. If the value is <75%, the MS/MS scan is ineligible for merging.
Precursor Selection Purity & Spectral Similarity & RT & m/z - Select to merge scans based on all the possibilities for merging.

Specialty MS/MS Merging Options - For Agilent ion trap data files that contain spectra that alternate between CID and ETD, the software merges the ETD spectra the same way as the corresponding CID spectra. For example, if the CID spectra were merged from scan 2 thru 12, then the ETD spectra are merged from scan 3 thru 13. In no case are ETD spectra merged with CID spectra. For Thermo Fisher Scientific data the merged MS/MS spectra must also be acquired with the same dissociation method (CID, HCD, PQD, or ETD) and the same resolution, unless otherwise specified by the following specialty merging options.

Same resolution

Merge CID & HCD MSn
Merge CID & PQD MSn

Different resolution

Merge ion trap CID & HCD MSn immonium ion region. Data can be acquired to generate 2 separate spectra with iTRAQ/TMT reporter ions at high collision energy using HCD, and sequence ions at lower collision energy using CID. When merging is done, the reporter ion instensities are stored in the specFeatures files associated with the CID MS/MS spectrum for later use in quantitation. The peaks are also inserted into the CID spectrum (replacing any prexisting CID peaks at those masses). The inserted peaks are scaled to be less intense than the base CID peak to prevent interfering with subsequent identification and to facilitate spectral viewing. The unscaled intensities are stored in the specFeatures file and used later for quantitation.

Merge MS² and MS³ spectra from same precursor: This option appears only when you select *.d or *.raw data files. If the data does not contain MS³ scans (for example, Q-TOF), the setting is ignored. Select from the following list of options:

Merge - merge the MS² and MS³ data from the same precursor ion
Merge 5x MS³ intensity - multiply the intensities of the MS³ peaks by 5 (to make them more comparable to the MS² intensities) and then merge the MS² and MS³ data from the same precursor ion
Create separate extracted files for MS³ spectra - save the MS³ spectra separately for searching
Ignore MS³ spectra - ignore the MS³ data and extract only the MS² data
Ignore MS² spectra - ignore the MS² data and extract only the MS³ data

Peak Merging - When spectra are merged, many of the corresponding peaks in the spectra will not have identical mass; hence using a tolerance of +/- 0.25 Da (for ion trap data) the peaks are merged by summing the intensity and retaining the mass of the most intense peak (does not try to centroid). This attempts to correct artifacts resulting from prior centroiding of the individual spectra.

Merge settings for Agilent instruments in instrument.txt

The Agilent extractor merges MS/MS spectra only if they are similar. This avoids merging closely eluting or co-eluting isobaric peptides. The parameters that control the merging are set in E:\SpectrumMill\msparams_mill\instrument.txt:

merge_num_peaks	For similarity merging of MS/MS spectra, the number of peaks that match between the two spectra must be greater than or equal to merge_num_peaks, which is a number between 0 and 50. The similarity merging takes the top 50 peaks from both spectra and compares them.
merge_SPI	For similarity merging of MS/MS spectra, the percentage of the total intensity of the top 50 spectral peaks that is matched from spectrum A to spectrum B and from spectrum B to spectrum A must be greater than or equal to merge_SPI, which is a number between 0 and 100.

With the exception of the Agilent Q-TOF, all Agilent instruments that generate MS/MS data use the defaults of merge_SPI = 70 and merge_num_peaks = 25, but if you add an entry to instrument.txt, that overrides the defaults. The Agilent Q-TOF uses merge_SPI = 50 and merge_num_peaks = 5, and the software merges only fragment ions that are within a 0.05 m/z mass tolerance.

If a significant number of peptides appear twice in the summary report, and the peptides do not have different charge states or different labels (for example, D₀ and D₈), then it is possible you need to modify the settings in instrument.txt. Before you do so, first increase the windows for Merge scans with same precursor m/z in the Data Extractor form. If changing the extractor settings does not produce satisfactory results, then modify instrument.txt to set merge_SPI to a lower value. Try a small change first, for example, change from merge_SPI = 70 to merge_SPI = 65. The format in instrument.txt is merge_SPI, followed by a tab, followed by the value.

You can also try setting merge_num_peaks to a lower value (down to 20 or 15). This may be useful for some MALDI MS/MS spectra where sequence coverage is low and there are only a few large peaks in the spectrum.

For more information about modifying instrument.txt, click here.

To customize merging, see this Help section for the Data Extractor form.

Precursor m/z & Charge Assignment

Note: These options are not available when you mark the check box Show only MS (PMF) parameters.

Default/Find/Force - See Precursor Charge Assignment for MS/MS Scans.
If you choose Find, the following options are available:

Maximum (z): See Precursor Charge Assignment for MS/MS Scans.
Minimum MS S/N: Sets the minimum MS signal-to-noise for determining charge state. See Peak Detection.
Find ¹²C - Mark this check box to compensate for the fact that the mass spec control software may not have selected the monoisotopic peak for MS/MS. See Peak Detection. Also mark this check box if you want the software to use results from centroiding to further improve the mass accuracy for the monoisotopic peak in Agilent Q-TOF data. See the information above on precursor charge assignment for Agilent Q-TOF data.

If you choose Default, the following option is available:

Find ¹²C - Mark this check box to compensate for the fact that the mass spec control software may have selected the ¹³C peak for MS/MS. See Peak Detection.

If you choose Force, the following option is available:

Force (z): Forces the charge state to the specified value or range of values.

MS Noise Threshold - Applies only to Agilent Q-TOF data. The default value of 100 counts is fine for most data. For data acquired with the Agilent 6550 Q-TOF, a higher value might provide better results. The increased sensitivity also can increase non-peptidic background signals. If you observe that the overall background is much higher than 100 counts, specify a value that filters out much of the background.

Precursor Charge Assignment for MS/MS scans

Default mode - if instrument does not assign charge, the charge is assigned as 0 (ambiguous charge) unless it can be determined to be +1 as described in Find mode.

Force Mode - charge assigned as designated by the user.

Find Mode - fixed charge assigned if it can be determined as described below, otherwise 0 (ambiguous charge) assigned.

For Agilent Q-TOF data: The software examines the MS spectra for the precursor ions and calculates the theoretical isotopic distribution for all charge states from +1 up to Maximum (z ), which is set in the Data Extractor form. It then uses a least squares fit to determine which is the best match for the monoisotopic peak and isotopic distribution in the experimental spectrum. The software performs a least squares calculation for each spectrum across the elution profile of the chromatographic peak and then centroids. If the check box for Find ¹²C is marked, then it replaces the original monoisotopic mass with the centroided mass, to provide better mass accuracy.

For Agilent Q-TOF data, the software performs the charge assignment prior to peak merging, which is the opposite of the order for low-resolution data.

For ion trap (low-resolution) CID data: Tests below are performed in the order listed.

+1 If No Peaks Above Precursor - if after peak detection as described above, there are no remaining peaks in the MS/MS spectrum above the precursor m/z value with an additional allowance of 2.5 m/z for precursor isotopes, then the precursor charge is assigned as +1.
+2 from b/y pairs in MS/MS scan - if after peak detection as described above, there are at least 3 b/y pairs (pairs of peaks which add up to the mass of putative precursor MH⁺ + hydrogen), then the precursor charge is assigned as +2. Note that this calculation is dependent upon the putative precursor m/z (as adjusted by user designation of Find precursor ¹²C ) and the user-designated tolerance allowed for merging scans with the same precursor m/z.
+2 to Max z by checking MS scan for precursor charge distribution - the MS scan preceding the MS/MS scan is examined for peaks corresponding to additional charge states of the peptide's precursor m/z. Peaks corresponding to possible additional charge states in the MS scan are subject to a signal/noise calculation as described in the Peak Detection section and the user-designated mass tolerance allowed for merging scans with the same precursor m/z. After finding possible alternate charge states, the following further restrictions must be met before assigning the precursor charge:
- disregard possible higher charge states found below m/z = 500 (chemical noise present).
- to assign z > 3, 2 additional charge states must be found.
- to assign z = 3, an additional +2, or +4 and +5 must be found.
- to assign z = 2, an additional +1 or + 3 must be found.

For Agilent ion trap ETD data: The software examines the MS/MS spectra for a pattern of peaks with reduced charge states, finds the pattern that is most complete, and uses that information to assign the charge state to the precursor ion. It tests all possible precursor charges from +1 up to Maximum (z ), which is set in the Data Extractor form.

For example, to test z = 4 the software looks in the MS/MS spectrum for peaks that correspond to reduced charges of +3, +2, and +1. To test z = 5, it looks for peaks that correspond to reduced charges of +4, +3, +2, and +1. The charge state that produces the most complete pattern is the one that is picked.

For Thermo Fisher Scientific ETD data: Charge assignment uses four different tests. If any of the four methods provide a charge, the software assigns the charge unless there is a conflict. If none of the four methods provide a charge, the software creates a .0 pkl file. The four tests are:

Precursor isotope spacing in the MS survey scan (only if the scan used enhanced scan rate resolution or higher)
Additional precursor charge states in the MS survey scan
Additional reduced precursor charge states in the ETD MS/MS scan
Complementary c/z ion pairs in the ETD MS/MS scan

Data Extractor for Generic (Peak List) Files

The generic Data Extractor serves two basic functions for MS/MS spectra: spectral quality filtering and spectral feature calculation. The generic Data Extractor is automatically invoked for files that contain peak lists. It handles only spectra with peaks that have all already been centroided. The generic Data Extractor also processes *.mgf files that contain centroided spectra.

The generic Data Extractor performs many of the functions that the raw file Data Extractor does, but since it can not similarly read the raw mass spectral files, neither chromatographic time information nor MS scan data is available. Like the raw file Data Extractor, the generic Data Extractor creates the SpecFeatures.1.tsv file that contains Spectral Features such as total intensity and Maximum Sequence Tag Length. These features are used in the MS/MS Search, Sherenga de novo Sequencing, Protein/Peptide Summary, and Spectrum Summary scripts.

Settings in instrument.txt

By default, this extractor expects files that contain data that has been centroided only - not signal-to-noise processed or de-isotoped. For generic data, it is best to let the Spectrum Mill do the signal-to-noise processing and de-isotoping/charge-assignment. If your instrument software performs these functions, then add the following to the section of E:\SpectrumMill\msparams_mill\instrument.txt that applies to your instrument:

bypassSignalNoiseThresholding 1 bypassDeisotoping 1

If you want your instrument software to do signal-to-noise thresholding but not de-isotoping/charge-assignment, then add the following to the section of instrument.txt that applies to your instrument:

bypassSignalNoiseThresholding 1 bypassDeisotoping 0

For more information about modifying instrument.txt, click here.

Files generated

When you process appended *.pkl files, the software generates individual spectral files with the following naming conventions:

prefix.pkl - The starting file containing multiple spectra
prefix.scanNumber.0.parentCharge.pkl - A resulting file containing an individual spectrum

scanNumber: the consecutive order of the spectrum in the starting file
0: placeholder where function number would be if created by ProteinLynx
parentCharge: charge of the precursor ion for the spectrum

MS/MS Spectral Quality Filtering and peak detection are performed as with raw file Data Extractor.

*.mgf file support

The Generic Data Extractor can parse most *.mgf files. To get the best results, make sure that the PEPMASS lines contain both mass and intensity values, and that the CHARGE line is reported.

To optimize results, you may need to change settings for your instrument or define a new instrument type in E:\SpectrumMill\msparams_mill\instrument.txt. The instrument.txt setting for MALDI-TOF-TOF is configured for *.mgf files where the data has been centroided, signal-to-noise filtered, and de-isotoped. With the hiEnergyCID setting of 1 in instrument.txt, the search score is not penalized for unassigned peaks.

If your spectra contain many noise peaks, when you search the spectra, reduce the value for Minimum scored peak intensity. Likewise, when you validate and summarize data, reduce the % SPI and Score filters.

MS/MS Search

Filters for excluding files from MS/MS searches are described here. MS/MS Search itself is described in the MS/MS Search Help.

Search Filters

Features for excluding files from a group of MS/MS searches are covered here.

Data Directories - Designates the base sample directory where a directory of spectral input files can be found.
Validation filter - allows searches to be restricted to those files that have or have not been assigned a validation state using Protein/Peptide Summary or Spectrum Summary.
Sequence tag length - allows data set files to be skipped that have a low number of ions constituting an ion series separated by amino acid masses.
Minimum detected peaks - allows files in the data set that have a low number of peaks remaining after peak detection to be skipped.
Spectrum files - Designates the particular spectral input files; note that wildcards can be used to specify a set of filenames.
Fragmentation mode - Can filter searches based on CID, ETD, HCD and/or PQD fragmentation modes; near Data Files.
Precursor mass tolerance - can be specified in either Da or ppm.

MS/MS Autovalidation

The MS/MS Autovalidation page permits automatic validation of results meeting user-set score thresholds. Two major differences exist between the validation done with this page and the validation done with the Protein/Peptide Summary page. The first difference is that with Autovalidation, the validation occurs in a single step; the validation states are immediately written to file. The second difference is that Autovalidation permits validation using charge-state-dependent score thresholds.

Note that when you validate files via either autovalidation or manual validation (Protein/Peptide Summary page), the software lists validated hits and spectra. These are cumulative and include both the new hits and spectra you just validated, as well as those you validated previously.

False Discovery Rate

With any protein database search, you get some top hits that are correct and some that are not. In the Spectrum Mill workbench, you (or the autovalidation software) can judge which hits are more likely to be correct, based on database search score and %SPI (the percentage of the extracted spectrum that is explained by the database search result). To further ensure the quality of results, the Spectrum Mill allows you to autovalidate database search results based on false discovery rate (FDR) – a percent FDR that you set and that provides an independent measure of the likelihood that the results are correct.

To calculate the FDR, the software needs the results of the search of a decoy database. It gets these results when you mark the check box (in MS/MS Search) for Calculate reversed database scores. To calculate %FDR, it compares the number of top database hits from the reversed database search to the total number of top hits. It multiplies the decoy top hits by 2, under the assumption that for each incorrect top hit in the decoy (internally reversed) database, there exists an incorrect hit in the forward database (SwissProt, or whatever database you searched).

Note: To publish the calculated %FDR, use the calculations available under Quality Metrics & FDR.

Strategies/Modes

To use false discovery rate calculations most effectively for your situation, Agilent has provided a number of options for autovalidating the matches and estimating the false discovery rate. You can choose from among three Autovalidation strategies:

Fixed thresholds: Run Autovalidation first in Protein details mode, where you set fixed thresholds for different scores, above which the protein is valid, and then in Peptide mode, again where you set fixed thresholds for different scores, above which the peptide is valid. In both modes you can calculate an FDR using reversed hits. This FDR is the global FDR at the spectral level.
Auto thresholds: Run Autovalidation first in Peptide mode, where the score and R1-R2 score thresholds are automatically optimized until a target % FDR, which you enter, is reached, and then in Protein Polishing mode.
In Protein Polishing mode the intention is to achieve a target protein FDR and increase the sequence coverage of validated proteins. The first objective is achieved by unvalidating previously validated peptides. This capability allows you to autovalidate marginal peptides during peptide autovalidation; yet the protein FDR is kept under control by unvalidating the marginal peptides that cause trouble at the protein level. The second intention is achieved by recalculating the peptide FDR only on the subset of peptides from validated proteins. This generally results in increased sequence coverage of the validated proteins.
Auto thresholds - discriminant: Run Autovalidation first in Peptide mode, where either a global FDR or a local FDR is set (see Global versus Local FDR below) and the discriminant score thresholds are automatically optimized until the FDR you entered is reached. Then run Autovalidation in a Protein Polishing mode (see description above). You must have searched with Discriminant scoring set to something other than Off.

You can use all of these strategies and modes with Workflow Automation, but only certain sets in recursive workflows. A recursive workflow involves successive searches and validations; for example, identity search, followed by autovalidation, followed by a variable modification search on a smaller database, followed by autovalidation. The recursive workflow is incompatible with the global FDR, calculated by the Optimize score and R1-R2 ...option in the Auto thresholds/Peptide strategy/mode and by the Global FDR option in the Auto thresholds-determinant/Peptide strategy/mode. The recursive workflow leads to subsets, each of which can have different characteristics, while the global FDR calculates a single FDR value over all matches under the assumption that all the matches have uniform characteristics on average. Therefore, you can use only the Fixed threshold strategy/modes and the Auto threshold-determinant/Peptide/Local FDR option in recursive workflows.

Global versus Local FDR

With the Auto threshold-determinant strategy Peptide mode, you can autovalidate by either Global FDR or Local FDR. The Global FDR gives an overall error rate for validated peptides in the entire data set. You choose a cutoff (for example, 1% FDR) for which you accept results. That means in the overall data set, 1% of the identifications are likely to be wrong. However, an individual validated peptide may have a much higher chance of being wrong, which is especially true for the lower-scoring results. If that is a concern, you can use the Local FDR.

To calculate Global FDR, the program orders the identifications from best (highest discriminant score, or highest score if discriminant score is disabled) to worst (lowest score), then sums the total number of hits to the reversed database (D) and the total number of hits to both forward and reversed databases (N). Then it calculates FDR as:

FDR_global = 2D/N

The Local FDR measures the quality of each individual peptide identification. It answers the question, "If I accept this hit as a correct answer, how much does that increase my false positive rate?" As with the global FDR, you choose a cutoff (for example, 1% FDR) for which you accept results. The local FDR calculation uses the equation:

FDR_local = 2 dD/dN

In other words, it plots D on the y-axis versus N on the x-axis, and takes the derivative at each (D, N) pair. (See example graphs below.) This plot is not smooth, which causes local variations in the derivative. To get more reliable results, the program first fits a function to the plot, then takes the derivative of the function at each point.

Local FDR example 1 Local FDR example 2

As shown below, the local FDR is generally a more stringent measure of quality, so it usually gives fewer validated hits than global FDR.

Global/local FDR comparison

For more information, see:

Tang, W. H.; Shilov, I. V.; and Seymour, S. L. "Nonlinear Fitting Method for Determining Local False Discovery Rates from Decoy Database Searches;" J. Proteome Res.; 2008; 7; 3661-67; DOI: 10.1021/pr070492f.

FDR at the PSM, Peptide, and Protein Levels

FDRs can be calculate at different levels: peptide spectrum match (PSM), peptide, and protein. The Autovalidation form in the Spectrum Mill calculates FDR at the PSM and protein levels, while the Quality Metrics module calculates FDR at all levels. The difference between the PSM level and the peptide level is that the PSM level may include multiple spectra for the same peptide, while the peptide level uses only the highest-scoring spectrum for each peptide. Therefore, the peptide level is a more stringent calculation.

MS/MS Autovalidation and Workflows

Autovalidation strategies in Spectrum Mill

There are three Autovalidation “strategies” in the Spectrum Mill, and each provides both a peptide-level and a protein-level Autovalidation mode, but there are some differences. In general, the Auto thresholds strategy is recommended, but there are cases where the other strategies should be used. This is discussed in the Suggested Workflows section.

FDR

Determination of a false discovery rate (FDR) requires the data be searched with Calculate reversed database scores enabled. When enabled, Spectrum Mill reverses the sequence of amino acids in the peptide that are between the termini. For example, “SAMPLER” is also searched as “SELPMAR”. This allows for the search to use the same peptide mass, and it is faster than searching a decoy database. FDR calculations require a sufficiently large database so that false positives can be determined. This has implications for searching single protein or small species subsets, and when searching saved results.

The actual FDR obtained can be determined in the Quality Metrics & FDR page.

Auto Thresholds

The Auto thresholds strategy is available in B.04.00 and later, and is the default. With this strategy, the Peptide mode is done first and optimizes the score and Rank1-Rank2 score thresholds to reach a specified maximum FDR. This mode allows for various peptide filtering settings which are applied prior to validation. The Protein polishing mode can then be used to remove one-hit wonders and increase coverage of valid proteins. Note that Peptide followed by Protein polishing is the reverse order than what is done in the Fixed thresholds strategy.

Auto Thresholds

The Auto thresholds strategy is the recommend strategy to use in most cases. Note that you first perform Peptide mode, then optionally use Protein polishing.

Peptide mode

For each precursor charge state, the matrix of score and Rank1-Rank2 values are examined to find the values that yield the maximum number of peptide spectrum matches below the designated FDR threshold. For datasets or charge states that have small numbers of peptides, you should choose to optimize across an entire directory rather than across each LC-MS/MS run. In peptide mode, when you use the Auto thresholds strategy multiple times on the same directory, each time it only optimizes using the not-yet-valid peptide spectrum matches. The results of each round are appended to the pool of previously valid spectra. Use the Quality Metrics & FDR tool to calculate the final combined FDR.

Protein polishing mode

The Protein polishing mode has two goals: (1) achieving a target protein FDR, and (2) increasing the sequence coverage of validated proteins. Before using this mode, you must use the Peptide mode.

Both goals are achieved by unvalidating previously validated peptides. This unvalidation capability enables you to autovalidate marginal peptides during peptide autovalidation, yet the protein FDR is kept under control with subsequent protein polishing by unvalidating the marginal peptides that belong to marginal proteins.

Fixed Thresholds

The Fixed thresholds strategy is similar to the “classic” (A.03.03 and prior) Autovalidation, but now provides the option to calculate an FDR. New peptide filtering options are also available. In this strategy, validation is done first with Protein details mode, and then can optionally be followed with Peptide mode. The Quality Metrics & FDR page can be used to determine the FDR that was obtained.

Fixed Thresholds

Enhancements over the “classic” Autovalidation include:

Ability to calculate FDR using reversed hits. Note that if the FDR calculation is enabled, the reversed hits cannot be also used for threshold filtering – that is, the Fwd-Rev Score Threshold filter can not be selected. The FDR calculated is the global FDR at the spectral level.
Ability to optimize score and R1-R2 score thresholds for each run with max FDR using reversed hits
Filtering on precursor mass error
Multiple filtering options that are variable for each run or fixed range for all runs
Can require or disallow certain amino acids (AAs )

Auto Thresholds - Discriminant

Discriminant Scoring allows additional factors (%SPI, Backbone Cleavage Score, Number of Complementary Fragments, Matched Sequence Tag Length, Peak Match%, Charge, Rank1-Rank2 Delta) to contribute to the scoring used in the Autovalidation.

To use this strategy, Discriminant Scoring must be enabled in the search. Effective use of discriminant scoring requires the careful curation and validation (using one of the other Autovalidation modes and manual validation) of a representative data set. The Tool Belt Calculate discriminant scoring coefficients tool is then used to create the coefficients. Several precalculated sets are provided for evaluation. Note that selection of Score in the MS/MS Search defeats the purpose of the discriminant mode, and is there for backwards compatibility only.

Auto Thresholds - Discriminant

The FDR target may be applied to either Local or Global levels.

Peptide mode - Global FDR

In this mode, the program calculates the global peptide FDR at the spectral level. The global FDR is the percentage of all the peptide identifications that are likely to be false. It is a calculation for a collection of peptides across the data set you are validating. The program adjusts the validation thresholds for peptide score (or discriminant score) until it meets the %FDR that you typed. This mode does not support recursive workflows with successive validations and searches.

Peptide mode - Local FDR

In this mode, the local FDR measures the error rate for individual peptides at the spectral level. While the global FDR focuses on a collection of peptides, the local FDR answers the question, "Does this peptide identification increase the FDR? If I validate this identification, how many additional false positives am I likely to get?" This mode supports recursive workflows with successive validations and searches.

Compared to the global FDR calculation, the local FDR calculation requires an additional curve fitting step and is thus less robust from a computational standpoint than the global FDR calculation. The larger the data set, the more reliable the curve fitting becomes and hence the more reliable the calculated local FDR value. You should review the curve fitting, which you can see by clicking on an entry in the FDR search # column and looking at the graph titled “Fit quality for computing local false discovery rate.”

Recursive Workflows

Note: Prior to Spectrum Mill B.04.00, the recommendation for variable modification searches was to always search first with Identity mode, validate, then search in Variable mode. Because of both search performance improvements and the ability to Autovalidate to an FDR, the initial search should now include the expected variable modifications.

In recursive workflows, an initial search is done with the expected variable modifications. The results are then Autovalidated. Additional searches are then run with Search previous hits selected. This restricts the search to only those proteins that were identified and validated in the initial search. Typical uses of a recursive search are to search with a different variable modification (usually a different one for a modification that was applied during the initial search), or a different enzyme. Setting the Validation filter to spectrum-not-marked-sequence-not-validated reduces the search space to those spectra that were not validated after an earlier search.

It may be the case that changing the modifications and enzyme selections will result in completely different proteins being found during the MS/MS Search. You can combine these additional results with your previously found results by clearing the check boxes for both Remove all prior MS/MS Search results and Search previous hits.

Autovalidation Strategies and Recursive Searches

When you do recursive searches, only the following Autovalidation strategies should be used to Autovalidate after each recursive search:

Fixed thresholds (Protein Details, followed by optional Peptide)
Auto Thresholds – Discriminant, with Local FDR

The Auto thresholds strategy (either Peptide or Protein polishing mode), and the Auto thresholds – Discriminant strategy with Global FDR mode should not be used. While it might be tempting to Clear All prior validations prior to Autovalidating after recursive searches, this will not provide an accurate FDR, because the size of the search space is different for each round and thus the delta R1-R2 scores are not comparable.

Suggested Workflows

Auto thresholds Strategy

This workflow begins with the Peptide mode. It can then be followed by the Protein Polishing mode. Use of the latter may remove previously validated peptides to meet the protein FDR% target.

This Autovalidation workflow should not be used with recursive search workflows. The implication is that Variable modifications searches must be done in the initial search step. Additional (recursive) searches should be followed by one of the Autovalidation strategies that support recursive searches.

Fixed threshold Strategy

When using this strategy, first do Protein Details validation, then optionally follow with Peptide validation. Do not clear the validations between searches.

Auto thresholds - Discriminant Strategy

This workflow begins with either the Peptide Global or the Peptide Local Autovalidation. (Do not do both.) Either mode can then be followed by the Protein Polishing mode.

Only the Peptide Local Autovalidation workflow can be used in the recursive search workflows.

Which Workflow to Use?

The Auto thresholds strategy automatically validates for a target FDR%, where it uses both the Score and the Rank1-Rank2 score to optimize thresholds. It provides various filtering options, and is the recommended strategy to use. The disadvantage is that it does not support the recursive search workflow, but it can be used to validate the initial search results.

The “classic” approach using the Fixed threshold strategy still works and can be used as a reference point for evaluating the other approaches. The resulting FDR can be calculated and shown. To change the FDR target, though, you must change the various Rules settings and redo the Autovalidation.

The Auto thresholds – discriminant strategy is the simplest Autovalidation approach for FDR, but only the Peptide Local mode can be used in a recursive search workflow. The disadvantage is that Discriminant Scoring must be enabled during the search, and requires that a training set be carefully validated, although several default sets are provided for evaluation. Typically, you would use the Fixed Thresholds or the Auto Thresholds approaches, along with some manual validation, to prepare the data set. The use of Discriminant Scoring allows additional factors (%SPI, Backbone Cleavage Score, Number of Complementary Fragments, Matched Sequence Tag Length, Peak Match%, Charge, Rank1-Rank2 Delta) to contribute to the scoring used in the Autovalidation. For small data sets, the local FDR calculation may be unreliable and it is wise to use the global FDR.

Quality Metrics & FDR

All of the peptide Autovalidation modes calculate the spectra level FDR. The Protein polishing Autovalidation calculates the protein level FDR. The only place the distinct peptide level FDR is calculated is in the Quality Metrics & FDR page. The FDR may be reported at the spectra level, distinct peptide level, and protein level.

FDR

To Use the Autovalidation Form

The following topics describe options available on the MS/MS Autovalidation form. In general, you should retain the default settings, except for the options highlighted in red text on the form. For more details, see MS/MS Autovalidation.

Automatic Validation

Validate Files - Click to validate search results and spectra. Click this button after you have either loaded a parameter file or manually set the parameters. The name of the current parameter file appears in red at the top of the form. Once you have saved a parameter file, you may start the autovalidation from a workflow rather than manually with the Validate Files button. Whether you use the workflow or not, you usually validate twice, first in Protein Details mode and second in Peptide mode.
Queue request - Mark this check box if you want the autovalidation to occur after a queued MS/MS search has completed for the selected data directories. That is, mark the check box if you want to do interactive automation. If you want to validate immediately, clear the check box.
Undo Last - Click to remove results of the last autovalidation you performed for the data set(s) you selected.
Clear All - Click to remove results of all autovalidations for the data set(s) you selected.
Save As - Click to save current autovalidation settings in a parameter file.
Load - Click to load a parameter file that contains settings for autovalidation. For default values, select a parameter file from the Defaults folder.

Data Directories

Click the Select ... button to select a data directory or data directories. See Selecting Data Directories.
Fragmentation mode: Select the mode whose data you intend to use for autovalidation, thus filtering out data from other fragmentation modes. This lets you set different score thresholds for different fragmentation modes to enable more convenient integrated processing of data with a mixture of fragmentation modes in the same directory. Agilent no longer supports the "MIX" Instrument selections because their purpose is now met with the Fragmentation mode capability.

All - default selection; do not change if you do not intend to differentiate scoring based on fragmentation modes; use for Agilent Q-TOF and other instruments that only acquire CID.
CID only - Agilent Q-TOF and ion trap
ETD only - Agilent ion trap
HCD only - ThermoFinnigan
PQD only - ThermoFinnigan

Search result files: Modify this list if you want to summarize only a subset of the files in the data directory. Wildcards (*) are supported. To see the names of your search result files, look in the results_mstag subdirectory under the directory where you placed your raw files. This list now includes *.spo files.

Validation Strategy/Mode

For an introductory explanation of the strategy/mode selections, see Strategies/Modes. Select from one of three strategies for autovalidating proteins and peptides in the search results and then select a mode associated with the strategy:

Fixed thresholds - Select if you intend to autovalidate using the fixed Score Threshold, %SPI Threshold and the Rank 1-2 score Threshold in the Protein/Peptide Rules table. If you choose this option you can also choose to calculate a False Discovery Rate (FDR) for the autovalidation for either of the two available modes. Modes available for this strategy are Protein details and Peptide. You can set up workflow automation with parameter files for Autovalidation - Protein details and Autovalidation - Peptide. Select Protein details first and save the parameter file; then select Peptide and save the parameter file.

Protein details - In this mode, the program summarizes results by protein, and considers all the peptides that belong to a given protein. Using the default scoring, individual peptides must have scores greater than 6 to 12 (depending on charge state), and the cumulative protein score must be greater than 20. By default, the %SPI, a measure of how much of your extracted spectrum is explained by the database result, should be greater than 60 to 90, depending on charge state and score. A lower value may produce more false positives, but they can reviewed in Protein/Peptide Summary.
Peptide -In this mode, the program summarizes results by peptide. Even if it finds only a single peptide corresponding to a protein, it will validate the corresponding search results provided that the peptide score is high enough. Using the default scoring, individual peptides must have scores greater than 11 to 15 (depending on charge state), with %SPI greater than 60 to 70 (also depending on charge state). This score threshold is higher than in the Protein details mode, where you have the additional assurance of knowing you have identified more than one peptide per protein. The chance of false positives increases at higher charge states, so it is a good idea to set higher score requirements for higher charge states.

Auto thresholds - Select if you intend to autovalidate by optimizing the score and delta R1-R2 thresholds to reach a specified target FDR. This selection also automatically calculates an FDR. Modes available for this strategy are Peptide, Protein Polishing, and VM site Polishing.

Peptide - For the Auto thresholds selection, this mode summarizes results by peptide but instead of using a rule set with fixed thresholds, automatically optimizes the thresholds until the target FDR specified is reached. Validate using this mode first, then select the Protein Polishing mode. Note that the default value is 1.2%. This means that the final calculated FDR with all of the charge states is closer to 1% when you set the target FDR for each charge state to 1.2%.
Protein Polishing and VM site Polishing - These modes polish in the sense that they only consider PSMs that have already been marked as valid via previous rounds of automated or manual validation. The modes were specifically built with the intent of being used after peptide mode has been applied to optimize score thresholds. Protein Polishing mode allows for a specified FDR to be reached at the protein level. A false protein is considered to be one composed entirely of distinct peptides with delta Fwd-Rev score <= 0. It enables you to be aggressive with peptide level FDR thresholds and then come back and remove protein 1-2 hit wonders. The final FDR levels at the protein and peptide level can always be calculated using the Tool Belt search statistics page.

Auto thresholds - discriminant - Select if you intend to autovalidate by using discriminant scores to reach a specified target FDR. Be sure to enable discriminant scoring in the MS/MS Search before using this strategy. You begin with the Peptide mode, then use the Protein Polishing mode.

Peptide - For the Auto threshold - discriminant selection, this mode summarizes results by peptide by calculating the global peptide FDR or local peptide FDR. The program adjusts the validation thresholds for peptide discriminant score until it meets the %FDR you specify.
Protein Polishing - See explanation above.

Validation Parameters: Fixed Thresholds

Protein details mode

Minimum protein score: Set the cumulative protein score (adds scores entered in Protein Rules section) that must be met for automatic validation
Group proteins across all directories - In the Protein details mode, allows peptides from multiple directories of data files to contribute to protein score. Mark this check box if you placed your data files from a given sample into multiple subdirectories.
Minimum number of directories a protein group is observed in - In the Protein details mode, specifies the lowest number of directories in which a protein group must be observed for a validation to occur. The greater the number, the better confidence you have in the identification.
Minimum protein score - In the Protein details mode, specifies the lowest score required for a validation to occur.
Calculate FDR using reversed hits - Mark this check box if you marked the check box for Calculate reversed database scores in the MS/MS Search form, and now you want to use the reversed database scores to calculate a false positive rate. See Reversed Database Search.

If you mark this check box, you cannot mark the check box for Fwd - Rev Score Threshold (under Protein Rules) and vice versa. You must mark this check box if you want to calculate a false discovery rate in the Tool Belt.

Min Sequence Length - Specify the minimum length of the sequence for which a validation will occur. Longer amino acid sequences provide better confidence.

Filtering

You choose from one of these two options:

None - Click this radio button to turn off filtering
Fixed Range for all runs

Filter precursor mass error - Click this radio button to exclude from validation peptides whose precursor mass errors are below or above the range of values you enter. Then type the Low and High mass error.

Protein Rules

These rules permit validation of proteins that match specified criteria.

Precursor Charge - establishes the charge state for which the rule applies
Score Threshold - lowest score for which peptides are validated
% SPI Threshold - lowest Scored Peak Intensity (SPI) for which peptides are validated. SPI is a measure of how much of your extracted spectrum is explained by the database match.
Fwd - Rev Score Threshold - minimum difference between forward and reversed search scores for which peptides are validated. You cannot mark this check box if the check box for Calculate FDR using reversed hits is marked.
Rank 1-2 Score Threshold - minimum difference between the scores of the top and second highest scoring database hit for which peptides are validated

Peptide mode

Calculate FDR using reversed hits - Mark this check box if you marked the check box for Calculate reversed database scores in the MS/MS Search form, and now you want to use the reversed database scores to calculate a false positive rate. See Reversed Database Search.

If you mark this check box, you cannot mark the check box for Fwd - Rev Score Threshold (under Peptide Rules) and vice versa. You must mark this check box if you want to calculate a false discovery rate in the Tool Belt.

Min Sequence Length - Specify the minimum length of the sequence for which a validation will occur. Longer amino acid sequences provide better confidence.
Required AAs: Validates peptides only if they contain the required amino acid(s). To disable, select any. See Amino Acid Filtering.
Disallowed AAs: Peptides are not validated if they contain disallowed amino acid(s). To disable, select none. See Amino Acid Filtering.

Filtering

Use the settings for Automatic variable range for each run when your runs contains peptides with very different values for these parameters. The program calculates a range of expected values based on the amino acid sequences of the peptides, and filters those peptides from the list whose parameter values are above or below the set percentile range (25-75 percentile?). Use the settings for Fixed range for all runs when your runs contain peptides most of whose parameter values lie within a similar range. Or use with only one run.

Precursor mass error filter - You can make only one choice from the options below:

None (ppm) - Click to turn off the following two filters:
Auto precursor mass error - Click to exclude from validation any peptides whose precursor mass errors are estimated to be below or above a set percentile range of values.
Fixed precursor mass error - Click to exclude from validation peptides whose precursor mass errors are below or above the range of values you enter. Then type the Low and High mass error.

Solution Charge/peptide pI filters - You can make only one choice from the options below:

None (SC/pI) - Click to turn off the following four filters:
Auto SCX Solution Charge, pH3 - Click to exclude from validation any peptides whose Strong Cation Exchange charges at pH3 are estimated to be below or above a set percentile range of values (calculated from the amino acid sequence).
Auto OGE/IEF peptide pI - Click to remove peptides whose Off-Gel Electrophoresis/IsoElectric Focusing isoelectric points are estimated to be below or above a set percentile range of values.
Fixed Solution Charge - Click to exclude from validation any peptides whose Strong Cation Exchange charges at pH3 are below or above the range of values you enter. Then type values for the Low and High solution charge.
Fixed peptide pI - Click to exclude peptides whose Off-Gel Electrophoresis/IsoElectric Focusing isoelectric points are below or above the range of values you enter. Then type values for the Low and High peptide pI.

Peptide Rules

These rules permit validation of peptides that match specified criteria. Note that there are only five rules, whereas Protein Rules have six.
The score requirements are more stringent in peptide mode, and for peptides of higher charge states.

Precursor Charge - establishes the charge state for which the rule applies
Score Threshold - lowest score for which peptides are validated
% SPI Threshold - lowest Scored Peak Intensity (SPI) for which peptides are validated. SPI is a measure of how much of your extracted spectrum is explained by the database match.
Fwd - Rev Score Threshold - minimum difference between forward and reversed search scores for which peptides are validated. You cannot mark this check box if the check box for Calculate FDR using reversed hits is marked.
Rank 1-2 Score Threshold - minimum difference between the scores of the top and second highest scoring database hit for which peptides are validated

Validation Parameters: Auto Thresholds

Peptide mode

Optimize Score & R1-R2 score thresholds with max FDR - Type a %FDR value that you do not want to exceed as a target for optimizing the R1-R2 score thresholds. As a starting point this will use the score and R1-R2 score separately to determine maximum thresholds. Combinations of the two are then explored to maximize the number of peptide spectrum matches, while meeting the FDR threshold.
Select whether to optimize across each: LC run or Directory. The threshold optimization is done separately for each precursor charge state and done after applying all the filters described below.
Precursor charge range - Type the range of precursor charges the program will run through for optimization. This is helpful for setting different parameters for different precursor charge state ranges.
Min & Max Sequence Length - Select the minimum and maximum sequence tag length for valid peptides. Short peptides are often not unique in the proteome and can occur in multiple unrelated proteins. A typical minimum length filter for a human proteomics experiment is 7. When working in xenograft systems (human tumor grown in mouse) or other systems with a larger space one should increase the filter to 8. A max length filter is only intended for systems where one is focused on peptides of similar length and might want to set different parameters for different length ranges.
Min Backbone Cleavage score (BCS)- Select the minimum BCS for valid peptides. This enables enforcing uniformly higher minimum sequence coverage for each PSM, and will have the effect of validating low scoring peptides with reasonable fragmentation and excluding ones with higher scores from multiple ion types at only a few peptide backbone positions. Implementation of this filter was motivated by HLA antigens, which are peptides of length 8-12 AAs, that are search in No enzyme mode and thus have a very large search space.
Required AAs: Validates peptides only if they contain the required amino acid(s). To disable, select any. See Amino Acid Filtering.
Disallowed AAs: Peptides are not validated if they contain disallowed amino acid(s). To disable, select none. See Amino Acid Filtering.

Filtering

Use the settings for Automatic variable range for each run when each run can be expected to contain different medians or ranges for instrument performance or peptide properties. Use the settings for Fixed range for all runs when all the runs contain values within a similar range and you have foreknowledge of what that range should be.

Precursor mass error filter - You can make only one choice from the options below:

None (ppm) - Click to turn off the following two filters:
Auto precursor mass error - Click to exclude from validation any peptides whose precursor mass errors are estimated to be above or below 4 standard deviations from the median.
Fixed precursor mass error - Click to exclude from validation peptides whose precursor mass errors are below or above the range of values you enter. Then type the Low and High mass error.

Solution Charge/peptide pI filters - You can make only one choice from the options below:

None (SC/pI) - Click to turn off the following four filters:
Auto SCX Solution Charge, pH3 - Click to exclude from validation any peptides whose theoretical Strong Cation Exchange charges at pH3 are above or below thresholds. The thresholds correspond to 2 standard deviations from the median, with integer rounding after applying the 2 std deviations, max: up to the next integer charge value, or min: down to the next integer charge value.
Auto OGE/IEF peptide pI - Click to remove peptides whose Off-Gel Electrophoresis/IsoElectric Focusing theoretical isoelectric points are above or below thresholds. The thresholds correspond to 2 standard deviations from the median, with integer rounding after applying the 2 std deviations, max: up to the next integer pI value, or min: down to the next integer pI value.
Fixed Solution Charge - Click to exclude from validation any peptides whose Strong Cation Exchange charges at pH3 are below or above the range of values you enter. Then type values for the Low and High solution charge.
Fixed peptide pI - Click to exclude peptides whose Off-Gel Electrophoresis/IsoElectric Focusing isoelectric points are below or above the range of values you enter. Then type values for the Low and High peptide pI.

Protein Polishing mode

The Protein Polishing mode can only be used after validating in Peptide mode.

In Protein Polishing mode the intention is to reach a target protein FDR and eliminate unreliable protein-level identifications, particularly low scoring proteins that are detected either by single peptides (so called one-hit-wonders) or proteins infrequently detected when multiple experiments are being combined across multiple data directories. These goals are achieved by unvalidating PSMs previously validated in a peptide mode autovalidation step. This allows one to autovalidate marginal PSMs during peptide-level autovalidation, yet keep the protein FDR under control by subsequently unvalidating the marginal PSMs that cause trouble at the protein level. Removal of low quality PSMs should also result in reducing the peptide-level FDR that will be recalculated via Quality Metrics after all autovalidation steps are complete. Consequently, autovalidation using a 2-step approach of peptide mode followed by protein polishing mode generally results in increased sequence coverage of the validated proteins as compared to a 1-step approach of peptide-level autovalidation with a target FDR threshold lowered to be equivalent to what is reached after a combined two-step approach.

Minimum protein score: Set the cumulative protein score that must be met for automatic validation
Group proteins across all directories - Allows peptides from multiple directories of data files to contribute to protein score. Mark this check box if you placed your data files from a given sample into multiple subdirectories.
Protein grouping method: (More detailed description on protein grouping is available)
The selected method determines whether the thresholds for minimum number of directories and protein score are applied the level of protein group (unexpand subgroups method), or at the level of protein subgroup (expand subgroups, top uses shared). The latter choice will tend to remove isoforms/family members when the distinct peptide support for an isoform/family member is weak (1 low scoring peptide, non-recurrent in multiple experiments).
Method for applying combined thresholds of Protein Score and Minimum number of directories
- Retain proteins above both thresholds. This option is more strict and will not only eliminate one peptide/protein, but also one experiment/protein observations.
- Retain proteins above either thresholds This option is less strict and intended to retain one peptide/protein observations if they are recurrent (observed in multiple experiments). The primary value of this option is when being applied to multiple data directories at once.
Minimum number of directories a protein group is observed in: Set the minimum number of directories required for a protein to be identified in order to be considered valid. When multiple experiments are being combined across multiple data directories this feature allows exclusion of low scoring non-recurrently observed proteins, which can be expected to be more likely to be false-positive identifications.
Automatically raise minimum protein score to yield maximum protein FDR: ___% - Type the % FDR you do not want the program to exceed as it automatically raises the minimum protein score.

VM site polishing mode

The VM site polishing mode can only be used after validating in Peptide mode.

In VM site polishing mode the intention is to eliminate unreliable VM site-level identifications, particularly low scoring VM sites that are only detected as low scoring peptides that are infrequently detected when multiple experiments are being combined across multiple data directories. This goals is achieved by unvalidating PSMs previously validated in a peptide mode autovalidation step. This allows one to autovalidate marginal PSMs during peptide-level autovalidation with the potential to increase sensitivity and diminish the number of missing values for VM site level quantitation across multiple experiments. Subsequent VM site polishing will then unvalidate marginal PSMs that are non-recurrent. Removal of low quality PSMs should also result in reducing the peptide-level FDR that will be recalculated via Quality Metrics after all autovalidation steps are complete. Consequently, autovalidation using a 2-step approach of peptide mode followed by VM site polishing mode generally results in fewer missing values across mulitple experiments as compared to a 1-step approach of peptide-level autovalidation with a target FDR threshold lowered to be equivalent to what is reached after a combined two-step approach.

Group proteins across all directories - Allows peptides from multiple directories of data files to contribute to protein score. Mark this check box if you placed your data files from a given sample into multiple subdirectories.
Protein grouping method: (More detailed description on protein grouping is available)
The selected method determines whether the thresholds for minimum number of directories and protein score are applied the level of protein group (unexpand subgroups method), or at the level of protein subgroup (expand subgroups, top uses shared). The latter choice will tend to remove isoforms/family members when the distinct peptide support for an isoform/family member is weak (1 low scoring peptide, non-recurrent in multiple experiments).
Method for applying combined thresholds of VM site score and Minimum number of directories
- Retain proteins above both thresholds. This option is will not only eliminate low scoring VM sites, but also one experiment/protein observations. This option is perhaps overly strict and is expected to be infrequently used. Removing it from the UI was contemplated, but left in to maintain consisitency with protien polishing mode.
- Retain proteins above either thresholds This option is less strict and intended to retain low scoring VM sites observations if they are recurrent (observed in multiple experiments). This method is expected to be the default, typically used option.
Minimum number of directories a VM site is observed in: Set the minimum number of directories required for a VM site to be identified in order to be considered valid. When multiple experiments are being combined across multiple data directories this feature allows exclusion of low scoring non-recurrently observed VM sites, which can be expected to be more likely to be false-positive identifications.
Minimum VM site score: Set the minimum VM site score (peptide id score for the representative peptide amongst all the PSM's that contain the VM site) that must be met for automatic validation.

Validation Parameters: Auto Thresholds - Discriminant

These parameter fields change depending on the strategy and mode you select. See the explanations above for each strategy and its associated modes. Below are the parameter fields for the Auto thresholds - discriminant strategy.
This strategy uses discriminant scores to validate the peptides found in the MS/MS search. See Discriminant Scoring for details.

Peptide mode

Whether you choose the Global FDR mode or the Local FDR mode, first make sure that you did the MS/MS Search with the check box marked for Calculate reversed database scores. The FDR calculations use the results from these calculations. And you must also make sure that results of any previous autovalidations or manual validations are deleted.

Global FDR - Type a number for the %FDR that is acceptable for your study. In this mode, the software calculates a global peptide FDR, which is the percentage of all the peptide identifications that are likely to be false. It is a calculation for a collection of peptides across the data set you are validating. The program looks at only distinct peptides, so if multiple spectra give the same peptide identification, the program uses only the one with the highest discriminant score. The program adjusts the validation thresholds for discriminant score until it meets the %FDR that you typed.
Local FDR - Type a number for the %FDR that is acceptable for your study. In this mode, the software calculates a local peptide FDR, which it obtains from a curve that it fits to the data. While the global FDR focuses on a collection of peptides, the local FDR answers the question, "Does this peptide identification increase the FDR? If I validate this identification, how many additional false positives am I likely to get?" To meet the %FDR that you typed, the program reduces the number of peptides that it validates, by removing those where the peptide identification is less certain.

Protein Polishing mode

In Protein Polishing mode the intention is to achieve a target protein FDR and increase the sequence coverage of validated proteins. The first objective is achieved by unvalidating previously validated peptides. This capability allows you to autovalidate marginal peptides during peptide autovalidation; yet the protein FDR is kept under control by unvalidating the marginal peptides that cause trouble at the protein level. The second intention is achieved by recalculating the peptide FDR only on the subset of peptides from validated proteins. This generally results in increased sequence coverage of the validated proteins.

Minimum protein score: Set the cumulative protein discriminant score that must be met for automatic validation.
Group proteins across all directories - Allows peptides from multiple directories of data files to contribute to protein score. Mark this check box if you placed your data files from a given sample into multiple subdirectories.
Minimum number of directories a protein group is observed in: Set the minimum number of directories required for a protein to be identified in order to be considered valid.
Automatically raise minimum protein score to yield maximum protein FDR: ___% - Type the % FDR you do not want the program to exceed as it automatically raises the minimum protein discriminant score.
Peptide FDR for validated proteins - Mark this check box and type a percentage acceptable for your study if you want peptides to be validated based on a global FDR for only the valid proteins. It is analogous to the Protein details approach for Fixed Thresholds, but based on FDR.

To Report Quality Metrics and FDR

This utility enables two functions:

Calculation of the final FDR after all rounds of manual and auto validation have been performed

false discovery rate (FDR) is important to measure the validity of results and is a requirement for publication in some journals.
Note: For any of the FDR calculations to be functional, searches must have been performed in MS/MS Search, with the check box for Calculate reversed database scores enabled. (This is the default setting.)

Reporting of metrics related to the quality of peptide separation, chromatography, and mass spectrometry associated for each of the underlying LC-MS/MS experiments. To learn more about these metrics, refer to Rudnick PA, Clauser KR, Kilpatrick LE, et. al., "Performance metrics for liquid chromatography-tandem mas spectrometry systems in proteomics analyses", Mol Cell Proteomics. 2010 Feb;9(2):225-41 http://www.ncbi.nlm.nih.gov/pubmed/19837981

To use these capabilities:

On the Spectrum Mill home page, under Result Summary Tools, click Quality Metrics & FDR.
Mark the check box(es) to give the results you need.
Select the Data Directories for which you want to report FDR and search statistics. You may select one or more data directories. They must have sequential numbers at the end. For example, the names could be Pfu-OGE-01.d, Pfu-OGE-02.d, ... Pfu-OGE-12.d.
Click the Report button.

Checking the Excel Export Checkbox will cause the reports to be written to the first directory selected. The report for file-level (LC-MS/MS run) metrics will be written to a file called qualityMetricsExportFile.1.ssv. Directory-level metrics will be written to a file called qualityMetricsExportDir.1.ssv.

Checking the box for Update Log file (2 directories up) with file level metrics will cause file-level metrics to be appended to a pre-existing file present 2 directory-levels up from the first selected directory. This feature was created with the intended purpose of keeping an ongoing log of quality metrics for a particular instrument. The file to be appended to should be called qualityMetricsExportFile.Cady.ssv. (the user should alter the Cady portion of the filename to match the relevant instrument name). If the checkbox is not visible on the form, it can be enabled for a website via the switch variable (enableUpdateLogFileCheckbox=true) in millhtml/SM_js/SMcustomFlags.js.

The following describes the results you can show:

Yields (spectra collected, filtered, validated)

MS/MS spectra collected: Number of MS/MS spectra in the raw data file.
MS/MS spectra merged: Number of MS/MS spectra that result from merging by the Data Extractor
MS/MS spectra filtered: Number of MS/MS spectra exported by the Data Extractor program after filtering by spectral quality.
MS/MS spectra valid: Number of MS/MS spectra for which MS/MS Search results were validated.
Collection Yield V/C (%): Number of MS/MS spectra interpreted and validated divided by number of MS/MS spectra collected, expressed as a percentage
Validation Yield V/F (%): Number of MS/MS spectra interpreted and validated divided by number of MS/MS spectra filtered, expressed as a percentage

It is typical that not all spectra will be interpreted and validated. If your Collection Yield seems particularly low, there may have been an unusually high number of noisy spectra in your analysis. Perhaps you used a low threshold for data acquisition, or maybe there was a high instrument background. In these cases, the relative number of spectra that are picked by the Data Extractor will be low.

Both the Collection Yield and the Validation Yield will reflect to some degree how much time you spent processing the data, via homology searches, broader databases, etc. In general, processing is complete when sufficient information has been extracted from the data to meet the experimental goals.

FDR Metrics (spectra, peptide, protein)

FDR at the peptide & spectra level (from valid hits)
FDR at the protein level
- Group proteins across all directories - When calculating the FDR, the software allows peptides from multiple directories of data files to contribute to the FDR for the protein. Mark this check box if you placed your data files from a given sample into multiple subdirectories.
- Grouping method: Determines how proteins are grouped for the FDR calculations.
  - 1 shared peptide - When at least one peptide sequence >8 residues long is contained in multiple protein entries in the sequence database, the software groups the proteins together and then reports the highest-scoring one and its accession number.
  - 1 shared, expand subgroups - The software initially groups the proteins as described for 1 shared peptide. In some cases when the protein sequences are grouped in this manner, there are distinct peptides that uniquely represent a lower-scoring member of the group (isoforms and family members). When you choose 1 shared peptide, expand subgroups, more than one member of the group is reported and counted towards the total number of proteins.

Precursor Ion Metrics

Precursor mass error mean (ppm) - Gives the precursor mass error (in ppm), mean and standard deviation for validated spectra. These values are useful for tracking the stability of mass calibration across a set of LC-MS/MS experiments
Precursor charge count (from valid spectra) - Gives the number and percent of validated spectra for each precursor charge. These values can be useful for troubleshooting unexpected variance in digestion completeness, peptide fractionation steps employed prior to LC-MS/MS runs, data dependent acquisition settings, autovalidation settings, or ion source performance.
Precursor Isolation Purity & Averagine Chi2 - A measure of whether only a single precursor was isolated. Poor quality is defined as less than 0.85 Chi-squared versus averagine. Chi-squared is a measure of similarity and averagine is the mass distribution you get if you assume that the peptide is made up of "average" amino acids. The elemental composition of for averagine is:
C 4.9384 H 7.7583 N 1.3577 O 1.4773 S 0.0417
(Senko et al, J Am Soc MS 1995 pp. 229-233)
Precursor Acquisition Uncertainty: m/z and z - Reports the number of MS/MS spectra acquired without being assigned a precursor charge by the acquisition control software, and the number of spectra for which the precursor m/z was adjusted post-acquisition by the Spectrum Mill extractor by more than +/- 0.2 m/z.
Precursor Ion Fragmentation Table - Reports several metrics about the extent of fragmentation of PSMs in a dataset. This report was developed with the primary intent of helping to optimize the collision energy setting on Thermo fisher Orbitrap instruments for TMT and iTRAQ labeled datasets. The following metrics are calculated separately for each precursor charge state and number of labels/peptide:
- Number of PSMs
- Median PSM identification score
- Median backbone cleavage score (BCS)
- Median max sequence tag length
- Median Dissociated Intensity (%)
- Median ratio of intensities of base reporter ion (RI) to base fragment ion (FI) - after peak detection (which includes removal of residual precursor ion related peaks), the most intense reporter ion / the most intense fragment ion in the MS/MS spectrum

MS/MS Interpretation Metrics

Identification Scores - Reports the Median ID Score and the Median SPI(%).
Fragmentation Mode – Gives the percentage of validated MS/MS spectra resulting from each of the fragmentation modes that may have been employed in the LC-MS/MS run (CID, ETD, HCD, etc.)
Variable modification site localization - Reports metrics associated with variable modification site localization.
Select the type of modifications from the list.
Identifiable Spectra, Max tag length - Reports all identifiable spectra with a Maximum tag length (MTL) greater than the indicated value.

MS/MS Spectral Identifiability Metrics

With thresholds for MS/MS Spectral Quality Filtering several subsets of spectra are created and used to calculate several metrics to help understand the identifiability of a dataset. The metrics attempt to measure what portion of the dataset was good quality spectra, what portion of those good spectra became valid identifications, what portion of those good spectra remain to be interpreted, and the relative distribution of spectral quality in each of those portions. The spectral quality thresholds allow the user to craft the definition of "good".

If lots of good spectra are unidentified then one should consider possible causes like problems with cysteine alkylation chemistry, contaminant proteins present that are not in the database, non-specific proteolysis in the sample prior to digestion, and significant presence of unanticipated modifications.

Metrics reported:

Premium Identifiable, I - The total number of filtered spectra, F, passing the spectral quality thresholds.
Valid Premium Identifiable
- VI - The total number of valid spectra passing the spectral quality thresholds.
- VI/V(%) - VI as a percentage of the total number of valid spectra.
Not Valid Premium Identifiable
- NVI - The total number of valid spectra passing the spectral quality thresholds.
- NVI/I(%) - NVI as a percentage of the total number of identifiable spectra.
Sequence Tag Length (STL) Histograms for the various sets of spectra NVI, VI, I and NV, V, F. For each set of spectra a histogram is constructed with bins for sequence tag length, and each bin counts the number of spectra with that sequence tag length. The histograms are then converted to simple numerical representations by normalizing the counts in each bin to the highest bin in corresponding primary histogram, I for (NVI, VI, I), and F for (NV, V, F). The highest count is given a value of 9, and all other bins are scaled proportionally from 0 to 9, and rounded to the nearest integer. These normalized values then consitute a number that when read left to right is in descending order of sequence tag length. Consequently, these numerical representations of histograms can be put in a tabular display and when the numerical histograms for multiple subsets of spectra are stacked it is convenient to to see the relative distributions of spectral quality between the subsets. When reading a normalized numerical histogram, the bins in sequence tag length order are 54321.0.
Example 1: A dataset that is thoroughly identified.
The quality metrics sequence tag length threshold was > 3. The Data Extractor STL filter was >0.

STL Histogram
NVI
VI
I STL Histogram
NV
V
F

111000.0
11223568000.0
11224579000.0 123331.0
1122456630.0
1123478962.0
- 4 is most common sequence tag length in the identifiable spectra set, I. 4th position to the left of the decimal point in the I histogram has a value of 9.
- 8/9 of the identifiable spectra with STL 4 were validated. 4th position in the VI histogram has a value of 8.
- 1/9 of the identifiable spectra with STL 4 were not validated. 4th position in the NVI histogram has a value of 1.
- 3 is most common sequence tag length in the filtered spectra set, F. 3rd position in the F histogram has a value of 9.
- 6/9 of the filtered spectra with STL 3 were validated. 3rd position in the V histogram has a value of 6.
- 3/9 of the filtered spectra with STL 3 were not validated. 3rd position in the NV histogram has a value of 3.
- Nearly all the spectra with STL > 7 were validated. Same values in positions 8 to 11 of VI and I histograms, and positions 8-10 of V and F histograms.
Example 2: A dataset with lots of high quality unidentified spectra.
The quality metrics sequence tag length threshold was > 3. The Data Extractor STL filter was >0.

STL Histogram
NVI
VI
I STL Histogram
NV
V
F

1123576000.0
1123322000.0
11246898000.0 123668742.0
112221100.0
1235789842.0
- 5 is most common sequence tag length in the identifiable spectra set, I. 5th position to the left of the decimal point in the I histogram has a value of 9.
- 2/9 of the identifiable spectra with STL 5 were validated. 5th position in the VI histogram has a value of 2.
- 7/9 of the identifiable spectra with STL 5 were not validated. 5th position in the NVI histogram has a value of 7.
- 4 is most common sequence tag length in the filtered spectra set, F. 4th position in the F histogram has a value of 9.
- 1/9 of the filtered spectra with STL 4 were validated. 4th position in the V histogram has a value of 1.
- 8/9 of the filtered spectra with STL 4 were not validated. 4th position in the NV histogram has a value of 8.
- 3/6 of the identifiable spectra with STL 7 were validated. Ratio of values in position 7 of VI and I histograms.
- 2/5 of the filtered spectra with STL 7 were validated. Ratio of values in position 7 of V and F histograms.

STL Histogram NVI VI I	STL Histogram NV V F
111000.0 11223568000.0 11224579000.0	123331.0 1122456630.0 1123478962.0

STL Histogram NVI VI I	STL Histogram NV V F
1123576000.0 1123322000.0 11246898000.0	123668742.0 112221100.0 1235789842.0

Peptide Separation Metrics

Chromatography metrics for each run – Gives several metrics for measuring the quality of the chromatography and associated MS data collection. For highest utility, the metrics should be calculated to encompass the continuous middle retention time portion of the elution profile. A value of 80% helps to exclude discontinuous bursts of peptides that elute at the beginning of the run because they are unretained on the column.

Practical Uses:

The reported values for start time, end time and span of the middle portion of the gradient help measure the overall efficiency of the method and dead volume incorporated into the column plumbing.
The gradient shapes help measure the distribution of peptide abundances across the gradient and can be used to troubleshoot gradient delivery by the LC pumps, autosampler sample loop filling/washing, and recovery of peptides from sample handling manipulations.
Peak width in seconds helps measure the chromatographic resolution of the column packing material and acetonitrile gradient.
Median MS1 intensity Trigger Apex helps optimize the acquisition method so that MS/MS spectra are selected closer to the peak apex on average, leading to shorter acquisition times for MS/MS and more peptides identified during the run.
The median and max fill time metrics help optimize data acquisition methods and measure mass spectrometer sensitivity.

Metrics reported:

Start time mid xx% matched spectra in run (min) - mid xx% means the percentage of spectra in the middle portion of the chromatographic range; for example, if 10,000 MS/MS spectra gave IDs in the run, the mid 80% of matched spectra are those between #1001 and #9000 after sorting the spectra by retention time; for this metric Spectrum Mill reports the retention time for #1001. We use this example for each of the metrics described below.
End time mid xx% matched spectra in run (min) - the retention time for #9000 in our example
Time span mid xx% matched spectra in run (min) - the time range between the retention time for #1001 and that for #9000
Gradient Shape mid xx% filtered spectra in run - To measure the distribution of peptide abundances across the gradient, this metric attempts to provide a numeric representation of the shape of MS1 Total Ion Chromatogram using the XIC's of all precursor ions which yielded MS/MS spectra passing the spectral quality filtering done with the Data Extractor. To due this a histogram is constructed by spliting the time span of the mid xx% matched spectra into 7 equal time bins. The precursor intensity in each bin is summed up. The histogram is then converted to simple numerical representations by normalizing the intensity in each bin to the highest bin. The highest intensity is given a value of 9, and all other bins are scaled proportionally from 0 to 9, and rounded to the nearest integer. These normalized values then consitute a number that when read left to right is in retention time order.
- Example 1: 9999999 ideal gradient, peptides evenly distributed
- Example 2: 8999751 diminished late-eluting hydrophobic peptides
- Example 3: 1359988 diminished early-eluting hydrophilic peptides
Gradient Shape mid xx% matched spectra in run - same as above, exept histogram constructed using only the precursor ion XIC's of valid MS/MS spectra.
Median MS1 peak width mid xx% matched spectra (sec) - average chromatographic peak width of the precursor ion chromatograms that gave rise to the subset of the middle 8,000 matches with a precursor ion Chi2 metric > 0.85
Total precursor XIC mid xx% matched spectra in run - total abundance for 8000 precursor XICs
Median MS1 intensity Trigger Apex mid xx% matched spectra (%) - On average for the 8000 identified peptides, the ratio (in percent) of the precursor ion abundance in the MS1 spectrum which triggered acquisition of its MS/MS spectrum to the abundance of the precursor ion in the MS1 spectrum at the ion's chomatographic apex.
Median MS2 fill time mid xx% matched spectra(msec) - the median ion fill time of the valid MS/MS spectra.
Max MS2 fill time mid xx% matched spectra(msec) - the maximum ion fill time of the valid MS/MS spectra.
Spectra Reaching max MS2 fill time mid xx% matched spectra(%) - proportion of the valid MS/MS which had a maximum ion fill time.
Spectra Reaching max MS2 fill time mid xx% filtered spectra(%) - proportion of the filtered MS/MS which had a maximum ion fill time.

Peptide pI median for each run – Gives the calculated isoelectric point median and standard deviation for the reported number of distinct peptides in the validated spectra for each LC-MS/MS experiment. These values are useful for measuring the effectiveness OFFGEL electrophoresis (OGE) or isoelectric focusing (IEF) separations that may have been performed prior to the LC-MS/MS runs.
RT scatter plot & peptide subset reports for each directory (seqdb/peptideQMlists/*.txt) - This checkbox triggers two actions.
1. A distinct peptide report will be created in each directory for the subset of validated peptides that are observed in the data for that directory that are also contained on 1 of the lists stored in the files seqdb/peptideQMlists/*.txt.
2. A file called scatter.html will also be generated in each directory that contains an interactive plot comparing the retention times of the observed subset of peptides to a gold standard report (goldStandardDir/Selected_peptides_all_sequences_peptideExport.1.ssv). The string goldStandardDir is currently hardcoded in the file millpy/SM_Select_Peptide_QM_ScatterPlot.py. A future revision of this feature should allow for a user-specified comparator source. The plot is generated from a python script that uses the Bokeh interactive visualization JavaScript library, https://docs.bokeh.org/en/latest/.
If the checkbox is not visible on the form, it can be enabled for a website via the switch variable (enableRTscatterPlotPeptideSubsetsCheckbox=true) in millhtml/SM_js/SMcustomFlags.js.

Sample Handling Metrics

Isobaric label incorporation for each run - metrics associated with iTRAQ and TMT experiments. The metrics are intended to measure several characteristics related to quantifiability, labeling completeness, mixing balance, and reporter ion sensitivity. Select the type of isobaric label and the control ion used. The metrics include:
- Metrics for quantifiability of PSMs and the completeness of labeling as percentages of either the number of MS/MS PSMs(spectra) or MS1 precursor intensity of all PSMs include the following:
- Metrics for reporter ion ratios vs retention time as charts and tables are created if Chromatography metrics are also enabled. These are at the level of individual LC-MS/MS run or aggregated for all runs in a directory. The 4 resulting files are charts (.PDF) or tables (.tsv).
- Metrics for mixing balance amongst the various reporter ions is measured based on 3 sets of metrics:
  - Reporter ion intensity (1 column / channel) - a column contains the sum of the intensity for a single reporter ion across all PSMs in a run or in a directory.
  - % of reporter ion / base reporter ion using the summed intensities of each reporter ion across all PSMs in a run or in a directory. These metrics provides a range of the mixing balance of the combined samples relative to the most abundant component.
  - Ratio reporter ion / control ion using the summed of intensities of each reporter ion across all PSMs in a run or in a directory. These metrics provides a range of the mixing balance of the combined samples relative to the denominator intended to be used for quantitative ratios.
- Metrics for reporter ion sensitivity include the following:
  - All Reporters Detected Spectra (%) - percentage of PSMs detected which contain all reporter ions for the selected labeling chemistry.
  - Control Ion Detected Spectra (%) - percentage of PSMs with the control ion detected.
  - Median S/N All Reporters - the median signa/noise ratio of the peaks in the reporter ion region is calculated for each PSM, followed by a median calculated across all PSMs in a run or in a directory.
Digestion completeness - Reports metrics associated with enzymatic digestion during sample preparation.
Observed modifications by - Reports metrics associated with modifications. Marking this check box enables the Distinct peptides and Peptide Spectrum Matches selections.

Peptide Fraction Overlap

Distinct Peptide Fraction Overlap Table
- Select the Distinct peptide comparison method:
  For fraction overlap, sample handling, or pI, choose Case Sensitive(CS) or Case Insensitive(CI). FDR calculations always use CI. Filtering to distinct peptides retains each highest scoring representative after CS or CI string comparison of sequences. Variable modifications are lowercase.
  - Case Sensitive(CS)
  - Case Insensitive(CI)

Protein/Peptide Review of MS/MS Search Results

The Spectrum Mill provides a means for summarizing the results to answer questions like:

What peptides are in my sample?
What phosphosites are in my sample?
What proteins are in my sample?
How well was my mixture of peptides/proteins fractionated in my offline separation scheme?
What trends in protein presence/abundance are there across several LC-MS/MS runs?
What are the quantitative differences in proteins and phosphosites across the cohort of patients in my TMT data set?
What single amino acid variant or spliceform containing peptides were observed across the cohort of patients in my TMT data set?

Summary Modes:

See Chapter 2 of the Application Guide for detailed descriptions of the current Protein/Peptide Summary displays.

Mode	Description	Manual Validation State Assignment Available	Example Applications
Peptide - Spectrum Match	Peptides listed for each spectrum with links to data.	yes	List of PSMs present in the data.
Peptide - Distinct	Peptide is the primary organizing feature. PSMs for the same peptide are collapsed into a single row. The menu Filter to distinct peptides enables refining the notion of sameness to suit one's need (modified or not, different precursor charge, different LC-MS/MS run).	no	List of distinct peptides present in the data. Primary reporting mode for immunopeptidome experiments.
Protein Summary Details	Protein is the primary organizing feature. Peptides listed for each protein with links to spectra.	yes	Sequencing of simple mixtures of proteins, where coverage inspection is valuable.
Protein - Protein Comparison	Protein is the primary organizing feature. Each protein is listed once. Columns then show distribution of that protein among samples (one LC-MS/MS file per column, or a directory full of LC-MS/MS files treated as one column).	no	Primary reporting mode for quantitative whole proteome experiments. One or many LC-MS/MS files analyzed in a single directory. Directory corresponds to a sample.
Protein - Peptide Comparison	Peptide is the primary organizing feature. PSMs for the same peptide are collapsed into a single row. Peptides that belong to same protein group are clustered then listed in rows below each protein. The protein grouping method should be set to unexpand subgroups to prevent a peptide from being repeated for each protein subgroup in which it is a member. Columns then show distribution of each peptide among samples (one LC-MS/MS file or sample directory per column).	no	Evaluation of fractionation scheme.
Protein - Var Mod Site Comparison	VM site is the primary organizing feature. PSMs for the same variable modification site are collapsed into a single row. The type of VM site (phospho, acetyl, ubiquityl) is controlled by setting the coresponding value on the required AAs menu value (s\|t\|y, k, k). The protein grouping method should be set to unexpand subgroups to prevent a VM site from being repeated for each protein subgroup in which it is a member. Columns then show distribution of each VM site among samples (one LC-MS/MS file or sample directory per column).	no	Primary reporting mode for quantitative phosphoproteome, acetylome, ubuiqitylome experiments.
Protein - Prot Genom Site Comparison	PG site is the primary organizing feature. PSMs for the same proteogenomic site are collapsed into a single row. The type of PG site (variant or splice junction) is controlled by setting the coresponding value on the Filter by Proteogenomic Features menu value. The protein grouping method should be set to unexpand subgroups to prevent a PG site from being repeated for each subgroup in which it is a member. Columns then show distribution of each PG site among samples (one LC-MS/MS file or sample directory per column). This mode is critically dependent on the prior creation of summary tables for personalized sequence databases used for the MS/MS searches.	no	Primary reporting mode for focusing on personalized proteogenomic features observed within a whole proteome experiment.

Protein Grouping in Protein Modes

The mechanism consists of the following steps:

Extract peptides - From each search result, extract all of the rank 1 hits (may be multiple instances of the same peptide sequence matched to proteins with different accession numbers).
Form proteins - Assemble all the peptides belonging to a single accession number.
Eliminate peptide redundancy - Redundancy has several sources:
- Spectra acquired on multiple charge states of the same peptide
- Multiple spectra acquired from a single precursor m/z
- Multiple homology matches to the same peptide in a single protein (i.e. the peptide sequence can be ambiguously interpreted by different AA substitutions)
The protein score and the number of distinct peptides are calculated so that only the instance of a particular peptide with the highest MS/MS Search score is counted (i.e. each peptide counted once, NOT multiple spectra, NOT multiple charge states, NOT multiple substitutions). The protein score is the sum of the identification scores of the distinct peptides from that protein. However, the total intensity is summed so that each observation of a peptide counts towards the total intensity for the protein (i.e. each spectrum counted once).
Eliminate protein redundancy - Proteins are grouped by peptide roll-up. All proteins are sorted in descending order of number of distinct peptides. Then starting from the bottom protein, the question is asked: for this protein, is at least one of the observed peptides present in a protein higher on the list? If so, the proteins are grouped together when a peptide sequence of >8 residues is contained in multiple protein entries in the sequence database.
In some cases when the protein sequences are grouped in this manner, there are distinct peptides that uniquely represent a lower-scoring member of the group (isoforms and family members). Each of these instances spawns a subgroup. Multiple subgroups are reported and counted towards the total number of proteins, and given related protein subgroup numbers (e.g. 3.1 and 3.2 for group 3, subgroups 1 and 2). See also the information about multiple sequence alignment. In the Protein Summary Modes, the highest-scoring member of each protein group and subgroup become the basis for further calculations. All subgroups are reported in Protein/Peptide Summary, unless the protein grouping method is set to Unexpand subgroups.
Expand subgroups and shared peptides - When reporting the protein score, summed precursor intensity and quantitative ratios there are multiple possible ways of handling the peptides which are shared by more than 1 subgroup in a protein group. 4 options are provided:
1. unexpand subgroups
  all peptides are used and protein group level values are reported without expanding into subgroups. For certain modes which display peptide level results (Protein - Peptide, Protein - Var Mod Site, Protein - Prot Genom Site) this method is valuable to prevent peptides, VM sites, and PG sites from being reported multiple times i.e. for each subgroup they are members of. When doing so, the highest scoring protein subgroup they are members of will be reported.
2. expand subgroups, all use shared
  Shared peptides are used in each subgroup in which they are observed. This is the default approach.
3. expand subgroups, top uses shared, SGT
  Shared peptides are used only in the top scoring subgroup. They are excluded from other subgroups. For isoforms and family members, this method is valuable for having quantitation based solely on the peptides which are distinct to that subgroup. The report filename will contain a .SGT. designation intended to mean SubGroup Top.
4. expand subgroups, ignore shared, SGS
  Shared peptides are ignored for all subgroups. Only the subgroup specific peptides are used toward each subgroup’s count of distinct peptides and protein level quantitation. This method is particularly suited for xenograft experiments (a human tumor grown in a mouse). If evidence for BOTH human and mouse peptides from an orthologous protein were observed, then peptides that cannot distinguish the two (shared) are ignored. However, the peptides shared between species are retained if there was specific evidence for only one of the species, thus yielding a single subgroup attributed to only the single species consistent with the specific peptides. Furthermore, if all peptides observed for a protein group are shared between species, thus yielding a single subgroup composed of indistinguishable species, then all peptides are retained. The report filename will contain a .SGS. designation intended to mean SubGroup Specific.
In some applications it is helpful to consider more than one method of handling the shared peptides. Consequently, instead of a user having to generate multiple reports (and wait for the protein grouping to be repeated), when either the SGS or SGT option is selected a second report for the all use shared option is generated when the excel export option is used for producing output.
Sort protein groups and subgroups - Protein groups are sorted in descending order of protein score. Subgroups within a group are sorted in descending order of protein score that includes the peptides that are shared with other subgroups.

Notes:

The modes Protein - Var Mod Site Comparison and Protein - Peptide Comparison - should be used with the protein grouping method set to Unexpand subgroups to prevent VM sites and peptides from being reported multiple times i.e. for each subgroup they are members of. When doing so, the highest scoring protein subgroup they are members of will be reported.
In Protein Summary Details mode - When you use manual validation with 1 shared peptide, expand subgroups, the top portion of the report that lists the proteins shows the individual subgroups. The lower peptide portion of the report shows all the peptides that belong to the group; subgroup information is not given at the peptide level. Because a peptide can belong to more than one subgroup, this prevents you from assigning conflicting validation states to a single peptide that is listed multiple times in different subgroups.

For a discussion of the principles of protein grouping, see:

Nesvizhskii, A. I.; Aebersold, R.; "Interpretation of Shotgun Proteomic Data: The Protein Inference Problem;" Mol. Cell Proteomics.; 2005; 4(10);1419-40 DOI: 10.1074/mcpR500012-MCP200

Peptide Validation

The Spectrum Mill provides a means for segregating search results that contain a valid interpretation of an MS/MS spectrum from those which do not. The segregated groups can then be subjected to subsequent rounds of searches (against other databases or in homology mode for example) or to produce a summarized list of only those peptides or proteins found in a sample from confidently-interpreted spectra. An interpretation which is not valid can result from several causes:

Sequence not in database
Marginal spectral quality
Incorrect precursor charge designation (mostly likely resulting from inadequate instrument resolution on the precursor ion)
Incorrect search parameter settings (mass accuracy, fragment ion types, enzyme, cysteine modification, etc ...)
Search algorithm or peak selection in need of improvement

To segregate the search results, the software must keep track of both the spectrum and its interpretation in a coordinated way. The software must simultaneously keep track of spectra separately from search results, since spectra can be segregated according to quality without regard to their interpretations. The validation state of a particular spectrum or a search result can be designated with certain programs. After toggling the validation state for each search result or spectrum and clicking the perform validation button, two files are created in the appropriate data directory (hitTable.tsv, and spectrumTable.tsv). The tables record the appropriate state of search result or spectrum file according to the chart below. Files whose state is not designated are not recorded in the tables. When additional validation events are performed, the table files cumulatively record the validation states of spectra and search results for the particular data directory. Subsequent operations using different programs can thus be done using only the members of the group corresponding to combinations of states. Subsequent MS/MS searches will overwrite the results of earlier searches.

Validation Filter	Program Using Filter	Possible Spectrum States	Possible Interpretation (Hit) States	Program Capable of Assigning Spectrum States	Program Capable of Assigning Interpretation (Hit) States
spectrum-not-marked-sequence-not-validated	MS/MS Search de novo Sequencing Spectrum Summary	none	none	Protein/Peptide Summary Spectrum Summary	Protein/Peptide Summary
sequence-not-validated	Protein/Peptide Summary MRM Selector	none good bad	none	Protein/Peptide Summary Spectrum Summary	Protein/Peptide Summary
valid	MS/MS Search Protein/Peptide Summary MRM Selector	valid	valid	Protein/Peptide Summary Autovalidation	Protein/Peptide Summary Autovalidation
good-spectrum-sequence-not-validated	MS/MS Search de novo Sequencing Protein/Peptide Summary Spectrum Summary MRM Selector	good	none	Spectrum Summary	Protein/Peptide Summary
good-spectrum	Spectrum Summary	good	none valid	Spectrum Summary	Protein/Peptide Summary Autovalidation
bad-spectrum	Spectrum Summary	bad	none valid	Spectrum Summary	Protein/Peptide Summary Autovalidation
all	Protein/Peptide Summary MRM Selector	none valid good bad	none valid	Protein/Peptide Summary Autovalidation Spectrum Summary	Protein/Peptide Summary Autovalidation

The Spectrum Viewer is a convenient tool for reviewing results.

To Use the Protein/Peptide Summary Form

The following topics describe options available on the Protein/Peptide Summary form. Note that the options under Validation and Sorting and Review Fields change depending upon which Mode you select. This section describes all possible options; you may see only a subset of these on your form.

If during data review you wish to display the Protein/Peptide Summary form again, click the Summary Settings button at the top of the page.

For more details, see Protein/Peptide Review of MS/MS Search Results.

Summarize Results for Review

Summarize - Click to summarize results. Click this button after you have either loaded the desired parameter file or manually set the parameters. The name of the current parameter file appears in red at the top of the form. Once you have saved a parameter file from this form, you may do the summary from a workflow rather than manually with the Summarize button.
Save As - Click to save current summary settings in a parameter file.
Load - Click to load a parameter file that contains summary settings. For default values, select a parameter file from the Defaults folder.
Queue request - Mark this check box if you want the data summary to occur after completion of a queued MS/MS search and a queued autovalidation for the selected data directories. That is, mark the check box if you want to do interactive automation. If you want to see summary results immediately, clear the check box. You also mark this check box if you want to preserve the output in HTML format for later access.
Note: When you view Protein/Peptide Summary results from the Completion Log, some links do not work as they would if they were viewed within the Protein/Peptide Summary page. For example, you cannot click the Row# link to view and review spectra. Most links do work, but they display their output in a separate window.
Excel export - Mark or select to export results to Excel or to upload to LIMS. For the latter, first make sure your system administrator has configured the upload. See Exporting to Excel or Uploading to LIMS. This setting appears only for some of the display modes.
MPP Generic export - Select to export results to MPP (Mass Profiler Professional) generic import format. The MPP Generic export is only available in the Protein-Protein Comparison mode. If you want all proteins reported without grouping, use the Protein-Protein Comparison mode with 1 shared, expand subgroups selected as the Protein grouping method.
MPP APR export - Select to export Agilent Proteomics Results (APR) to Agilent Mass Profiler Professional (MPP) 14.0. If you have not updated to MPP 14.0, continue to use the MPP Generic export (which also supports non-Agilent data). The APR format provides both Protein and Peptide results that you can import into MPP’s “Proteomics” experiment type. The format organizes results by proteins with their corresponding peptides. (See the MPP documentation for details). When you select MPP APR export, the program exports all the necessary protein and peptide information, whether or not the review fields are selected. The exception is labeled quantitation. For DEQ/SILAC, select the DEQ ratios and the Invert setting if applicable. For iTRAQ/TMT, select the Reporter Ratios, the type of modification, and the Control ion. Note that the abundance values are exported for each labeled modification rather than the ratios, but enabling the ratio calculation allows the labeled abundances to be determined and the controls to be specified. Peptides that do not have the labeled modification are not exported. For duplicate peptide hits (same sequence, modification, and charge), only the most abundant peptide is exported for that protein. All other protein and peptide filtering options and the Protein Quantitation Options are applied, so you can filter and limit what is exported.
AMRT export - Mark to export results to a CSV file that you can search directly or import into an existing Agilent MassHunter accurate mass retention time (AMRT) database. (See the MassHunter Personal Compound Database and Library Quick Start Guide for import instructions.)

You can then search this database from MassHunter Qualitative Analysis or MassHunter ID Browser, to map features to identifications from Spectrum Mill.
You can then import the results from the AMRT database search into MassHunter Mass Profiler Professional, which transfers the identifications into Mass Profiler and annotates the features. Mass Profiler and Mass Profiler Professional can then make use of ID Browser to search the AMRT database to provide annotations for the features in these programs.
AMRT export also exports neutral mass formulas for use with Find by Formula in MassHunter Qualitave Analysis.
The AMRT CSV export setting appears only for the peptide display mode, and you can mark the box only if you have not marked Excel export.
This function only exports the most abundant peptide if there are more than one of the same sequence. And it exports intensity as "Area" column.
The CSV file is named peptideExport.#.amrt.csv, and the program stores it in the data file folder. If you generate a CSV file from multiple data folders, the program stores the file in the first data folder that you selected.

Export inclusion list for top peptides/protein - Mark this check box to create an inclusion list for Agilent Q-TOF instruments. Enter a value for the maximum number of peptides to target per protein. (This feature is only available if Agilent Q-TOF data has been selected.)
Mode - Select a summary mode. For more details, see Summary Modes.
Filter to distinct peptides - To report only the instance of a particular peptide with the highest MS/MS Search score, select one of the following:
- Off -- Disables the filtering.
- Case insensitive -- When collapsing to "distinct", a case-insensitive string compare is used, thus peptides with variable modifications (lowercase AA's) and unmodified peptides are combined.
- Case sensitive -- When collapsing to "distinct", a case-sensitive string compare is used, thus peptides with variable modifications (lowercase AA's), different localizations of those variable modifications, and unmodified peptides are kept separate.
- Charge file CS -- When collapsing to "distinct", a case-sensitive string compare is applied to both the sequence and spectrum filename prefix, thus peptides from different LC-MS/MS runs and those with different precursor charges are kept separate.
This option is available only in Peptide - Distinct mode.
Group results by: Select File to display results by file or Directory to display results by directory. The latter is useful if you want to compare multiple samples, and each sample is located in a separate directory on the Spectrum Mill server. Note that this option is available only in certain display modes.
The Protein-Peptide Comparison Columns mode allows rows to be grouped by Sequence or Var mod site.
Data directories - Click the Select ... button to select a data directory or data directories. See Selecting Data Directories.
Search result files: Modify this list if you want to summarize only a subset of the files in the data directory. Wildcards (*) are supported. To see the names of your search result files, look in the results_mstag subdirectory under the directory where you placed your raw files.
Search result files exclude: Modify this list if you want to exclude certain files from the summary. Wildcards (*) are supported.

Validation and Sorting

Filter results by: See Peptide Validation.
Validation preset: Used during results review, and determines whether results are initially classified as status, valid, reset, or none.

Choose none if you want to summarize results rather than review and validate results.
Choose valid if you want to validate results and you set filters to select results with relatively high protein and/or peptide scores.
Choose reset if you want to validate results and you set filters to select results with medium protein and/or peptide scores. You can change to valid when you find acceptable results as you manually review data. The validation preset classifications are not yet written to file and can easily be changed as data are reviewed.
Choose status if you want to see all peptides that belong to a protein, regardless of validation state. To generate such a display, set Filter results by: to all and set Mode to one of the display modes that show both proteins and peptides. When the data are displayed, look under Validation category to see if a particular peptide was validated (V) or not validated (R, for reset). You may also change the validation state, but before you exit the form, be sure to click the Perform Validation button to save the new validation state.

Protein grouping method: Options for how proteins are grouped based on shared/distinct peptides and which peptides contribute to protein-level quantitation.
Sort proteins by: Determines how proteins are sorted in the results summary.
Filter by protein score: Permits display of only proteins matching specified score criteria. Note that protein scores of 25 and greater are almost certain to represent valid results.
Sort peptides by: Determines how peptides are sorted in the results summary. Select the appropriate Review Field from the list. Note that when you sort by accession number, the sort is alphabetical rather than numerical. This is because some databases do not have strictly numeric accession numbers.
Filter peptides by: Permits display of only peptides that match specified criteria.

Score: Filters by database search score. Note that peptide scores of 15 or greater, in combination with % SPI of 70 or greater, are almost certain to represent valid results. Peptide scores less than 6 seldom represent valid interpretations unless the spectra originated with an instrument capable of accurate mass measurements (e.g., Agilent Q-TOF). For Agilent Q-TOF data, you search with a narrower mass tolerance, so there is a better chance that lower-scoring results are valid. It is not unusual for a score of 5 to represent a valid result, but only if the peptide is short or in low abundance.
% SPI: Filters by percent scored peak intensity. This is the percentage of the spectral peak-detected ion current explained by the search interpretation.
Required AAs: Filters search results so that peptides are shown only if they contain the required amino acid(s). To disable, select any. See Amino Acid Filtering.
Disallowed AAs: Filters search results so that peptides are not shown if they contain disallowed amino acid(s). To disable, select none. See Amino Acid Filtering.
Peptide pI: Filters search results by peptide pI. Fill in a range, or mark the check box for All. The software displays this filter only when you mark the check box for Peptide pI under Review Fields. If you wish to use the pI filter for modified peptides, ask your server administrator to first verify that the pK of the modified amino acid is specified in smconfig.std.xml or smconfig.custom.xml. Spectrum Mill server administrators may set the pK values for modifications when they define modifications (only necessary if the pK values are different from those of the unmodified amino acid).
Accession #'s: Filters search results by accession numbers. You can type or paste a list of accession numbers in various formats (space-separated, separated by ‘|’, comma-separated, etc).

Review Fields

Filename - Spectral file name, in the format Data_File_Name.aaaa.bbbb.c, where aaaa = first merged scan, bbbb = last merged scan, and c = assigned precursor charge (0 means charge was ambiguous)
Score - Database search score. Depending on display mode, this shows either the individual peptide score or the summed peptide scores for the protein.
FDR (Discriminant) - False Discovery Rate - Displays both the global FDR and local FDR values, as well as an FDR Search #; independent of the autovalidation strategy used
Fwd-Rev score - Difference between scores for top hits from forward and reversed database searches.
Rank 1-2 score - Difference between rank 1 and rank 2 database search scores
SPI (%) - Scored peak intensity. This is the percentage of the spectral peak-detected ion current explained by the search interpretation.
Glyco Product Ions Score - Spectral feature based on The 9 ion glycosylation-signature set: 126,138,144,168,186,204,274,292,366. Numerically, GPIS is a 2-part score. See CPIS & GPIS for more detail.
Backbone Cleavage Score (BCS) - Based on the search results, the number of cleavages of the amino acid backbone that are represented in the spectrum.
Unmatched ions - Number of ions in the peak detected spectrum that did not match the theoretical ions predicted from the top database search result. This is displayed in the format: # unmatched ions/ # total ions after peak detection.
Var mod sites - Lists number and sites of variable modifications and amino acid substitutions, primarily phosphorylation sites but others are available, too.
VML score - Displays the VML (Variable Modification Localization) score of the modification selected, which is the difference in score between equivalent identified sequences with different variable modification localizations. A VML score of >1.1 indicates confident localization. 1 implies there is a distinguishing ion of b or y ion type. 0.1 means that when unassigned, the peak is 10% the intensity of the base peak.
Solution charge - Displays the predicted charge of the peptide in solution, which can be useful for reviewing results after charge-based fractionation. The later fractions are expected to have a higher solution charge.
Ion mobility - Reports DT and CCS values, if present.
Start AA position - The numerical position of the peptide's first amino acid in the sequence of the protein
Proteogenomic feature - Enables reporting some extra columns about variant-containing or spliceform-containing peptides. This feature is critically dependent on the prior creation of summary tables for personalized sequence databases used for the MS/MS searches.
Run specific - This setting appears only in two display modes: Protein-Protein Comparison Columns and Protein-Protein Comparison Redundant. It allows you to add information to the summary report. For each colored cell in a protein comparison columns report, the results in the colored cell are specific to that column (could be one LC-MS/MS run, or one folder or sample); that is, peptides shared across the columns are excluded. When you have multiple runs (folders or files), each of the first N columns (where N is the number of runs/files/folders/samples) report values that are specific to those runs. Further to the right are values that are summed across all the fildes/folders (samples).
When you mark the check box for Run Specific, you can then choose to display any combination of the following five settings in the first column (one run) or multiple columns (multiple runs):

% Coverage - percent of the protein sequence covered by the identified peptides
Distinct peptides - number of unique peptides identified from the spectra associated with the protein. Multiple spectra may match the same peptide; this is a count of unique peptides rather than matched spectra.
Distinct peptide forms/mods - number of unique peptides, where modified peptides are also counted as unique
DEQ ratio - displays ratios for differential expression quantitation, such as light/heavy ratios for ICAT reagents or other reagents that use isotopic labels. See SILAC and Other Differential Expression Quantitation. Also displays the number of light/heavy pairs that contribute to each ratio.
Reporter Ratios: iTRAQn or TMTn - displays ratios for iTRAQ or TMT quantitation. "n" indicates the number of reporter ions in the experiment. To see the iTRAQn or TMTn in this field, select either iTRAQ4, iTRAQ8, TMT2 or TMT6 from the dropdown list in the right column of the Review Fields. Also select the mass for the denominator of the ratio from the Control dropdown list in the right column. This field reports all the required ions.

Sequence - Amino acid sequence of matched peptide from database search
b/y map - displays the amino acid sequence of the matched peptide from the database search, annotated with the following:
- Red forward-slashes for locations of y-ions
- Blue backslashes for locations of b-ions
- Magenta pipes (vertical lines) for locations of both b- and y-ions
This functionality allows you to assess b- and y-ion coverage without inspecting the spectrum, and for homology mode results aids in identifying the site of a PTM by highlighting existence of surrounding b/y ions.
Rev Sequence - Displays (in magenta) the score and sequence for the top hit from the reversed database search
Rank 2 - Displays ( in green) the score and sequence for the number 2 database hit. The SPI is also displayed, but it retains the blue color to indicate that it is a link.
VML sequence - Displays the actual sequence of the amino acids surrounding the variable modification site(s)
Prec Av Chi² - Displays the Precursor Avergine Chi² value.
Isol Pur - Displays the Precursor Isolation Purity.
Ret time, width -Time (min) from the start of the LC gradient to the chromatographic apex of the precursor ion. When multiple spectra are merged, the retention time is that for the first of the merged scans. Width of precursor ion chromatographic peak (sec), 0 means no more than 1 MS 1 scan had a satisfactory precursor isotope cluster shape.
Precursor m/z - Measured m/z of the precursor ion
MH⁺-Measured precursor ion MH⁺
Delta mass - Difference between measured precursor MH⁺ (calculated from measured precursor m/z and charge state) and precursor ion MH⁺ from top database search hit
Pep pI - Calculated isoelectric point (pH at which the net charge on the peptide is zero) of peptide that corresponds to top database hit. When you mark this check box, the report includes peptide pI, and you enable the capability to filter by peptide pI.
Protein MW - Molecular weight of protein representing top database hit
Prot pI - Isoelectric point (pH at which the net charge on the protein is zero) of protein representing top database hit.
Species - Species for protein representing top database hit
Accession # - Database accession number
Protein name - Protein name for top database hit
Intensity - In peptide summary modes, this is the peak intensity calculated from the extracted ion chromatogram of each peptide precursor. In protein display modes, this is the mean intensity, total intensity, or both (depends on user selection) of the peptides that make up the protein. For more details, see Color-Coded Quantitation Results and the totalIntensity topic under Spectral Features.
DEQ ratios - Mark to display ratios for differential expression quantitation, such as light/heavy ratios for ICAT reagents or other reagents that use isotopic labels. See SILAC and Other Differential Expression Quantitation. When you mark this check box, and you have selected a protein summary mode that supports quantitation, the program displays two an additional settings:

Invert - Mark to display the reciprocals of the ratios that are normally calculated, for example, to change light/heavy (L/H) to heavy/light (H/L)
Selection for Median, Mean, or Both

Reporter Ratios - Mark the check box to calculate the ratios of isobaric tag masses used in quantitation.

Intensities - Mark to display intensities for marker ion masses for peptide reports.
Dropdown list of Reporter Ratios: Select one of these options to change the list of Reporter Ratios available.

iTRAQ4 - Select this quantitation option for 4 samples labeled with this tag.
iTRAQ8 - Select this quantitation option for 8 samples labeled with this tag.
TMT2 - Select this quantitation option for 2 samples labeled with this Tandem Mass Tag.
TMT6 - Select this quantitation option for 6 samples labeled with this Tandem Mass Tag.

Control - Select the isobaric tag mass you wish to use in the denominator for ratio calculations.

Modification names - Lists modifications. Lists the site of the modification first, followed by the type of modification.
N-term - N-terminal modifications
C-term - C-terminal modifications
Cysteines - Cysteine modifications
Fragmentation mode - The MS/MS fragmentation mode for the spectrum, either CID for collision-induced dissociation or ETD for electron transfer dissociation
Max tag length - Maximum sequence tag length, defined as the length of the longest path of amino acids that is represented in the spectrum. This is a useful measure of spectral information content.
Longest tag - Amino acid sequence corresponding to maximum sequence tag length
# b/y pairs - Number of b/y pairs represented in the spectrum.
Category: User-defined protein category. Categories must be defined by your system administrator and entered into an msparams_mill\categories.#.tsv file. When you mark the Category check box, you can then select a category.

Protein Quantitation Options

These options are available only in certain protein modes.

Exclude poor isotope quality Precursor XIC's - Mark this check box if you do not want quantitation to include peptides whose isotope ratios show poor quality. Poor quality is defined as less than 0.85 Chi-squared versus averagine. Chi-squared is a measure of similarity and averagine is the mass distribution you get if you assume that the peptide is made up of "average" amino acids. The elemental composition of averagine is C 4.9384 H 7.7583 N 1.3577 O 1.4773 S 0.0417. (See Senko MW, Beu SC, McLafferty FW, "Determination of monoisotopic masses and ion populations for large biomolecules from resolved isotopic distributions," J Am Soc Mass Spectrom 1995, 6:229-233.) This setting applies only for Agilent Q-TOF and Thermo data and for protein summary modes that support quantitation.
Exclude poor Precursor Isolation Purity < [75%]- Mark this check box if you do not want quantitation to include peptides whose proportion of the ion current in the isolation window of a high resolution MS1 scan represented by the isotope cluster of precursor ion assigned to the resulting MS/MS scan is less than 75%.
Exclude outlier DEQ Ratios - Mark this check box if you do not want quantitation to include DEQ ratios that are more than two standard deviations from the mean. The program displays this setting only in certain modes that display proteins, and then only when you mark the check box for DEQ ratios.

Spectrum Grouping Options

These options are available only in the Protein-Peptide Comparison Columns mode.
Each precursor ion intensity reported contains the summed value from all of the peptide spectrum matches (PSM's) that were grouped together.

Group missed cleavages containing VM site(s) - Mark this check box to combine PSM's of the same variable modification site with missed cleavages if they contain the same number of modifications. That is, different missed cleavage forms of peptides containing the same modification site (AA position in the sequence) will be collapsed into a single row. For example, for s|t|y modifications, a row in the table combines PSM's of the same s|t|y site with missed cleavages allowed so long as they all contain the same number of s|t|y modifications. The displayed representative is the one having the highest VML score.
Show all grouped spectra - Mark this check box to see all the PSM's that were combined. A row in the table combines peptide spectrum matches (PSM's) of the same peptide containing the same number of the variable modifications. For example, for s|t|y modifications, a row in the table combines PSM's of the same peptide containing the same number of s|t|y modifications. This allows one to inspect the collapsing behavior by reporting all the individual PSM's that are collapsed to an individual sequence or VM site. Because this results in a nested table with multiple rows in individual celss, Excel Export is not supported for this feature.

Variable Modification Localization within Protein/Peptide Summary

Variable modification localization is a unique Spectrum Mill feature that assigns modifications to specific location(s) in a sequence when you have two or more possibilities. In addition, it provides a confidence indicator, which is the difference in score between equivalent identified sequences with different variable modification localizations. A VML score:

Greater than 1.1 indicates confident localization.
1 implies there is a distinguishing b or y ion.
0.1 means that when unassigned, the peak is less than 10% of the intensity of the base peak.

This tool saves time because you can determine modification sites without the need to inspect the spectra. For example, with this tool, you can compare and visualize phosphosite differences across samples. The sequence map shows the cleavage location for the observed ions, which provides additional information on the scoring.

Ion Mobility Workflow

The Spectrum Mill B.06.00 release provides support for Agilent IM-Q-TOF data using concatenated peak list (PKL) files generated by the Agilent MassHunter IM-MS Browser (B.07.02 or later). The PKL files contain the retention time (RT), drift time (DT), and collision cross section (CCS) values. The CCS values are written only if the data has been calibrated for the CCS calibration factors, and if the charge state of the precursor can be determined.

The PKL file is extracted using the Generic Extractor, which writes the RT, DT, and CCS values to the mzXML file that is generated. The IM values are propagated into tagSummary during a search. The Protein/Peptide Summary modes that include peptide results have an Ion mobility review field. If marked, the summary report includes the DT and CCS values. If CCS is not available, its value is reported as 0.0. Spectrum Summary also provides an Ion mobility field to report these values. The MPP APR Export (Protein-Protein Comparison summary mode) supports export of the ion mobility values if they are present in the data.

Workflow for processing IM-Q-TOF data

The workflow described here is current as of the Spectrum Mill B.06.00 release. Contact Agilent for possible updates to the recommended workflow.

To report CCS values, the data must be calibrated for calculating the CCS values. The calibration involves acquiring a tune mix that contains at least three ions with known CCS values. The calibration is done using the IM-MS Browser, and can be done on the acquisition system where the factors are applied to future acquired data, or to selected data files on the analysis system. Refer to the IM-MS Browser documentation for details.

To process IM-QTOF data:

In IM-MS Browser:

If the data has not been calibrated for CCS, open the tune file in IM-MS Browser, and apply CCS calibration to the data that is to be processed.
Open the data file.
Method->Find Features (IMFE). Select the Peptides as the Isotope model, and set Limit charge state to what is expected for the peptides. The Ion intensity setting of >= 100 is a reasonable default.
Method->Filter Features. These setting may require some experimentation, depending upon the data. Select Max ion volume. Typical values to use are:

Quality score from 50 to 100
Charge state from 2 to 7
Maximum feature count 2500
Leave other filters unmarked.

Method -> Extract Fragmentation Spectra… The default values (+/- 3 seconds for RT and +/- 0.3 milliseconds for DT) are suitable.
Method -> Find Peaks in Mass Spectrum… Only enable and set the Maximum peak count to be 200 peaks, and do not mark the Charge state assignment.

In the Spectrum Mill:

Copy the PKL file generated by the IM-MS Browser to a new folder under msdataSM on the Spectrum Mill server. Do not place it in a cpick_in subfolder.
In the Data Extractor, select the folder with the PKL file. It will show the Generic Extractor parameters. Select the Instrument type to be Agilent ESI Q-TOF. Set the MS/MS Spectral Feature Finding parameters to correspond to your data.
In MS/MS Search, select the instrument to be Agilent ESI Q-TOF and set other parameters according to how the data is to be searched.
In Protein/Peptide Summary, the Peptide modes have an Ion mobility review field. Mark it to report the DT and CCS values. If the data was not calibrated for CCS, it reports “0.0” for CCS values. The AMRT export in the Peptide – Spectrum Match and Peptide – Distinct modes will include the DT and CCS values if the Ion mobility review field is marked.
To export ion mobility results to Agilent Mass Profiler Professional (MPP) 14.8 or later, in the Protein-Protein Comparison mode of Protein/Peptide Summary, select MPP APR Export.

Color-Coded Quantitation Results

When you mark the Intensity check box under Review Fields on the Protein/Peptide Summary form, then the results include color-coded information to make it easy to visualize relative concentrations and differences in protein abundances between samples. The color code indicates relative peptide or protein concentrations, where darker colors (e.g., red) indicate larger relative concentrations and lighter colors (e.g., yellow) indicate smaller relative concentrations. Colors in between (e.g., orange) represent intermediate concentrations. The colors make it easier to compare samples and quickly pick out sample differences.

Depending on the display mode you select, the color-coded results appear in either one or two columns of the results table.

In peptide display modes, you see Spectrum Intensity, which is the peak intensity calculated from the extracted ion chromatogram of each peptide precursor.

In protein display modes, you see one or more of the following:

Mean Peptide Spectral Intensity - mean intensity of all peptides assigned to that protein. Peptide intensities are calculated from extracted ion chromatograms from the precursor ions.
mean intensity - same as Mean Peptide Spectral Intensity
Total Protein Spectral Intensity - total intensity of all peptides assigned to that protein. Peptide intensities are calculated from extracted ion chromatograms from the precursor ions.
total intensity - same as Total Protein Spectral Intensity
Distinct Peptides (#) - number of distinct peptides detected for each protein
# spectra - total number of spectra, including those for redundant peptides

Multiple Sequence Alignment Tool within Protein/Peptide Summary

Introduction

The Multiple Sequence Alignment Tool within the Spectrum Mill software enables alignment and comparison of the amino acid sequences of proteins within a protein group. The software accomplishes the alignment via a transparent interface to Clustal W, a program that is available from the European Bioinformatics Institute (EBI). Agilent licenses the Clustal W program, and the Spectrum Mill installation copies it to the millbin folder on the Spectrum Mill server. Once the program is located within millbin, you access the alignment capability of Clustal W directly via links in the Protein/Peptide Summary report in the Spectrum Mill. You can also access multiple sequence alignment from a stand-alone utility – the Multiple Sequence Aligner. For more information, please see the help for that form.

Reference:

Chenna, R.; Sugawara, H.; Koike,T.; Lopez, R.; Gibson, T.; Higgins, D.G.; Thompson, J. D. “Multiple Sequence Alignment with the Clustal Series of Programs”, Nucl. Acids Res. 2003, vol. 31, no. 13, 3497–500, PubMedID: 12824352.

Note: If the database is too large (> 4.2 Gb), the alignment does not work properly. In that case, create a subset database before you do the alignment.

To Align Sequences

To access the alignment capability:

Generate a Protein/Peptide Summary report from one of the following summary modes:

Protein-Protein Comparison Columns

Protein-Protein Comparison Redundant

Click a Group # or Subgroup # link in the report.
Wait a short while for the report to display.

Report Description

The top of the report provides information about the proteins in the group, starting with the longest protein first. For each protein, the report lists:

ID – the accession number
Subgroup # – number of a protein subgroup. The presence of subgroups indicates that distinct peptides were detected for isoforms or protein family members.
Length – number of amino acids in the protein
Identical AA’s – in the aligned sequences, the number of amino acids that are identical to those in the longest protein
%ID – the percent of matching amino acids, as given by Identical AA’s divided by Length, expressed as a percentage
Species – the species for the protein from the database
Protein Name – name from protein database

The bottom of the report aligns the amino acid sequences from the various proteins. The left column lists the protein accession numbers. To view a protein name (as given in the protein database), rest the mouse pointer on the accession number. The right side of the report displays the aligned amino acid sequences. The sequences typically occupy more than one line of text; scroll down to view subsequent lines. Blank lines indicate the start of the next section of the sequence. Colored highlights show the locations of supporting peptides for each protein identification.

For more information about Clustal W and a description of the calculations that Clustal W uses to perform the alignment, access the online help at EBI, or see the reference above.

Note: If you want to both align sequences and generate a phylogenetic tree, then use the Web form at EBI. The phylogenetic tree is not available when you use Clustal W within the Spectrum Mill.

Peptide Table

To view a table that lists the detected peptides that are present in the amino acid sequence of each protein, do one of the following:

If the mode is Protein-Protein Comparison Columns, click a 2X link.
If the mode is Protein-Protein Comparison Redundant, mark the check boxes for the proteins you wish to display, then click the button (at the top of the results) labeled Display Peptides by Protein SubGroup.

In either case, the table shows which proteins contain each detected peptide. To limit further the number of accession numbers in the table, mark the check boxes for the accession numbers you wish to display and click the Peptide Compare button.

Spectrum Summary

The Spectrum Summary tool was created as a means to sort spectra according to some measure of spectral quality. One obvious use is to find novel peptides by process of elimination, i.e. good quality spectra which remain uninterpreted after all appropriate databases have been searched. While several spectral features are available as criteria for sorting, the one which seems to do the best job of putting high quality spectra to the top of the list is the Maximum Sequence Tag Length - the longest path through a series of ions separated by amino acid masses. This is not intended as a de novo interpretation, but rather a very crude calculation which makes no attempt to consider the various possible fragment ion types nor choose which of the possible paths is actually correct. Note that high scores by this measure will represent spectra which fragment at each consecutive amino acid along the peptide backbone (most likely doubly charged spectra in electrospray MS/MS).

Spectrum Summary has also become the primary means of reporting results from Sherenga de novo Sequencing and Spectrum Matcher through its Data Integration Modes.

Spectrum Summary also allows spectra to be segregated according to quality, as a means for creating groups of spectra that can be selectively subjected to interpretation by MS/MS Search or Sherenga de novo Sequencing. See the validation state section for further details.

The Spectrum Viewer is a convenient tool for reviewing results.

To Use the Spectrum Summary Form

The following topics describe options available on the Spectrum Summary form.

Summarize Results for Review

Summarize - Click to summarize results. Click this button after you have set all parameters. This button also saves your Spectrum Summary settings so that they are retained as you navigate to other Spectrum Mill pages during your current web browser session. Once you click the Summarize button, the button is disabled until the results appear. If you need to re-enable the button, click the Summary Settings button at the top of the page to reload the Spectrum Summary form.
Save As - Click to save current Spectrum Summary settings in a parameter file.
Load - Click to load a parameter file that contains summary settings. For default values, select a parameter file from the Defaults folder. The parameter files for Spectrum Summary are there to help with the page settings, but they cannot be used in a workflow.
Excel Export - Mark to export results to Excel or to upload to LIMS. For the latter, first make sure your system administrator has configured the upload. See Exporting to Excel or Uploading to LIMS.
Spectrum Files - Modify this list if you want to process only a subset of the files in the data directory. Wildcards (*) are supported. To see the names of your spectrum files, look in the cpick_in subdirectory under the directory where you placed your raw files.

Sorting

Sort by: Determines how the spectra are sorted in the results summary

Data Directory

Click the Select ... button to select a data directory. See Selecting Data Directories.

Spectral Quality Filtering

Certain Spectral Features calculated by the SM Data Extractor and can be used with multiple downstream SM modules to craft a smaller subset of high value spectra. For more details see Spectral Quality Filtering.

Filter by: Use this to filter by one additional feature. See Spectral Features.

Spectral Type/Status Filtering

Fragmentation mode - choose the mode from the drop-down list.
Spectrum validation filter: Use this to filter and list only the spectra having a particular validation setting. See Peptide Validation.
Validation preset: Used during spectral review and manual validation, and determines whether spectra are initially classified as good-spectrum, bad-spectrum, reset, or none. Select good-spectrum if you set the filter in this section to select spectra with relatively high probability of being good. Otherwise, set to bad-spectrum and then change to good-spectrum when you find good spectra as you manually review the spectra. The validation preset classifications are not yet a permanent part of the data record and can easily be changed as data are reviewed.

Data Integration Modes

DB search Result - Reports not only core features of database search results (score, sequence, sequence map, accession_number, backbone_cleavage_score, fragmentation category), but also sequence coverage metrics (recall, num covered AAs, #cuts - N,C,I).
de novo Result Sherenga - Reports features of the Sherengade novo results. These include: (score, the vertex score string - for each peptide backbone bond cleavage, the original top scoring sequence from Sherenga, sequence tag representations of the top scoring result that replaces low confidence AA's with mass - using vertex score thresholds of 0 and 2, sequence coverage metrics including recall and accuracy relative to the DB search result based on the thresholded sequence tag representation of the top scoring result).
When Excel Export output is generated and R is installed on the SM server, plots are generated for the recall and accuracy performance of SM DB search, Sherenga, and PEAKS/Novor.
de novo Result PEAKS/Novor
- For integration with results derived from PEAKS using the same dataset the following file must be present in a subdirectory of the selected SM data Directory: cpick_in/all de novo candidates.csv
- For integration with results derived from Novor using the same dataset the following files must be present in a subdirectory of the selected SM data Directory: cpick_in/*.mgf.csv, where * corresponds to the prefixes of the *.RAW files processed in Spectrum Mill.
Reports features of the de novo results. For PEAKS these include: (average local confidence score - ALC, minimum LC score - MLC, the local confidence score string - for each AA, the original top scoring sequence from PEAKS, sequence tag representations of the top scoring result that replaces low confidence AA's with mass - using an LC threshold of 60 and 80, sequence coverage metrics including recall and accuracy relative to the DB search result based on the thresholded sequence tag representation of the top scoring result).
For Novor, similar metrics to PEAKS are reported. Where appropriate, differences are accounted for: aaScore instead of LC score, aaScore thresholds of 25, 30, 35.

When Excel Export output is generated and R is installed on the SM server, plots are generated for the recall and accuracy performance of SM DB search, Sherenga, and PEAKS/Novor.
Spectrum Matcher Result required -

SpectralFeatures

Longest sequence tag - Amino acid sequence corresponding to maximum sequence tag length
Precursor charge - Charge state of precursor ion
Fragmentation mode - Fragmentation mode used to acquire the spectrum
Precursor MH⁺ - Measured precursor ion MH⁺
m/z - Measured m/z of the precursor ion, as determined by Data Extractor
Acquired precursor m/z - Measured m/z of the precursor ion, as determined by the mass spectrometer software. This value may actually be a ¹³C isotope, which is why it may differ from Precursor m/z.
Collision energy - Available only for certain Applied Biosystems/MDS Sciex data
RT - Retention time associated with each spectrum
# b/y pairs - Number of b/y pairs represented in the spectrum
Glyco Product Ions Score - Based on The 9 ion glycosylation-signature set: 126,138,144,168,186,204,274,292,366. Numerically, GPIS is a 2-part score. See CPIS & GPIS for more detail.
Ion mobility - reports drift time (DT) and collision cross-section (CCS) values, if present. If CCS values are not calculated, a value of 0.0 is reported.
MS precursor EIC intensity - Peak intensity calculated for extracted ion chromatogram of each peptide precursor. See the totalIntensity topic under Spectral Features.
MS L/H EIC intensity - Displays intensities of all parallel light/heavy EICs calculated during data extraction. See SILAC and Other Differential Expression Quantitation. Since you typically use the Spectrum Summary page to display spectra that have not been interpreted, you do not see the light/heavy ratios calculated in the results table, nor should you attempt to calculate them from these data. Instead, use the calculations from the Protein/Peptide Summary page. The values are reported as 0’s for “metabolic” modifications, such as SILAC and ¹⁴N/¹⁵N-mixes.
Reporter ion intensity - Displays the reporter ion fragment intensities for iTRAQ and TMT experiments
MS/MS TIC intensity - After peak detection, the total intensity of all peaks in the MS/MS spectrum
# peaks Detected - Number of peaks remaining after the MS/MS Search peak detection is applied
# peaks Centroid - Number of peaks after centroiding
Profile - Number of profile peaks (Applied Biosystems/MDS Sciex data only)
Noise Mean - Mean noise calculated during Spectrum Mill data extraction. See Peak Detection.
Std dev - Noise standard deviation calculated during Spectrum Mill data extraction. See Peak Detection.
Base peak intensity - After peak detection, the intensity of the most intense peak in the MS/MS spectrum
Base peak / TIC ratio - Base peak intensity / MS/MS TIC intensity

SILAC and Other Differential Expression Quantitation

The Spectrum Mill supports precursor ion intensity based quantitation with a wide range of labels that are used for differential expression quantitation (DEQ). A number of labels are pre-programmed in the software, but you can add your own modifications and use them for quantitation. At installation, the software supports many common modifications, including:

SILAC 2 (Arg 0-6Da, Lys 0-8Da)-mix
SILAC 3 (Arg 0-6-10Da)-mix
N-terminal propionyl-D₀, propionyl-D₅, and propionyl-mix
C-terminal methyl ester-D₀, methyl ester-D₃, and methyl ester-mix (also modifies D and E)
C-terminal ¹⁶O/¹⁸O
ICAT (D₀/D₈)
Cleavable ICAT (¹²C/¹³C)

The discussion in this section applies to the above modifications. The following modifications are also supported, but the software handles them differently:

Quantitation with iTRAQ and TMT
Quantitation for ¹⁵N and ¹⁴N/¹⁵N mix

If you have labels that exhibit small mass differences between the light and the heavy versions (~4 Da), see also Quantitation for labels with small mass differences.

For the isotopic labels other than iTRAQ, TMT and ¹⁴N/¹⁵N, regardless of whether the DEQ modification is pre-programmed or added later, the following requirements must be met:

The instrument must have a Spectrum Mill Data Extractor program that reads the instrument vendor's data file directly.
The data file must be from a data-dependent analysis that acquired both MS and MS/MS spectra.

This section describes how to display the results, how the light/heavy ratios are calculated for each peptide, and how the peptide ratios are combined to calculate a ratio for the corresponding protein. In this section, the term "SILAC" refers generically to reagents that are used for differential expression quantitation based on precursor ion intensity.

Displaying Results for SILAC and other Isotopic Labels

On the Protein/Peptide Summary page:

Under Review Fields, mark the check box for DEQ ratios (differential expression quantitation ratios).
If you wish to display the light/heavy pairs together, set Sort peptides by to Sequence.

Calculating Light/Heavy Ratios for Each Peptide

The Spectrum Mill allows a SILAC ratio to be calculated even if only one member of a heavy, light pair has been subjected to MS/MS.

As described in the Spectral Features section, for each precursor mass subjected to MS/MS, Data Extractor calculates an EIC (extracted ion chromatogram) in the intervening MS scans of an LC-MS/MS run, resulting in a chromatographic peak area for the precursor mass. In each Spectrum Mill data directory in a file called SpecFeatures.tsv these peak areas are stored in the column called totalIntensity. When you review database search results in Protein/Peptide Summary, these peak areas are retrieved for display.

When Data Extractor is run and the modification is set to one of the -mix varieties, Data Extractor calculates a parallel EIC in the intervening MS scans, depending on the m/z shift associated with the SILAC label, to yield a chromatographic peak area for the other member of the SILAC pair. Actually, multiple parallel EIC's are calculated for each precursor mass because at the time of running Data Extractor, the MS/MS spectrum has not yet been interpreted, so it is not known whether the precursor subjected to MS/MS was from a label-containing peptide at all, from a light or heavy labeled peptide, nor how many labeled residues are present in the peptide. Furthermore, on low resolution instruments, the precursor charge may not yet be known; thus the m/z shift is uncertain as well. Since Data Extractor will calculate and store all the possibilities, Protein/Peptide Summary can later retrieve the appropriate one after interpretation has been completed.

Consequently, the SILAC ratio for a particular peptide is the result of the EIC for the selected precursor mass and the result of the appropriate parallel EIC associated with the mass shift of the SILAC label. This means that a ratio can be calculated when only one member of a pair has been subjected to MS/MS.

In the cases where both members of an SILAC pair have been subjected to MS/MS, the ratio shown for the two members will most likely be close but not identical. That is because the parallel EIC calculations are performed in the time domain based upon the particular precursor selected for MS/MS. The fact that the two labels (if K₀ and K₈) may not quite co-elute or the chromatographic peak detection of the MS/MS-triggering precursors may have different sensitivity accounts for the difference between the two calculations. The time tolerance (+/- seconds) set in Data Extractor should allow for the difference in retention times. You will not see this discrepancy in the protein mode, provided that both the K₈- and K₀-labeled precursor ions were subjected to MS/MS and that these results were of sufficient quality to be interpreted and included in the final results summary. When the peptide ratios are combined to calculate a ratio for the protein, the ratio for the pair is recalculated directly using only the EICs of each precursor, not the parallel EICs obtained using the calculated m/z shift from the precursor.

Calculating a Light/Heavy Ratio for the Corresponding Protein

After the interpreted spectra for peptides have been grouped together because they correspond to a single protein, a SILAC ratio for the protein is calculated by approximately taking the median of the values for the PSMs. The median, standard deviation and number of values contributing to the median are reported in the Protein modes in Protein/ Peptide Summary.

Some details associated with error and redundancy in the calculation of the median are described here.

Since the ratios of lesser abundant proteins will have poorer ion statistics, the standard deviation on the ratios will be larger and thus the ratios less trustworthy. Hence it is valuable to report standard deviations as well as ratios.
Poorer ion statistics may occur even when counting ratios from peptides toward the median for a particular protein. Some examples are peptides derived from non-specific or missed cleavages and partially oxidized methionines, or any peptide that ionizes poorly.
If multiple precursor charge states for a particular peptide are measured, all charge states contribute.

Filtering out PSMs with poor quality ratios

In Protein/Peptide Summary modes that incorporate protein level information have Protein Quantitation options relevant to precursor ion-based quantitation including:

Exclude poor isotope quality Precursor XIC's: < 0.70 Chi2 vs. Averagine -
Exclude outlier DEQ Ratios (> 2 std dev from mean) -

Why are some ratios negative?

In Peptide and PSM level reports, some ratios (not log2 transformed) may be listed as negative. This is done to indicate that the ratio was designated as not meeting a quality control threshold. Nonetheless, the magnitude is provided and represents the actual ratio of the measured intensities to allow one override the quality control designation. The primary source of this negative designation is when the averagine Chi2 ratio of the partner precursor ion to the one selected for MS/MS was poor quality. See the p/i/q/p code in the table below. When the parallel EICs to the selected precursor described above are being calculated in the Data Extractor, an averagine Chi2 ratio for each is calculated, but not exported to the specFeatures.1.tsv file (because there are many of these for each MS/MS spectrum). Instead, a hardcoded threshold of xx is applied and if the value is below it, the EIC intensity is simply marked as negative when written to the SpecFeatures file.

Any ratio containing a negative intensity value can be excluded from from contributing to median protein and VM-site level ratios. To override/use this behavior open the file millscripts/lsmDEQ.pl and toggle the variable near the top of the file $UNDO_QUALITY_CONTROLLED_LH_RATIO_NEGATION. 2/27/2019 Karl needs to check, 0 means exclude, 1 means do not exclude. Karl should give give some guidance here....

When ratios are not calculated

If the Data Extractor cannot determine a charge for a peptide (the extracted file ends in 0.pkl), it assumes a charge of +2 for determining the mass shifts for quantitation, and it looks for up to two modification sites in the peptide (e.g., two cysteines at most). When the actual charge is not +2, or when there are more than two modification sites in the peptide, the ratio is not calculated, and is reported as n/c.

Ratios are also reported as n/c when the peptide does not contain the amino acid that reacts with the labeling reagent.

The following codes in PSM/Peptide level reports may be present to indicate why a ratio was not reported:

Code	Meaning
n/c	Not calculated (see above)
d/d/z	Do not divide by zero (the denominator was zero)
o/e	Outlier excluded
r/e	Replicate excluded- the precursor ions of both the numerator and denominator labels are present as PSMs. Only the ratio for one of those PSMs is reported and counted toward the protein or VM-site level quantitation, the other PSM is designated as r/e
p/i/q	The averagine Chi2 ratio of the precursor selected for MS/MS was poor quality
p/i/q/p	The averagine Chi2 ratio of the partner precursor ion to the one selected for MS/MS was poor quality

In Protein Summaries, the Single Label (L,M,H Only) column is new with B.04.01. A single label protein will have all peptide ratios <= 0, which indicates that all of the peptides for the protein had ratios which were found to be one of the codes in the above table.

Quantitation for iTRAQ and TMT

The Spectrum Mill supports quantitation with iTRAQ and TMT labels. The iTRAQ (isobaric tag for relative and absolute quantitation) reagents modify the N-terminus and K, and they allow simultaneous quantitation of up to eight different cell states based on low-mass MS/MS signature ions. The processing and quantitation for iTRAQ-modified peptides is different from that described under SILAC and Other Differential Expression Quantitation.

The Spectrum Mill supports iTRAQ and TMT quantitation for Agilent Q-TOF and ion trap data, generic peak list data (requires the generic Data Extractor), and Thermo Fisher Scientific LCQ and LTQ *.raw data (requires the Thermo Fisher Scientific Data Extractor and requires that during extraction, the software merges MS² and MS³ scans from the same precursor).

Starting with version A.03.03, the Spectrum Mill supports iTRAQ in two forms:

iTRAQ, which assumes complete labeling of the N-termini and lysines, and behaves the same as iTRAQ-mix did in version A.03.02
iTRAQ Partial-mix, which assumes incomplete labeling and searches in four cycles:

No label
Lysine-only label
N-terminal-only label
Complete label (both lysines and N-terminus)

Starting with version B.04.00, the Spectrum Mill workbench supports iTRAQ4 and iTRAQ8, TMT2 and TMT6:

iTRAQ4 - select from 4 isobaric tags with masses 114 to 117
iTRAQ8 - select from 8 isobaric tags with masses 113 to 121
TMT2 - select from 2 isobaric tags with masses 126 to 127
TMT6 - select from 6 isobaric tags with masses 126 to 131

Starting with version B.05.00, TMT10 quantification is supported.

TMT10 - select from 10 isobaric tags

Data Extractor

The iTRAQ and TMT intensity calculations do not require extracted ion chromatograms from the MS data. The abundances of the iTRAQ and TMT masses are calculated from the MS/MS data. This is significantly different behavior than for the SILAC-like modifications.

MS/MS Search

With the isotopic labels used for differential expression quantitation, if you select one of the variations that ends in mix, each spectrum is searched multiple times—once for each possible label. The results are merged as a single output. For iTRAQ or TMT, only a single search is necessary. Since the tags are isobaric, all versions of the iTRAQ or TMT reagent are simultaneously fragmented during MS/MS. Further, all iTRAQ and TMT labels produce the same MS/MS fragments for a given parent peptide. Therefore, the iTRAQ or TMT labels do not have to be searched as a mix. However, each set of tags produces different reporter ions in its mass range, and the abundances of these reporter ions are used by the Spectrum Mill for relative quantitation.

Protein/Peptide Summary

To display iTRAQ and TMT results using the Protein/Peptide Summary page:

Under Review Fields, mark the intensities check box next to the iTRAQ/TMT selection list.
From the iTRAQ/TMT selection list, select either iTRAQ4, iTRAQ8, TMT2 or TMT6.
Mark the check box for Ratios control, and select the iTRAQ or TMT mass you wish to use in the denominator for ratio calculations.
If you want to see the iTRAQ or TMT modification in a report that shows peptides, mark check boxes for both N-terminus and Modifications, since the reagents react at both the N-terminus and lysines.
If you want to export your data to Excel so you can apply the correction factors that you received in your certificate of analysis for the iTRAQ reagents, mark the check box for Excel export.

Quantitation for ¹⁵N and ¹⁴N/¹⁵N mix

Quantitation for the metabolic isotopic labels ¹⁵N and ¹⁴N/¹⁵N mix is different than for the modifications discussed under SILAC and Other Differential Expression Quantitation. For ¹⁴N/¹⁵N mix, the quantitation begins at the Protein/Peptide Summary level rather than at the Data Extractor level. The quantitation is based on finding matching peptides with the two labels. Both the ¹⁴N and the ¹⁵N peptides must have been subjected to MS/MS, and the MS/MS Search results must indicate the same sequence and charge. ¹⁴N/¹⁵N mix and iTRAQ/TMT are the only modifications where differential expression quantitation can begin with the generic Data Extractor. The ¹⁴N/¹⁵N calculations assume 100% incorporation of ¹⁵N.

Quantitation for labels with small mass differences

If you are attempting differential expression quantitation with labels that have relatively small mass differences between the light and the heavy versions (~4 Da), you need to change the Data Extractor setting for Merge scans with same precursor m/z from the default value. Change from the default window of +/-1.4 m/z to a window of +/-1.0 m/z or lower.

When there are small mass differences between labels, a 2+ peptide with both versions of such a label will show two isotopic distributions that are 2-m/z from each other. With the default extractor window of +/-1.4 m/z, it is likely that when the software calculates the intensity for a given precursor m/z, some of the isotopic peaks from the precursor's light or heavy counterpart will be contained within the m/z window, which will significantly skew the DEQ results. To avoid errors in the intensity measurement, reduce the window to +/-1.0 m/z or even lower.

When you reduce the window, some MS/MS spectra may not merge, so multiple identifications of the same peptide within the merge time period may occur. However, this is preferable to inaccurate DEQ results.

Spectrum Matcher

Spectrum Matcher provides a means of matching one set of spectra against another in a way that is integrated into the Spectrum Mill file system, thus allowing one to define the sets of spectra according to directory location and validation state. You can also use Spectrum Matcher to compare spectra acquired with different acquisition methods to evaluate any improvements, and to evaluate the quality of spectra using the spectral quality filters.

Thus Spectrum Matcher is a tool for answering the following types of questions:

Identity mode - Are any of the spectra in my query set the same as any in the library set?

Precursor mass shift mode - Are any of the spectra in my query set related to any in the library set?

When seeking to match related spectra, the most common application is to select the same directory for both Query Set and Library Set, with the Library Set being those spectra already identified (Validation State: valid) and the Query Set being unidentified spectra (Validation State: spectrum-not-marked-sequence-not-validated).

Scoring of Matches

The score in Spectrum Matcher is very similar to that in MS/MS Search. Following peak detection, the Spectrum Matcher algorithm attempts to match every peak present in a query set MS/MS spectrum to every peak present in a library set MS/MS spectrum. The scoring system is based on the following general principles:

Peak Intensity - If a peak is "real" and explainable, intensity doesn’t matter. Very intense unexplained peaks suggest a poor match.
Precursor mass shift mode - If two spectra are from similar peptides (one modification or amino acid substitution), then the fragment ion masses may be shifted by the mass difference between the precursor masses. In calculating the mass shift, the charge state of the precursor and fragment ions are taken into account.

Spectrum Matcher has two particular scoring attributes:

Score
Bonus points for each matched peak. Bonus values are always one point per peak regardless of peak height.
Penalty points for each unmatched peak. Penalty value is based on peak height - (peak height / height of tallest peak). For example, if an unassigned peak is 50% the height of the tallest peak in the peak-detected spectrum, then its penalty value would be 0.5, while an unassigned peak that is 10% the height of the tallest peak has a penalty value of only 0.1. Spectrum Matcher requires a minimum score of 3 to report a match. Using the default value of 10 for Peaks (most intense ¹²C), the maximum score would be 10.
SPI - Scored Peak Intensity
From peaks remaining after peak detection, this is the percentage of total intensity in the query set spectrum that is matched to peaks in the library spectrum. Scored Peak Intensities lower than 50% suggest a poor match, or presence of non-corresponding fragment ion types in the query set spectrum. Adjust the value of Minimum matched peak intensity to something less than 50% (default value) to enable reporting of poorer quality matches.

Precursor Mass Shift

Spectrum Matcher compares MS/MS spectra if their precursor masses fall within the precursor m/z tolerance filter. In Precursor mass shift mode, this filter is a combination of the Precursor mass shift and Precursor m/z tolerance. You should NOT attempt to accomplish this by using a wider precursor m/z tolerance. Use a Precursor m/z tolerance consistent with the accuracy to which the precursor mass is measured. The default value for the Precursor mass shift of +/- 81 allows for the largest possible precursor mass shift associated with a mutation among the 20 standard amino acids and phosphorylation. The shift can be set in four different forms, all of which show only homologous matches, thus excluding identity mode matches:

+/- (wide range) - allows matching of a query spectrum to all library spectra spanning the range of the Precursor mass shift
=/+/- allows a query spectrum to match a library spectrum only if the query spectrum's precursor MH⁺ is shifted either higher or lower by the specified mass. (The program automatically takes into account precursor charge.)
-/= (specified shift down) - allows a query spectrum to match a library spectrum only if the query spectrum's precursor MH⁺ is shifted lower by the specified mass. (The program automatically takes into account precursor charge.)

Note that the +/- will compare many more spectra so it will take longer to run, and the run time will be proportional to the magnitude of the Precursor mass shift.

To Use the Spectrum Matcher Form

The following options are available on the Spectrum Matcher form. For more details, see Spectrum Matcher.

If during data review you wish to display the Spectrum Matcher form again, click the Match Settings button at the top of the page.

Match Spectra

Match - Click to search one set of spectra against another. Click this button after you have set all parameters. This button also saves your Spectrum Matcher settings so that they are retained as you navigate to other Spectrum Mill pages during your current web browser session.
Save As - Click to save current Spectrum Matcher settings in a parameter file.
Load - Click to load a parameter file that contains settings for Spectrum Matcher. For default values, select a parameter file from the Defaults folder. The parameter files for Spectrum Matcher are there to help with the page settings, but they cannot be used in a workflow.
Instrument - Select the instrument used to acquire the data. Unlike in the MS/MS Search form, changing the instrument selection does not change the Matching Tolerances. The instrument selection is only used to obtain the peak picking parameters (as set in E:\SpectrumMill\msparams_mill\instrument.txt).

Search Criteria

The following topics discuss the Search Criteria options.

Search Mode

Search mode: Select Identity or Precursor mass shift.
Precursor mass shift: See Precursor Mass Shift. This option applies only in Precursor mass shift mode.
Mass shift histogram from last search - Click this button to generate a histogram after you search in Precursor mass shift mode. This option applies only in Precursor mass shift mode.

Matching Tolerances

Minimum matched peak intensity: See Scoring of Matches.
Precursor m/z: Set to a value consistent with the mass accuracy of the instrument. See Mass Tolerances.
Product m/z: Set to a value consistent with the mass accuracy of the instrument. Units are the same as for Precursor m/z. See Mass Tolerances.

Spectral Quality Filtering (instrument-specific peak detection used in Extractor)

MS/MS Peak Detection (over-ride instrument-specific peak detection for matching)

Minimum S/N - Minimum signal-to-noise of spectral peaks retained for searching. See Peak Detection.
< Peaks (most intense ¹²C) - Set maximum number of peaks you want to use for each search. Extracted peaks will be the most intense ¹²C ions.
Minimum # of peaks: - Restrict searches based on the minimum number of peaks detected during data extraction.

Data Sets

There are two key Data Sets concepts when using the Spectrum Matcher

Query Set

Click the Select ... button to select a data directory of spectra you wish to search. See Selecting Data Directories.
Validation state - Use this to search only the spectra having a particular validation setting. See Peptide Validation.
Search result files: Modify this list if you want to process only a subset of the spectrum files in the data directory. The wildcard character (*) can be used to include only spectra from particular LC-MS/MS runs, or spectra with particualr precursor charge states.

Library Set

Click the Select ... button to select a data directory of library spectra you wish to search against. See Selecting Data Directories.
Validation state - Use this to search against only spectra having a particular validation setting. See Peptide Validation.
Search result files: Modify this list if you want search against only a subset of the spectrum files in the data directory. The wildcard character (*) can be used to include only spectra from particular LC-MS/MS runs, or spectra with particualr precursor charge states.

Overview for MS Interactive Processing

From a MALDI-MS experiment that takes less than one minute, one can measure the peptide mass fingerprint of a particular protein by spotting a target with an aliquot of the proteolytic digest of the protein. The technique requires that the peptides detected are all derived from a single protein (perhaps a mixture of up to three proteins).

To search Agilent Q-TOF or TOF .d data, first use MassHunter Qualitative Analysis with Molecular Feature Extraction (MFE) to create a peak list of possible peptides, and paste that into the Manual PMF search page. Or the peak list can be a differentially expressed list from Mass Profiler Professional. The Spectrum Mill workbench provides tools to run high-throughput PMF searches, and to review and summarize the results. The figure below illustrates the overall process.

MALDI Spectra Preprocessing

MALDI spectra must be supplied as peak list files. Depending on the instrument type, spectral preprocessing steps (centroiding, charge assignment, de-isotoping, etc.) may be done either within the instrument data system or within the Spectrum Mill. Settings in instrument.txt ensure that preprocessing steps are not duplicated between the two. The Spectrum Mill then provides tools to run high-throughput PMF searches, and to review and summarize the results. The figure below illustrates the overall process.

Experiment Scheme

Getting Started for Applied Biosystems MALDI

Acquire some mass spectra.
Calibrate and centroid the spectra, then export peak lists using the instrument data system.
Transfer the exported spectral files to the Spectrum Mill computer in a fit_batch_in directory within the Spectrum Mill file system.
From the Spectrum Mill homepage, go to the PMF Search page.
Set the appropriate parameters and run the searches.
Review the data from the PMF Search page or the PMF Summary page.

For more details on the PMF Search and PMF Summary pages, see MS PMF Search/Summary Help.

To Use the Data Extractor Form (MS)

Important note: If you wish to redo a data extraction, mark the check box for Remove all prior results.

Extraction

Extract - Click to place the task in the queue for execution. The program will execute the task to extract spectra from raw data files based on the time the command entered the queue, its capacity to process tasks in parallel, and dependencies. Click this button after you have either loaded a parameter file or manually set the parameters. The name of the current parameter file appears in red at the top of the form. Once you have saved a parameter file, you may start the extraction from a workflow rather than manually with the Extract button.
Save As - Click to save current data extraction settings in a parameter file.
Load - Click to load a parameter file that contains settings for data extraction. For default values, select a parameter file from the Defaults folder.
Remove all prior results - Mark this check box to remove prior extractions, searches and data summaries for this dataset.
Show only MS (PMF) parameters - Mark this check box to extract MS-only data, such as from the Agilent TOF. This simplifies the form to show only the parameters related to MS-only data extraction.

Data Directories

Click the Select... button to select a data directory or data directories. See Selecting Data Directories.

Modifications

Click the Choose... button to select modifications appropriate for your sample. See Choosing Modifications.

MS (PMF) Spectral Features

Note: These options are only available when you mark the check box Show only MS (PMF) parameters.

MH⁺ - Set the mass range to extract.
Extraction time range: Set the range of scan times you wish to extract from the raw data files. Use to this to avoid processing regions of the chromatogram that are not of interest. Keep the default (1 to 300) to extract all scans.

MS PMF Search/Summary

Peptide mass fingerprinting (PMF) is a very popular technique for protein identification. The method encompasses digestion of the protein with site-specific proteases, measurement of the peptide masses by mass spectrometry (MS), and protein identification via a database search. The PMF Search capability within the Spectrum Mill is an advanced, automated database search program for MS-only spectra.

With PMF Search, the certainty of the identification is primarily a function of the level of mass accuracy. The Agilent TOF delivers low-ppm mass accuracy and can be used with both electrospray and atmospheric pressure MALDI sources, making it an ideal instrument for confident identifications. For Agilent TOF and Q-TOF .d data, you must use MassHunter Qual with MFE to create a peak list to paste into Manual PMF Search.

After using PMF Search, you can summarize and review results with the PMF Summary page.

For more details on the PMF Search and Summary pages, see MS PMF Search/Summary Help.

Spectrum Mill Basics

Table of Contents

Introduction

File system

Location of Spectral Files

Directory structure

How Spectrum Mill locates data files

Naming of files and folders

Overview for MS/MS Interactive Processing

Getting Started for Agilent Q-TOF and Other MS/MS Data

Spectral Preprocessing for MS/MS Data

Data Extractor

Spectral Extraction

Peak Detection

Spectral Features

MS/MS Spectral Quality Filtering

Multicore (Maximize CPUs) Data Extraction

Configuring Service Request Manager Settings

When to Change the memFactor Settings

When to Select "Maximize CPUs"

To Use the Data Extractor Form (MS/MS)

Extraction

Data Directories

Modifications

MS/MS Spectral Feature Filtering

Merge nearby MSn scans with same precursor m/z:

Merge settings for Agilent instruments in instrument.txt

Precursor m/z & Charge Assignment

Precursor Charge Assignment for MS/MS scans

Data Extractor for Generic (Peak List) Files

Settings in instrument.txt

Files generated

*.mgf file support

MS/MS Search

Search Filters

MS/MS Autovalidation

False Discovery Rate

Strategies/Modes

Global versus Local FDR

FDR at the PSM, Peptide, and Protein Levels

MS/MS Autovalidation and Workflows

Autovalidation strategies in Spectrum Mill

Auto Thresholds

Fixed Thresholds

Auto Thresholds - Discriminant

Recursive Workflows

Autovalidation Strategies and Recursive Searches

Suggested Workflows

Which Workflow to Use?

Quality Metrics & FDR

To Use the Autovalidation Form

Automatic Validation

Data Directories

Validation Strategy/Mode

Validation Parameters: Fixed Thresholds

Protein details mode

Filtering

Protein Rules

Peptide mode

Filtering

Peptide Rules

Validation Parameters: Auto Thresholds

Peptide mode

Filtering

Protein Polishing mode

VM site polishing mode

Validation Parameters: Auto Thresholds - Discriminant

Peptide mode

Protein Polishing mode

To Report Quality Metrics and FDR

Yields (spectra collected, filtered, validated)

FDR Metrics (spectra, peptide, protein)

Precursor Ion Metrics

MS/MS Interpretation Metrics

MS/MS Spectral Identifiability Metrics

Metrics reported:

Peptide Separation Metrics

Metrics reported:

Sample Handling Metrics

Peptide Fraction Overlap

Merge nearby MSⁿ scans with same precursor m/z:

Quantitation for ¹⁵N and ¹⁴N/¹⁵N mix