Spectrum Mill Basics


Table of Contents


Introduction

Mass spectrometry has become a core technology for proteomics research, but without modern tools, there are often bottlenecks in data interpretation and review. The Agilent Spectrum Mill MS Proteomics Workbench is a comprehensive suite of software tools designed to facilitate high-throughput proteomics experiments using mass spectrometry. Key features of the Spectrum Mill include:

Intelligent spectral extraction

The Spectrum Mill data extractors preprocess data to extract high-quality spectra for database searches. Data extractors identify and exclude noise spectra and poor quality spectra, to increase the speed of database searches and to reduce the number of false positives.

The data extractors for raw data files preprocess MS/MS spectra from Agilent and Thermo Fisher Scientific instruments. MS-only spectra can be searched using peak list files or by pasting a mass list into the Manual PMF form. These extractors produce files that contain mass - intensity lists suitable for use with Spectrum Mill search programs.

An optional Spectrum Mill Data Extractor for Generic Peak List Files enables use of the Spectrum Mill with peak list files, such as those as exported from Micromass Q-Tof using the ProteinLynx package. This extractor handles individual *.pkl and *.dta spectral files, or appended *.pkl files that contain multiple spectra. It also processes *.mgf files. The Spectrum Mill Generic Data Extractor prepares the peak list files for further Spectrum Mill processing.

Multiple search options

The Spectrum Mill provides multiple options for protein identification and characterization. You can search MS/MS spectra using MS/MS Search, or MS-only spectra using Manual Peptide Mass Fingerprinting (PMF) Search. Both searches include optimized scoring schemes that speed downstream data review.

MS/MS Search automates the search of large volumes of processed MS/MS spectra against protein databases. The MS/MS Search algorithm uses intelligent parallelization to provide extremely fast searches. It can operate in identity mode to find unmodified peptides or in variable modifications or homology modes to look for mutations, post-translational modifications, and chemical modifications.

Manual PMF Search performs searches of spectral peak lists that you enter into the Manual PMF Search form.

Automatic and manual match validation for MS/MS Search results

The Spectrum Mill offers both automatic and manual match validation of MS/MS Search results. Autovalidation quickly segregates those spectra that have matched well in the database search. Manual validation (in Protein/Peptide Summary) provides tools for fast, easy interactive data review and validation.

The Spectrum Mill segregates validated and unvalidated matches, and keeps a cumulative history of validated results. Spectra from remaining unvalidated matches can be re-searched using alternate parameters or databases. Each iterative search involves fewer and fewer spectra, making the searches even faster.

Fast, comprehensive result summaries

The Protein/Peptide Summary capability within the Spectrum Mill workbench allows you to summarize and correlate search results for MS/MS data. Protein/Peptide Summary includes tools to review entire directories of search results, and summaries can range from single samples to complex studies. The wide choice of summary modes makes the results accessible to biologists and biochemists, as well as mass spectrometrists.

Protein/Peptide Summary provides both qualitative and quantitative information. Qualitative results (validated search matches) are accompanied by either approximate quantitation (based on mean peak intensities of component peptides) or quantitation based on stable isotope or similar studies.

Advanced de novo spectral interpretation

For proteins not identified by database searching, the Spectrum Mill workbench also offers advanced de novo sequencing based on the Sherenga algorithm. The algorithm uses graph theory to generate a list of potential peptide sequences and to discard unrealistic solutions.

Workflow automation

The Spectrum Mill allows you to automate a typical data analysis workflow for MS/MS data files from protein digests:


File system

Before running MS/MS Search or PMF Search with the Spectrum Mill workbench, the spectral files must be placed in the appropriate directory underneath the web root on the server running the Spectrum Mill workbench. Because of communication demands for computer / mass spectrometer during spectral acquisition, this is expected to be a separate computer from the one that controls the instrument, with file transfer occurring over a network.

Location of Spectral Files

After you configure your file system with data root directories, you can create directories to place spectra as shown below:

Directory structure

Note that you may have up to ten directory levels between msdataSM and mySampleDirectory. But we recommend shorter path lengths to reduce memory usage, especially for large data sets.

How Spectrum Mill locates data files

The Spectrum Mill recognizes the bottom of the directory hierarchy (the location of data files) when it finds one of the following:

To ensure that the Spectrum Mill finds all your data files:

  1. Do not copy a processed data folder into a higher level folder.
  2. Keep your data files in subfolders that are at equivalent levels in the Spectrum Mill file system. Remember that the Spectrum Mill workbench can find only the highest level of data files in a given subfolder. For example, given these two data files,

    The Spectrum Mill will recognize datafile2.d, but not datafile1.d.

Naming of files and folders

Do not use spaces and parentheses in folder or file names. The following characters are also not permitted: | , ; % < > ? . +.


Overview for MS/MS Interactive Processing

In an automated LC-MS/MS experiment, one can separate peptides by reversed-phase HPLC and acquire an MS/MS spectrum approximately every second on whatever happens to be eluting from the column at that particular instance. Hence in about a half hour, one can be awash in about 1000 spectra. The Spectrum Mill provides tools to extract information from that morass of data in a manner that attempts to minimize the amount of data overload frustration. The figure below was created to illustrate the overall process. Note that failure to perform any of the items properly is likely to diminish the usefulness of the final output. 


Experiment Scheme

Getting Started for Agilent Q-TOF and Other MS/MS Data

  1. Acquire some mass spectra.
  2. Export spectral files.
  3. From the Spectrum Mill homepage, go to the Data Extractor page. Preprocess the spectral files. The Data Extractor program recognizes the data type and automatically uses the correct extractor:
  4. From the Spectrum Mill homepage, go to the MS/MS Search page.
  5. Set the appropriate MS/MS Search parameters and run the searches.
  6. Validate results in the Autovalidation page or manually in the Protein/Peptide Summary page.
  7. Review the data from the Protein/Peptide Summary page.
For more details on the MS/MS Search page, see the MS/MS Search Help.


Spectral Preprocessing for MS/MS Data

Data Extractor

The Spectrum Mill Data Extractor preprocesses raw data files from Agilent and Thermo Fisher Scientific instruments, to extract high-quality spectra for database searches. The Data Extractor automatically detects which type of raw file (specific instrument vendor or generic format) you have submitted and then invokes the appropriate extraction program (provided that it has been purchased and installed on your server). The MS/MS raw file data extractors extract and merge nearby MS/MS spectra from the same precursor ion.  They optionally apply MS/MS similarity criteria prior to merging scans, to avoid merging closely eluting or co-eluting isobaric peptides. For Agilent *.d ion trap and Thermo Fisher Scientific *.raw ion trap data, the extractors optionally merge MS2 and MS3 scans from the same precursor. The extractors assign precursor charges where possible, centroid the MS/MS spectra, calculate spectral features, filter MS/MS spectra by quality, extract reporter ion intensities (iTRAQ and TMT), and calculate extracted ion chromatograms (EICs) for the intervening MS precursor scans. The intensities are later are used for quantitation by subsequent Spectrum Mill programs.

Note: As of Spectrum Mill B.05.00, XtractorFinnigan uses the Thermo (Xcalibur or MSFileReader) code rather than Spectrum Mill code to do centroiding. Xcalibur or MSFileReader centroiding does a better job of using appropriately narrow windows across the entire mass range (particularly important for the barely resolved TMT-10 peaks). It also requires half the extraction time. Because the intensities are scaled differently (10-100-fold), you should not mix Spectrum Mill  centroiding and Xcalibur centroiding across multiple directories that will later be used for a combined report.

The functionality has been split into multiple programs:

For specifics on third-party software requirements, see the Installation Guide you received with your software. In general, Agilent Q-TOF and Agilent Trap (including ETD) do not require installation of offline software.  Thermo data (*.raw) requires the offline software be installed on the server, and the version must be equal to or later than the version that was used to acquire the data.

Output from Data Extractor consists of three types of files.

  1. mzXML files containing all quality-filtered, centroided individual MS/MSspectra for an LC-MS/MS run, for Agilent Q-TOF .d and Thermo Fisher Scientific .raw data (Spectrum Mill B.04.01 and later). With Spectrum Mill B.06, the Generic Extractor extracts *.pkl files to mzXML as well. Spectra from other instruments are extracted to individual *.pkl files.
  2. A summary file: SpecFeatures.1.tsv, containing spectral characteristics such as Max. Sequence Tag length, MS/MS reporter ion intensities, precursor ion intensity, retention time, and chromatographic peak width from the MS/MS scans that are used in the MS/MS Search, Quality Metrics, Sherenga de novo Sequencing, Protein/Peptide Summary, and Spectrum Summary scripts.
  3. Log files that describe reasons for rejecting particular MS/MS spectra and the means by which the precursor charge was determined.

If your input into the Spectrum Mill consists of peak list files (for example, from Micromass Q-Tof), see also Data Extractor for Generic (Peak List) Files.

Spectral Extraction

Peak Detection

The Data Extractor performs the peak detection steps described below prior to precursor charge assignment, spectral quality filtering, and spectral feature calculation. However, the peak detection does not persist. The extracted files (*.mzXML) retain all centroided peaks, and peak detection is repeated when necessary in MS/MS Search, Spectrum Matcher, and Sherenga de novo Sequencing. Thus, the MS/MS spectrum viewer can visualize interpretation results on the full spectrum, rather than just the processed peak list.

Spectral Features

A variety of spectral characteristics are pre-calculated for possible later use in the MS/MS Search, Quality Metrics, Sherenga de novo Sequencing, Protein/Peptide Summary, and Spectrum Summary scripts. MaxSequenceTagLength and totalIntensity are the most noteworthy. The following lists the more important spectral features. The extractors calculate additional features, depending on the amino acid modifications, etc. The extractors store the spectral features in the file specFeatures.tsv, with the variable names listed below. A subset of the fields that are reported are listed here.

MS/MS Spectral Quality Filtering

Although the Data Extractor filters out very poor quality spectra, certain spectral features (see features described above) can be used to craft a smaller subset of high quality spectra to limit input to MS/MS search, Spectrum Matcher, and Spectrum Summary. The same filters control the Identifiability Metrics calculated by Quality Metrics.

Multicore (Maximize CPUs) Data Extraction

Spectrum Mill B.05.00 now supports the ability to select Maximize CPUs when you extract data. Prior revisions only supported Maximize CPUs for MS/MS Search. Because data extraction can require much more memory than searches, Spectrum Mill implements a “memory governor” that prevents multiple extractions from running at the same time if available free memory becomes too low. When all physical memory is used, Windows will swap memory to disk, which significantly degrades performance. It is better to limit the number of parallel extractions than to have Windows go into swap file mode.

Configuring Service Request Manager Settings

The Spectrum Mill Service Request Manager (SRM) must be stopped for configuration changes to apply. See To Start and Stop the Spectrum Mill Workflow Manager Service for details. You must perform the following procedures from an elevated command window (cmd.exe, Run As Administrator).

The Spectrum flow configuration file (millsrm\smsrm.config) provides several parameters that configure how memory is governed:

<provider> section

<provider hostname="localhost" available="true" maxConcurrentTasks="2" minRequiredTaskMemoryGb="2">

maxConcurrentTasks This attribute is set by default during installation to be one less than the number of (multicore) CPU cores detected.
minRequiredTaskMemoryGb This attribute defaults to 2 Gb. If there is less than that amount available, no tasks that have been submitted to the workflow queue will be allowed to run. When currently running tasks complete, memory will be freed up and queued tasks will then run. 

<provider> <supportedTasks> section

The <task> definitions for “xtractorAgilent” and “xtractorFinnigan” support multicore processing. These have “memFactor” attribute. Because it is not possible to predict how much memory an extraction will require, the memFactor is used to estimate it based on the data file size. For Agilent data, this factor defaults to 1.25 times the size of the file. This factor applies to both centroid and profile data.  For Thermo .raw data, it is not possible for the request manager to determine whether the data is profile or is centroid data. The memFactor of 2.7 assumes data is centroid.  If your lab typically generates only profile data, the memFactor for the xtractorFinnigan task should be set to 1.0 instead.

<task type="xtractorAgilent" memFactor="1.5" />

<task type="xtractorFinnigan" memFactor="2.7" />

When to Change the memFactor Settings

You use Windows Task Manager to monitor the memory usage when multiple parallel extractions are occurring. You can also look at the Process tab to monitor how many xtractorAgilent.cgi or xtractorFinnigan.cgi processes are running at once.

If you find that available memory falls to near 0 or below, then consider increasing the memFactor setting. This will reduce the number of parallel extractions that can be run.

If you find that you do not see very many extractor processes running at the same time, yet there appears to be enough available memory (for example 4 or more Gb), then consider reducing the memFactor value. In general,Spectrum Mill should allow the number of CPUs minus 1 to run in parallel (if no other searches are running).

Note that reducing the MS/MS Search Batch Size setting can also reduce the amount of memory used in searches.

When to Select "Maximize CPUs"

Select Maximize CPUs in the Data Extractor when you are only extracting a data folder that contains multiple data files.  However, if you are extracting multiple data folders (where the number selected is near or greater than the number of CPUs on the server) then you will generally get better performance if you do not select Maximize CPUs for the Data Extractor. The data folders will all be extracted in parallel.


To Use the Data Extractor Form (MS/MS)

The following topics describe options available on the Data Extractor form.   In general, you should retain the default settings, except for the options highlighted in red text on the form.  For more details, see Spectral Preprocessing for MS/MS Data.  Note that the options change depending upon the vendor data type to be extracted.

Important note:  If you wish to redo a data extraction, mark the check box for Remove all prior results

Extraction

Data Directories

Modifications

MS/MS Spectral Feature Filtering

Merge nearby MSn scans with same precursor m/z: 

Replicate MS/MS scans that were acquired nearby in time and have the same precursor m/z are merged into a single spectrum using the constraints below.

Merge settings for Agilent instruments in instrument.txt

The Agilent extractor merges MS/MS spectra only if they are similar. This avoids merging closely eluting or co-eluting isobaric peptides. The parameters that control the merging are set in  E:\SpectrumMill\msparams_mill\instrument.txt:

merge_num_peaks For similarity merging of MS/MS spectra, the number of peaks that match between the two spectra  must be greater than or equal to merge_num_peaks, which is a number between 0 and 50. The similarity merging takes the top 50 peaks from both spectra and compares them.
merge_SPI For similarity merging of MS/MS spectra, the percentage of the total intensity of the top 50 spectral peaks that is matched from spectrum A to spectrum B and from spectrum B to spectrum A must be greater than or equal to merge_SPI, which is a number between 0 and 100.

With the exception of the Agilent Q-TOF, all Agilent instruments that generate MS/MS data use the defaults of merge_SPI = 70 and merge_num_peaks = 25, but if you add an entry to instrument.txt, that overrides the defaults. The Agilent Q-TOF uses merge_SPI = 50 and merge_num_peaks = 5, and the software merges only fragment ions that are within a 0.05 m/z mass tolerance.

If a significant number of peptides appear twice in the summary report, and the peptides do not have different charge states or different labels (for example, D0 and D8), then it is possible you need to modify the settings in instrument.txt. Before you do so, first increase the windows for Merge scans with same precursor m/z in the Data Extractor form. If changing the extractor settings does not produce satisfactory results, then modify instrument.txt to set merge_SPI to a lower value. Try a small change first, for example, change from merge_SPI = 70 to merge_SPI = 65. The format in instrument.txt is merge_SPI, followed by a tab, followed by the value.

You can also try setting merge_num_peaks  to a lower value (down to 20 or 15). This may be useful for some MALDI MS/MS spectra where sequence coverage is low and there are only a few large peaks in the spectrum.

For more information about modifying instrument.txt, click here.

 To customize merging, see this Help section for the Data Extractor form.

Precursor m/z & Charge Assignment

Note:  These options are not available when you mark the check box Show only MS (PMF) parameters.

Precursor Charge Assignment for MS/MS scans

Default mode - if instrument does not assign charge, the charge is assigned as 0 (ambiguous charge) unless it can be determined to be +1 as described in Find mode.

Force Mode - charge assigned as designated by the user.

Find Mode - fixed charge assigned if it can be determined as described below, otherwise 0 (ambiguous charge) assigned.

For Agilent Q-TOF data: The software examines the MS spectra for the precursor ions and calculates the theoretical isotopic distribution for all charge states from +1 up to Maximum (z ), which is set in the Data Extractor form. It then uses a least squares fit to determine which is the best match for the monoisotopic peak and isotopic distribution in the experimental spectrum. The software performs a least squares calculation for each spectrum across the elution profile of the chromatographic peak and then centroids. If the check box for Find 12C is marked, then it replaces the original monoisotopic mass with the centroided mass, to provide better mass accuracy.

For Agilent Q-TOF data, the software performs the charge assignment prior to peak merging, which is the opposite of the order for low-resolution data.

For ion trap (low-resolution) CID data: Tests below are performed in the order listed.

  1. +1 If No Peaks Above Precursor - if after peak detection as described above, there are no remaining peaks in the MS/MS spectrum above the precursor m/z value with an additional allowance of 2.5 m/z for precursor isotopes, then the precursor charge is assigned as +1.
  2. +2 from b/y pairs in MS/MS scan - if after peak detection as described above, there are at least 3 b/y pairs (pairs of peaks which add up to the mass of putative precursor MH+ + hydrogen), then the precursor charge is assigned as +2. Note that this calculation is dependent upon the putative precursor m/z (as adjusted by user designation of Find precursor 12C ) and the user-designated tolerance allowed for merging scans with the same precursor m/z
  3. +2 to Max z by checking MS scan for precursor charge distribution - the MS scan preceding the MS/MS scan is examined for peaks corresponding to additional charge states of the peptide's precursor m/z. Peaks corresponding to possible additional charge states in the MS scan are subject to a signal/noise calculation as described in the Peak Detection section and the user-designated mass tolerance allowed for merging scans with the same precursor m/z. After finding possible alternate charge states, the following further restrictions must be met before assigning the precursor charge:

For Agilent ion trap ETD data: The software examines the MS/MS spectra for a pattern of peaks with reduced charge states, finds the pattern that is most complete, and uses that information to assign the charge state to the precursor ion. It tests all possible precursor charges from +1 up to Maximum (z ), which is set in the Data Extractor form.

For example, to test z = 4 the software looks in the MS/MS spectrum for peaks that correspond to reduced charges of +3, +2, and +1. To test z = 5, it  looks for peaks that correspond to reduced charges of +4, +3, +2, and +1. The charge state that produces the most complete pattern is the one that is picked.

For Thermo Fisher Scientific ETD data: Charge assignment uses four different tests. If any of the four methods provide a charge, the software assigns the charge unless there is a conflict. If none of the four methods provide a charge, the software creates a .0 pkl file. The four tests are:



Data Extractor for Generic (Peak List) Files

The generic Data Extractor serves two basic functions for MS/MS spectra: spectral quality filtering and spectral feature calculation. The generic Data Extractor is automatically invoked for files that contain peak lists. It handles only spectra with peaks that have all already been centroided. The generic Data Extractor also processes *.mgf files that contain centroided spectra.

The generic Data Extractor performs many of the functions that the raw file Data Extractor does, but since it can not similarly read the raw mass spectral files, neither chromatographic time information nor MS scan data is available. Like the raw file Data Extractor, the generic Data Extractor creates the SpecFeatures.1.tsv file that contains Spectral Features such as total intensity and Maximum Sequence Tag Length. These features are used in the MS/MS Search, Sherenga de novo Sequencing, Protein/Peptide Summary, and Spectrum Summary scripts.

Settings in instrument.txt

By default, this extractor expects files that contain data that has been centroided only - not signal-to-noise processed or de-isotoped. For generic data, it is best to let the Spectrum Mill do the signal-to-noise processing and de-isotoping/charge-assignment. If  your instrument software performs these functions, then add the following to the section of E:\SpectrumMill\msparams_mill\instrument.txt that applies to your instrument:

bypassSignalNoiseThresholding 1
bypassDeisotoping 1


If you want your instrument software to do signal-to-noise thresholding but not de-isotoping/charge-assignment, then add the following to the section of instrument.txt that applies to your instrument:

bypassSignalNoiseThresholding 1
bypassDeisotoping 0

For more information about modifying instrument.txt, click here

Files generated

When you process appended *.pkl files, the software generates individual spectral files with the following naming conventions:

prefix.pkl - The starting file containing multiple spectra
prefix.scanNumber.0.parentCharge.pkl - A resulting file containing an individual spectrum

scanNumber: the consecutive order of the spectrum in the starting file
0: placeholder where function number would be if created by ProteinLynx
parentCharge: charge of the precursor ion for the spectrum

MS/MS Spectral Quality Filtering and peak detection are performed as with raw file Data Extractor.

*.mgf file support

The Generic Data Extractor can parse most *.mgf files. To get the best results, make sure that the PEPMASS lines contain both mass and intensity values, and that the CHARGE line is reported.

To optimize results, you may need to change settings for your instrument or define a new instrument type in E:\SpectrumMill\msparams_mill\instrument.txt. The instrument.txt setting for MALDI-TOF-TOF is configured for *.mgf files where the data has been centroided, signal-to-noise filtered, and de-isotoped. With the hiEnergyCID setting of 1 in instrument.txt, the search score is not penalized for unassigned peaks.

If your spectra contain many noise peaks, when you search the spectra, reduce the value for Minimum scored peak intensity. Likewise, when you validate and summarize data, reduce the % SPI and Score filters.
 

 

MS/MS Search

Filters for excluding files from MS/MS searches are described here. MS/MS Search itself is described in the MS/MS Search Help.

Search Filters

Features for excluding files from a group of MS/MS searches are covered here. 


MS/MS Autovalidation

The MS/MS Autovalidation page permits automatic validation of results meeting user-set score thresholds.  Two major differences exist between the validation done with this page and the validation done with the Protein/Peptide Summary page.  The first difference is that with Autovalidation, the validation occurs in a single step; the validation states are immediately written to file.  The second difference is that Autovalidation permits validation using charge-state-dependent score thresholds.

Note that when you validate files via either autovalidation or manual validation (Protein/Peptide Summary page), the software lists validated hits and spectra. These are cumulative and include both the new hits and spectra you just validated, as well as those you validated previously.

False Discovery Rate

With any protein database search, you get some top hits that are correct and some that are not. In the Spectrum Mill workbench, you (or the autovalidation software) can judge which hits are more likely to be correct, based on database search score and %SPI (the percentage of the extracted spectrum that is explained by the database search result). To further ensure the quality of results, the Spectrum Mill allows you to autovalidate database search results based on false discovery rate (FDR) a percent FDR that you set and that provides an independent measure of the likelihood that the results are correct.

To calculate the FDR, the software needs the results of the search of a decoy database. It gets these results when you mark the check box (in MS/MS Search) for Calculate reversed database scores. To calculate %FDR, it compares the number of  top database hits from the reversed database search to the total number of top hits. It multiplies the decoy top hits by 2, under the assumption that for each incorrect top hit in the decoy (internally reversed) database, there exists an incorrect hit in the forward database (SwissProt, or whatever database you searched).

Note: To publish the calculated %FDR, use the calculations available under Quality Metrics & FDR.

Strategies/Modes

To use false discovery rate calculations most effectively for your situation, Agilent has provided a number of options for autovalidating the matches and estimating the false discovery rate.  You can choose from among three Autovalidation strategies: 

You can use all of these strategies and modes with Workflow Automation, but only certain sets in recursive workflows.  A recursive workflow involves successive searches and validations; for example, identity search, followed by autovalidation, followed by a variable modification search on a smaller database, followed by autovalidation.  The recursive workflow is incompatible with the global FDR, calculated by the Optimize score and R1-R2 ...option in the Auto thresholds/Peptide strategy/mode and by the Global FDR option in the Auto thresholds-determinant/Peptide strategy/mode. The recursive workflow leads to subsets, each of which can have different characteristics, while the global FDR calculates a single FDR value over all matches under the assumption that all the matches have uniform characteristics on average.  Therefore, you can use only the Fixed threshold strategy/modes and the Auto threshold-determinant/Peptide/Local FDR option in recursive workflows.

Global versus Local FDR

With the Auto threshold-determinant strategy Peptide mode, you can autovalidate by either Global FDR or Local FDR. The Global FDR gives an overall error rate for validated peptides in the entire data set. You choose a cutoff (for example, 1% FDR) for which you accept results. That means in the overall data set, 1% of the identifications are likely to be wrong. However, an individual validated peptide may have a much higher chance of being wrong, which is especially true for the lower-scoring results. If that is a concern, you can use the Local FDR.

To calculate Global FDR, the program orders the identifications from best (highest discriminant score, or highest score if discriminant score is disabled) to worst (lowest score), then sums the total number of hits to the reversed database (D) and the total number of hits to both forward and reversed databases (N).  Then it calculates FDR as:

FDRglobal = 2D/N

The Local FDR measures the quality of each individual peptide identification. It answers the question, "If I accept this hit as a correct answer, how much does that increase my false positive rate?" As with the global FDR, you choose a cutoff (for example, 1% FDR) for which you accept results. The local FDR calculation uses the equation:

FDRlocal = 2 dD/dN

In other words, it plots D on the y-axis versus N on the x-axis, and takes the derivative at each (D, N) pair. (See example graphs below.) This plot is not smooth, which causes local variations in the derivative. To get more reliable results, the program first fits a function to the plot, then takes the derivative of the function at each point.

Local FDR example 1Local FDR example 2

As shown below, the local FDR is generally a more stringent measure of quality, so it usually gives fewer validated hits than global FDR.

Global/local FDR comparison

For more information, see:

Tang, W. H.; Shilov, I. V.; and Seymour, S. L. "Nonlinear Fitting Method for Determining Local False Discovery Rates from Decoy Database Searches;" J. Proteome Res.; 2008; 7; 3661-67; DOI: 10.1021/pr070492f.

FDR at the PSM, Peptide, and Protein Levels

FDRs can be calculate at different levels: peptide spectrum match (PSM), peptide, and protein. The Autovalidation form in the Spectrum Mill calculates FDR at the PSM and protein levels, while the Quality Metrics module calculates FDR at all levels. The difference between the PSM level and the peptide level is that the PSM level may include multiple spectra for the same peptide, while the peptide level uses only the highest-scoring spectrum for each peptide. Therefore, the peptide level is a more stringent calculation.


MS/MS Autovalidation and Workflows

Autovalidation strategies in Spectrum Mill

There are three Autovalidation “strategies” in the Spectrum Mill, and each provides both a peptide-level and a protein-level Autovalidation mode, but there are some differences. In general, the Auto thresholds strategy is recommended, but there are cases where the other strategies should be used. This is discussed in the Suggested Workflows section. 

FDR

Determination of a false discovery rate (FDR) requires the data be searched with Calculate reversed database scores enabled. When enabled, Spectrum Mill reverses the sequence of amino acids in the peptide that are between the termini. For example, “SAMPLER” is also searched as “SELPMAR”. This allows for the search to use the same peptide mass, and it is faster than searching a decoy database. FDR calculations require a sufficiently large database so that false positives can be determined. This has implications for searching single protein or small species subsets, and when searching saved results.

The actual FDR obtained can be determined in the Quality Metrics & FDR page.

Auto Thresholds

The Auto thresholds strategy is available in B.04.00 and later, and is the default. With this strategy, the Peptide mode is done first and optimizes the score and Rank1-Rank2 score thresholds to reach a specified maximum FDR. This mode allows for various peptide filtering settings which are applied prior to validation. The Protein polishing mode can then be used to remove one-hit wonders and increase coverage of valid proteins. Note that Peptide followed by Protein polishing is the reverse order than what is done in the Fixed thresholds strategy.

Auto Thresholds

The Auto thresholds strategy is the recommend strategy to use in most cases.  Note that you first perform Peptide mode, then optionally use Protein polishing.  

Peptide mode

For each precursor charge state, the matrix of score and Rank1-Rank2 values are examined to find the values that yield the maximum number of peptide spectrum matches below the designated FDR threshold. For datasets or charge states that have small numbers of peptides, you should choose to optimize across an entire directory rather than across each LC-MS/MS run. In peptide mode, when you use the Auto thresholds strategy multiple times on the same directory, each time it only optimizes using the not-yet-valid peptide spectrum matches. The results of each round are appended to the pool of previously valid spectra. Use the Quality Metrics & FDR tool to calculate the final combined FDR.

Protein polishing mode

The Protein polishing mode has two goals: (1) achieving a target protein FDR, and (2) increasing the sequence coverage of validated proteins. Before using this mode, you must use the Peptide mode.

Both goals are achieved by unvalidating previously validated peptides. This unvalidation capability enables you to autovalidate marginal peptides during peptide autovalidation, yet the protein FDR is kept under control with subsequent protein polishing by unvalidating the marginal peptides that belong to marginal proteins.

Fixed Thresholds

The Fixed thresholds strategy is similar to the “classic” (A.03.03 and prior) Autovalidation, but now provides the option to calculate an FDR. New peptide filtering options are also available. In this strategy, validation is done first with Protein details mode, and then can optionally be followed with Peptide mode. The Quality Metrics & FDR page can be used to determine the FDR that was obtained.

Fixed Thresholds

Enhancements over the “classic” Autovalidation include:

Auto Thresholds - Discriminant

Discriminant Scoring allows additional factors (%SPI, Backbone Cleavage Score, Number of Complementary Fragments, Matched Sequence Tag Length, Peak Match%, Charge, Rank1-Rank2 Delta) to contribute to the scoring used in the Autovalidation.

To use this strategy, Discriminant Scoring must be enabled in the search. Effective use of discriminant scoring requires the careful curation and validation (using one of the other Autovalidation modes and manual validation) of a representative data set. The Tool Belt Calculate discriminant scoring coefficients tool is then used to create the coefficients. Several precalculated sets are provided for evaluation. Note that selection of Score in the MS/MS Search defeats the purpose of the discriminant mode, and is there for backwards compatibility only.

Auto Thresholds - Discriminant

The FDR target may be applied to either Local or Global levels.

Peptide mode - Global FDR

In this mode, the program calculates the global peptide FDR at the spectral level. The global FDR is the percentage of all the peptide identifications that are likely to be false. It is a calculation for a collection of peptides across the data set you are validating. The program adjusts the validation thresholds for peptide score (or discriminant score) until it meets the %FDR that you typed. This mode does not support recursive workflows with successive validations and searches.  

Peptide mode - Local FDR

In this mode, the local FDR measures the error rate for individual peptides at the spectral level. While the global FDR focuses on a collection of peptides, the local FDR answers the question, "Does this peptide identification increase the FDR? If I validate this identification, how many additional false positives am I likely to get?" This mode supports recursive workflows with successive validations and searches.

Compared to the global FDR calculation, the local FDR calculation requires an additional curve fitting step and is thus less robust from a computational standpoint than the global FDR calculation. The larger the data set, the more reliable the curve fitting becomes and hence the more reliable the calculated local FDR value. You should review the curve fitting, which you can see by clicking on an entry in the FDR search # column and looking at the graph titled “Fit quality for computing local false discovery rate.”

Recursive Workflows

Note: Prior to Spectrum Mill B.04.00, the recommendation for variable modification searches was to always search first with Identity mode, validate, then search in Variable mode. Because of both search performance improvements and the ability to Autovalidate to an FDR, the initial search should now include the expected variable modifications.

In recursive workflows, an initial search is done with the expected variable modifications. The results are then Autovalidated. Additional searches are then run with Search previous hits selected. This restricts the search to only those proteins that were identified and validated in the initial search. Typical uses of a recursive search are to search with a different variable modification (usually a different one for a modification that was applied during the initial search), or a different enzyme. Setting the Validation filter to spectrum-not-marked-sequence-not-validated reduces the search space to those spectra that were not validated after an earlier search.

It may be the case that changing the modifications and enzyme selections will result in completely different proteins being found during the MS/MS Search. You can combine these additional results with your previously found results by clearing the check boxes for both Remove all prior MS/MS Search results and Search previous hits.

Autovalidation Strategies and Recursive Searches

When you do recursive searches, only the following Autovalidation strategies should be used to Autovalidate after each recursive search:

The Auto thresholds strategy (either Peptide or Protein polishing mode), and the Auto thresholds – Discriminant strategy with Global FDR mode should not be used. While it might be tempting to Clear All prior validations prior to Autovalidating after recursive searches, this will not provide an accurate FDR, because the size of the search space is different for each round and thus the delta R1-R2 scores are not comparable.

Suggested Workflows

Auto thresholds Strategy

This workflow begins with the Peptide mode. It can then be followed by the Protein Polishing mode. Use of the latter may remove previously validated peptides to meet the protein FDR% target.

This Autovalidation workflow should not be used with recursive search workflows.  The implication is that Variable modifications searches must be done in the initial search step. Additional (recursive) searches should be followed by one of the Autovalidation strategies that support recursive searches.

Fixed threshold Strategy

When using this strategy, first do Protein Details validation, then optionally follow with Peptide validation. Do not clear the validations between searches.

Auto thresholds - Discriminant Strategy

This workflow begins with either the Peptide Global or the Peptide Local Autovalidation. (Do not do both.) Either mode can then be followed by the Protein Polishing mode.

Only the Peptide Local Autovalidation workflow can be used in the recursive search workflows.

Which Workflow to Use?

The Auto thresholds strategy automatically validates for a target FDR%, where it uses both the Score and the Rank1-Rank2 score to optimize thresholds. It provides various filtering options, and is the recommended strategy to use. The disadvantage is that it does not support the recursive search workflow, but it can be used to validate the initial search results. 

The “classic” approach using the Fixed threshold strategy still works and can be used as a reference point for evaluating the other approaches. The resulting FDR can be calculated and shown. To change the FDR target, though, you must change the various Rules settings and redo the Autovalidation.

The Auto thresholds – discriminant strategy is the simplest Autovalidation approach for FDR, but only the Peptide Local mode can be used in a recursive search workflow. The disadvantage is that Discriminant Scoring must be enabled during the search, and requires that a training set be carefully validated, although several default sets are provided for evaluation. Typically, you would use the Fixed Thresholds or the Auto Thresholds approaches, along with some manual validation, to prepare the data set. The use of Discriminant Scoring allows additional factors (%SPI, Backbone Cleavage Score, Number of Complementary Fragments, Matched Sequence Tag Length, Peak Match%, Charge, Rank1-Rank2 Delta) to contribute to the scoring used in the Autovalidation. For small data sets, the local FDR calculation may be unreliable and it is wise to use the global FDR.

Quality Metrics & FDR

All of the peptide Autovalidation modes calculate the spectra level FDR. The Protein polishing Autovalidation calculates the protein level FDR. The only place the distinct peptide level FDR is calculated is in the Quality Metrics & FDR page. The FDR may be reported at the spectra level, distinct peptide level, and protein level. 

FDR


To Use the Autovalidation Form

The following topics describe options available on the MS/MS Autovalidation form.   In general, you should retain the default settings, except for the options highlighted in red text on the form.  For more details, see MS/MS Autovalidation.

Automatic Validation

Data Directories

Validation Strategy/Mode

For an introductory explanation of the strategy/mode selections, see Strategies/Modes. Select from one of three strategies for autovalidating proteins and peptides in the search results and then select a mode associated with the strategy:

Validation Parameters: Fixed Thresholds

These parameter fields change depending on the strategy and mode you select.  See the explanations above for each strategy and its associated modes. Below are the parameter fields for the Fixed thresholds strategy.

Protein details mode

If you mark this check box, you cannot mark the check box for Fwd - Rev Score Threshold (under Protein Rules) and vice versa. You must mark this check box if you want to calculate a false discovery rate in the Tool Belt.

Filtering

You choose from one of these two options:

Protein Rules

These rules permit validation of proteins that match specified criteria.   

Peptide mode

If you mark this check box, you cannot mark the check box for Fwd - Rev Score Threshold (under Peptide Rules) and vice versa. You must mark this check box if you want to calculate a false discovery rate in the Tool Belt.

Filtering

Use the settings for Automatic variable range for each run when your runs contains peptides with very different values for these parameters.  The program calculates a range of expected values based on the amino acid sequences of the peptides, and filters those peptides from the list whose parameter values are above or below the set percentile range (25-75 percentile?).  Use the settings for Fixed range for all runs when your runs contain peptides most of whose parameter values lie within a similar range.  Or use with only one run.

Peptide Rules

These rules permit validation of peptides that match specified criteria.  Note that there are only five rules, whereas Protein Rules have six.
The score requirements are more stringent in peptide mode, and for peptides of higher charge states.

Validation Parameters: Auto Thresholds

These parameter fields change depending on the strategy and mode you select.  See the explanations above for each strategy and its associated modes. Below are the parameter fields for the Auto thresholds strategy.

Peptide mode

Filtering

Use the settings for Automatic variable range for each run when each run can be expected to contain different medians or ranges for instrument performance or peptide properties. Use the settings for Fixed range for all runs when all the runs contain values within a similar range and you have foreknowledge of what that range should be.

Protein Polishing mode

The Protein Polishing mode can only be used after validating in Peptide mode.

In Protein Polishing mode the intention is to reach a target protein FDR and eliminate unreliable protein-level identifications, particularly low scoring proteins that are detected either by single peptides (so called one-hit-wonders) or proteins infrequently detected when multiple experiments are being combined across multiple data directories. These goals are achieved by unvalidating PSMs previously validated in a peptide mode autovalidation step. This allows one to autovalidate marginal PSMs during peptide-level autovalidation, yet keep the protein FDR under control by subsequently unvalidating the marginal PSMs that cause trouble at the protein level. Removal of low quality PSMs should also result in reducing the peptide-level FDR that will be recalculated via Quality Metrics after all autovalidation steps are complete. Consequently, autovalidation using a 2-step approach of peptide mode followed by protein polishing mode generally results in increased sequence coverage of the validated proteins as compared to a 1-step approach of peptide-level autovalidation with a target FDR threshold lowered to be equivalent to what is reached after a combined two-step approach.

VM site polishing mode

The VM site polishing mode can only be used after validating in Peptide mode.

In VM site polishing mode the intention is to eliminate unreliable VM site-level identifications, particularly low scoring VM sites that are only detected as low scoring peptides that are infrequently detected when multiple experiments are being combined across multiple data directories. This goals is achieved by unvalidating PSMs previously validated in a peptide mode autovalidation step. This allows one to autovalidate marginal PSMs during peptide-level autovalidation with the potential to increase sensitivity and diminish the number of missing values for VM site level quantitation across multiple experiments. Subsequent VM site polishing will then unvalidate marginal PSMs that are non-recurrent. Removal of low quality PSMs should also result in reducing the peptide-level FDR that will be recalculated via Quality Metrics after all autovalidation steps are complete. Consequently, autovalidation using a 2-step approach of peptide mode followed by VM site polishing mode generally results in fewer missing values across mulitple experiments as compared to a 1-step approach of peptide-level autovalidation with a target FDR threshold lowered to be equivalent to what is reached after a combined two-step approach.

Validation Parameters: Auto Thresholds - Discriminant

These parameter fields change depending on the strategy and mode you select.  See the explanations above for each strategy and its associated modes. Below are the parameter fields for the Auto thresholds - discriminant strategy.
This strategy uses discriminant scores to validate the peptides found in the MS/MS search.  See Discriminant Scoring for details.

Peptide mode

Whether you choose the Global FDR mode or the Local FDR mode, first make sure that you did the MS/MS Search with the check box marked for Calculate reversed database scores. The FDR calculations use the results from these calculations. And you must also make sure that results of any previous autovalidations or manual validations are deleted.

Protein Polishing mode

In Protein Polishing mode the intention is to achieve a target protein FDR and increase the sequence coverage of validated proteins.  The first objective is achieved by unvalidating previously validated peptides.  This capability allows you to autovalidate marginal peptides during peptide autovalidation; yet the protein FDR is kept under control by unvalidating the marginal peptides that cause trouble at the protein level.  The second intention is achieved by recalculating the peptide FDR only on the subset of peptides from validated proteins.  This generally results in increased sequence coverage of the validated proteins.



To Report Quality Metrics and FDR

This utility enables two functions:

To use these capabilities:

  1. On the Spectrum Mill home page, under Result Summary Tools, click Quality Metrics & FDR.
  2. Mark the check box(es) to give the results you need.
  3. Select the Data Directories for which you want to report FDR and search statistics. You may select one or more data directories. They must have sequential numbers at the end. For example, the names could be Pfu-OGE-01.d, Pfu-OGE-02.d, ... Pfu-OGE-12.d.
  4. Click the Report button.
Checking the Excel Export Checkbox will cause the reports to be written to the first directory selected. The report for file-level (LC-MS/MS run) metrics will be written to a file called qualityMetricsExportFile.1.ssv. Directory-level metrics will be written to a file called qualityMetricsExportDir.1.ssv.

Checking the box for Update Log file (2 directories up) with file level metrics will cause file-level metrics to be appended to a pre-existing file present 2 directory-levels up from the first selected directory. This feature was created with the intended purpose of keeping an ongoing log of quality metrics for a particular instrument. The file to be appended to should be called qualityMetricsExportFile.Cady.ssv. (the user should alter the Cady portion of the filename to match the relevant instrument name). If the checkbox is not visible on the form, it can be enabled for a website via the switch variable (enableUpdateLogFileCheckbox=true) in millhtml/SM_js/SMcustomFlags.js.

The following describes the results you can show:

Yields (spectra collected, filtered, validated)

It is typical that not all spectra will be interpreted and validated. If your Collection Yield seems particularly low, there may have been an unusually high number of noisy spectra in your analysis.  Perhaps you used a low threshold for data acquisition, or maybe there was a high instrument background.  In these cases, the relative number of spectra that are picked by the Data Extractor will be low.

Both the Collection Yield and the Validation Yield will reflect to some degree how much time you spent processing the data, via homology searches, broader databases, etc.  In general, processing is complete when sufficient information has been extracted from the data to meet the experimental goals.

FDR Metrics (spectra, peptide, protein)

Precursor Ion Metrics

MS/MS Interpretation Metrics

MS/MS Spectral Identifiability Metrics

With thresholds for MS/MS Spectral Quality Filtering several subsets of spectra are created and used to calculate several metrics to help understand the identifiability of a dataset. The metrics attempt to measure what portion of the dataset was good quality spectra, what portion of those good spectra became valid identifications, what portion of those good spectra remain to be interpreted, and the relative distribution of spectral quality in each of those portions. The spectral quality thresholds allow the user to craft the definition of "good".

If lots of good spectra are unidentified then one should consider possible causes like problems with cysteine alkylation chemistry, contaminant proteins present that are not in the database, non-specific proteolysis in the sample prior to digestion, and significant presence of unanticipated modifications.

Metrics reported:

Peptide Separation Metrics

Sample Handling Metrics

Peptide Fraction Overlap


Protein/Peptide Review of MS/MS Search Results

The Spectrum Mill provides a means for summarizing the results to answer questions like:

Summary Modes:

See Chapter 2 of the Application Guide for detailed descriptions of the current Protein/Peptide Summary displays.

 

Mode Description Manual
Validation
State
Assignment
Available
Example Applications
Peptide - Spectrum Match Peptides listed for each spectrum with links to data. yes List of PSMs present in the data.
Peptide - Distinct Peptide is the primary organizing feature. PSMs for the same peptide are collapsed into a single row. The menu Filter to distinct peptides enables refining the notion of sameness to suit one's need (modified or not, different precursor charge, different LC-MS/MS run). no List of distinct peptides present in the data. Primary reporting mode for immunopeptidome experiments.
Protein Summary Details Protein is the primary organizing feature. Peptides listed for each protein with links to spectra. yes Sequencing of simple mixtures of proteins, where coverage inspection is valuable.
Protein - Protein Comparison Protein is the primary organizing feature. Each protein is listed once. Columns then show distribution of that protein among samples (one LC-MS/MS file per column, or a directory full of LC-MS/MS files treated as one column). no Primary reporting mode for quantitative whole proteome experiments. One or many LC-MS/MS files analyzed in a single directory. Directory corresponds to a sample.
Protein - Peptide Comparison Peptide is the primary organizing feature. PSMs for the same peptide are collapsed into a single row. Peptides that belong to same protein group are clustered then listed in rows below each protein. The protein grouping method should be set to unexpand subgroups to prevent a peptide from being repeated for each protein subgroup in which it is a member. Columns then show distribution of each peptide among samples (one LC-MS/MS file or sample directory per column). no Evaluation of fractionation scheme.
Protein - Var Mod Site Comparison VM site is the primary organizing feature. PSMs for the same variable modification site are collapsed into a single row. The type of VM site (phospho, acetyl, ubiquityl) is controlled by setting the coresponding value on the required AAs menu value (s|t|y, k, k). The protein grouping method should be set to unexpand subgroups to prevent a VM site from being repeated for each protein subgroup in which it is a member. Columns then show distribution of each VM site among samples (one LC-MS/MS file or sample directory per column). no Primary reporting mode for quantitative phosphoproteome, acetylome, ubuiqitylome experiments.
Protein - Prot Genom Site Comparison PG site is the primary organizing feature. PSMs for the same proteogenomic site are collapsed into a single row. The type of PG site (variant or splice junction) is controlled by setting the coresponding value on the Filter by Proteogenomic Features menu value. The protein grouping method should be set to unexpand subgroups to prevent a PG site from being repeated for each subgroup in which it is a member. Columns then show distribution of each PG site among samples (one LC-MS/MS file or sample directory per column). This mode is critically dependent on the prior creation of summary tables for personalized sequence databases used for the MS/MS searches. no Primary reporting mode for focusing on personalized proteogenomic features observed within a whole proteome experiment.

Protein Grouping in Protein Modes

The mechanism consists of the following steps:

  1. Extract peptides - From each search result, extract all of the rank 1 hits (may be multiple instances of the same peptide sequence matched to proteins with different accession numbers).
  2. Form proteins - Assemble all the peptides belonging to a single accession number.
  3. Eliminate peptide redundancy - Redundancy has several sources: The protein score and the number of distinct peptides are calculated so that only the instance of a particular peptide with the highest MS/MS Search score is counted (i.e. each peptide counted once, NOT multiple spectra, NOT multiple charge states, NOT multiple substitutions). The protein score is the sum of the identification scores of the distinct peptides from that protein. However, the total intensity is summed so that each observation of a peptide counts towards the total intensity for the protein (i.e. each spectrum counted once).
  4. Eliminate protein redundancy - Proteins are grouped by peptide roll-up. All proteins are sorted in descending order of number of distinct peptides. Then starting from the bottom protein, the question is asked: for this protein, is at least one of the observed peptides present in a protein higher on the list? If so, the proteins are grouped together when a peptide sequence of >8 residues is contained in multiple protein entries in the sequence database.
    In some cases when the protein sequences are grouped in this manner, there are distinct peptides that uniquely represent a lower-scoring member of the group (isoforms and family members). Each of these instances spawns a subgroup. Multiple subgroups are reported and counted towards the total number of proteins, and given related protein subgroup numbers (e.g. 3.1 and 3.2 for group 3, subgroups 1 and 2). See also the information about multiple sequence alignment. In the Protein Summary Modes, the highest-scoring member of each protein group and subgroup become the basis for further calculations. All subgroups are reported in Protein/Peptide Summary, unless the protein grouping method is set to Unexpand subgroups.
  5. Expand subgroups and shared peptides - When reporting the protein score, summed precursor intensity and quantitative ratios there are multiple possible ways of handling the peptides which are shared by more than 1 subgroup in a protein group. 4 options are provided:
    1. unexpand subgroups
      all peptides are used and protein group level values are reported without expanding into subgroups. For certain modes which display peptide level results (Protein - Peptide, Protein - Var Mod Site, Protein - Prot Genom Site) this method is valuable to prevent peptides, VM sites, and PG sites from being reported multiple times i.e. for each subgroup they are members of. When doing so, the highest scoring protein subgroup they are members of will be reported.
    2. expand subgroups, all use shared
      Shared peptides are used in each subgroup in which they are observed. This is the default approach.
    3. expand subgroups, top uses shared, SGT
      Shared peptides are used only in the top scoring subgroup. They are excluded from other subgroups. For isoforms and family members, this method is valuable for having quantitation based solely on the peptides which are distinct to that subgroup. The report filename will contain a .SGT. designation intended to mean SubGroup Top.
    4. expand subgroups, ignore shared, SGS
      Shared peptides are ignored for all subgroups. Only the subgroup specific peptides are used toward each subgroup’s count of distinct peptides and protein level quantitation. This method is particularly suited for xenograft experiments (a human tumor grown in a mouse). If evidence for BOTH human and mouse peptides from an orthologous protein were observed, then peptides that cannot distinguish the two (shared) are ignored. However, the peptides shared between species are retained if there was specific evidence for only one of the species, thus yielding a single subgroup attributed to only the single species consistent with the specific peptides. Furthermore, if all peptides observed for a protein group are shared between species, thus yielding a single subgroup composed of indistinguishable species, then all peptides are retained. The report filename will contain a .SGS. designation intended to mean SubGroup Specific.
    In some applications it is helpful to consider more than one method of handling the shared peptides. Consequently, instead of a user having to generate multiple reports (and wait for the protein grouping to be repeated), when either the SGS or SGT option is selected a second report for the all use shared option is generated when the excel export option is used for producing output.
  6. Sort protein groups and subgroups - Protein groups are sorted in descending order of protein score. Subgroups within a group are sorted in descending order of protein score that includes the peptides that are shared with other subgroups.

Notes:

  1. The modes Protein - Var Mod Site Comparison and Protein - Peptide Comparison - should be used with the protein grouping method set to Unexpand subgroups to prevent VM sites and peptides from being reported multiple times i.e. for each subgroup they are members of. When doing so, the highest scoring protein subgroup they are members of will be reported.
  2. In Protein Summary Details mode - When you use manual validation with 1 shared peptide, expand subgroups, the top portion of the report that lists the proteins shows the individual subgroups. The lower peptide portion of the report shows all the peptides that belong to the group; subgroup information is not given at the peptide level. Because a peptide can belong to more than one subgroup, this prevents you from assigning conflicting validation states to a single peptide that is listed multiple times in different subgroups.

For a discussion of  the principles of protein grouping, see:

Nesvizhskii, A. I.; Aebersold, R.; "Interpretation of Shotgun Proteomic Data: The Protein Inference Problem;" Mol. Cell Proteomics.; 2005; 4(10);1419-40 DOI: 10.1074/mcpR500012-MCP200

Peptide Validation

The Spectrum Mill provides a means for segregating search results that contain a valid interpretation of an MS/MS spectrum from those which do not. The segregated groups can then be subjected to subsequent rounds of searches (against other databases or in homology mode for example) or to produce a summarized list of only those peptides or proteins found in a sample from confidently-interpreted spectra. An interpretation which is not valid can result from several causes:

To segregate the search results, the software must keep track of both the spectrum and its interpretation in a coordinated way. The software must simultaneously keep track of spectra separately from search results, since spectra can be segregated according to quality without regard to their interpretations. The validation state of a particular spectrum or a search result can be designated with certain programs. After toggling the validation state for each search result or spectrum and clicking the perform validation button, two files are created in the appropriate data directory (hitTable.tsv, and spectrumTable.tsv). The tables record the appropriate state of search result or spectrum file according to the chart below. Files whose state is not designated are not recorded in the tables. When additional validation events are performed, the table files cumulatively record the validation states of spectra and search results for the particular data directory. Subsequent operations using different programs can thus be done using only the members of the group corresponding to combinations of states. Subsequent MS/MS searches will overwrite the results of earlier searches.

 

Validation Filter Program Using Filter Possible
Spectrum States
Possible
Interpretation (Hit)
States
Program Capable
of Assigning
Spectrum States
Program Capable
of Assigning
Interpretation (Hit) States
spectrum-not-marked-sequence-not-validated MS/MS Search
de novo Sequencing
Spectrum Summary
none none Protein/Peptide Summary
Spectrum Summary
Protein/Peptide Summary
sequence-not-validated Protein/Peptide Summary
MRM Selector
none
good
bad
none Protein/Peptide Summary
Spectrum Summary
Protein/Peptide Summary
valid MS/MS Search
Protein/Peptide Summary
MRM Selector
valid valid Protein/Peptide Summary
Autovalidation
Protein/Peptide Summary
Autovalidation
good-spectrum-sequence-not-validated MS/MS Search
de novo Sequencing
Protein/Peptide Summary
Spectrum Summary
MRM Selector
good none Spectrum Summary Protein/Peptide Summary
good-spectrum Spectrum Summary good none
valid
Spectrum Summary Protein/Peptide Summary
Autovalidation
bad-spectrum Spectrum Summary bad none
valid
Spectrum Summary Protein/Peptide Summary
Autovalidation
all Protein/Peptide Summary
MRM Selector
none
valid
good
bad
none
valid
Protein/Peptide Summary
Autovalidation
Spectrum Summary
Protein/Peptide Summary
Autovalidation

The Spectrum Viewer is a convenient tool for reviewing results.


To Use the Protein/Peptide Summary Form

The following topics describe options available on the Protein/Peptide Summary form. Note that the options under Validation and Sorting and Review Fields change depending upon which Mode you select.  This section describes all possible options; you may see only a subset of these on your form.

If during data review you wish to display the Protein/Peptide Summary form again, click the Summary Settings button at the top of the page.

For more details, see Protein/Peptide Review of MS/MS Search Results.

Summarize Results for Review

Validation and Sorting

Review Fields

Protein Quantitation Options

These options are available only in certain protein modes.

Spectrum Grouping Options

These options are available only in the Protein-Peptide Comparison Columns mode.
Each precursor ion intensity reported contains the summed value from all of the peptide spectrum matches (PSM's) that were grouped together.


Variable Modification Localization within Protein/Peptide Summary

Variable modification localization is a unique Spectrum Mill feature that assigns modifications to specific location(s) in a sequence when you have two or more possibilities. In addition, it provides a confidence indicator, which is the difference in score between equivalent identified sequences with different variable modification localizations. A VML score:

This tool saves time because you can determine modification sites without the need to inspect the spectra. For example, with this tool, you can compare and visualize phosphosite differences across samples. The sequence map shows the cleavage location for the observed ions, which provides additional information on the scoring.


Ion Mobility Workflow

The Spectrum Mill B.06.00 release provides support for Agilent IM-Q-TOF data using concatenated peak list (PKL) files generated by the Agilent MassHunter IM-MS Browser (B.07.02 or later). The PKL files contain the retention time (RT), drift time (DT), and collision cross section (CCS) values. The CCS values are written only if the data has been calibrated for the CCS calibration factors, and if the charge state of the precursor can be determined.

The PKL file is extracted using the Generic Extractor, which writes the RT, DT, and CCS values to the mzXML file that is generated. The IM values are propagated into tagSummary during a search. The Protein/Peptide Summary modes that include peptide results have an Ion mobility review field. If marked, the summary report includes the DT and CCS values. If CCS is not available, its value is reported as 0.0. Spectrum Summary also provides an Ion mobility field to report these values. The MPP APR Export (Protein-Protein Comparison summary mode) supports export of the ion mobility values if they are present in the data.

Workflow for processing IM-Q-TOF data

The workflow described here is current as of the Spectrum Mill B.06.00 release. Contact Agilent for possible updates to the recommended workflow.

To report CCS values, the data must be calibrated for calculating the CCS values. The calibration involves acquiring a tune mix that contains at least three ions with known CCS values. The calibration is done using the IM-MS Browser, and can be done on the acquisition system where the factors are applied to future acquired data, or to selected data files on the analysis system. Refer to the IM-MS Browser documentation for details.

To process IM-QTOF data:

In IM-MS Browser:

  1. If the data has not been calibrated for CCS, open the tune file in IM-MS Browser, and apply CCS calibration to the data that is to be processed.
  2. Open the data file.
  3. Method->Find Features (IMFE). Select the Peptides as the Isotope model, and set Limit charge state to what is expected for the peptides. The Ion intensity setting of >= 100 is a reasonable default.
  4. Method->Filter Features. These setting may require some experimentation, depending upon the data. Select Max ion volume. Typical values to use are:
  5. Method -> Extract Fragmentation Spectra…  The default values (+/- 3 seconds for RT and +/- 0.3 milliseconds for DT) are suitable.
  6. Method -> Find Peaks in Mass Spectrum… Only enable and set the Maximum peak count to be 200 peaks, and do not mark the Charge state assignment.

In the Spectrum Mill:

  1. Copy the PKL file generated by the IM-MS Browser to a new folder under msdataSM on the Spectrum Mill server. Do not place it in a cpick_in subfolder.
  2. In the Data Extractor, select the folder with the PKL file. It will show the Generic Extractor parameters. Select the Instrument type to be Agilent ESI Q-TOF. Set the MS/MS Spectral Feature Finding parameters to correspond to your data.
  3. In MS/MS Search, select the instrument to be Agilent ESI Q-TOF and set other parameters according to how the data is to be searched.
  4. In Protein/Peptide Summary, the Peptide modes have an Ion mobility review field. Mark it to report the DT and CCS values. If the data was not calibrated for CCS, it reports “0.0” for CCS values. The AMRT export in the Peptide – Spectrum Match and Peptide – Distinct modes will include the DT and CCS values if the Ion mobility review field is marked.
  5. To export ion mobility results to Agilent Mass Profiler Professional (MPP) 14.8 or later, in the Protein-Protein Comparison mode of Protein/Peptide Summary, select MPP APR Export.

Color-Coded Quantitation Results

When you mark the Intensity check box under Review Fields on the Protein/Peptide Summary form, then the results include color-coded information to make it easy to visualize relative concentrations and differences in protein abundances between samples.  The color code indicates relative peptide or protein concentrations, where darker colors (e.g., red) indicate larger relative concentrations and lighter colors (e.g., yellow) indicate smaller relative concentrations.   Colors in between (e.g., orange) represent intermediate concentrations.  The colors make it easier to compare samples and quickly pick out sample differences.

Depending on the display mode you select, the color-coded results appear in either one or two columns of the results table. 

In peptide display modes, you see Spectrum Intensity, which is the peak intensity calculated from the extracted ion chromatogram of each peptide precursor. 

In protein display modes, you see one or more of the following:


Multiple Sequence Alignment Tool within Protein/Peptide Summary

Introduction

The Multiple Sequence Alignment Tool within the Spectrum Mill software enables alignment and comparison of the amino acid sequences of proteins within a protein group.  The software accomplishes the alignment via a transparent interface to Clustal W, a program that is available from the European Bioinformatics Institute (EBI). Agilent licenses the Clustal W program, and the Spectrum Mill installation copies it to the millbin folder on the Spectrum Mill server. Once the program is located within millbin, you access the alignment capability of Clustal W directly via links in the Protein/Peptide Summary report in the Spectrum Mill. You can also access multiple sequence alignment from a stand-alone utility – the Multiple Sequence Aligner. For more information, please see the help for that form.

Reference:

Chenna, R.; Sugawara, H.; Koike,T.; Lopez, R.; Gibson, T.; Higgins, D.G.; Thompson, J. D. “Multiple Sequence Alignment with the Clustal Series of Programs”, Nucl. Acids Res. 2003, vol. 31, no. 13, 3497–500, PubMedID: 12824352.

Note: If the database is too large (> 4.2 Gb), the alignment does not work properly. In that case, create a subset database before you do the alignment.

To Align Sequences

To access the alignment capability:

  1. Generate a Protein/Peptide Summary report from one of the following summary modes:
  1. Click a Group # or Subgroup # link in the report.
  2. Wait a short while for the report to display.

Report Description

The top of the report provides information about the proteins in the group, starting with the longest protein first. For each protein, the report lists:

The bottom of the report aligns the amino acid sequences from the various proteins. The left column lists the protein accession numbers. To view a protein name (as given in the protein database), rest the mouse pointer on the accession number. The right side of the report displays the aligned amino acid sequences. The sequences typically occupy more than one line of text; scroll down to view subsequent lines. Blank lines indicate the start of the next section of the sequence. Colored highlights show the locations of supporting peptides for each protein identification.

For more information about Clustal W and a description of the calculations that Clustal W uses to perform the alignment, access the online help at EBI, or see the reference above.

Note: If you want to both align sequences and generate a phylogenetic tree, then use the Web form at EBI. The phylogenetic tree is not available when you use Clustal W within the Spectrum Mill.

Peptide Table

To view a table that lists the detected peptides that are present in the amino acid sequence of each protein, do one of the following:

In either case, the table shows which proteins contain each detected peptide. To limit further the number of accession numbers in the table, mark the check boxes for the accession numbers you wish to display and click the Peptide Compare button.


Spectrum Summary

The Spectrum Summary tool was created as a means to sort spectra according to some measure of spectral quality. One obvious use is to find novel peptides by process of elimination, i.e. good quality spectra which remain uninterpreted after all appropriate databases have been searched. While several spectral features are available as criteria for sorting, the one which seems to do the best job of putting high quality spectra to the top of the list is the Maximum Sequence Tag Length - the longest path through a series of ions separated by amino acid masses. This is not intended as a de novo interpretation, but rather a very crude calculation which makes no attempt to consider the various possible fragment ion types nor choose which of the possible paths is actually correct. Note that high scores by this measure will represent spectra which fragment at each consecutive amino acid along the peptide backbone (most likely doubly charged spectra in electrospray MS/MS).

Spectrum Summary has also become the primary means of reporting results from Sherenga de novo Sequencing and Spectrum Matcher through its Data Integration Modes.

Spectrum Summary also allows spectra to be segregated according to quality, as a means for creating groups of spectra that can be selectively subjected to interpretation by MS/MS Search or Sherenga de novo Sequencing. See the validation state section for further details.

The Spectrum Viewer is a convenient tool for reviewing results. 


To Use the Spectrum Summary Form

The following topics describe options available on the Spectrum Summary form. 

Summarize Results for Review

Sorting

Data Directory

Spectral Quality Filtering

Certain Spectral Features calculated by the SM Data Extractor and can be used with multiple downstream SM modules to craft a smaller subset of high value spectra. For more details see Spectral Quality Filtering.

Filter by: Use this to filter by one additional feature. See Spectral Features.

Spectral Type/Status Filtering

Data Integration Modes

SpectralFeatures


SILAC and Other Differential Expression Quantitation

The Spectrum Mill supports precursor ion intensity based quantitation with a wide range of labels that are used for differential expression quantitation (DEQ).  A number of labels are pre-programmed in the software, but you can add your own modifications and use them for quantitation.  At installation, the software supports many common modifications, including:

The discussion in this section applies to the above modifications. The following modifications are also supported, but the software handles them differently:

If you have labels that exhibit small mass differences between the light and the heavy versions (~4 Da), see also Quantitation for labels with small mass differences.

For the isotopic labels other than iTRAQ, TMT and 14N/15N, regardless of whether the DEQ modification is pre-programmed or added later, the following requirements must be met:

This section describes how to display the results, how the light/heavy ratios are calculated for each peptide, and how the peptide ratios are combined to calculate a ratio for the corresponding protein.   In this section, the term "SILAC" refers generically to reagents that are used for differential expression quantitation based on precursor ion intensity.

Displaying Results for SILAC and other Isotopic Labels

On the Protein/Peptide Summary page:

  1. Under Review Fields, mark the check box for DEQ ratios (differential expression quantitation ratios).
  2. If you wish to display the light/heavy pairs together, set Sort peptides by to Sequence.

Calculating Light/Heavy Ratios for Each Peptide

The Spectrum Mill allows a SILAC ratio to be calculated even if only one member of a heavy, light pair has been subjected to MS/MS.

As described in the Spectral Features section, for each precursor mass subjected to MS/MS, Data Extractor calculates an EIC (extracted ion chromatogram) in the intervening MS scans of an LC-MS/MS run, resulting in a chromatographic peak area for the precursor mass. In each Spectrum Mill data directory in a file called SpecFeatures.tsv these peak areas are stored in the column called totalIntensity. When you review database search results in Protein/Peptide Summary, these peak areas are retrieved for display.

When Data Extractor is run and the modification is set to one of the -mix varieties, Data Extractor calculates a parallel EIC in the intervening MS scans, depending on the m/z shift associated with the SILAC label, to yield a chromatographic peak area for the other member of the SILAC pair. Actually, multiple parallel EIC's are calculated for each precursor mass because at the time of running Data Extractor, the MS/MS spectrum has not yet been interpreted, so it is not known whether the precursor subjected to MS/MS was from a label-containing peptide at all, from a light or heavy labeled peptide, nor how many labeled residues are present in the peptide. Furthermore, on low resolution instruments, the precursor charge may not yet be known; thus the m/z shift is uncertain as well. Since Data Extractor will calculate and store all the possibilities, Protein/Peptide Summary can later retrieve the appropriate one after interpretation has been completed.

Consequently, the SILAC ratio for a particular peptide is the result of the EIC for the selected precursor mass and the result of the appropriate parallel EIC associated with the mass shift of the SILAC label. This means that a ratio can be calculated when only one member of a pair has been subjected to MS/MS.

In the cases where both members of an SILAC pair have been subjected to MS/MS, the ratio shown for the two members will most likely be close but not identical. That is because the parallel EIC calculations are performed in the time domain based upon the particular precursor selected for MS/MS. The fact that the two labels (if K0 and K8) may not quite co-elute or the chromatographic peak detection of the MS/MS-triggering precursors may have different sensitivity accounts for the difference between the two calculations. The time tolerance (+/- seconds) set in Data Extractor should allow for the difference in retention times. You will not see this discrepancy in the protein mode, provided that both the K8- and K0-labeled precursor ions were subjected to MS/MS and that these results were of sufficient quality to be interpreted and included in the final results summary. When the peptide ratios are combined to calculate a ratio for the protein, the ratio for the pair is recalculated directly using only the EICs of each precursor, not the parallel EICs obtained using the calculated m/z shift from the precursor.

Calculating a Light/Heavy Ratio for the Corresponding Protein

After the interpreted spectra for peptides have been grouped together because they correspond to a single protein, a SILAC ratio for the protein is calculated by approximately taking the median of the values for the PSMs. The median, standard deviation and number of values contributing to the median are reported in the Protein modes in Protein/ Peptide Summary.

Some details associated with error and redundancy in the calculation of the median are described here.

  1. Since the ratios of lesser abundant proteins will have poorer ion statistics, the standard deviation on the ratios will be larger and thus the ratios less trustworthy. Hence it is valuable to report standard deviations as well as ratios.
  2. Poorer ion statistics may occur even when counting ratios from peptides toward the median for a particular protein. Some examples are peptides derived from non-specific or missed cleavages and partially oxidized methionines, or any peptide that ionizes poorly.
  3. If multiple precursor charge states for a particular peptide are measured, all charge states contribute.

Filtering out PSMs with poor quality ratios

In Protein/Peptide Summary modes that incorporate protein level information have Protein Quantitation options relevant to precursor ion-based quantitation including:

Why are some ratios negative?

In Peptide and PSM level reports, some ratios (not log2 transformed) may be listed as negative. This is done to indicate that the ratio was designated as not meeting a quality control threshold. Nonetheless, the magnitude is provided and represents the actual ratio of the measured intensities to allow one override the quality control designation. The primary source of this negative designation is when the averagine Chi2 ratio of the partner precursor ion to the one selected for MS/MS was poor quality. See the p/i/q/p code in the table below. When the parallel EICs to the selected precursor described above are being calculated in the Data Extractor, an averagine Chi2 ratio for each is calculated, but not exported to the specFeatures.1.tsv file (because there are many of these for each MS/MS spectrum). Instead, a hardcoded threshold of xx is applied and if the value is below it, the EIC intensity is simply marked as negative when written to the SpecFeatures file.

Any ratio containing a negative intensity value can be excluded from from contributing to median protein and VM-site level ratios. To override/use this behavior open the file millscripts/lsmDEQ.pl and toggle the variable near the top of the file $UNDO_QUALITY_CONTROLLED_LH_RATIO_NEGATION. 2/27/2019 Karl needs to check, 0 means exclude, 1 means do not exclude. Karl should give give some guidance here....

When ratios are not calculated

If the Data Extractor cannot determine a charge for a peptide (the extracted file ends in 0.pkl), it assumes a charge of +2 for determining the mass shifts for quantitation, and it looks for up to two modification sites in the peptide (e.g.,  two cysteines at most). When the actual charge is not +2, or when there are more than two modification sites in the peptide, the ratio is not calculated, and is reported as n/c.

Ratios are also reported as n/c when the peptide does not contain the amino acid that reacts with the labeling reagent.

The following codes in PSM/Peptide level reports may be present to indicate why a ratio was not reported:

Code Meaning
n/c Not calculated (see above)
d/d/z Do not divide by zero (the denominator was zero)
o/e Outlier excluded
r/e Replicate excluded- the precursor ions of both the numerator and denominator labels are present as PSMs. Only the ratio for one of those PSMs is reported and counted toward the protein or VM-site level quantitation, the other PSM is designated as r/e
p/i/q The averagine Chi2 ratio of the precursor selected for MS/MS was poor quality
p/i/q/p The averagine Chi2 ratio of the partner precursor ion to the one selected for MS/MS was poor quality

In Protein Summaries, the Single Label (L,M,H Only) column is new with B.04.01. A single label protein will have all peptide ratios <= 0, which indicates that all of the peptides for the protein had ratios which were found to be one of the codes in the above table.


Quantitation for iTRAQ and TMT

The Spectrum Mill supports quantitation with iTRAQ and TMT labels. The iTRAQ (isobaric tag for relative and absolute quantitation) reagents modify the N-terminus and K, and they allow simultaneous quantitation of up to eight different cell states based on low-mass MS/MS signature ions. The processing and quantitation for iTRAQ-modified peptides is different from that described under  SILAC and Other Differential Expression Quantitation.

The Spectrum Mill supports iTRAQ and TMT quantitation for Agilent Q-TOF and ion trap data, generic peak list data (requires the generic Data Extractor), and Thermo Fisher Scientific LCQ and LTQ *.raw data (requires the Thermo Fisher Scientific Data Extractor and requires that during extraction, the software merges MS2 and MS3 scans from the same precursor).

Starting with version A.03.03, the Spectrum Mill supports iTRAQ in two forms:

Starting with version B.04.00, the Spectrum Mill workbench supports iTRAQ4 and iTRAQ8, TMT2 and TMT6:

Starting with version B.05.00, TMT10 quantification is supported.

Data Extractor

The iTRAQ and TMT intensity calculations do not require extracted ion chromatograms from the MS data. The abundances of the iTRAQ and TMT masses are calculated from the MS/MS data. This is significantly different behavior than for the SILAC-like modifications.

MS/MS Search

With the isotopic labels used for differential expression quantitation, if you select one of the variations that ends in mix, each spectrum is searched multiple times—once for each possible label. The results are merged as a single output. For iTRAQ or TMT, only a single search is necessary. Since the tags are isobaric, all versions of the iTRAQ or TMT reagent are simultaneously fragmented during MS/MS. Further, all iTRAQ and TMT labels produce the same MS/MS fragments for a given parent peptide. Therefore, the iTRAQ or TMT labels do not have to be searched as a mix. However, each set of tags produces different reporter ions in its mass range, and the abundances of these reporter ions are used by the Spectrum Mill for relative quantitation.

Protein/Peptide Summary

To display iTRAQ and TMT results using the Protein/Peptide Summary page:

  1. Under Review Fields, mark the intensities check box next to the iTRAQ/TMT selection list.
  2. From the iTRAQ/TMT selection list, select either iTRAQ4, iTRAQ8, TMT2 or TMT6.
  3. Mark the check box for Ratios control, and select the iTRAQ or TMT mass you wish to use in the denominator for ratio calculations.
  4. If you want to see the iTRAQ or TMT modification in a report that shows peptides, mark check boxes for both N-terminus and Modifications, since the reagents react at both the N-terminus and lysines.
  5. If you want to export your data to Excel so you can apply the correction factors that you received in your certificate of analysis for the iTRAQ reagents, mark the check box for Excel export


Quantitation for 15N and 14N/15N mix

Quantitation for the metabolic isotopic labels 15N and 14N/15N mix is different than for the modifications discussed under SILAC and Other Differential Expression Quantitation. For 14N/15N mix, the quantitation begins at the Protein/Peptide Summary level rather than at the Data Extractor level. The quantitation is based on finding matching peptides with the two labels. Both the 14N and the 15N peptides must have been subjected to MS/MS, and the MS/MS Search results must indicate the same sequence and charge. 14N/15N mix and iTRAQ/TMT are the only modifications where differential expression quantitation can begin with the generic Data Extractor. The 14N/15N calculations assume 100% incorporation of 15N.


Quantitation for labels with small mass differences

If you are attempting differential expression quantitation with labels that have relatively small mass differences between the light and the heavy versions (~4 Da), you need to change the Data Extractor setting for Merge scans with same precursor m/z from the default value.  Change from the default window of +/-1.4 m/z to a window of +/-1.0 m/z or lower.

When there are small mass differences between labels, a 2+ peptide with both versions of such a label will show two isotopic distributions that are 2-m/z from each other. With the default extractor window of +/-1.4 m/z, it is likely that when the software calculates the intensity for a given precursor m/z, some of the isotopic peaks from the precursor's light or heavy counterpart will be contained within the m/z window, which will significantly skew the DEQ results. To avoid errors in the intensity measurement, reduce the window  to +/-1.0 m/z or even lower.

When you reduce the window, some MS/MS spectra may not merge, so multiple identifications of the same peptide within the merge time period may occur. However, this is preferable to inaccurate DEQ results.


Spectrum Matcher

Spectrum Matcher provides a means of matching one set of spectra against another in a way that is integrated into the Spectrum Mill file system, thus allowing one to define the sets of spectra according to directory location and validation state. You can also use Spectrum Matcher to compare spectra acquired with different acquisition methods to evaluate any improvements, and to evaluate the quality of spectra using the spectral quality filters.

Thus Spectrum Matcher is a tool for answering the following types of questions:

Identity mode - Are any of the spectra in my query set the same as any in the library set?

Precursor mass shift mode - Are any of the spectra in my query set related to any in the library set?

When seeking to match related spectra, the most common application is to select the same directory for both Query Set and Library Set, with the Library Set being those spectra already identified (Validation State: valid) and the Query Set being unidentified spectra (Validation State: spectrum-not-marked-sequence-not-validated).

Scoring of Matches

The score in Spectrum Matcher is very similar to that in MS/MS Search. Following peak detection, the Spectrum Matcher algorithm attempts to match every peak present in a query set MS/MS spectrum to every peak present in a library set MS/MS spectrum. The scoring system is based on the following general principles:

Spectrum Matcher has two particular scoring attributes:

Precursor Mass Shift

Spectrum Matcher compares MS/MS spectra if their precursor masses fall within the precursor m/z tolerance filter. In Precursor mass shift mode, this filter is a combination of the Precursor mass shift and Precursor m/z tolerance. You should NOT attempt to accomplish this by using a wider precursor m/z tolerance. Use a Precursor m/z tolerance consistent with the accuracy to which the precursor mass is measured. The default value for the Precursor mass shift of +/- 81 allows for the largest possible precursor mass shift associated with a mutation among the 20 standard amino acids and phosphorylation.  The shift can be set in four different forms, all of which show only homologous matches, thus excluding identity mode matches:

Note that the +/- will compare many more spectra so it will take longer to run, and the run time will be proportional to the magnitude of the Precursor mass shift.


To Use the Spectrum Matcher Form

The following options are available on the Spectrum Matcher form. For more details, see Spectrum Matcher.

If during data review you wish to display the Spectrum Matcher form again, click the Match Settings button at the top of the page.

Match Spectra

Search Criteria

The following topics discuss the Search Criteria options.

Search Mode

Matching Tolerances

Spectral Quality Filtering (instrument-specific peak detection used in Extractor)

Certain Spectral Features calculated by the SM Data Extractor and can be used with multiple downstream SM modules to craft a smaller subset of high value spectra. For more details see Spectral Quality Filtering.

MS/MS Peak Detection (over-ride instrument-specific peak detection for matching)

Data Sets

There are two key Data Sets concepts when using the Spectrum Matcher

Query Set

Library Set


Overview for MS Interactive Processing

From a MALDI-MS experiment that takes less than one minute, one can measure the peptide mass fingerprint of a particular protein by spotting a target with an aliquot of the proteolytic digest of the protein. The technique requires that the peptides detected are all derived from a single protein (perhaps a mixture of up to three proteins).

To search Agilent Q-TOF or TOF .d data, first use MassHunter Qualitative Analysis with Molecular Feature Extraction (MFE) to create a peak list of possible peptides, and paste that into the Manual PMF search page.  Or the peak list can be a differentially expressed list from Mass Profiler Professional.  The Spectrum Mill workbench provides tools to run high-throughput PMF searches, and to review and summarize the results. The figure below  illustrates the overall process.

MALDI Spectra Preprocessing

MALDI spectra  must be supplied as peak list files. Depending on the instrument type, spectral preprocessing steps (centroiding, charge assignment, de-isotoping, etc.) may be done either within the instrument data system or within the Spectrum Mill. Settings in instrument.txt ensure that preprocessing steps are not duplicated between the two. The Spectrum Mill then provides tools to run high-throughput PMF searches, and to review and summarize the results. The figure below  illustrates the overall process.

Experiment Scheme   
 

Getting Started for Applied Biosystems MALDI

  1. Acquire some mass spectra.
  2. Calibrate and centroid the spectra, then export peak lists using the instrument data system.
  3. Transfer the exported spectral files to the Spectrum Mill computer in a fit_batch_in directory within the Spectrum Mill file system.
  4. From the Spectrum Mill homepage, go to the PMF Search page.
  5. Set the appropriate parameters and run the searches.
  6. Review the data from the PMF Search page or the PMF Summary page.
For more details on the PMF Search and PMF Summary pages, see MS PMF Search/Summary Help.



To Use the Data Extractor Form (MS)

The following topics describe options available on the Data Extractor form.   In general, you should retain the default settings, except for the options highlighted in red text on the form.  For more details, see Spectral Preprocessing for MS/MS Data.  Note that the options change depending upon the vendor data type to be extracted.

Important note:  If you wish to redo a data extraction, mark the check box for Remove all prior results

Extraction

Data Directories

Modifications

MS (PMF) Spectral Features

Note:  These options are only available when you mark the check box Show only MS (PMF) parameters.


MS PMF Search/Summary

Peptide mass fingerprinting (PMF) is a very popular technique for protein identification. The method encompasses digestion of the protein with site-specific proteases, measurement of the peptide masses by mass spectrometry (MS), and protein identification via a database search. The PMF Search capability within the Spectrum Mill is an advanced, automated database search program for MS-only spectra. 

With PMF Search, the certainty of the identification is primarily a function of the level of mass accuracy. The Agilent TOF delivers low-ppm mass accuracy and can be used with both electrospray and atmospheric pressure MALDI sources, making it an ideal instrument for confident identifications.  For Agilent TOF and Q-TOF .d data, you must use MassHunter Qual with MFE to create a peak list to paste into Manual PMF Search.

After using PMF Search, you can summarize and review results with the PMF Summary page.

For more details on the PMF Search and Summary pages, see MS PMF Search/Summary Help.