Agilent Spectrum Mill MS Proteomics Workbench
MS PMF Search/Summary
Table of Contents
Introduction
Peptide mass fingerprinting (PMF) is a popular technique for protein identification. The method encompasses
digestion of the protein with site-specific proteases, measurement of the peptide masses by mass spectrometry
(MS), and protein identification via a database search. The PMF Search capability within the Spectrum Mill workbench
is an advanced, automated database search program for MS-only spectra.
PMF Search is limited to the analysis of digests from simple protein mixtures (usually three to five proteins).
When there are peptides from too many different proteins in your spectrum, you will not achieve statistically
significant scores because the more complex the protein mixture, the more non-matching (noise) peptides for any
given protein. Use of the Spectrum Mill workbench's Mixture scoring option helps to overcome this
limitation.
With PMF Search, the certainty of the identification is primarily a function
of the level of mass accuracy.
PMF Search is an automated program that searches one or more mass-intensity files.
The Spectrum Mill workbench
also has a Manual PMF Search where the mass list can be typed or copied manually into the search form. You
access Manual PMF via the PMF Search page or the Spectrum Mill home page. Use
the Manual PMF Search form to paste data from:
- Molecular Feature Extraction of Agilent Q-TOF and TOF .d files with MassHunter Qualitative
Analysis.
- Mass Profiler and Mass Profiler Professional.
To Use the PMF Search Form
The following topics describe options available on the PMF Search form.
Search
- Start Search - Click to
place the task in the queue
for execution. The
program determines the order in which it will execute the task to do a
PMF search
based on the time the task entered the queue, its capacity to execute
tasks in parallel, and dependencies.Click this button after you have
either loaded the desired parameter file or manually set the
parameters. The name of the current parameter file appears in red at
the top of the form. Once you have saved a parameter file from this
form, you may start the search from a workflow rather than manually with the Start
Search button.
- Save As - Click to save current search settings in a parameter file.
- Load - Click to load a parameter file that contains settings for PMF Search.
For default values, select a parameter file from the Defaults
folder.
- Remove all prior PMF results - Mark this check box to remove prior PMF searches and data summaries
for this dataset.
- Mixture scoring - Mark this check box to have the software assign probability scores to potential
mixtures and to color-code the mass spectrum to show peaks from each component. See
Ranking / Scoring of Results.
Data Directory
Search Parameters
- Database: Select a database. See Databases.
- Species: Choose a species if you want if you want to narrow the search possibilities
and to accelerate searches. Please see the list of species
definitions that ship with the software, as some definitions do not encompass all possible members.
Retain the default of All to search the entire database. Be aware that because of inconsistencies
in the way species information is organized in different databases, the Spectrum Mill workbench cannot
read about 10% of the species information in NCBInr, and cannot read any of the species information in
trEMBL. See Species Filtering.
- MW of of protein: Type the molecular weight range for your protein, or mark the All
check box to search the entire database. See Intact Protein MW Filtering.
- Digest: Select the enzyme used for the proteolytic digestion. See
Enzyme Specificity / Missed Cleavages.
- Maximum # missed cleavages: Set the maximum number of missed enzymatic cleavages.
See Enzyme Specificity / Missed Cleavages.
- Protein pI: Type the pI range for your protein, or mark the All check box to
search the entire database. See Intact Protein pI Filtering.
Modifications
Search Criteria
The next several topics describe options available in the Search Criteria section of the PMF Search
form.
Matching Tolerances
- Instrument: Select the instrument on which you acquired the data. Note that the
software automatically sets a Peptide mass tolerance appropriate for the instrument you select.
You can manually change this setting.
- Minimum matched peptides: Set the minimum threshold for a match. For a particular protein
in the database to generate a hit, it must match at least a minimum number of masses from the input data.
This minimum number is set by Minimum matched peptides. The Minimum matched peptides setting
is most useful when the total number of peaks to match is less than 30 peaks. The value should be set
in relation to one's expectation of the quality of noise removal and selection of 12C isotope
peaks representing peptide masses.
- Peptide mass tolerance: Type a new value if you wish to manually set peptide mass tolerance
and override the automatic software setting for your instrument. See
Mass Tolerance.
- Masses are: See Mass Type.
Data Files
- Mass list files - Modify this list if you want to process only a subset of the files in the
data directory. Wildcards (*) are supported. To see the names of your mass list files, look in the
fit_batch_in subdirectory where you placed these files.
Spectral Features
- Override instrument defaults - Mark this check box to use Max Peaks and Min S/N
that are different than the instrument defaults. When you mark the check
box, the next two options are displayed:
- Max Peaks: Set the maximum number of peaks to search from each mass list file. Note
that these peaks will be searched only if they meet the signal-to-noise criteria set with the next option.
The Max Peaks defaults are 100 for Agilent AP-MALDI TOF and 500 for Agilent ESI TOF. For
more complex samples, or for better sequence coverage, set Max peaks to a larger number.
- Min S/N: Set the minimum signal-to-noise for which a peak should be searched.
Peaks of lower signal-to-noise will be ignored. The default is 2 for Agilent AP-MALDI TOF and Agilent
ESI TOF. If you have low-level data, you can decrease this number to have additional spectral peaks
searched. The tradeoff is that some of the added peaks may be matrix ions.
Contaminant Masses
- File: Select the file that includes your contaminant masses. To set up new contaminant
mass list files, see To Add/Change Contaminant Mass List Files.
If you wish to retain the contaminants to use them to recalibrate the mass axis, select None.
Recalibrate Data
- Force data recalibration - Mark this check box to recalibrate mass data using a slope and
intercept that you have determined external to the Spectrum Mill workbench, or from PMF Summary. See
Data Recalibration. This check box is not applicable for data from the
Agilent TOF. With the Agilent TOF, the data is automatically recalibrated before it
is written to disk, so there is no need to recalibrate it again. When you mark the check box, the
next two options are displayed:
- New slope: Type a slope (from PMF Summary) to recalibrate the mass data.
- New intercept: Type an intercept (from PMF Summary) to recalibrate the mass data.
Report Details
- Max reported hits: Sets the maximum number of hits that will be returned from the search.
- Detailed hits: Sets the number of hits for which detailed results will be returned.
If you wish to print the results, it is sometimes useful to set Max hits to a larger number (the
number of hits you want on a short list of results) and Detailed hits to a smaller number.
- Report MOWSE scores - Mark this check box if you want the MOWSE scores to be reported.
See Ranking / Scoring of Results.
- Pfactor: Type a value for the contribution to the MOWSE score of peptides with missed
cleavages. Peptides with no missed cleavages contribute 1.0. If in doubt, retain the default of 0.4.
See Ranking / Scoring of Results.
To Use the PMF Summary Form
The following topics describe options available on the PMF Summary form.
Click here for default values.
Summarize Results for Review
- Summarize - Click to summarize results. Click this button after you have either loaded
the desired parameter file or manually set the parameters. The name of the current parameter file appears
in red at the top of the form. Once you have saved a parameter file from this form, you may do the summary
from a workflow rather than manually with the Summarize
button.
- Save As - Click to save current summary settings in a parameter file.
- Load - Click to load a parameter file that contains settings for PMF Summary.
- Queue request - Mark this check box if you want the data summary to occur after completion
of a queued PMF Search for the selected data directories. That is, mark the check box if you want to
do interactive automation. If you want to see summary
results immediately, clear the check box. You also mark this check box if you want to preserve the output
in HTML format for later access.
Data Directory
- Select ... - Click this button to a select data directory. See
Selecting Data Directories.
- Search result files - Modify this list if you want to summarize only a subset of the files
in the data directory. Wildcards (*) are supported. To see the names of your search result files,
look in the results_msfit subdirectory that shares the same parent data directory as the fit_batch_in
subdirectory where you placed your peak list files.
Sorting
- Filter hits by score: Sets a filter to display only hits meeting certain probability
score criteria. Note that for MS PMF scores, smaller numbers correlate with smaller probabilities
of chance occurrence, and greater likelihood of valid results. This is the opposite of MS/MS Search
scores, where larger numbers represent better scores.
- Sort by: Determines how results are sorted. Score refers to probability score rather
than MOWSE score.
Review Fields
- Filename - Spectral file name
- Score - Probability score. See Ranking / Scoring of Results.
- MOWSE score - See Ranking / Scoring of Results.
- Mass error - Mark this check box to report the mean and standard deviation of the mass error.
- Recalibration - Mark this check box to calculate a slope and intercept for the mass
recalibration. See Data Recalibration.
- Excel export - Mark to export results to Excel or to upload to a LIMS system. For the
latter, first make sure your system administrator has configured the upload. See
Exporting to Excel or Uploading to LIMS.
- Protein MW - Molecular weight of protein representing top database hit
- Protein pI - Isoelectric point (pH at which the net charge of the protein is zero) of protein
representing top database hit
- Species - Species for protein representing top database hit
- Accession # - Database accession number
- Protein name - Protein name for top database hit
To Use the Manual PMF Search Form
The following topics describe options available on the Manual PMF Search form. Use the Manual PMF Search form to
paste data from a Molecular Feature Extraction of Q-TOF and TOF .d
files with MassHunter Qualitative Analysis. See the Familiarization Guide and the
Application Guide for instructions on how to use MassHunter Qualitative Analysis and Manual PMF Search to process MS-only Q-TOF and TOF .d files.
To return to default settings on the Manual PMF Search page, click the Spectrum Mill button to
go to the Spectrum Mill home page. Then click the link on the home page to go back to the Manual PMF
Search page.
Search
- Start Search - Click to initiate search. Click this button after you have set all parameters.
This button also saves your search settings until you close your web browser.
- Sample ID: Type your sample name or other identifier.
- Maximum reported hits: Set to the maximum number of hits you want for each search.
- MOWSE scores - Mark this check box if you want the MOWSE scores to be reported. See
Ranking / Scoring of Results.
- Pfactor: Type a value for the contribution to the MOWSE score of peptides with missed cleavages.
Peptides with no missed cleavages contribute 1.0. If in doubt, retain the default of 0.4. See
Ranking / Scoring of Results.
- Mixture scoring - Mark this check box to have the software assign probability scores to potential
mixtures and to color-code the mass spectrum to show peaks from each component. See
Ranking / Scoring of Results.
Search Parameters
- Min. # peptides required to match: Set the minimum threshold for a match. For a particular
protein in the database to generate a hit, it must match at least a minimum number of masses from the
input data. This minimum number is set by Min. # peptides required to match. The Min. # peptides
required to match setting is most useful when the total number of peaks to match is less than 30
peaks. The value should be set in relation to one's expectation of the quality of noise removal and selection
of 12C isotope peaks representing peptide masses.
- Database: Select a database. See Databases.
- DNA frame translation: See Frame Translation in DNA databases.
Note that this setting appears only if you select a DNA database.
- Species: Choose a species if you want to narrow the search possibilities and to accelerate
searches. Please see the list of species definitions
that ship with the software, as some definitions do not encompass all possible members. Retain
the default of All to search the entire database. Be aware that because of inconsistencies
in the way species information is organized in different databases, the Spectrum Mill workbench cannot
read about 10% of the species information in NCBInr, and cannot read any of the species information in
trEMBL. See Species Filtering.
- MW of protein: Type the molecular weight range for your protein, or mark the All
check box to search the entire database. See Intact Protein MW Filtering.
- Protein pI: Type the pI for your protein, or mark the All check box to search
the entire database. See Intact Protein pI Filtering.
- Digest: Select the enzyme used for the proteolytic digestion. See
Enzyme Specificity / Missed Cleavages.
- Maximum # of missed cleavages: Set the maximum number of missed enzymatic cleavages.
See Enzyme Specificity / Missed Cleavages.
Modifications
Peptide Masses
- Mass tolerance: Select this option to manually input peptide mass tolerance. If
you set a tolerance greater than 100 ppm, your scores will be zero since results cannot be guaranteed
with that level of mass accuracy. See Mass Tolerance.
- Masses are: See Mass Type.
- MH+/M - If your masses represent protonated species, select MH+.
If your masses represent neutral species, select M.
- Mass (m/z) - Type an m/z value. If you have protonated species, you do not need
to subtract protons (H+). Charging agents other than H+ are not
allowed.
- Charge (z) - Type a charge if necessary. If you selected MH+ above, then
the software assumes a charge of +1. If you selected M above, the software assumes a charge of
0. If you type a charge into the box, it overrides the MH+/M settings.
Mass Tolerance
The mass tolerances should be set to be consistent with the mass accuracy of the instrument used to
generate the data. For TOF instruments, it is generally a better idea to use units of ppm or % rather than Da,
as these mass spectrometers typically have an error associated with mass measurement that is mass-dependent
and thus cannot be uniformly expressed in Da. For ion trap instruments, it is better to use units of Da.
If you set the mass tolerance too tight, you may miss peptides, but if you set it too loose, you may generate
false positives.
Measuring masses as accurately as possible is the single most important thing one can do to achieve the highest
certainty of protein identification in a peptide mass fingerprinting experiment.
Instrument
For MS-only data, when you select an instrument, you trigger the software to configure extraction and search
parameters that are designed particularly for the instrument type. You can edit the instrument parameters or add
new instruments by editing the files: msparams_mill/instrument.txt and millhtml/SM_js/instrument.js.
E:\SpectrumMill\msparams_mill\instrument.txt
E:\SpectrumMill\millhtml\SM_js\instrument.js
If you add an instrument, be sure to set the parameters in instrument.txt in a way that is appropriate
for the data you export from that instrument. For example, if deisotoping is accomplished by the instrument data
system, set bypassDeIsotoping = 1 in instrument.txt, to avoid repetition of deisotoping
in the Spectrum Mill workbench.
Examples of supported MS-only instrument configurations are shown below. Users should ordinarily NOT
change these values. For additional supported instruments, see E:\SpectrumMill\msparams_mill\instrument.txt.
Feature |
Description |
MALDI-TOF |
MALDI-TOF-AGILENT |
MALDI-ION-TRAP |
MALDI- QSTAR |
ESI-TOF-AGILENT |
instrument charges certain |
see below* |
yes |
if determined |
if determined |
if determined |
yes |
minSignalNoiseRatio |
threshold for peak detection for MS/MS data |
30 |
8 |
5 |
8 |
15 |
minSignalNoiseRatioPMF |
threshold for peak detection for MS (PMF) data |
|
2 |
15 |
15 |
15 |
peakLimitCount |
max # of detected peaks to use for interpretation |
25 |
100 |
25 |
25 |
500 |
peakBinningTolerance |
used for centroiding in Data Extractor - expected peak width in amu |
0.6 |
0.2 |
|
0.6 |
0.2 |
bypassDeIsotoping |
skip de-isotoping |
yes |
no |
no |
no |
no |
bypassSignalNoiseThresholding |
skip S/N thresholding |
yes |
no |
no |
no |
no |
*instrument_charges_certain:
- no - 0 (default) - for 'raw' data - SM will centroid, merge peaks, and do charge assignment
- yes - 1 - for centroided data - isotope peaks are required to assign charge
- if determined - 2 - as (1), but isotope peaks are NOT required to assign charge
Ranking / Scoring of Results
Probability Scoring
The probability scores represent the chance that the protein match occurs by chance. Thus a score of 0.5 means
that match has a 50% chance of occurring randomly. A score of 1e-6 means that match has a one-in-a-million chance
of occurring randomly. The probability distribution is calculated after counting the occurrences in the database
of each mass submitted within the specified mass tolerance. Consequently, the score for the same set of masses
submitted will change if the mass tolerance, the enzyme, the number of missed cleavages, or the database is changed.
Also note that modified amino acids such as met-sulfoxide do not contribute to the score.
There are two types of probability scores. In the PMF Summary report, the column labeled Static Probability
Score lists the probability score calculated based on the Peptide mass tolerance chosen in the PMF
Search form. The column labeled Dynamic Probability Score lists the probability score calculated
based on the actual peptide mass deviations determined from the data. Thus, if the actual data is more accurate
than the mass tolerance set in PMF Search, then the Dynamic Probability Score will be better (smaller number)
than the Static Probability Score.
Mixture scoring
When you invoke mixture scoring within PMF Search, the software assigns probability scores to potential mixtures
and color-codes the mass spectrum to show peaks from each component. If the sample represents a mixture and this
check box is marked, then the scoring method is optimized for mixtures. Here is an example:
Say you have a three-component mixture with a total of 100 mass spectral peaks. Component A matches 30
peaks, component B matches 30 peaks, component C matches 30 peaks, and 10 peaks are noise. If you do not
mark the check box for Mixture scoring, then the score for component A is penalized for the fact that it
represents only 30 peaks out of the 100 total. When you do mark the check box for Mixture scoring,
then the score for component A is penalized only by the 10 noise peaks, because it now represents 30 peaks out
of the 40 peaks remaining after the software subtracts the peaks attributed to components B and C. In the
results, the scores for each individual protein in the mixture are the same as without mixture scoring,
but the core for the overall mixture does take into account the scenario described above.
Another advantage of mixture scoring is that there are bonus points for the peaks being mutually exclusive
(e.g., no overlap of peaks among components).
The mixture scoring feature is especially useful if you have a mixture where one protein dominates the spectrum,
because it avoids the situation where the top hits are the dominant protein and various precursors of the dominant
protein. When you enable mixture scoring, you are more likely to identify the less abundant protein
components in a mixture.
If you mark this box and the sample is actually a single component, the search will take a very long time.
In this case, you may want to stop the search, clear the Mixture scoring
check box, and restart the search.
If you invoke mixture scoring, it is more convenient to review the results directly from the PMF Search page
than from the PMF Summary page. The links from the results section of the PMF Search page take you directly
to the mixture results without requiring additional clicks.
The default parameters for mixture scoring (e.g., the total number of components permitted in the mixture)
are set in msparams_mill\mixParamsMsfit.txt.
MOWSE Score
The MOWSE score reported by PMF Search is based on the scoring system described in Pappin et al, Current Biology,
1993, Vol 3, No 6, pp 327-. As PMF Search offers several options not available in the initial version of MOWSE,
several modifications have had to be made.
After the species and molecular weight pre-searches, the remaining proteins undergo theoretical digestion.
The resulting peptides are then placed in bins based on their molecular weight and the intact molecular weight
of undigested protein they originated from. There are eleven intact molecular weight bins. Under 100000 Da, there
are 10 bins of width 10000 Da. The other bin contains all the proteins over 100000 Da. There are thirty
peptide molecular weight bins of width 100 amu between 0-3000 Da. Peptides above 3000 Da are not binned.
Peptides with no missed cleavages contribute 1.0 to the bin total, whereas peptides containing missed cleavages
contribute pfactor (a user supplied parameter).
Bin frequency values are then calculated by dividing the bin totals by the sum of the bin totals for each 10000
Da protein interval. The bin frequency values are then normalized to the largest bin frequency value to yield
frequency values between 0 and 1.
Masses in the theoretical digestion which match masses in the data set are divided into scoring matches and
non-scoring matches. Scoring matches include unmodified peptides and acrylamide-modified Cys and N-terminal Gln
to pyroGlu and oxidation of Met in the presence of the unmodified peptide. Non-scoring matches include pyroGlu
and oxidation of Met in the absence of the unmodified peptide, acetylated N-termini, phosphorylation of S, T and
Y, and single amino acid substitutions. Unmatched masses are ignored. The score for each matching mass is assigned
as the appropriate normalized distribution frequency value. In the case of multiple matching masses, the scores
are multiplied together. The final product score is inverted and normalized to an average protein molecular weight
of 50 kD.
For databases with < 1000 entries (not enough entries to generate valid scoring statistics)
PMF Search scoring systems are turned off and a simple ranking system is used. The results are sorted so that
if multiple database entries are matched, more likely sequences are listed higher in the list. All database entries
matching the input data and parameters are ranked on the following basis:
- Database entries with the highest number of matched masses are ranked higher.
- Among equivalent matches (those with the same rank) the results are sorted in order of increasing index
number.
Note that the last sort does NOT imply a BETTER ranking, even though one match will be listed
higher than another, but is merely intended to provide some organization to the listing.
Data Recalibration
The data recalibration feature of PMF Search/PMF Summary is useful if you have data files that were acquired
without the instrument having been properly calibrated. This feature recalibrates the experimental mass
data based on the peptide masses of the top-scoring database match. To use this feature:
- Run PMF Search
.
- Under Matching Tolerances, set a wide Peptide mass tolerance (one that is larger
than the calibration error).
- If you need to use contaminant masses to recalibrate the data, then for Contaminant Masses,
set File to None.
- Run PMF Summary
.
- Under Review Fields, click the check box for Recalibration.
- In the results, check that each sample has a Fitted Slope and a Fitted Intercept.
These represent a linear fit of the experimental data to the masses of the top-scoring database
match.
- Record representative values for the Fitted Slope and Fitted Intercept.
- Rerun PMF Search
- Under Search, mark the check box to Remove all prior PMF results.
- Under Recalibrate Data, mark the check box to Force data recalibration.
- Type the representative slope and intercept.
- Examine the new results in PMF Search.
- Check the results to confirm that the probability scores are now better (lower numbers).
The mean mass error should also be lower, although the standard deviation of the mass error may
not be significantly changed.
Caveats:
In PMF Summary, the slope and intercept are determined for each sample. However, when you type these
into PMF Search, they apply to the entire sample set. So, unless you plan to process one sample at a time,
this feature corrects for instrument calibration problems, but not for sample-specific calibration issues (e.g.,
MALDI plate surface irregularities).
If the top-scoring match (used to calculated the slope and intercept) is wrong, then the new calibration is
wrong. For a set of samples, it is worthwhile to examine a number of results to ensure that you select a
valid slope and intercept to type into PMF Search.