Agilent Spectrum Mill MS Proteomics Workbench

MS PMF Search/Summary

Introduction

Peptide mass fingerprinting (PMF) is a popular technique for protein identification. The method encompasses digestion of the protein with site-specific proteases, measurement of the peptide masses by mass spectrometry (MS), and protein identification via a database search. The PMF Search capability within the Spectrum Mill workbench is an advanced, automated database search program for MS-only spectra.

PMF Search is limited to the analysis of digests from simple protein mixtures (usually three to five proteins). When there are peptides from too many different proteins in your spectrum, you will not achieve statistically significant scores because the more complex the protein mixture, the more non-matching (noise) peptides for any given protein. Use of the Spectrum Mill workbench's Mixture scoring option helps to overcome this limitation.

With PMF Search, the certainty of the identification is primarily a function of the level of mass accuracy.

PMF Search is an automated program that searches one or more mass-intensity files. The Spectrum Mill workbench also has a Manual PMF Search where the mass list can be typed or copied manually into the search form. You access Manual PMF via the PMF Search page or the Spectrum Mill home page. Use the Manual PMF Search form to paste data from:

Molecular Feature Extraction of Agilent Q-TOF and TOF .d files with MassHunter Qualitative Analysis.
Mass Profiler and Mass Profiler Professional.

To Use the PMF Search Form

The following topics describe options available on the PMF Search form.

Search

Start Search - Click to place the task in the queue for execution. The program determines the order in which it will execute the task to do a PMF search based on the time the task entered the queue, its capacity to execute tasks in parallel, and dependencies.Click this button after you have either loaded the desired parameter file or manually set the parameters. The name of the current parameter file appears in red at the top of the form. Once you have saved a parameter file from this form, you may start the search from a workflow rather than manually with the Start Search button.
Save As - Click to save current search settings in a parameter file.
Load - Click to load a parameter file that contains settings for PMF Search. For default values, select a parameter file from the Defaults folder.
Remove all prior PMF results - Mark this check box to remove prior PMF searches and data summaries for this dataset.
Mixture scoring - Mark this check box to have the software assign probability scores to potential mixtures and to color-code the mass spectrum to show peaks from each component. See Ranking / Scoring of Results.

Data Directory

Click the Select ... button to select a data directory. See Selecting Data Directories.

Search Parameters

Database: Select a database. See Databases.
Species: Choose a species if you want if you want to narrow the search possibilities and to accelerate searches. Please see the list of species definitions that ship with the software, as some definitions do not encompass all possible members. Retain the default of All to search the entire database. Be aware that because of inconsistencies in the way species information is organized in different databases, the Spectrum Mill workbench cannot read about 10% of the species information in NCBInr, and cannot read any of the species information in trEMBL. See Species Filtering.
MW of of protein: Type the molecular weight range for your protein, or mark the All check box to search the entire database. See Intact Protein MW Filtering.
Digest: Select the enzyme used for the proteolytic digestion. See Enzyme Specificity / Missed Cleavages.
Maximum # missed cleavages: Set the maximum number of missed enzymatic cleavages. See Enzyme Specificity / Missed Cleavages.
Protein pI: Type the pI range for your protein, or mark the All check box to search the entire database. See Intact Protein pI Filtering.

Modifications

Click the Choose... button to select modifications appropriate for your sample. See Choosing Modifications.

Search Criteria

The next several topics describe options available in the Search Criteria section of the PMF Search form.

Matching Tolerances

Instrument: Select the instrument on which you acquired the data. Note that the software automatically sets a Peptide mass tolerance appropriate for the instrument you select. You can manually change this setting.
Minimum matched peptides: Set the minimum threshold for a match. For a particular protein in the database to generate a hit, it must match at least a minimum number of masses from the input data. This minimum number is set by Minimum matched peptides. The Minimum matched peptides setting is most useful when the total number of peaks to match is less than 30 peaks. The value should be set in relation to one's expectation of the quality of noise removal and selection of ¹²C isotope peaks representing peptide masses.
Peptide mass tolerance: Type a new value if you wish to manually set peptide mass tolerance and override the automatic software setting for your instrument. See Mass Tolerance.
Masses are: See Mass Type.

Data Files

Mass list files - Modify this list if you want to process only a subset of the files in the data directory. Wildcards (*) are supported. To see the names of your mass list files, look in the fit_batch_in subdirectory where you placed these files.

Spectral Features

Override instrument defaults - Mark this check box to use Max Peaks and Min S/N that are different than the instrument defaults. When you mark the check box, the next two options are displayed:
Max Peaks: Set the maximum number of peaks to search from each mass list file. Note that these peaks will be searched only if they meet the signal-to-noise criteria set with the next option. The Max Peaks defaults are 100 for Agilent AP-MALDI TOF and 500 for Agilent ESI TOF. For more complex samples, or for better sequence coverage, set Max peaks to a larger number.
Min S/N: Set the minimum signal-to-noise for which a peak should be searched. Peaks of lower signal-to-noise will be ignored. The default is 2 for Agilent AP-MALDI TOF and Agilent ESI TOF. If you have low-level data, you can decrease this number to have additional spectral peaks searched. The tradeoff is that some of the added peaks may be matrix ions.

Contaminant Masses

File: Select the file that includes your contaminant masses. To set up new contaminant mass list files, see To Add/Change Contaminant Mass List Files. If you wish to retain the contaminants to use them to recalibrate the mass axis, select None.

Recalibrate Data

Force data recalibration - Mark this check box to recalibrate mass data using a slope and intercept that you have determined external to the Spectrum Mill workbench, or from PMF Summary. See Data Recalibration. This check box is not applicable for data from the Agilent TOF. With the Agilent TOF, the data is automatically recalibrated before it is written to disk, so there is no need to recalibrate it again. When you mark the check box, the next two options are displayed:
New slope: Type a slope (from PMF Summary) to recalibrate the mass data.
New intercept: Type an intercept (from PMF Summary) to recalibrate the mass data.

Report Details

Max reported hits: Sets the maximum number of hits that will be returned from the search.
Detailed hits: Sets the number of hits for which detailed results will be returned. If you wish to print the results, it is sometimes useful to set Max hits to a larger number (the number of hits you want on a short list of results) and Detailed hits to a smaller number.
Report MOWSE scores - Mark this check box if you want the MOWSE scores to be reported. See Ranking / Scoring of Results.
Pfactor: Type a value for the contribution to the MOWSE score of peptides with missed cleavages. Peptides with no missed cleavages contribute 1.0. If in doubt, retain the default of 0.4. See Ranking / Scoring of Results.

To Use the PMF Summary Form

The following topics describe options available on the PMF Summary form.

Click here for default values.

Summarize Results for Review

Summarize - Click to summarize results. Click this button after you have either loaded the desired parameter file or manually set the parameters. The name of the current parameter file appears in red at the top of the form. Once you have saved a parameter file from this form, you may do the summary from a workflow rather than manually with the Summarize button.
Save As - Click to save current summary settings in a parameter file.
Load - Click to load a parameter file that contains settings for PMF Summary.
Queue request - Mark this check box if you want the data summary to occur after completion of a queued PMF Search for the selected data directories. That is, mark the check box if you want to do interactive automation. If you want to see summary results immediately, clear the check box. You also mark this check box if you want to preserve the output in HTML format for later access.

Data Directory

Select ... - Click this button to a select data directory. See Selecting Data Directories.
Search result files - Modify this list if you want to summarize only a subset of the files in the data directory. Wildcards (*) are supported. To see the names of your search result files, look in the results_msfit subdirectory that shares the same parent data directory as the fit_batch_in subdirectory where you placed your peak list files.

Sorting

Filter hits by score: Sets a filter to display only hits meeting certain probability score criteria. Note that for MS PMF scores, smaller numbers correlate with smaller probabilities of chance occurrence, and greater likelihood of valid results. This is the opposite of MS/MS Search scores, where larger numbers represent better scores.
Sort by: Determines how results are sorted. Score refers to probability score rather than MOWSE score.

Review Fields

Filename - Spectral file name
Score - Probability score. See Ranking / Scoring of Results.
MOWSE score - See Ranking / Scoring of Results.
Mass error - Mark this check box to report the mean and standard deviation of the mass error.
Recalibration - Mark this check box to calculate a slope and intercept for the mass recalibration. See Data Recalibration.
Excel export - Mark to export results to Excel or to upload to a LIMS system. For the latter, first make sure your system administrator has configured the upload. See Exporting to Excel or Uploading to LIMS.
Protein MW - Molecular weight of protein representing top database hit
Protein pI - Isoelectric point (pH at which the net charge of the protein is zero) of protein representing top database hit
Species - Species for protein representing top database hit
Accession # - Database accession number
Protein name - Protein name for top database hit

To Use the Manual PMF Search Form

The following topics describe options available on the Manual PMF Search form. Use the Manual PMF Search form to paste data from a Molecular Feature Extraction of Q-TOF and TOF .d files with MassHunter Qualitative Analysis. See the Familiarization Guide and the Application Guide for instructions on how to use MassHunter Qualitative Analysis and Manual PMF Search to process MS-only Q-TOF and TOF .d files.

To return to default settings on the Manual PMF Search page, click the Spectrum Mill button to go to the Spectrum Mill home page. Then click the link on the home page to go back to the Manual PMF Search page.

Search

Start Search - Click to initiate search. Click this button after you have set all parameters. This button also saves your search settings until you close your web browser.
Sample ID: Type your sample name or other identifier.
Maximum reported hits: Set to the maximum number of hits you want for each search.
MOWSE scores - Mark this check box if you want the MOWSE scores to be reported. See Ranking / Scoring of Results.
Pfactor: Type a value for the contribution to the MOWSE score of peptides with missed cleavages. Peptides with no missed cleavages contribute 1.0. If in doubt, retain the default of 0.4. See Ranking / Scoring of Results.
Mixture scoring - Mark this check box to have the software assign probability scores to potential mixtures and to color-code the mass spectrum to show peaks from each component. See Ranking / Scoring of Results.

Search Parameters

Min. # peptides required to match: Set the minimum threshold for a match. For a particular protein in the database to generate a hit, it must match at least a minimum number of masses from the input data. This minimum number is set by Min. # peptides required to match. The Min. # peptides required to match setting is most useful when the total number of peaks to match is less than 30 peaks. The value should be set in relation to one's expectation of the quality of noise removal and selection of ¹²C isotope peaks representing peptide masses.
Database: Select a database. See Databases.
DNA frame translation: See Frame Translation in DNA databases. Note that this setting appears only if you select a DNA database.
Species: Choose a species if you want to narrow the search possibilities and to accelerate searches. Please see the list of species definitions that ship with the software, as some definitions do not encompass all possible members. Retain the default of All to search the entire database. Be aware that because of inconsistencies in the way species information is organized in different databases, the Spectrum Mill workbench cannot read about 10% of the species information in NCBInr, and cannot read any of the species information in trEMBL. See Species Filtering.
MW of protein: Type the molecular weight range for your protein, or mark the All check box to search the entire database. See Intact Protein MW Filtering.
Protein pI: Type the pI for your protein, or mark the All check box to search the entire database. See Intact Protein pI Filtering.
Digest: Select the enzyme used for the proteolytic digestion. See Enzyme Specificity / Missed Cleavages.
Maximum # of missed cleavages: Set the maximum number of missed enzymatic cleavages. See Enzyme Specificity / Missed Cleavages.

Modifications

Click the Choose... button to select modifications appropriate for your sample. See Choosing Modifications.

Peptide Masses

Mass tolerance: Select this option to manually input peptide mass tolerance. If you set a tolerance greater than 100 ppm, your scores will be zero since results cannot be guaranteed with that level of mass accuracy. See Mass Tolerance.
Masses are: See Mass Type.
MH⁺/M - If your masses represent protonated species, select MH⁺. If your masses represent neutral species, select M.
Mass (m/z) - Type an m/z value. If you have protonated species, you do not need to subtract protons (H⁺⁾. Charging agents other than H⁺ are not allowed.
Charge (z) - Type a charge if necessary. If you selected MH⁺ above, then the software assumes a charge of +1. If you selected M above, the software assumes a charge of 0. If you type a charge into the box, it overrides the MH⁺/M settings.

Mass Tolerance

The mass tolerances should be set to be consistent with the mass accuracy of the instrument used to generate the data. For TOF instruments, it is generally a better idea to use units of ppm or % rather than Da, as these mass spectrometers typically have an error associated with mass measurement that is mass-dependent and thus cannot be uniformly expressed in Da. For ion trap instruments, it is better to use units of Da.

If you set the mass tolerance too tight, you may miss peptides, but if you set it too loose, you may generate false positives.

Measuring masses as accurately as possible is the single most important thing one can do to achieve the highest certainty of protein identification in a peptide mass fingerprinting experiment.

Instrument

For MS-only data, when you select an instrument, you trigger the software to configure extraction and search parameters that are designed particularly for the instrument type. You can edit the instrument parameters or add new instruments by editing the files: msparams_mill/instrument.txt and millhtml/SM_js/instrument.js.

E:\SpectrumMill\msparams_mill\instrument.txt
E:\SpectrumMill\millhtml\SM_js\instrument.js

If you add an instrument, be sure to set the parameters in instrument.txt in a way that is appropriate for the data you export from that instrument. For example, if deisotoping is accomplished by the instrument data system, set bypassDeIsotoping = 1 in instrument.txt, to avoid repetition of deisotoping in the Spectrum Mill workbench.

Examples of supported MS-only instrument configurations are shown below. Users should ordinarily NOT change these values. For additional supported instruments, see E:\SpectrumMill\msparams_mill\instrument.txt.

Feature	Description	MALDI-TOF	MALDI-TOF-AGILENT	MALDI-ION-TRAP	MALDI- QSTAR	ESI-TOF-AGILENT
instrument charges certain	see below*	yes	if determined	if determined	if determined	yes
minSignalNoiseRatio	threshold for peak detection for MS/MS data	30	8	5	8	15
minSignalNoiseRatioPMF	threshold for peak detection for MS (PMF) data		2	15	15	15
peakLimitCount	max # of detected peaks to use for interpretation	25	100	25	25	500
peakBinningTolerance	used for centroiding in Data Extractor - expected peak width in amu	0.6	0.2		0.6	0.2
bypassDeIsotoping	skip de-isotoping	yes	no	no	no	no
bypassSignalNoiseThresholding	skip S/N thresholding	yes	no	no	no	no

*instrument_charges_certain:

no - 0 (default) - for 'raw' data - SM will centroid, merge peaks, and do charge assignment
yes - 1 - for centroided data - isotope peaks are required to assign charge
if determined - 2 - as (1), but isotope peaks are NOT required to assign charge

Ranking / Scoring of Results

Probability Scoring

The probability scores represent the chance that the protein match occurs by chance. Thus a score of 0.5 means that match has a 50% chance of occurring randomly. A score of 1e-6 means that match has a one-in-a-million chance of occurring randomly. The probability distribution is calculated after counting the occurrences in the database of each mass submitted within the specified mass tolerance. Consequently, the score for the same set of masses submitted will change if the mass tolerance, the enzyme, the number of missed cleavages, or the database is changed. Also note that modified amino acids such as met-sulfoxide do not contribute to the score.

There are two types of probability scores. In the PMF Summary report, the column labeled Static Probability Score lists the probability score calculated based on the Peptide mass tolerance chosen in the PMF Search form. The column labeled Dynamic Probability Score lists the probability score calculated based on the actual peptide mass deviations determined from the data. Thus, if the actual data is more accurate than the mass tolerance set in PMF Search, then the Dynamic Probability Score will be better (smaller number) than the Static Probability Score.

Mixture scoring

When you invoke mixture scoring within PMF Search, the software assigns probability scores to potential mixtures and color-codes the mass spectrum to show peaks from each component. If the sample represents a mixture and this check box is marked, then the scoring method is optimized for mixtures. Here is an example:

Say you have a three-component mixture with a total of 100 mass spectral peaks. Component A matches 30 peaks, component B matches 30 peaks, component C matches 30 peaks, and 10 peaks are noise. If you do not mark the check box for Mixture scoring, then the score for component A is penalized for the fact that it represents only 30 peaks out of the 100 total. When you do mark the check box for Mixture scoring, then the score for component A is penalized only by the 10 noise peaks, because it now represents 30 peaks out of the 40 peaks remaining after the software subtracts the peaks attributed to components B and C. In the results, the scores for each individual protein in the mixture are the same as without mixture scoring, but the core for the overall mixture does take into account the scenario described above.

Another advantage of mixture scoring is that there are bonus points for the peaks being mutually exclusive (e.g., no overlap of peaks among components).

The mixture scoring feature is especially useful if you have a mixture where one protein dominates the spectrum, because it avoids the situation where the top hits are the dominant protein and various precursors of the dominant protein. When you enable mixture scoring, you are more likely to identify the less abundant protein components in a mixture.

If you mark this box and the sample is actually a single component, the search will take a very long time. In this case, you may want to stop the search, clear the Mixture scoring check box, and restart the search.

If you invoke mixture scoring, it is more convenient to review the results directly from the PMF Search page than from the PMF Summary page. The links from the results section of the PMF Search page take you directly to the mixture results without requiring additional clicks.

The default parameters for mixture scoring (e.g., the total number of components permitted in the mixture) are set in msparams_mill\mixParamsMsfit.txt.

MOWSE Score

The MOWSE score reported by PMF Search is based on the scoring system described in Pappin et al, Current Biology, 1993, Vol 3, No 6, pp 327-. As PMF Search offers several options not available in the initial version of MOWSE, several modifications have had to be made.

After the species and molecular weight pre-searches, the remaining proteins undergo theoretical digestion. The resulting peptides are then placed in bins based on their molecular weight and the intact molecular weight of undigested protein they originated from. There are eleven intact molecular weight bins. Under 100000 Da, there are 10 bins of width 10000 Da. The other bin contains all the proteins over 100000 Da. There are thirty peptide molecular weight bins of width 100 amu between 0-3000 Da. Peptides above 3000 Da are not binned. Peptides with no missed cleavages contribute 1.0 to the bin total, whereas peptides containing missed cleavages contribute pfactor (a user supplied parameter).

Bin frequency values are then calculated by dividing the bin totals by the sum of the bin totals for each 10000 Da protein interval. The bin frequency values are then normalized to the largest bin frequency value to yield frequency values between 0 and 1.

Masses in the theoretical digestion which match masses in the data set are divided into scoring matches and non-scoring matches. Scoring matches include unmodified peptides and acrylamide-modified Cys and N-terminal Gln to pyroGlu and oxidation of Met in the presence of the unmodified peptide. Non-scoring matches include pyroGlu and oxidation of Met in the absence of the unmodified peptide, acetylated N-termini, phosphorylation of S, T and Y, and single amino acid substitutions. Unmatched masses are ignored. The score for each matching mass is assigned as the appropriate normalized distribution frequency value. In the case of multiple matching masses, the scores are multiplied together. The final product score is inverted and normalized to an average protein molecular weight of 50 kD.

For databases with < 1000 entries (not enough entries to generate valid scoring statistics)

PMF Search scoring systems are turned off and a simple ranking system is used. The results are sorted so that if multiple database entries are matched, more likely sequences are listed higher in the list. All database entries matching the input data and parameters are ranked on the following basis:

Database entries with the highest number of matched masses are ranked higher.
Among equivalent matches (those with the same rank) the results are sorted in order of increasing index number.

Note that the last sort does NOT imply a BETTER ranking, even though one match will be listed higher than another, but is merely intended to provide some organization to the listing.

Data Recalibration

The data recalibration feature of PMF Search/PMF Summary is useful if you have data files that were acquired without the instrument having been properly calibrated. This feature recalibrates the experimental mass data based on the peptide masses of the top-scoring database match. To use this feature:

Run PMF Search

Under Matching Tolerances, set a wide Peptide mass tolerance (one that is larger than the calibration error).
If you need to use contaminant masses to recalibrate the data, then for Contaminant Masses, set File to None.

Run PMF Summary

Under Review Fields, click the check box for Recalibration.
In the results, check that each sample has a Fitted Slope and a Fitted Intercept. These represent a linear fit of the experimental data to the masses of the top-scoring database match.
Record representative values for the Fitted Slope and Fitted Intercept.

Rerun PMF Search

Under Search, mark the check box to Remove all prior PMF results.
Under Recalibrate Data, mark the check box to Force data recalibration.
Type the representative slope and intercept.

Examine the new results in PMF Search.

Check the results to confirm that the probability scores are now better (lower numbers). The mean mass error should also be lower, although the standard deviation of the mass error may not be significantly changed.

Caveats:

In PMF Summary, the slope and intercept are determined for each sample. However, when you type these into PMF Search, they apply to the entire sample set. So, unless you plan to process one sample at a time, this feature corrects for instrument calibration problems, but not for sample-specific calibration issues (e.g., MALDI plate surface irregularities).

If the top-scoring match (used to calculated the slope and intercept) is wrong, then the new calibration is wrong. For a set of samples, it is worthwhile to examine a number of results to ensure that you select a valid slope and intercept to type into PMF Search.

Agilent Spectrum Mill MS Proteomics Workbench

MS PMF Search/Summary

Table of Contents

Introduction

To Use the PMF Search Form

Search

Data Directory

Search Parameters

Modifications

Search Criteria

Matching Tolerances

Data Files

Spectral Features

Contaminant Masses

Recalibrate Data

Report Details

To Use the PMF Summary Form

Summarize Results for Review

Data Directory

Sorting

Review Fields

To Use the Manual PMF Search Form

Search

Search Parameters

Modifications

Peptide Masses

Mass Tolerance

Instrument

Ranking / Scoring of Results

Probability Scoring

Mixture scoring

MOWSE Score

Data Recalibration