Agilent Spectrum Mill MS Proteomics Workbench

Protein Databases


Table of Contents

Links to related topics in the general and server administration instructions:


Introduction

 

Protein Databases (formerly known as FA-Index) was developed for five main reasons:

  1. To enable an internal means for the Spectrum Mill workbench programs to store an index number when a hit is recorded during a search, then later use that number to retrieve that database entry for output/report generation purposes. This cuts down the memory requirements for program execution.
  2. To provide indices which can be used to accelerate searches that are pre-filtered by intact protein MW and/or species.
  3. To aid the the Spectrum Mill workbench programs in addressing some of the hindrances inherent in FASTA comment line format heterogeneity.
  4. To allow users to create subset databases based on either a Species/Protein Molecular Weight  pre-filter or the results of a previous search. Searches performed on these smaller databases are often very much faster than searches performed on complete databases.
  5. To allow users to create databases containing user-defined proteins.

Note! As of B.06.00, databases that contain DNA sequences can no longer be searched. You must convert the DNA sequences to protein sequences, and name the file with either a PA or PN prefix.


Updating Databases

Important: Database names must include the appropriate prefix, as described in step 2 below! The database prefix must reflect the format, which is usually dictated by the site from which the database was downloaded. For example, if you download a SwissProt database from the NCBI site, then the format is NCBI rather than SwissProt.

If you choose to append a date or revision to your database filenames, you will need to keep the older databases so that summarizing data searched with the older databases will report proper database accessions. For large databases (such as the full NCBI database), this may not be practical.

  1. Obtain FASTA-formatted sequence database files for the seqdb directory:
    D:\seqdb.

    Locations to download FASTA formatted database files via ftp:

    Note that the URLs for these databases may change over time, so you may need to search for the current URL. You may also check the Agilent Software Status Bulletin to see if there is an update for this file. To look for an update, go to this Spectrum Mill page and click the Support link. Then click Status Bulletins. To view this document, click here.

    Note that the full NCBI database is now very large, so you may want to download a species-specific database.

    Note: As of September 2016, the NCBI FASTA files (download as nr.gz) are in a new format that specifies GeneBank accessions instead of "gi" accession numbers. You must use either "NCBIgb" or "gb" as the filename prefix for Spectrum Mill to properly parse the FASTA header information. NCBI FASTA files in the older "gi" format must be specified with "NCBInr" or "nr" as the prefix.

  2. Uncompress and rename the database files according to the format: NCBInr, NCBIgb, SwissProt, UniProt, TrEMBL, IPI, Genpept. The prefixes shown in italics ( NCBInr, NCBIgb, SwissProt, UniProt, TrEMBL, IPI, or Genpept) are a necessary part of the name.  These allow the software to differentiate the specific dialect of the FASTA format comment line used in each database. The database names do not need the .fas or .fasta extension, so you can delete it.  You can rename the IPI databases using an IPI.SPECIES format. 

    Note that the database prefix must reflect the format, which is usually dictated by the site from which the database was downloaded. For example, if you download a SwissProt database from the NCBI site, then the format is NCBI, not SwissProt. 

    For use with the Spectrum Mill workbench, you should keep a stable filename for updates rather than append a different suffix for each periodic update. The database filename is stored with search results to enable subsequent retrieval of the protein sequence; hence, review of older data will be hindered by obsolete database filenames.

    You may also use the corresponding lowercase prefixes nr, gb, swp, trembl, ipi, or gen for a second database that is of the same format as the uppercase one, but for which you want to link from the accession number to a different URL for annotation display. For more details, please read the file naming section.

    RefSeq databases (protein fasta format) may be processed as NCBInr databases. Simply prefix the RefSeq database (protein fasta format) with NCBInr. The accession numbers will be the NCBI GenBank numbers, and url links will be to the NCBI site. From there, you can link to the appropriate RefSeq information.

    If you want to include common contaminants in your IPI database, you can add NCBI-type entries to IPI databases. The database NCBInr.contaminants is installed from the Example Databases folder when you install the Spectrum Mill workbench. You can append the NCBInr.contaminants database (using a text editor) to your IPI database, and re-index the updated file. (It is worthwhile to add the contaminants; for example they allow finding trypsin in IPI_HUMAN.)

  3. Create indices in the seqdb directory for each database, by running Protein Databases from the directory immediately above seqdb or by using the web browser version of Protein Databases. The indices are necessary for efficient memory mapping during searches, particularly for preliminary filtering by species and protein molecular weight. You must create new indices after each update of a database, even if the update is done by only adding new entries to the end of the original file.

    If you want to use proprietary databases or update databases regularly, fully read this manual, particularly the generic database file naming sections.

    If you wish to use the command line version of Protein Databases rather than the browser version, see the section about the command line version.

    Protein Databases creates a file with a .usp suffix ( i.e. NCBInr.usp ) where it writes the comment line for each FASTA entry which the Protein Databases program cannot parse out the species. Viewing this file can help troubleshoot FASTA format problems for anyone using proprietary databases.

  4. Update the database list on the HTML forms.


Background on the FASTA Format

The FASTA format for sequence databases was originally developed by Pearson for use with the FASTA program. Today it is probably the most widely-used standard format, primarily because its brevity results in the smallest possible file size for sequences.

An example of the format is shown below:

>sp|P28190|AA1R_BOVIN ADENOSINE A1 RECEPTOR.
MPPSISAFQAAYIGIEVLIALVSVPGNVLVIWAVKVNQALRDATFCFIVSLAVADVAVGA
LVIPLAILINIGPRTYFHTCLKVACPVLILTQSSILALLAMAVDRYLRVKIPLRYKTVVT
PRRAVVAITGCWILSFVVGLTPMFGWNNLSAVERDWLANGSVGEPVIECQFEKVISMEYM
VYFNFFVWVLPPLLLMVLIYMEVFYLIRKQLSKKVSASSGDPQKYYGKELKIAKSLALIL
FLFALSWLPLHILNCITLFCPSCHMPRILIYIAIFLSHGNSAMNPIVYAFRIQKFRVTFL
KIWNDHFRCQPAPPIDEDAPAERPDD

The standard format is not very specific because it says only that there is a single comment line per entry which must begin with the ">" character and all subsequent lines for an entry contain sequence. However, there are many "standards" as to the arrangement of fields and/or delimiting of fields in the comment line. Often the comment line is used to describe basic information like entry name, accession number (or other unique identifier), and the species or organism from which the sequence was obtained.

The FASTA format was chosen for use with the Spectrum Mill workbench primarily because of it's universality, brevity, and expected ease with which database files could be shared on the same computer with other programs for sequence analysis.

The Protein Databases program creates several indices which are much smaller files than the FASTA database file. These indices aid the Spectrum Mill workbench programs in addressing some of the hindrances inherent in the FASTA comment line format heterogeneity.


Change in NCBI FASTA Header Format

In September 2016, NCBI changed the FASTA header format to supply only the gb (GeneBank) accession. The former gi accession is no longer used.

Newly downloaded databases in the new format are supported and the gb accession is used by the Spectrum Mill workbench for those databases.

For the Spectrum Mill workbench to properly recognize the format, these new databases require either an NCBIgb or gb prefix instead of the NCBInr prefix.

Existing databases (NCBInr) are still supported. GeneBank accession numbers (when present) can be reported in Protein/Peptide Summary by creating a Category file for the database.


The Spectrum Mill Workbench File Naming Conventions for Public FASTA Databases

Often the comment line in a FASTA database is used to describe basic information like entry name, accession number (or other unique identifier), and the species or organism from which the sequence was obtained. However, this information is NOT consistently organized into fields in the comment line of different FASTA database, though within a specific database it is usually consistent.

The way the Spectrum Mill workbench programs "know" which dialect of FASTA to "speak" with a particular database is via the filename. Acceptable filename prefixes are shown below in bold and the associated comment line format described.

NCBIgb

Here is an example: >CAA56020.1 B-127 protein [Saccharomyces cerevisiae]

The accession number (GeneBank Id) is CAA56020.1, the protein name is "B0127 protein", and the species is "Saccharomyces cerevisiae".

In some cases, multiple other protein database accessions are referenced and separated by a ctrl-A character. Spectrum Mill ignores anything in the header after the first ctrl-A it encounters.

NCBInr

The comment lines from this database are tricky to handle because it is a non-redundant database which collects entries form several databases; thus there are several formats present in the final database.

>gi|304881 (L07596) alaS [Escherichia coli]
>gi|132349|sp|P15394|REPA_AGRTU REPLICATING PROTEIN
>gi|282349|pir||A41961 chitinase (EC 3.2.1.14) D - Bacillus circulans
>gi|477498|pir||A49131 releasechannel homolog - fruit fly (Drosophila melanogaster) (fragment)
>gi|543687|pir||A48298 sodium channel homolog - jellyfish (Cyanea capillata)

the Spectrum Mill workbench programs designate:

  • accession number, 304881, as all consecutive digits following the first "|"
  • species
  • name

    Whenever the species cannot be found, the species is assigned as UNREADABLE, and the name is assigned as the entire comment line. All of these UNREADABLE lines are then written by Protein Databases to the file seqdb\NCBInr.usp.

    SwissProt and TrEMBL

    Sample entry SwissProt (prior to November 2006)
    >100K_RAT (Q62671) 100 KDA PROTEIN (EC 6.3.2.-).

    Sample entry TrEMBL (prior to November 2006)
    >Q46513 (Q46513) ORF 2 GENE PRODUCT (FRAGMENT). 

    the Spectrum Mill workbench programs designate:

  • species, RAT, as the string between "_" and " "
  • accession number, Q62671, the alphanumeric string between "(" and ")"
  • name, 100 KDA PROTEIN (EC 6.3.2.-)., as the string following the species.
    Whenever the species cannot be found the species is assigned as UNREADABLE. (This usually does not happen for any entries in SwissProt, but happens for all entries in TrEMBL.)  All of these UNREADABLE lines are then written by Protein Databases to the file seqdb\SwissProt.usp.
  • Sample entry SwissProt and TrEMBL (after November 2006)
    >Q4U9M9|104K_THEAN 104 kDa microneme-rhoptry antigen precursor (p104) - Theileria annulata

    the Spectrum Mill workbench programs designate:

  • species, THEAN, as the string between "_" and " "
  • accession number, Q4U9M9, the alphanumeric string between ">" and "|"
  • name, 104 kDa microneme-rhoptry antigen precursor (p104) - Theileria annulata, as the string following the species.

    The Spectrum Mill workbench is compatible with both SwissProt formats described above.

    IPI

    Sample entry IPI human December 2003
    >IPI:IPI00030991.1|SWISS-PROT:P40855|REFSEQ_NP:NP_002848|TREMBL:Q8NI97|ENSEMBL:ENSP00000294784 Tax_Id=9606 Peroxisomal farnesylated protein

    Sample entry IPI mouse December 2003
    >IPI:IPI00110309.2|TREMBL:Q9CXH0|ENSEMBL:ENSMUSP00000024958 Tax_Id=10090 Ensembl_locations(Chr-bp):17-23175270 3300002N10Rik protein

    Sample entry IPI rat December 2003
    >IPI:IPI00357878.1|REFSEQ_XP:XP_224588|ENSEMBL:ENSRNOP00000019511 Tax_Id=10116 Ensembl_locations(Chr-bp):16-2595757 similar to Arhgef3 protein

    the Spectrum Mill workbench programs designate:

  • accession number, IPI00030991, IPI00110309, & IPI00357878 as the alphanumeric string following the colon and preceding the first "."
  • name, Peroxisomal farnesylated protein, Ensembl_locations(Chr-bp):17-23175270 3300002N10Rik protein , Ensembl_locations(Chr-bp):16-2595757 similar to Arhgef3 protein as the string following the taxonomy code extending to the end of the line.
  • species, HUMAN, MOUSE, & RAT, taxonomy codes Tax_Id=9606, Tax_Id=10090, & Tax_Id=10116. When IPI adds new taxonomy codes, system administrators can update \msparams_mill\ipitax_id.tsv to enable immediate Spectrum Mill support. Consult the NCBI site for NCBI Taxonomy code information. See IPI help for details concerning version number and IPI entry information.

    Genpept

    >gi|216790 (D13314) arginine deiminase [Mycoplasma hominis]

    the Spectrum Mill workbench programs designate:

  • accession number, D13314, as the alphanumeric string in the first set of parentheses in the line
  • name, arginine deiminase, as the string between the first ")" and the last "[" in the line
  • species, Mycoplasma hominis, as the string between the last set of brackets in the line
    Whenever the species cannot be found the species is assigned as UNREADABLE, and the name is assigned as the entire comment line. All of these UNREADABLE lines are then written by Protein Databases to the file seqdb\Genpept.usp.

    Owl

    >10KD_VIGUN 10 KD PROTEIN PRECURSOR (CLONE PSAS10). - VIGNA UNGUICULATA (COWPEA).
    >AEOHFPA AEOHFPA NID: g141875 - A.hydrophila DNA, clone pPH4.
    >pir|Q62671|100K_RAT 100 KD PROTEIN (EC 6.3.2.-). - RATTUS NORVEGICUS (RAT).

    the Spectrum Mill workbench programs designate:

  • accession number, 10KD_VIGUN, AEOHFPA, 100K_RAT as either the string before the first space in the line or the string between the second dash in the line and the first space in the line. The second case is activated if the letters "pir" immediately follow the '>' character.
  • species, VIGNA UNGUICULATA, A.hydrophila DNA, clone pPH4, RATTUS NORVEGICUS as the string between the last dash " - " in the line and either the character combination " (" or the period character.
  • name, 10 KD PROTEIN PRECURSOR (CLONE PSAS10)., AEOHFPA NID: g141875, 100 KD PROTEIN (EC 6.3.2.-). as the string between the first space " " and the last dash " -" in the line.
    Whenever the species cannot be found the species is assigned as UNREADABLE, and the name is assigned as the entire comment line. All of these UNREADABLE lines are then written by Protein Databases to the file seqdb\Owl.usp.


    The Spectrum Mill Workbench File Naming Conventions for Proprietary/Generic FASTA Databases

    Note! As of B.06.00, databases that contain DNA sequences can no longer be searched. You must convert the DNA sequences to protein sequences, and name the file with either a PA or PN prefix.

    You name  proprietary databases with the prefixes PA or PN:

    Explanation

    Often the comment line in a FASTA database is used to describe basic information like entry name, accession number (or other unique identifier), and the species or organism from which the sequence was obtained. With well-curated databases, this information is consistently organized into fields in the comment line of a FASTA-formatted database.

    For the Spectrum Mill workbench programs the sequence field is only subject to two constraints: 1) it must be in CAPITAL letters, and 2) it must be in single letter code. (Some people express amino acids in three-letter code.)

    The way the Spectrum Mill workbench programs "know" which dialect of FASTA to "speak" with a particular database's comment line is via the filename. Generic filename prefixes are shown below in bold and the associated comment line format described. These formats are handled in a relatively robust manner, to allow for the absence of fields or the presence of additional fields. The formats basically consist of "|" delimited fields of accession number, name, and species in that order.

    PN

    The P forms indicate protein sequence.

    > 417909| Better than sliced bread growth factor beta|Mouse|pancreas|

    the Spectrum Mill workbench programs designate:

  • accession number, 417909, as the integer before the first "|"
  • name, Better than sliced bread growth factor beta, as the string between the first "|" and second "|" (or the end of the line, if no second "|")
  • species, Mouse, as the string between the second "|" and third "|" (or the end of the line, if no third "|")
    Whenever the species cannot be found, the species is assigned as UNREADABLE, and the name is assigned as the entire comment line. All of these UNREADABLE lines are then written by Protein Databases to the file seqdb\DN.usp, or seqdb\PN.usp.
    If the accession number is alphanumeric, Protein Databases will still run to completion, and all the Spectrum Mill workbench programs will function properly, except those that retrieve an entry based on the accession number. This applies only to MS Digest and MS Edman, when retrieve entry by accession number is designated. In those cases, supplying an alphanumeric accession number will result in retrieving the entry closest to the end of the file which has an alphanumeric accession number.

    PA

    The P forms indicate protein sequence.

    Note that the PA differs from the PN set only in that the accession number can be alphanumeric rather than numeric. This second set is thus more robust. However, for large, frequently-updated databases, Protein Databases can take an hour to run rather than several minutes simply because creation of the dbfilename.acc file involves the much slower process of sorting strings rather than integers.

    > SlowSort909| Better than sliced bread growth factor beta|Mouse|pancreas|

    the Spectrum Mill workbench programs designate:

  • accession number, SlowSort909, as the alphanumeric string before the first "|"
  • name, Better than sliced bread growth factor beta, as the string between the first "|" and second "|" (or the end of the line, if no second "|")
  • species, Mouse, as the string between the second "|" and third "|" (or the end of the line, if no third "|")
    Whenever the species cannot be found, the species is assigned as UNREADABLE, and the name is assigned as the entire comment line. All of these UNREADABLE lines are then written by Protein Databases to the file seqdb\DA.usp or seqdb\PA.usp.

    Any number of proprietary databases may be created with PA or PN prefixes. You must also create species alias lists and accession number links for any databases which you create.


    Protein Databases Output Files (the Indices)

    Suffix
    (databasefilename.xxx)
    Description
    .idx Index assigning a number to each entry in the database. The number is simply the order in which the entries appear in the database file. When a database is updated, the number corresponding to a particular entry will change only if the order of the entries in the file changes. Users see this number in the Spectrum Mill workbench programs designated as the MS Digest index number. Internally, the programs store this number when a hit is recorded during a search, the number is then used later to retrieve the sequence for output/report generation purposes.
    .idx2 Same as idx, but allows for databases > 4.2 GB.
    .unk Index which keeps track of all foreign characters in the sequence field for each database entry.
        For protein databases any characters other than the 20 standard amino acids are foreign characters.

        Note that the sequences must be in CAPITAL letters, and in single letter code. (Some people express amino acids in three-letter code.)
    .mw Index containing the calculated protein Molecular Weight (MW) of each sequence in the database. All amino acids are treated as unmodified, the amino acid X is treated as L, the amino acid B is treated as E, the amino acid J is treated as Q. The .mw file is used to accelerate searches that are constrained by intact MW.
    .pi Index containing the calculated protein pI of each sequence in the database. The amino acid C is treated as unmodified, the amino acid X is treated as L, the amino acid B is treated as E, the amino acid J is treated as Q. The .pi file is used to accelerate searches that are constrained by intact pI.
    .sp Index containing the Species of each sequence in the database. Used to accelerate searches that are constrained by species.
    .sl Contains a list in alphabetical order of the text strings used to denote different species. A text string has to occur at least ten times to appear in this file. This file is never used by the the Spectrum Mill workbench programs. The text strings are the ones you should use in MS Edman if you have the Search Mode set to Species.
    .usp File created to list the comment lines of each entry for which Protein Databases cannot read the species. This file is never used by the the Spectrum Mill workbench programs; it is created only for use by server administrators in troubleshooting species problems.
    .acc Index of alphanumeric accession numbers, created only for database filename prefixes: Genpept, gen, SwissProt, swp, Owl, owl, DA, PA.
    .acn Index of integer accession numbers, created only for database filename prefixes: NCBInr, nr, or PN.

    Note: You should not manually edit any of the files in the table above.


    To Use the Browser Version of Protein Databases

    To return to default settings on the Protein Databases page, click the Spectrum Mill button to go to the Spectrum Mill home page.  Then click the link on the home page to go back to the Protein Databases page.


    Creating Indices for a New Database

    Once you've downloaded a new database into the seqdb directory, you need to create the index files described above before you can start to use it..  To do this task, navigate to the Protein Databases form and select the Create indices for new database option.  Then:

    1. Type the name of the database into the Newly downloaded database box.
    2. Click the Create Indices button.
    3. See Update the database list.


    Re-indexing an Updated Database

    Once you've updated a database, you must re-index it.  To do this, navigate to the Protein Databases form and select the Re-index existing database option.  Then:

    1. For Existing database to re-index, select a database.
    2. Click the Re-index button.
    3. See Update the database list.
    4. Recreate any subset databases so that the subsets contain the latest information.


    Updating the Database List in the HTML Forms

    The list of databases used by the other forms is held in a JavaScript file. The JavaScript file is automatically updated after performing any of the operations on the Protein Databases form, with the exception of Database summary report. In some cases, the file is not refreshed in the browser. If you don't see a newly indexed database in the database list, click the Update Database List button. 

    After the automatic update, you will probably have to reload any search forms that are currently open before the new database list appears. If this doesn't work, place the cursor in the URL location box of the browser and press return. If even this doesn't work, investigate the cache settings on your browser.

     


    Creating a Species and Protein MW Subset Database with Indices

    To create a subset database which has been pre-filtered for species and molecular weight, navigate to the Protein Databases form and select the Create species subset database option.

    For example, to create a subset database of mammalian proteins between 1000 to100000 Da from the NCBInr database:

    1. Choose a suitable suffix, such as ecoli for the database.
    2. Select NCBInr as the existing database.
    3. Select ESCHERICHIA COLI as the species.
    4. Keep the default of 1000 to 100000 Da as the MW of the protein and deselect All.
    5. Click the Create button.
    6. See Update the database list.

    Using subset databases is a good way to dramatically decrease search times.


    Creating a Subset Database with Indices from Saved Hits

    The hits (index numbers for matching database entries) from MS Edman can be saved to a user-specified file. This file can then be used create a subset database containing only the hit proteins from the search.

    Navigate to the Protein Databases form and select the Create subset with indices from saved hits option. Then:

    1. Choose a suitable suffix for the database. The suffix must be unique; if you use the same suffix twice, then the previously-created subset database will be overwritten.
    2. Identify the database that was used in the original search.
    3. Identify the MS Edman file containing the saved hits by typing the file name.
    4. Click the Create button.
    5. See Update the database list.
     


    Creating or Appending to a Database Containing User Supplied Protein

    It is possible to create your own FASTA-format database which can be searched by the the Spectrum Mill workbench search programs. An entry for a single protein is made up of a comment line containing accession number, species and name fields followed by one or more lines containing the sequence.

    Navigate to the Protein Databases form and select the Create or append user database option. Then:

    1. Type the database name. There are several dialects of FASTA with the essential difference between them being the format of the comment line. You are strongly advised to use a proprietary format but it is also possible to use a public format. If you choose a database name that already exists on the disk, then subsequent proteins will be appended to the end of the file; otherwise a new database file will be created. It is possible to append entries to the end of the publicly-available databases but this is not advisable because the index files are remade after each entry, because newer versions of the database won't contain your entries, and because any errors in the information you supply when adding the entry could potentially damage the whole database. If you want to use a public database format, you should use a database name such as NCBInr.user.
    2. Type a description for the database entry. Whether you are using a proprietary format or a public format, make sure you do not use characters in the name that might give the the Spectrum Mill workbench programs problems in sorting out the fields in the comment line.
    3. Type a species for the entry. This should be consistent with the information in the msparams_mill\species.txt file.
    4. Type an accession number for the entry. The accession number must be unique; the program will alert you if it is not. If your database uses numeric accession numbers, then the accession number must be numeric.
    5. Type the protein sequence using only the upper case symbols for the 20 naturally occurring amino acids or the four base pairs as appropriate. You may also use X  if the sequence is unknown at a particular point.
    6. Click the Create button.


    Database Summary Report

    The Database Summary Report option is used to list the accession numbers, species and name fields for a selected index number range of a selected database. The Database Summary Report is a good way to verify that custom databases were properly parsed and indexed.

    Navigate to the Protein Databases form and select the Database summary report option. Then:

    1. Choose a database.
    2. Identify the index range you want to summarize. A typical Index number range is 1 to 100. The summary report will then allow you to see the next 100 (or your range), and so on.
    3. If you want to hide the protein sequences, mark the check box.
    4. Click the Summarize button.


    Concatenating Databases (FASTA files)

    This option is used to combine databases. You can either select one or more databases to concatenate, or you can concatenate all databases in a folder. Concatenate files in folders is most useful for adding smaller FASTA files, such as contaminants, that would not necessarily be selected separately for searching.

    The databases you concatenate must reside under the SeqDB folder. If you concatenate all FASTA files in a folder, the folder must reside under SeqDB.

    Navigate to the Protein Databases form and select the Concatenate FASTA files option. Then:

    1. Click either Select files to concatenate or Concatenate files in folder.
    2. Select the existing databases to concatenate, or (if you clicked Concatenate files in folder) enter paths to the FASTA files.
    3. Click the Concatenate button.
    4. Confirm that the new database is created in the SeqDB folder.
    5. After concatenating, you can use the Make non-redundant tool to remove redundant entries.
    6. Create indices for the newly created FASTA file.


    Comparing Two Databases

    This option allows you to compare two databases to determine whether their content is different. It is useful when you need to remove redundant databases from the Spectrum Mill server. Note that comparison of large databases requires some time.

    Navigate to the Protein Databases form and select the Compare two databases option. Then:

    1. Select the first database you want to compare (Database 1).
    2. Select the second database you want to compare (Database 2).
    3. Click the Compare button.
    4. In the report, view the number of updated entries (numUpdated) and the number of deleted entries (numDeleted). If these numbers are zero, the databases are identical.


    Calculating Database Statistics

    This option allows you to calculate these statistics:

    Navigate to the Protein Databases form and select the Calculate Statistics option. Then:

    1. From the Database 1 list, select the database for which you want the program to calculate statistics.
    2. If desired, mark the check box for Generate table of number of observable tryptic peptides per protein.
    3. Click Calculate Stats.
    4. Click Update Database List.


    Removing Redundant Database Entries

    Navigate to the Protein Databases form and select the Make Non-redundant option. Then:

    1. From the Database 1 list, select the database for which you want the program to remove redundant entries.
    2. Click Make Non-redundant.
    3. Use the Create Indices button to index the newly created database. In the Newly downloaded database box, type the database name with "nr" appended.


    Making a Subset FASTA File from Accession Numbers

    This option creates a subset FASTA file from accession numbers that you type. It is useful for limiting searches to the set of proteins of particular interest.

    1. Navigate to the Protein Databases form. 
    2. Select the Make subset FASTA file from Accession Numbers option.
    3. In the Suffix for subset database field, type the name of the suffix for your database.
    4. From the Database list, choose a database.
    5. Type the accession numbers you want to include, separated by a semi-colon (;).
    6. Click the Make Subset button.
    7. On the Spectrum Mill server, navigate to the folder where your databases are stored (for example, D:\seqdb).
    8. Note the new file created there.
    9. Create indices for the new subset database.


    Creating a category file

    Navigate to the Protein Databases form and select the Create category file option. Then:

    1. From the Database 1 list, choose a database.
    2. Type the accession numbers you want to include, seaparated by a semicolon.
    3. Click Make category file.
    4. On the Spectrum Mill server, navigate to the folder where your databases are stored (for example, D:\seqdb). Note the new file created there.


    The Command Line Version of Protein Databases

    Traditionalists and those who wish to automate the process of updating databases will probably prefer to use the command line version of Protein Databases.

    Protein Databases and the Spectrum Mill Workbench Directory Structure

    The faindex (Protein Databases) program is expected to reside in the same directory as all other Spectrum Mill workbench programs. Faindex accepts a single input argument ( the name of the database file). Upon execution, faindex issues an instruction to read the database file from seqdb\database_filename and write the indices to seqdb\database_filename.suffix.

    This requires careful attention to which directory to launch faindex from and the syntax of launching it.

    Basically you should launch faindex from the directory immediately above the seqdb directory, without specifying the path to the database file. Faindex inserts only seqdb\ in front of the filename.

    If the faindex program does not reside in the directory immediately above the seqdb directory, then you may need to specify the path to faindex (but not to the database).

    Running Protein Databases

    If you wish to use the command line version of Protein Databases rather than the browser version, you may run the faindex.cgi program from an MS-DOS command prompt. The faindex.cgi command must be run from the root volume where the databases are installed (D:\ by default).

     

    1. Open an MS-DOS Command Window. (From the Windows Start menu, select Run... and type cmd.exe.)

       

    2. Change to the volume where you installed the protein databases. Execute just the volume letter to change to that volume. (If necessary, replace D: with the correct volume where your protein databases are installed):

      C:\> D:

      The display changes to:

      D:\>

       

    3. Run the following command from the root of the SeqDB volume, specifying the full path to the location of the faindex.cgi program:

      D:\> E:\SpectrumMill\millbin\faindex.cgi NCBInr

      (Replace E: with the correct volume if you installed the Spectrum Mill workbench on a different volume)

      You will see a message like:
            Creating index file NCBInr
      and after a minute or so you will see an increasing count scroll across the screen as the indices are created. If not, please read the directory structure section above.