Spectrum Mill
Protein Sequence Database Utilities

Introduction
Background on FASTA Format
Obtaining and Updating Sequence Databases
Change in NCBI FASTA Header Format
Spectrum Mill supported FASTA Database Header Formats

for Public FASTA Databases
for Proprietary/Generic FASTA Databases

The Indices
The Utilities

Index a New Database
Re-index an Updated Database
Concatenate FASTA files
Make Non-redundant database

Create subset FASTA database from Accession Numbers
Create subset FASTA database from Species and Protein MW
Create or Append user FASTA Database
Translate nucleotide FASTA to protein FASTA
Create category file from FASTA headers
Create FASTA file from category file
Compare Two Databases
Calculate statistics about database
Database Summary Report
Updating the Database Lists in the HTML Forms

The Command Line Version of FAindex
- Protein Databases and the Spectrum Mill Directory Structure
- Running Protein Databases

Links to related topics in the general and server administration instructions:

Links to related topics in separate stand-alone documents:

Personalized Sequence Databases for proteogenomics (single amino acid variants, splice junctions)

Introduction

The SM Protein Sequence Database Utilities web page provides access to several key capabilities enabled by a few different programs and scripts:

FAindex (C++ program, faindex.cgi)

To create several indexes much smaller files than a FASTA sequence database file itself. These indices allow Spectrum Mill programs rapid, byte-specific access to a .fasta file based on accession number, species, and protein mw, which:
- Enable an internal means for Spectrum Mill programs to store an index number when a hit is recorded during a search, then later use that number for rapid retrieval of that database entry for output/report generation purposes, thus decreasing memory requirements for program execution.
- Accelerate searches that are pre-filtered by intact protein MW and/or species.
To create subset sequence databases based on either a Species/Protein Molecular Weight pre-filter or the results of a previous search. Searches performed on these smaller subset databases are often very much faster than searches performed on complete databases.
To append, via web UI, user provided sequences (fusion constructs, contaminants, etc).
To report summarized content of a sequence database using the index numbers of each entry. This is particularly helpful after appending user provided sequences.

FastaManipulator (Perl script, fastaManipulator.pl)

To combine multiple sequence databases by concatenating .fasta files.
To create personalized sequence databases for proteogenomics (single amino acid variants, splice junctions).
To remove redundant sequences.
To create a subset sequence database from a list of accession numbers
To create a tab-delimited category file from FASTA headers containing: accession numbers, gene symbols, protein name, species. The category file is intended for use when making reports using SM's Protein/Peptide Summary module to report extra columns of meta data about each protein. This is the typical way of providing gene symbols.
To create a FASTA file from a tab-delimited category file containing: accession numbers, protein name, species, and sequence.
To compare the content of two sequence databases in terms sequence, accession numbers, and gene symbols. This is particularly helpful when updating to a newer release or when considering switching database sources.
To calculate statistics about a database: distinct peptide count, peptide redundancy ratio, peptide redundancy histogram, sequence length histogram, amino acid frequencies, and a table of number of observable tryptic peptides per protein.

Background on the FASTA Format

The FASTA format for sequence databases was originally developed by Pearson for use with the FASTA program. Today it is probably the most widely-used standard format, primarily because its brevity results in the smallest possible file size for sequences.

An example of the format is shown below:

>sp|P28190|AA1R_BOVIN ADENOSINE A1 RECEPTOR. MPPSISAFQAAYIGIEVLIALVSVPGNVLVIWAVKVNQALRDATFCFIVSLAVADVAVGA LVIPLAILINIGPRTYFHTCLKVACPVLILTQSSILALLAMAVDRYLRVKIPLRYKTVVT PRRAVVAITGCWILSFVVGLTPMFGWNNLSAVERDWLANGSVGEPVIECQFEKVISMEYM VYFNFFVWVLPPLLLMVLIYMEVFYLIRKQLSKKVSASSGDPQKYYGKELKIAKSLALIL FLFALSWLPLHILNCITLFCPSCHMPRILIYIAIFLSHGNSAMNPIVYAFRIQKFRVTFL KIWNDHFRCQPAPPIDEDAPAERPDD

The standard format is not very specific because it says only that there is a single header line per entry which must begin with the ">" character and all subsequent lines for an entry contain sequence. However, there are many "standards" as to the arrangement of fields and/or delimiting of fields in the header line. Often the header line is used to describe basic information like entry name, accession number (or other unique identifier), and the species or organism from which the sequence was obtained.

The FASTA format was chosen for use with the Spectrum Mill primarily because of it's universality, brevity, and expected ease with which sequence database files could be shared on the same computer with other programs for sequence analysis.

Obtaining and Updating Sequence Databases

Important: Spectrum Mill sequence database filenames must have a prefix that represents the primary format, which is usually dictated by the site from which the database was downloaded, as described in step 2 below!

Obtain FASTA-formatted sequence database files for the seqdb directory:
D:\seqdb.

Locations to download public domain FASTA formatted database files via ftp:

SM prefix	DB source	Download location .FASTA Protein Sequences	Notes
UniProt	UniProt	http://www.uniprot.org/downloads	SwissProt + TrEMBL Excellent source of reference proteomes for model organisms
SwissProt	SwissProt	http://www.uniprot.org/downloads	Sets the standard for curated functional annotation
RefSeq	RefSeq	https://ftp.ncbi.nlm.nih.gov/refseq/ https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_protein.faa.gz https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers/GRCh37_latest_protein.faa.gz	NCBI reference proteomes readily map to the genome. Typically requires pre-filtering to remove entries that are ill-suited to proteomics.
Ensembl	EnsEMBL	http://ftp.ensembl.org/pub http://ftp.ensembl.org/pub/current_fasta/homo_sapiens/pep/Homo_sapiens.GRCh38.pep.all.fa.gz http://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/pep/Homo_sapiens.GRCh37.pep.all.fa.gz	Ensembl reference proteomes readily map to the genome. Typically requires pre-filtering to remove entries that are ill-suited to proteomics.
Gencode	GENCODE	https://www.gencodegenes.org/human/ http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_37/gencode.v37.pc_translations.fa.gz	Default annotation of the ensembl genome assembly. Typically requires pre-filtering to remove entries that are ill-suited to proteomics.
GenPept	GenPept	ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/	All_ the coding regions (with their /translation qualifiers) annotated on the records in GenBank. Download `gbDDDxxx.fsa_aa.gz`, where `DDD` is a division code and `xxx` equals the part number. The relevant division codes are BCT - Bacteria, ENV - Environmental sampling, INV - Invertebrate, PLN - Plant, VRL - Viral, PRI - Primate, ROD - Rodent MAM - Other Mammalians VRT - Other Vertebrate. Usage requires downloading and concatenating all the parts for a division.
NCBIgb*	NCBI non-redundant	ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz	Comprehensive non-redundant database which collects entries from several database sources and all species. Now too large for practical present day use.
NCBInr*	NCBI non-redundant	ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz	Obsolete, but SM remains backwards compatible
IPI	International Protein Index (IPI)	ftp://ftp.ebi.ac.uk/pub/databases/IPI/last_release/current	Obsolete, but SM remains backwards compatible
OWL			Obsolete, but SM remains backwards compatible

Note that the full NCBI database is now very large, so you are strongly encouraged to download a species-specific database instead.

*Note: As of September 2016, the NCBI FASTA files (download as nr.gz) are in a new format that specifies GeneBank accessions instead of "gi" accession numbers. You must use either "NCBIgb" or "gb" as the filename prefix for Spectrum Mill to properly parse the FASTA header information. NCBI FASTA files in the older "gi" format must be specified with "NCBInr" or "nr" as the prefix.

Uncompress and rename the database files with the appropriate prefix shown in the table above. The prefixes are a necessary part of the name. Adding a .fasta suffix is encouraged, but not required. The prefix helps enable SM programs to expect a specific format of the FASTA format header line used in each database, and configure links in SM reports. However, you can mix formats in a single .fasta file, so as to support common actions like appending a UniProt-derived set of common laboratory contaminants to a RefSeq-derived human proteome.
Note that the database prefix should represent the primary format, which is usually dictated by the site from which the database was downloaded. For example, if you download a SwissProt database from the NCBI site, then the format may be NCBI, not SwissProt.

When choosing a database filename keep in mind the filename is stored with search results to enable subsequent retrieval of the protein sequence; hence, review of older data will be hindered by updating a database, but retaining the prior database filename. Adding the download date to the filename is a simple, yet effective, means of handling updates and maintaining backwards compatibility with search results.
Create indices in the seqdb directory for each database, by running the utility Index a New Database (FAindex). The indices are necessary for efficient memory mapping during searches, particularly for preliminary filtering by species and protein molecular weight. You must create new indices after each update of a database, even if the update is done by only adding new entries to the end of the original file.
- If, when you create indices, you encounter the error Duplicated accession number found, then you need to remove the duplicate(s) so that you do not subsequently encounter problems when generating Protein/Peptide Summary report.
- To fix the duplicate accession numbers:
  1. Note from the message the duplicate accession number and the corresponding database entry numbers.
  2. Open the database in a plain text editing program (e.g., Notepad or WordPad), remove the duplicate accession number, save the file, and remove any .txt extension that has been appended.
  3. Go to Protein Databases and recreate the indices.
If you want to use proprietary databases or update databases regularly, fully read this manual, particularly the generic database file naming sections.

If you wish to use the command line version of FAindex rather than the browser version, see the section about the command line version.

FAindex creates a file with a .usp suffix ( i.e. NCBInr.usp ) where it writes the header line for each FASTA entry which the Protein Databases program cannot parse out the species. Viewing this file can help troubleshoot FASTA format problems for anyone using proprietary databases.
Update the database list on the HTML forms.

Change in NCBI FASTA Header Format

In September 2016, NCBI changed the FASTA header format to supply only the gb (GeneBank) accession. The former gi accession is no longer used.

Newly downloaded databases in the new format are supported and the gb accession is used by the Spectrum Mill for those databases.

For the Spectrum Mill to properly recognize the format, these new databases require either an NCBIgb or gb prefix instead of the NCBInr prefix.

Existing databases (NCBInr) are still supported. GeneBank accession numbers (when present) can be reported in Protein/Peptide Summary by creating a Category file for the database.

Spectrum Mill supported FASTA Database Header Formats

Often the header line in a FASTA database is used to describe basic information like entry name, accession number (or other unique identifier), and the species or organism from which the sequence was obtained. However, this information is NOT consistently organized into fields in the header line of different FASTA database, though within a specific database it is usually consistent.

The way the Spectrum Mill programs "know" which dialect of FASTA to "speak" with a particular database is via the filename. Acceptable filename prefixes are shown below in bold and the associated header line format described.

UniProt (SwissProt and TrEMBL)

Examples:
>sp|Q9UM73|ALK_HUMAN ALK tyrosine kinase receptor OS=Homo sapiens OX=9606 GN=ALK PE=1 SV=3
>sp|O82803|SRPP_HEVBR Small rubber particle protein OS=Hevea brasiliensis GN=SRPP PE=1 SV=1
>tr|A0A3Q1MKW9|A0A3Q1MKW9_BOVIN Tyrosine-protein kinase receptor OS=Bos taurus OX=9913 GN=ALK PE=3 SV=1

Spectrum Mill programs designate:

accession number Everything between 1st and 2nd |
Q9UM73
O82803
A0A3Q1MKW9
species
HUMAN
HEVBR
BOVIN
name Text after first space up until OS=
ALK tyrosine kinase receptor
Small rubber particle protein
Tyrosine-protein kinase receptor

SwissProt

Spectrum Mill remains compatible with the classic SwissProt formats described below. As of 2021 (and several years earlier) users should expect to instead use the UniProt formats shown above.

November 2006 - ??? SwissProt and TrEMBL
>Q4U9M9|104K_THEAN 104 kDa microneme-rhoptry antigen precursor (p104) - Theileria annulata

Spectrum Mill programs designate:

species, THEAN, as the string between "_" and " "
accession number, Q4U9M9, the alphanumeric string between ">" and "|"
name, 104 kDa microneme-rhoptry antigen precursor (p104) - Theileria annulata, as the string following the species.

prior to November 2006

Sample entry SwissProt
>100K_RAT (Q62671) 100 KDA PROTEIN (EC 6.3.2.-).

Sample entry TrEMBL
>Q46513 (Q46513) ORF 2 GENE PRODUCT (FRAGMENT).

Spectrum Mill programs designate:

species, RAT, as the string between "_" and " "
accession number, Q62671, the alphanumeric string between "(" and ")"
name, 100 KDA PROTEIN (EC 6.3.2.-)., as the string following the species.

Whenever the species cannot be found the species is assigned as UNREADABLE. (This usually does not happen for any entries in SwissProt, but happens for all entries in TrEMBL.) All of these UNREADABLE lines are then written by FAindex to the file seqdb\SwissProt.usp.

RefSeq

Example:
>NP_000005.3 alpha-2-macroglobulin isoform a precursor [Homo sapiens]
>NP_001340694.1 ALK tyrosine kinase receptor isoform 2 GN=ALK [Homo sapiens]

Spectrum Mill programs designate:

accession number Everything up to the first space or |.
NP_000005.3
NP_001340694.1
species Homo sapiens
name Left off here KRC 3/30/2021 detailes allow for handling QUILTS-formatted DBs
alpha-2-macroglobulin isoform a precursor
ALK tyrosine kinase receptor isoform 2

Ensembl and GENCODE

Examples:
>ENSBTAP00000064819|ENSBTAG00000052228|GN= TRBV24 - 1 T cell receptor beta variable 24 - 1[Source:HGNC Symbol; Acc:HGNC : 12203]
>ENSP00000373700.3|ENST00000389048.3|ENSG00000171094.11|chr2:29415640-30144432:+|GN=ALK anaplastic lymphoma receptor tyrosine kinase
>ENSP00000493203.1|ENST00000642122.1|ENSG00000171094.18|OTTHUMG00000152034.4|OTTHUMT00000493449.1|ALK-207|ALK|552
>ENSP00000493203.1 pep chromosome:GRCh38:2:29192774:29223900:-1 gene:ENSG00000171094.18 transcript:ENST00000642122.1 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:ALK description:ALK receptor tyrosine kinase [Source:HGNC Symbol;Acc:HGNC:427]

Spectrum Mill programs designate:

accession number Everything up to the first space or |.
ENSBTAP00000064819
ENSP00000373700.3
ENSP00000493203.1
ENSP00000493203.1
species Human, accession number prefix: ENSP - Human, ENSBTA - Bovine.
Otherwise parsed from .fasta filename (text between 1st and second dots in filename becomes the SM species designation with conversion of "_" to space)
Example: Ensembl.Gopherus_evgoodei.20210401.fasta, SM species - Gopherus evgoodei
name After the accession number, all text after the last "|" and before the first "["
GN= TRBV24 - 1 T cell receptor beta variable 24 - 1
GN=ALK anaplastic lymphoma receptor tyrosine kinase
552
ALK receptor tyrosine kinase

NCBIgb

Example:
>CAA56020.1 B-127 protein [Saccharomyces cerevisiae]

Spectrum Mill programs designate:

accession numberCAA56020.1
species Saccharomyces cerevisiae
name B-127 protein

In some cases, multiple other protein database accessions are referenced and separated by a ctrl-A character. Spectrum Mill ignores anything in the header after the first ctrl-A it encounters.

NCBInr (databases are obsolete as of Sept 2016, but SM remains backwards compatible)

The header lines from this database are tricky to handle because it is a non-redundant database which collects entries from several databases; thus there are several formats present in the final database.

Examples:

>gi|304881 (L07596) alaS [Escherichia coli]
>gi|132349|sp|P15394|REPA_AGRTU REPLICATING PROTEIN
>gi|282349|pir||A41961 chitinase (EC 3.2.1.14) D - Bacillus circulans
>gi|477498|pir||A49131 releasechannel homolog - fruit fly (Drosophila melanogaster) (fragment)
>gi|543687|pir||A48298 sodium channel homolog - jellyfish (Cyanea capillata)

Spectrum Mill programs designate:

accession number, 304881, as all consecutive digits following the first "|"
species
- Escherichia coli, as the string inside the last set of brackets from Genpept entries
- AGRTU, as the string between "_" and " " when preceded by "s.|......|" from SwissProt entries
- UNREADABLE, the dash "-" delimiter from PIR entries conflict with dashes in the protein name from other databases; thus all entries exclusive to PIR (~1 to 2% of NCBInr) are not readable by FAindex
name
- (L07596) alaS, as the string between the accession number the last set of brackets from Genpept entries
- REPLICATING PROTEIN, as the string following the species from SwissProt entries

Whenever the species cannot be found, the species is assigned as UNREADABLE, and the name is assigned as the entire header line. All of these UNREADABLE lines are then written by FAindex to the file seqdb\NCBInr.usp.

Genpept

>gi|216790 (D13314) arginine deiminase [Mycoplasma hominis]

the Spectrum Mill programs designate:

accession number, D13314, as the alphanumeric string in the first set of parentheses in the line
name, arginine deiminase, as the string between the first ")" and the last "[" in the line
species, Mycoplasma hominis, as the string between the last set of brackets in the line

Whenever the species cannot be found the species is assigned as UNREADABLE, and the name is assigned as the entire header line. All of these UNREADABLE lines are then written by Protein Databases to the file seqdb\Genpept.usp.

IPI (databases are obsolete, but SM remains backwards compatible)

Example entries IPI December 2003

IPI human
>IPI:IPI00030991.1|SWISS-PROT:P40855|REFSEQ_NP:NP_002848|TREMBL:Q8NI97|ENSEMBL:ENSP00000294784 Tax_Id=9606 Peroxisomal farnesylated protein

IPI mouse
>IPI:IPI00110309.2|TREMBL:Q9CXH0|ENSEMBL:ENSMUSP00000024958 Tax_Id=10090 Ensembl_locations(Chr-bp):17-23175270 3300002N10Rik protein

IPI rat
>IPI:IPI00357878.1|REFSEQ_XP:XP_224588|ENSEMBL:ENSRNOP00000019511 Tax_Id=10116 Ensembl_locations(Chr-bp):16-2595757 similar to Arhgef3 protein

Spectrum Mill programs designate:

accession number, IPI00030991, IPI00110309, & IPI00357878 as the alphanumeric string following the colon and preceding the first "."
name, Peroxisomal farnesylated protein, Ensembl_locations(Chr-bp):17-23175270 3300002N10Rik protein , Ensembl_locations(Chr-bp):16-2595757 similar to Arhgef3 protein as the string following the taxonomy code extending to the end of the line.
species, HUMAN, MOUSE, & RAT, taxonomy codes Tax_Id=9606, Tax_Id=10090, & Tax_Id=10116. When IPI adds new taxonomy codes, system administrators can update \msparams_mill\ipitax_id.tsv to enable immediate Spectrum Mill support. Consult the NCBI site for NCBI Taxonomy code information. See IPI help for details concerning version number and IPI entry information.

Owl (database is obsolete, but SM remains backwards compatible)

>10KD_VIGUN 10 KD PROTEIN PRECURSOR (CLONE PSAS10). - VIGNA UNGUICULATA (COWPEA).
>AEOHFPA AEOHFPA NID: g141875 - A.hydrophila DNA, clone pPH4.
>pir|Q62671|100K_RAT 100 KD PROTEIN (EC 6.3.2.-). - RATTUS NORVEGICUS (RAT).

tSpectrum Mill programs designate:

accession number, 10KD_VIGUN, AEOHFPA, 100K_RAT as either the string before the first space in the line or the string between the second dash in the line and the first space in the line. The second case is activated if the letters "pir" immediately follow the '>' character.
species, VIGNA UNGUICULATA, A.hydrophila DNA, clone pPH4, RATTUS NORVEGICUS as the string between the last dash " - " in the line and either the character combination " (" or the period character.
name, 10 KD PROTEIN PRECURSOR (CLONE PSAS10)., AEOHFPA NID: g141875, 100 KD PROTEIN (EC 6.3.2.-). as the string between the first space " " and the last dash " -" in the line.

The Spectrum Mill File Naming Conventions for Proprietary/Generic FASTA Databases

Note! As of B.06.00, databases that contain DNA sequences can no longer be searched. You must convert the DNA sequences to protein sequences, and name the file with either a PA or PN prefix.

You name proprietary databases with the prefixes PA or PN:

P indicates a protein database
N indicates use of numeric accession numbers
A indicates use of alphanumeric accession numbers

Explanation

Often the header line in a FASTA database is used to describe basic information like entry name, accession number (or other unique identifier), and the species or organism from which the sequence was obtained. With well-curated databases, this information is consistently organized into fields in the header line of a FASTA-formatted database.

For the Spectrum Mill programs the sequence field is only subject to two constraints: 1) it must be in CAPITAL letters, and 2) it must be in single letter code. (Some people express amino acids in three-letter code.)

The way the Spectrum Mill programs "know" which dialect of FASTA to "speak" with a particular database's header line is via the filename. Generic filename prefixes are shown below in bold and the associated comment line format described. These formats are handled in a relatively robust manner, to allow for the absence of fields or the presence of additional fields. The formats basically consist of "|" delimited fields of accession number, name, and species in that order.

The P forms indicate protein sequence.

> 417909| Better than sliced bread growth factor beta|Mouse|pancreas|

the Spectrum Mill programs designate:

accession number, 417909, as the integer before the first "|"

name, Better than sliced bread growth factor beta, as the string between the first "|" and second "|" (or the end of the line, if no second "|")

species, Mouse, as the string between the second "|" and third "|" (or the end of the line, if no third "|")
Whenever the species cannot be found, the species is assigned as UNREADABLE, and the name is assigned as the entire header line. All of these UNREADABLE lines are then written by Protein Databases to the file seqdb\DN.usp, or seqdb\PN.usp.
If the accession number is alphanumeric, Protein Databases will still run to completion, and all the Spectrum Mill workbench programs will function properly, except those that retrieve an entry based on the accession number. This applies only to MS Digest and MS Edman, when retrieve entry by accession number is designated. In those cases, supplying an alphanumeric accession number will result in retrieving the entry closest to the end of the file which has an alphanumeric accession number.

The P forms indicate protein sequence.

Note that the PA differs from the PN set only in that the accession number can be alphanumeric rather than numeric. This second set is thus more robust. However, for large, frequently-updated databases, Protein Databases can take an hour to run rather than several minutes simply because creation of the dbfilename.acc file involves the much slower process of sorting strings rather than integers.

> SlowSort909| Better than sliced bread growth factor beta|Mouse|pancreas|

the Spectrum Mill programs designate:

accession number, SlowSort909, as the alphanumeric string before the first "|"

name, Better than sliced bread growth factor beta, as the string between the first "|" and second "|" (or the end of the line, if no second "|")

Any number of proprietary databases may be created with PA or PN prefixes. You must also create species alias lists and accession number links for any databases which you create.

The Indices

Suffix (databasefilename.xxx)	Description
.idx	Primary binary index assigning an index number to each entry in the sequence database and mapping it to the byte-position in the .fasta file of the start of the entry. The index number is simply the order in which the entries appear in the database file. When a database is updated, the number corresponding to a particular entry will change only if the order of the entries in the file changes. Users see this number in the output Spectrum Mill programs designated as the MS Digest index number. Internally, the programs store this number when a hit is recorded during a search, the number is then used later to retrieve the sequence for output/report generation purposes.
.idx2	Same as idx, but allows for databases > 4.2 GB.
.unk	Index which keeps track of all foreign characters in the sequence field for each database entry. For protein databases any characters other than the 20 standard amino acids are foreign characters. Note that the sequences must be in CAPITAL letters, and in single letter code. (Some people express amino acids in three-letter code.)
.mw	Binary index containing the calculated protein Molecular Weight (MW) of each sequence in the database. All amino acids are treated as unmodified, the amino acid X is treated as L, the amino acid B is treated as E, the amino acid J is treated as Q. The .mw file is used to accelerate searches that are constrained by intact protein MW.
.pi	Index containing the calculated protein pI of each sequence in the database. The amino acid C is treated as unmodified, the amino acid X is treated as L, the amino acid B is treated as E, the amino acid J is treated as Q. The .pi file is used to accelerate searches that are constrained by intact pI.
.sp	Index containing the Species of each sequence in the database. Used to accelerate searches that are constrained by species.
.sl	Contains a list in alphabetical order of the text strings used to denote different species. A text string has to occur at least ten times to appear in this file. This file is never used by the the Spectrum Mill programs. The text strings are the ones you should use in MS Edman if you have the Search Mode set to Species.
.usp	File created to list the header lines of each entry for which Protein Databases cannot read the species. This file is never used by the the Spectrum Mill programs; it is created only for use by server administrators in troubleshooting species problems.
.acc	Index of alphanumeric accession numbers mapped to index numbers, created only for database filename prefixes: Genpept, gen, SwissProt, swp, Owl, owl, DA, PA.
.acn	Index of integer accession numbers mapped to index numbers, created only for database filename prefixes: NCBInr, nr, or PN.

Note: You should not manually edit any of the files in the table above.

The Utilities

Index a New Database

Once you've downloaded a new database into the seqdb directory, you need to create the index files described above before you can start to use it. To do this task, navigate to the Protein Sequence Database Utilities page, select the utility-Index new database option. Then:

Type or copy/paste the name of the database into the Newly downloaded database box.
Click the Create Indices button.
See Update the database list.

Re-index an Updated Database

Once you've updated a database, you must re-index it. To do this task, navigate to the Protein Sequence Database Utilities page, and select the Re-index existing database option. Then:

For Existing database to re-index, select a database.
Click the Re-index button.
See Update the database list.
Recreate any subset databases so that the subsets contain the latest information.

Update the Database List in the HTML Forms

The list of databases used by the other forms is held in a JavaScript file. The JavaScript file is automatically updated after performing any of the FAindex based operations on the Protein Sequence Database Utilities form, with the exception of Database summary report. In some cases, the file is not refreshed in the browser. If you don't see a newly indexed database in the database list, click the Update Database List button.

After the automatic update, you will probably have to reload any search forms that are currently open before the new database list appears. If this doesn't work, place the cursor in the URL location box of the browser and press return. If even this doesn't work, investigate the cache settings on your browser.

Create a Species and Protein MW Subset Database with Indices

To create a subset database which has been pre-filtered for species and molecular weight, navigate to the Protein Databases form and select the Create species subset database option.

For example, to create a subset database of mammalian proteins between 1000 to100000 Da from the NCBInr database:

Choose a suitable suffix, such as ecoli for the database.
Select NCBInr as the existing database.
Select ESCHERICHIA COLI as the species.
Keep the default of 1000 to 100000 Da as the MW of the protein and deselect All.
Click the Create button.
See Update the database list.

Using subset databases is a good way to dramatically decrease search times.

Create or Append to a Database Containing User Supplied Protein

It is possible to create your own FASTA-format database which can be searched by the the Spectrum Mill search programs. An entry for a single protein is made up of a header line containing accession number, species and name fields followed by one or more lines containing the sequence.

Navigate to the Protein Sequence Database Utilities page, and select the Create or append user database option. Then:

Type the database name. There are several dialects of FASTA with the essential difference between them being the format of the header line. You are strongly advised to use a proprietary format but it is also possible to use a public format. If you choose a database name that already exists on the disk, then subsequent proteins will be appended to the end of the file; otherwise a new database file will be created. It is possible to append entries to the end of the publicly-available databases but this is not advisable because the index files are remade after each entry, because newer versions of the database won't contain your entries, and because any errors in the information you supply when adding the entry could potentially damage the whole database. If you want to use a public database format, you should use a database name such as NCBInr.user.
Type a description for the database entry. Whether you are using a proprietary format or a public format, make sure you do not use characters in the name that might give the the Spectrum Mill programs problems in sorting out the fields in the comment line.
Type a species for the entry. This should be consistent with the information in the msparams_mill\species.txt file.
Type an accession number for the entry. The accession number must be unique; the program will alert you if it is not. If your database uses numeric accession numbers, then the accession number must be numeric.
Type the protein sequence using only the upper case symbols for the 20 naturally occurring amino acids or the four base pairs as appropriate. You may also use X if the sequence is unknown at a particular point.
Click the Create button.

Translate nucleotide FASTA to protein FASTA

It is possible to extract ORFS from a transcript and translate nucleotide sequences to create a protein FASTA-format database which can be searched by the the Spectrum Mill search programs. An entry for a single protein is made up of a header line containing the nucleotide sequence accession number with a suffix indicating the reading frame and incremental ORF. ( ie: F2_R46, frame 2, ORF 46 ) followed by one or more lines containing the protein sequence.

Navigate to the Protein Sequence Database Utilities page, and select the Translate nucleotide FASTA to protein FASTA option. Then:

Place .fasta or .fastq files in the SeqDB directory or a subdirectory on the SM server.
In the large text entry box, enter paths to nucleotide FAST(A/Q) files (using * as wildcard) under the SeqDB folder.
Select a maximum reading frames allowed value of 3 if direction is known, set to 1 if frame known, otherwise use the default value of 6.
Enter a Minimum ORF length factor. Default is 0.8. The script will extract multiple ORFs per transcript based on the value of this parameter. (ie: 1.0: longest ORFs only, 0.8: additional ORFs that are at least 80% as long also, 0.0: all ORFs meeting the length threshold)
Enter a minimum protein length value suited to your application, otherwise use the default value of 6.
Click the Extract ORFs button.
Create indices for the new protein database(s).

When translating from nucleotide to protein sequence, one is faced with the problem of deciding which of the 6 possible reading frames is the proper one. With some web searches you will probably be able to find a few programs to do this. This SM utility is built expecting that the assembled nucleotide sequences are imperfect, ie not always full length (may not start with Met) and may have frameshifts caused by misassembly. So there can be more than 1 protein sequence translated from each nucleotide sequence.

This SM utility tries to address the following 2 issues in a very simple-minded way:

For a transcript, the longest ORF probably corresponds to the primary gene.
But if the length difference between additional ORFs is small, it is preferable to retain additional alternative translations.

The utility looks for open reading frames in DNA sequences, translates them into protein sequences, and outputs them ordered according to length. The transcript is analysed in n frames. The n translated frames are examined for start/stop sequences.The translated frames are retained if they are:

longer than the minimum protein length (default 6)
and not-shorter-than the previous one by minimum ORF length factor x.xx. (default 0.8)

Example.: if a transcript is translated in these frames

ABCDEFGHIKLMNOPQRSTUVWXYZ (26 aa)
BCDEFGHIJKLMNOPQRZTUVWXYZA (26 aa)
ABCDEFGHIJLKMNOP (16 aa)
ABCDEFHIJKLMN (13 aa)

Then when minfactor is 0.4, only 1 and 2 are output (because [3] is 39% as long as 2, and 2 is more than 40% as long as 1). Instead, if minfactor is 0.8, then 1 2 3 and 4 are output because:
[4] is more than 80% as long as [3], and
[3] is more than 80% as long as [2] etc)

Database Summary Report

The Database Summary Report option is used to list the accession numbers, species and name fields for a selected index number range of a selected database. The Database Summary Report is a good way to verify that custom databases were properly parsed and indexed.

Navigate to the Protein Sequence Database Utilities page, and select the Database summary report option. Then:

Choose a database.
Identify the index range you want to summarize. A typical Index number range is 1 to 100. The summary report will then allow you to see the next 100 (or your range), and so on.
If you want to hide the protein sequences, mark the check box.
Click the Summarize button.

Concatenate Databases (FASTA files)

This option is used to combine databases. You can either select one or more databases to concatenate, or you can concatenate all databases in a folder. Concatenate files in folders is most useful for adding smaller FASTA files, such as contaminants, that would not necessarily be selected separately for searching.

The databases you concatenate must reside under the SeqDB folder. If you concatenate all FASTA files in a folder, the folder must reside under SeqDB.

Navigate to the Protein Sequence Database Utilities page, and select the Concatenate FASTA files option. Then:

Click either Select files to concatenate or Concatenate files in folder.
Select the existing databases to concatenate, or (if you clicked Concatenate files in folder) enter paths to the FASTA files.
Click the Concatenate button.
Confirm that the new database is created in the SeqDB folder.
After concatenating, you can use the Make non-redundant tool to remove redundant entries.
Create indices for the newly created FASTA file.

Make Proteogenomic Summary Tables:
Selecting this option triggers the creation of summary tables for personalized sequence databases that will, after MS/MS searches, enable generation of protein/peptide summary reports where the proteogenomic (PG) site, variant or splice junction, is the primary organizing feature. This option has 2 accompanying parameters:

Proteogenomic Database Source
Reference Database

These capabilities are more fully described in a separate document: Using Personalized Protein Sequence Databases in Spectrum Mill

Compare Two Databases

This option allows you to compare two databases to determine whether their content is different. It is useful when you need to remove redundant databases from the Spectrum Mill server. Note that comparison of large databases requires some time.

Navigate to the Protein Sequence Database Utilities page, and select the Compare two databases option. Then:

Select the first database you want to compare (Database 1).
Select the second database you want to compare (Database 2).
Click the Compare button.
In the report, view the number of updated entries (numUpdated) and the number of deleted entries (numDeleted). If these numbers are zero, the databases are identical.

Calculate Database Statistics

This option allows you to calculate these statistics:

Number of entries with leading methionine
Number of distinct peptides of length 8-40
Number of non-distinct peptides of length 8-40
Most repeated peptide
Peptide redundancy ratio
Distinct Peptide Redundancy Histogram
Sequence Length Histogram
Amino acid frequencies

Navigate to the Protein Sequence Database Utilities page, and select the Calculate Statistics option. Then:

From the Database 1 list, select the database for which you want the program to calculate statistics.
If desired, mark the check box for Generate table of number of observable tryptic peptides per protein. This can be used for adjusting label-free quantitation intensities in an EMPAI fashion.
Click Calculate Stats.

Make Non-redundant database

Navigate to the Protein Sequence Database Utilities page, and select the Make Non-redundant database option. Then:

From the Database 1 list, select the database for which you want the program to remove redundant entries.
Click Make Non-redundant.
Use the Create Indices button to index the newly created database. In the Newly downloaded database box, type the database name with "nr" appended.

Create Subset FASTA File from Accession Numbers

This option creates a subset FASTA file from accession numbers that you provide. It is useful for limiting searches to the set of proteins of particular interest.

Navigate to the Protein Sequence Database Utilities page.
Select the Make subset FASTA file from Accession Numbers option.
In the Suffix for subset database field, enter the suffix that will be appended to the filename of your selected existing database when creating the your new subset database.
From the Database list, choose a database.
Enter the accession numbers you want to include, separated by a semi-colon (;).
Click the Make Subset button.
On the Spectrum Mill server, navigate to the folder where your databases are stored (for example, D:\seqdb).
Note the new file created there.
Create indices for the new subset database.

Create category file from FASTA headers

Navigate to the Protein Sequence Database Utilities page, and select the Create category file from FASTA headers option. Then:

From the Database 1 list, choose a database.
Type the accession numbers you want to include, seaparated by a semicolon.
Click Make category file.
On the Spectrum Mill server, navigate to the folder where your databases are stored (for example, D:\seqdb). Note the new file created there.

Create FASTA file from category file

Create a Spectrum Mill format category file. For this task, a tab-delimited category file must contain 2 columns named: accession_number and sequence Optional columns named entry_name and species, if present, will also be included in the FASTA header. Any other additional columns will be ignored.

Navigate to the Protein Sequence Database Utilities page, and select the Create FASTA file from category file option. Then:

In the box, New category file in seqdb directory:Copy/paste or type the filename of the categories file.
Type the accession numbers you want to include, seaparated by a semicolon.
Click Creata FASTA file.
On the Spectrum Mill server, navigate to the folder where your databases are stored (for example, D:\seqdb). Note the new file created there.

The Command Line Version of FAindex

Those who wish to automate the process of updating sequence databases and indexing them for use in Spectrum Mill will probably prefer to use the command line version of FAindex.

FAindex and the Spectrum Mill Directory Structure

The faindex program is expected to reside in the same directory as all other Spectrum Mill programs. Faindex accepts a single input argument ( the name of the database file). Upon execution, faindex issues an instruction to read the database file from seqdb\database_filename and write the indices to seqdb\database_filename.suffix.

This requires careful attention to which directory to launch faindex from and the syntax of launching it.

Basically you should launch faindex from the directory immediately above the seqdb directory, without specifying the path to the database file. Faindex inserts only seqdb\ in front of the filename.

If the faindex program does not reside in the directory immediately above the seqdb directory, then you may need to specify the path to faindex (but not to the database).

Running FAindex

If you wish to use the command line version of FAindex rather than the browser version, you may run the faindex.cgi program from an MS-DOS command prompt. The faindex.cgi command must be run from the root volume where the databases are installed (D:\ by default).

Open an MS-DOS Command Window. (From the Windows Start menu, select Run... and type cmd.exe.)
Change to the volume where you installed the protein databases. Execute just the volume letter to change to that volume. (If necessary, replace D: with the correct volume where your protein databases are installed):
C:\> D:

The display changes to:

D:\>
Run the following command from the root of the SeqDB volume, specifying the full path to the location of the faindex.cgi program:
D:\> E:\SpectrumMill\millbin\faindex.cgi NCBInr

(Replace E: with the correct volume if you installed the Spectrum Mill on a different volume)

You will see a message like:
Creating index file NCBInr
and after a minute or so you will see an increasing count scroll across the screen as the indices are created. If not, please read the directory structure section above.

Spectrum MillProtein Sequence Database Utilities

Table of Contents

Introduction

FAindex (C++ program, faindex.cgi)

FastaManipulator (Perl script, fastaManipulator.pl)

Background on the FASTA Format

Obtaining and Updating Sequence Databases

Change in NCBI FASTA Header Format

Spectrum Mill supported FASTA Database Header Formats

UniProt (SwissProt and TrEMBL)

SwissProt

RefSeq

Ensembl and GENCODE

NCBIgb

NCBInr (databases are obsolete as of Sept 2016, but SM remains backwards compatible)

Genpept

IPI (databases are obsolete, but SM remains backwards compatible)

Owl (database is obsolete, but SM remains backwards compatible)

The Spectrum Mill File Naming Conventions for Proprietary/Generic FASTA Databases

Explanation

The Indices

The Utilities

Index a New Database

Re-index an Updated Database

Update the Database List in the HTML Forms

Create a Species and Protein MW Subset Database with Indices

Create or Append to a Database Containing User Supplied Protein

Translate nucleotide FASTA to protein FASTA

Database Summary Report

Concatenate Databases (FASTA files)

Compare Two Databases

Calculate Database Statistics

Make Non-redundant database

Create Subset FASTA File from Accession Numbers

Create category file from FASTA headers

Create FASTA file from category file

The Command Line Version of FAindex

FAindex and the Spectrum Mill Directory Structure

Running FAindex

Spectrum Mill
Protein Sequence Database Utilities