Links to related topics in the general and server administration instructions:
Links to related topics in separate stand-alone documents:
The SM Protein Sequence Database Utilities web page provides access to several key capabilities enabled by a few different programs and scripts:
The FASTA format for sequence databases was originally developed by Pearson for use with the FASTA program. Today it is probably the most widely-used standard format, primarily because its brevity results in the smallest possible file size for sequences.
An example of the format is shown below:
>sp|P28190|AA1R_BOVIN ADENOSINE A1 RECEPTOR.
MPPSISAFQAAYIGIEVLIALVSVPGNVLVIWAVKVNQALRDATFCFIVSLAVADVAVGA
LVIPLAILINIGPRTYFHTCLKVACPVLILTQSSILALLAMAVDRYLRVKIPLRYKTVVT
PRRAVVAITGCWILSFVVGLTPMFGWNNLSAVERDWLANGSVGEPVIECQFEKVISMEYM
VYFNFFVWVLPPLLLMVLIYMEVFYLIRKQLSKKVSASSGDPQKYYGKELKIAKSLALIL
FLFALSWLPLHILNCITLFCPSCHMPRILIYIAIFLSHGNSAMNPIVYAFRIQKFRVTFL
KIWNDHFRCQPAPPIDEDAPAERPDD
The standard format is not very specific because it says only that there is a single header line per entry which must begin with the ">" character and all subsequent lines for an entry contain sequence. However, there are many "standards" as to the arrangement of fields and/or delimiting of fields in the header line. Often the header line is used to describe basic information like entry name, accession number (or other unique identifier), and the species or organism from which the sequence was obtained.
The FASTA format was chosen for use with the Spectrum Mill primarily because of it's universality, brevity, and expected ease with which sequence database files could be shared on the same computer with other programs for sequence analysis.
Important: Spectrum Mill sequence database filenames must have a prefix that represents the primary format, which is usually dictated by the site from which the database was downloaded, as described in step 2 below!
Locations to download public domain FASTA formatted database files via ftp:
SM prefix | DB source | Download location .FASTA Protein Sequences | Notes |
---|---|---|---|
UniProt | UniProt | http://www.uniprot.org/downloads | SwissProt + TrEMBL
Excellent source of reference proteomes for model organisms |
SwissProt | SwissProt | http://www.uniprot.org/downloads | Sets the standard for curated functional annotation |
RefSeq | RefSeq | https://ftp.ncbi.nlm.nih.gov/refseq/
https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_protein.faa.gz https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers/GRCh37_latest_protein.faa.gz |
NCBI reference proteomes readily map to the genome. Typically requires pre-filtering to remove entries that are ill-suited to proteomics. |
Ensembl | EnsEMBL | http://ftp.ensembl.org/pub
http://ftp.ensembl.org/pub/current_fasta/homo_sapiens/pep/Homo_sapiens.GRCh38.pep.all.fa.gz http://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/pep/Homo_sapiens.GRCh37.pep.all.fa.gz |
Ensembl reference proteomes readily map to the genome. Typically requires pre-filtering to remove entries that are ill-suited to proteomics. |
Gencode | GENCODE | https://www.gencodegenes.org/human/
http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_37/gencode.v37.pc_translations.fa.gz |
Default annotation of the ensembl genome assembly. Typically requires pre-filtering to remove entries that are ill-suited to proteomics. |
GenPept | GenPept | ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/ | All_ the coding regions (with their /translation qualifiers) annotated on the records in GenBank. Download gbDDDxxx.fsa_aa.gz, where DDD is a division code and xxx equals the part number. The relevant division codes are BCT - Bacteria, ENV - Environmental sampling, INV - Invertebrate, PLN - Plant, VRL - Viral, PRI - Primate, ROD - Rodent MAM - Other Mammalians VRT - Other Vertebrate. Usage requires downloading and concatenating all the parts for a division. |
NCBIgb* | NCBI non-redundant | ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz | Comprehensive non-redundant database which collects entries from several database sources and all species. Now too large for practical present day use. |
NCBInr* | NCBI non-redundant | ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz | Obsolete, but SM remains backwards compatible |
IPI | International Protein Index (IPI) | ftp://ftp.ebi.ac.uk/pub/databases/IPI/last_release/current | Obsolete, but SM remains backwards compatible |
OWL | Obsolete, but SM remains backwards compatible |
Note that the full NCBI database is now very large, so you are strongly encouraged to download a species-specific database instead.
*Note: As of September 2016, the NCBI FASTA files (download as nr.gz) are in a new format that specifies GeneBank accessions instead of "gi" accession numbers. You must use either "NCBIgb" or "gb" as the filename prefix for Spectrum Mill to properly parse the FASTA header information. NCBI FASTA files in the older "gi" format must be specified with "NCBInr" or "nr" as the prefix.
Note that the database prefix should represent the primary format, which is usually dictated by the site from which the database was downloaded. For example, if you download a SwissProt database from the NCBI site, then the format may be NCBI, not SwissProt.
When choosing a database filename keep in mind the filename is stored with search results to enable subsequent retrieval of the protein sequence; hence, review of older data will be hindered by updating a database, but retaining the prior database filename. Adding the download date to the filename is a simple, yet effective, means of handling updates and maintaining backwards compatibility with search results.
If you want to use proprietary databases or update databases regularly, fully read this manual, particularly the generic database file naming sections.
If you wish to use the command line version of FAindex rather than the browser version, see the section about the command line version.
FAindex creates a file with a .usp suffix ( i.e. NCBInr.usp ) where it writes the header line for each FASTA entry which the Protein Databases program cannot parse out the species. Viewing this file can help troubleshoot FASTA format problems for anyone using proprietary databases.
In September 2016, NCBI changed the FASTA header format to supply only the gb (GeneBank) accession. The former gi accession is no longer used.
Newly downloaded databases in the new format are supported and the gb accession is used by the Spectrum Mill for those databases.
For the Spectrum Mill to properly recognize the format, these new databases require either an NCBIgb or gb prefix instead of the NCBInr prefix.
Existing databases (NCBInr) are still supported. GeneBank accession numbers (when present) can be reported in Protein/Peptide Summary by creating a Category file for the database.
Often the header line in a FASTA database is used to describe basic information like entry name, accession number (or other unique identifier), and the species or organism from which the sequence was obtained. However, this information is NOT consistently organized into fields in the header line of different FASTA database, though within a specific database it is usually consistent.
The way the Spectrum Mill programs "know" which dialect of FASTA to "speak" with a particular database is via the filename. Acceptable filename prefixes are shown below in bold and the associated header line format described.
Spectrum Mill programs designate:
Spectrum Mill remains compatible with the classic SwissProt formats described below. As of 2021 (and several years earlier) users should expect to instead use the UniProt formats shown above.
November 2006 - ??? SwissProt and TrEMBL
>Q4U9M9|104K_THEAN 104 kDa microneme-rhoptry antigen precursor (p104) - Theileria annulata
Spectrum Mill programs designate:
Sample entry SwissProt
>100K_RAT (Q62671) 100 KDA PROTEIN (EC 6.3.2.-).
Sample entry TrEMBL
>Q46513 (Q46513) ORF 2 GENE PRODUCT (FRAGMENT).
Spectrum Mill programs designate:
Whenever the species cannot be found the species is assigned as UNREADABLE. (This usually does not happen for any entries in SwissProt, but happens for all entries in TrEMBL.) All of these UNREADABLE lines are then written by FAindex to the file seqdb\SwissProt.usp.
Spectrum Mill programs designate:
Spectrum Mill programs designate:
Spectrum Mill programs designate:
In some cases, multiple other protein database accessions are referenced and separated by a ctrl-A character. Spectrum Mill ignores anything in the header after the first ctrl-A it encounters.
The header lines from this database are tricky to handle because it is a non-redundant database which collects entries from several databases; thus there are several formats present in the final database.
Examples:>gi|304881 (L07596) alaS [Escherichia coli]
>gi|132349|sp|P15394|REPA_AGRTU REPLICATING PROTEIN
>gi|282349|pir||A41961 chitinase (EC 3.2.1.14) D - Bacillus circulans
>gi|477498|pir||A49131 releasechannel homolog - fruit fly (Drosophila melanogaster) (fragment)
>gi|543687|pir||A48298 sodium channel homolog - jellyfish (Cyanea capillata)
Spectrum Mill programs designate:
Whenever the species cannot be found, the species is assigned as UNREADABLE, and the name is assigned as the entire header line. All of these UNREADABLE lines are then written by FAindex to the file seqdb\NCBInr.usp.
>gi|216790 (D13314) arginine deiminase [Mycoplasma hominis]
the Spectrum Mill programs designate:
Whenever the species cannot be found the species is assigned as UNREADABLE, and the name is assigned as the entire header line. All of these UNREADABLE lines are then written by Protein Databases to the file seqdb\Genpept.usp.
Example entries IPI December 2003
IPI human
>IPI:IPI00030991.1|SWISS-PROT:P40855|REFSEQ_NP:NP_002848|TREMBL:Q8NI97|ENSEMBL:ENSP00000294784 Tax_Id=9606 Peroxisomal
farnesylated protein
IPI mouse
>IPI:IPI00110309.2|TREMBL:Q9CXH0|ENSEMBL:ENSMUSP00000024958 Tax_Id=10090 Ensembl_locations(Chr-bp):17-23175270
3300002N10Rik protein
IPI rat
>IPI:IPI00357878.1|REFSEQ_XP:XP_224588|ENSEMBL:ENSRNOP00000019511 Tax_Id=10116 Ensembl_locations(Chr-bp):16-2595757
similar to Arhgef3 protein
Spectrum Mill programs designate:
>10KD_VIGUN 10 KD PROTEIN PRECURSOR (CLONE PSAS10). - VIGNA UNGUICULATA (COWPEA).
>AEOHFPA AEOHFPA NID: g141875 - A.hydrophila DNA, clone pPH4.
>pir|Q62671|100K_RAT 100 KD PROTEIN (EC 6.3.2.-). - RATTUS NORVEGICUS (RAT).
tSpectrum Mill programs designate:
Whenever the species cannot be found the species is assigned as UNREADABLE, and the name is assigned as the entire header line. All of these UNREADABLE lines are then written by Protein Databases to the file seqdb\Owl.usp.
Note! As of B.06.00, databases that contain DNA sequences can no longer be searched. You must convert the DNA sequences to protein sequences, and name the file with either a PA or PN prefix.
You name proprietary databases with the prefixes PA or PN:
Often the header line in a FASTA database is used to describe basic information like entry name, accession number (or other unique identifier), and the species or organism from which the sequence was obtained. With well-curated databases, this information is consistently organized into fields in the header line of a FASTA-formatted database.
For the Spectrum Mill programs the sequence field is only subject to two constraints: 1) it must be in CAPITAL letters, and 2) it must be in single letter code. (Some people express amino acids in three-letter code.)
The way the Spectrum Mill programs "know" which dialect of FASTA to "speak" with a particular database's header line is via the filename. Generic filename prefixes are shown below in bold and the associated comment line format described. These formats are handled in a relatively robust manner, to allow for the absence of fields or the presence of additional fields. The formats basically consist of "|" delimited fields of accession number, name, and species in that order.
PN
The P forms indicate protein sequence.> 417909| Better than sliced bread growth factor beta|Mouse|pancreas|
the Spectrum Mill programs designate:
PA
The P forms indicate protein sequence.
Note that the PA differs from the PN set only in that the accession number can be alphanumeric rather than numeric. This second set is thus more robust. However, for large, frequently-updated databases, Protein Databases can take an hour to run rather than several minutes simply because creation of the dbfilename.acc file involves the much slower process of sorting strings rather than integers.
> SlowSort909| Better than sliced bread growth factor beta|Mouse|pancreas|
the Spectrum Mill programs designate:
Any number of proprietary databases may be created with PA or PN prefixes. You must also create species alias lists and accession number links for any databases which you create.
Suffix (databasefilename.xxx) |
Description |
---|---|
.idx | Primary binary index assigning an index number to each entry in the sequence database and mapping it to the byte-position in the .fasta file of the start of the entry. The index number is simply the order in which the entries appear in the database file. When a database is updated, the number corresponding to a particular entry will change only if the order of the entries in the file changes. Users see this number in the output Spectrum Mill programs designated as the MS Digest index number. Internally, the programs store this number when a hit is recorded during a search, the number is then used later to retrieve the sequence for output/report generation purposes. |
.idx2 | Same as idx, but allows for databases > 4.2 GB. |
.unk | Index which keeps track of all foreign characters in the sequence field for each database entry.
For protein databases any characters other than the 20 standard amino acids are foreign characters. Note that the sequences must be in CAPITAL letters, and in single letter code. (Some people express amino acids in three-letter code.) |
.mw | Binary index containing the calculated protein Molecular Weight (MW) of each sequence in the database. All amino acids are treated as unmodified, the amino acid X is treated as L, the amino acid B is treated as E, the amino acid J is treated as Q. The .mw file is used to accelerate searches that are constrained by intact protein MW. |
.pi | Index containing the calculated protein pI of each sequence in the database. The amino acid C is treated as unmodified, the amino acid X is treated as L, the amino acid B is treated as E, the amino acid J is treated as Q. The .pi file is used to accelerate searches that are constrained by intact pI. |
.sp | Index containing the Species of each sequence in the database. Used to accelerate searches that are constrained by species. |
.sl | Contains a list in alphabetical order of the text strings used to denote different species. A text string has to occur at least ten times to appear in this file. This file is never used by the the Spectrum Mill programs. The text strings are the ones you should use in MS Edman if you have the Search Mode set to Species. |
.usp | File created to list the header lines of each entry for which Protein Databases cannot read the species. This file is never used by the the Spectrum Mill programs; it is created only for use by server administrators in troubleshooting species problems. |
.acc | Index of alphanumeric accession numbers mapped to index numbers, created only for database filename prefixes: Genpept, gen, SwissProt, swp, Owl, owl, DA, PA. |
.acn | Index of integer accession numbers mapped to index numbers, created only for database filename prefixes: NCBInr, nr, or PN. |
Note: You should not manually edit any of the files in the table above.
Once you've downloaded a new database into the seqdb directory, you need to create the index files described above before you can start to use it. To do this task, navigate to the Protein Sequence Database Utilities page, select the utility-Index new database option. Then:
Once you've updated a database, you must re-index it. To do this task, navigate to the Protein Sequence Database Utilities page, and select the Re-index existing database option. Then:
The list of databases used by the other forms is held in a JavaScript file. The JavaScript file is automatically updated after performing any of the FAindex based operations on the Protein Sequence Database Utilities form, with the exception of Database summary report. In some cases, the file is not refreshed in the browser. If you don't see a newly indexed database in the database list, click the Update Database List button.
After the automatic update, you will probably have to reload any search forms that are currently open before the new database list appears. If this doesn't work, place the cursor in the URL location box of the browser and press return. If even this doesn't work, investigate the cache settings on your browser.
To create a subset database which has been pre-filtered for species and molecular weight, navigate to the Protein Databases form and select the Create species subset database option.
For example, to create a subset database of mammalian proteins between 1000 to100000 Da from the NCBInr database:
Using subset databases is a good way to dramatically decrease search times.
It is possible to create your own FASTA-format database which can be searched by the the Spectrum Mill search programs. An entry for a single protein is made up of a header line containing accession number, species and name fields followed by one or more lines containing the sequence.
Navigate to the Protein Sequence Database Utilities page, and select the Create or append user database option. Then:
It is possible to extract ORFS from a transcript and translate nucleotide sequences to create a protein FASTA-format database which can be searched by the the Spectrum Mill search programs. An entry for a single protein is made up of a header line containing the nucleotide sequence accession number with a suffix indicating the reading frame and incremental ORF. ( ie: F2_R46, frame 2, ORF 46 ) followed by one or more lines containing the protein sequence.
Navigate to the Protein Sequence Database Utilities page, and select the Translate nucleotide FASTA to protein FASTA option. Then:
When translating from nucleotide to protein sequence, one is faced with the problem of deciding which of the 6 possible reading frames is the proper one. With some web searches you will probably be able to find a few programs to do this. This SM utility is built expecting that the assembled nucleotide sequences are imperfect, ie not always full length (may not start with Met) and may have frameshifts caused by misassembly. So there can be more than 1 protein sequence translated from each nucleotide sequence.
This SM utility tries to address the following 2 issues in a very simple-minded way:The utility looks for open reading frames in DNA sequences, translates them into protein sequences, and outputs them ordered according to length. The transcript is analysed in n frames. The n translated frames are examined for start/stop sequences.The translated frames are retained if they are:
Then when minfactor is 0.4, only 1 and 2 are output
(because [3] is 39% as long as 2, and 2 is more than 40% as long as 1).
Instead, if minfactor is 0.8, then 1 2 3 and 4 are output
because:
[4] is more than 80% as long as [3], and
[3] is more than 80% as long as [2] etc)
The Database Summary Report option is used to list the accession numbers, species and name fields for a selected index number range of a selected database. The Database Summary Report is a good way to verify that custom databases were properly parsed and indexed.
Navigate to the Protein Sequence Database Utilities page, and select the Database summary report option. Then:
This option is used to combine databases. You can either select one or more databases to concatenate, or you can concatenate all databases in a folder. Concatenate files in folders is most useful for adding smaller FASTA files, such as contaminants, that would not necessarily be selected separately for searching.
The databases you concatenate must reside under the SeqDB folder. If you concatenate all FASTA files in a folder, the folder must reside under SeqDB.
Navigate to the Protein Sequence Database Utilities page, and select the Concatenate FASTA files option. Then:
Make Proteogenomic Summary Tables:
Selecting this option triggers the
creation of summary tables for personalized sequence databases
that will, after MS/MS searches, enable generation of protein/peptide summary reports where the proteogenomic (PG) site, variant or splice junction,
is the primary organizing feature. This option has 2 accompanying parameters:
These capabilities are more fully described in a separate document: Using Personalized Protein Sequence Databases in Spectrum Mill
This option allows you to compare two databases to determine whether their content is different. It is useful when you need to remove redundant databases from the Spectrum Mill server. Note that comparison of large databases requires some time.
Navigate to the Protein Sequence Database Utilities page, and select the Compare two databases option. Then:
This option allows you to calculate these statistics:
Navigate to the Protein Sequence Database Utilities page, and select the Calculate Statistics option. Then:
Navigate to the Protein Sequence Database Utilities page, and select the Make Non-redundant database option. Then:
This option creates a subset FASTA file from accession numbers that you provide. It is useful for limiting searches to the set of proteins of particular interest.
Navigate to the Protein Sequence Database Utilities page, and select the Create category file from FASTA headers option. Then:
Create a Spectrum Mill format category file. For this task, a tab-delimited category file must contain 2 columns named: accession_number and sequence Optional columns named entry_name and species, if present, will also be included in the FASTA header. Any other additional columns will be ignored.
Navigate to the Protein Sequence Database Utilities page, and select the Create FASTA file from category file option. Then:
Those who wish to automate the process of updating sequence databases and indexing them for use in Spectrum Mill will probably prefer to use the command line version of FAindex.
The faindex program is expected to reside in the same directory as all other Spectrum Mill programs. Faindex accepts a single input argument ( the name of the database file). Upon execution, faindex issues an instruction to read the database file from seqdb\database_filename and write the indices to seqdb\database_filename.suffix.
This requires careful attention to which directory to launch faindex from and the syntax of launching it.
Basically you should launch faindex from the directory immediately above the seqdb directory, without specifying the path to the database file. Faindex inserts only seqdb\ in front of the filename.
If the faindex program does not reside in the directory immediately above the seqdb directory, then you may need to specify the path to faindex (but not to the database).
If you wish to use the command line version of FAindex rather than the browser version, you may run the faindex.cgi program from an MS-DOS command prompt. The faindex.cgi command must be run from the root volume where the databases are installed (D:\ by default).
C:\> D:
The display changes to:
D:\>
D:\> E:\SpectrumMill\millbin\faindex.cgi NCBInr
(Replace E: with the correct volume if you installed the Spectrum Mill on a different volume)
You will see a message like:
Creating index file NCBInr
and after a minute or so you will see an increasing count scroll across the screen as the indices are created.
If not, please read the directory structure section above.