Information about the STEPdb database
Index:
1. Updates
2. Search
3. Glossary
4. Topology and orientation of IM proteins
5. Internal Connections
6. Color Code
7. Multifun Terms
8. MatureP classifier
1. Updates:
STEPdb website has been updated! STEPdb2.0 version is available from October 2019.
To see the changes in E. coli K-12 proteome, please consult Loos et al 2019 and the downloads section. Additional information is available in the Supplementary Tables:- Table S1: Proteome changes (Download this table )
- Table S2: New features (Download this table )
- Table S3: Tools used (Download this table )
2. Search:
In STEP the search function is performed locally for the displayed proteome/sub-proteome. Consequently search is performed only within the current table. (e.g. when the user is browsing the Complexome he is able to search only for entries within this group of proteins.)
Search is for the moment not very sophisticated. Logical operators such OR, AND are not currently supported by the search tool and text is searched intact. (e.g. words separated with a space are not searched separately)
3. Glossary:
Basic Proteome of E.coli K-12 (MG1655)STEP defines the "basic" proteome of E.coli K-12 (MG1655) as the fraction the proteome that is devoid of gene products that are likely not produced, or that result from enomic insertions enriched in defective prophages, transposons, pseudogenes, integrases and mobile elements. To define "basic" proteome we used manual annotation. We based this on EcoGene (Rudd, 2000), Uniprot (Dimmer et al, 2012) and other (McClelland et al, 2001; Ochman et al, 2000) studies. F1 plasmid-encoded proteins were also removed from the Uniprot data of the E.coli K-12 proteome used (release version of November 2010). According to our analysis the essential protein-coding sequences devoid of these elements encompasses only 3899 proteins.
Core Proteome of E.coli K-12The common core proteome of all E.coli strains consists of 2073 proteins. These were derived from genome comparison with Multi-Genome Homology Comparison Tool (Davidsen et al, 2010) of 15 E.coli strains (Papanastasiou et al, submitted).
Protein Solubility
Protein structure is encoded in the primary amino acid sequence. However, the folding process of a protein to its functional form, in some cases, has to overcome miss-folding states that can lead to protein inclusion bodies. Niwa et al 2009 have calculated protein solubility for 3173 Escherichia coli proteins, in a chaperone-free reconstituted translation system. The aggregation propensity of each protein is examined by centrifugation assay. Solubility is defined as the index of aggregation propensity which is expressed as the proportion of the supernatant fraction, which is obtained after the centrifugation of a translation mixture, to the uncentrifuged total protein. Therefore solubility is a percentage and ranges from 0% to 100%
Manual Curation and non-experimental qualifiersWe follow the same experimental qualifiers with Uniprot: Potential: There is some logical or conclusive evidence that the given annotation could apply. This non-experimental qualifier is often used to present results from protein sequence analysis software tools, which are only annotated if the result makes sense in the biological context of a given protein. Probable: Indicates stronger evidence than the qualifier "Potential". This qualifier implies that there must be at least some experimental evidence, which indicates, that the information is expected to be found in the natural environment of a protein. By similarity: When some biological information was experimentally obtained for a given protein (or part of it), it may be transferred to other protein family members within a certain taxonomic range, dependent on the biological event or characteristic.
Sub-cellular localization special characters and formalisms
To denote multiple localization possibilities that have been experimentally established we introduced the comma "," formalism whereas a slash "/" denotes two or more possible sub-cellular locations that have not yet been experimentally determined.
To denote subcellular location of protein complexes that span both cellular membranes we introdiced the "&" formalism thus corresponding complexes (e.g copper / silver efflux transport system) are annotated as "B&H".
Exportome and subclasses (Secretome, non-classical secretion etc.)
We define as exportome those proteins that are localized within the inner membrane and beyond (e.g. lipoproteins, extra-cellular proteins). This includes the STEPdb sub-cellular classes: B, I, E, F2, F3, G, X, F4. We divide exportome into two subclasses membranome and secretome. The membranome contains proteins that are embedded in the inner mebrane whereas secretome referes to proteins that are fully translocated across the inner membrane. Membranome and secretome can further divided into proteins that are substrates of the Sec and Tat secretion pathways. Finally, within secretome is included a particular class of non-classical secretory proteins which are secreted without the presence of an apparent signal peptide motif.
4. Topology and orientation of IM proteins
Transmembrane regions of integral membrane proteins were predicted using Phobius. In cases where Phobius failed to identify any transmembrane region the prediction of TMHMM was used instead. The predicted orientation of the polypeptide sequences, which equals to the location of the C-terminus (cytoplasmic or periplasmic) was reconsidered based on experimental verification of C-terminus for 734 transmembrane proteins (Daley et al, 2005).
5. Internal Connections
There are internal connections between some of the tables. Proteins listed in the E.coli K-12 proteome table connect to the list of the complexes they participate. Further more from the Complexome table the user is able to view the schematically representation of each complex and through this schematic to link directly to the K-12 table.
6. Color Code
STEPdb follows a specific color code to represent proteins in the various sub-cellular locations. The E.coli K-12 export systems and the Peripherome are draw as cartoons where each protein is represented as a filled circle following the color code below. Additionally in the "Complexome" page, each complex is drawn dynamically upon clicking "draw" button. The protein subunits of each complex also follow STEPdb's color code.
Protein Symbol Protein Localization Nucleoid (N) Cytoplasmic (A) Ribosomal (r) Prepherally associated with the plasma membrane facing the cytoplasm (F1) Inner membrane protein (B) Prepherally associated with the plasma membrane facing the periplasm (F2) Inner membrane lipoprotein (E) Periplasmic (G) Outer membrane lipoprotein (I) Prepherally associated with the outer membrane facing the periplasm (F3) Outer membrane protein b-barrel protein (H) Prepherally associated with the outer membrane facing the extra-cellular space (F4) Other 7. Multifun Terms
Peripheral inner membrane proteins were classified in eight major categories of cellular function mainly based on Multifun Terms (Serres & Riley, 2000). These are summarized in the table below.
Cellular Process Multifun term GO term GO id Metabolism MultiFun:1 Metabolism GO:metabolism GO:0008152 DNA-related MultiFun:2.1 DNA related GO:DNA metabolism GO:0006259 RNA-related MultiFun:2.2 RNA related GO:RNA metabolism GO:0016070 Protein-related MultiFun:2.3 Protein related GO:protein biosynthesis GO:0006412 Transport MultiFun:4 Transport GO:transport GO:0006810 Cell division MultiFun:5.1 Cell division GO:cytokinesis GO:0000910 Response to stress MultiFun:5.5 Adaptation to stress GO:response to stress GO:0006950 Cell structure MultiFun:6 Cell structure GO:cellular_component GO:0005575 8. MatureP classifier
MethodsMatureP classifier predicts Sec secretory proteins over cytoplasmic ones. Two methods are provided: 1. MatureP classifier that accepts only the mature sequences of potential secretory or cytoplasmic proteins (i.e. known or potential signal peptide sequences must be removed) 2. SP-MatureP a combinatorial classifier that takes into account both the MatureP and the pre-protein classifiers. In SP-MatureP method first the pre-protein classifer predicts the existance of a signal peptide sequence and then the MatureP classifier tests the validity of the mature sequence. SP-MatureP decides whether a sequence is “cytoplasmic”, a mature or a secretory pre-protein sequence or, more interestingly, if a sequence is a “non-secretory” (i.e. possessing a signal peptide but having a non-compatible mature sequence).
MatureP score threshold
MatureP is a linear classifier that explores a variety of features derived from the amino acid sequence such as: amino acids, di-peptides and tri-peptides or pairwise interaction energy. MatureP assigns a classification score to each provided sequence. The final decision of the classifier depends on the selected score threshold above which proteins are considered to be secretory. The most commonly used threshold is zero and following this positively scored sequences are predicted as secretory. Score threshold can be chosen otherwise. Using the scores of the training samples we can draw the hit rate curves (y-axis) versus score (x-axis) for both the positive and the negative classes (press button below to draw the hit rate distributions of MatureP). That is the percent of correctly predicted positive/negative samples per selected score as a classification threshold. When the hit rate of the positive class is increased then the hit rate of the negative class is decreased. For every classifier there exist a score threshold where the two hit rates are equal.
Datasets
Escherichia coli
505 Sec-dependent secretory and 2365 cytoplasmic sequences of the Escherichia coli K-12 proteome (STEPdb) were used during the machine learning analysis. The class of secretory proteins includes eight sub-cellular categories of STEPdb (see table below). Only proteins that utilize the Sec secretion system for their translocation from the cytoplasm to the periplasm were included. 39 proteins with a Tat signal peptide or the flagellar Type III were excluded. The cleavage site of the type I signal peptides (e.g. periplasmic proteins excluding lipoproteins) were predicted using SignalP 4.0 and Phobius. The cleavage site of the type II signal peptides (i.e. inner and outer membrane lipoproteins) was predicted with LipoP.Sub-cellular LocationStepdb nomenlature# ProteinsSec Secretory proteinsPeripheral inner membrane protein facing the periplasm F2 10 Inner Membrane Lipoprotein E 21 Periplasmic G 295 Peripheral outer membrane protein facing the periplasm F3 8 Outer Membrane Lipoprotein I 94 Outer Membrane b-barrel protein H 64 Peripheral outer membrane protein facing the extra-cellular space F4 12 Extra-cellular X 1 Total 505 CytoplasmicCytoplasmic A 1851 Peripheral proteins F1 514 Total 2365 Other Gram-negative and Gram-positive bacteria
To test if the features that MatureP selects are universal we measured its effectiveness in predicting secretory proteins from 25 Gram- and 10 Gram+ bacteria from various phyla (7120 and 1361 secretory proteins, see table below). These were identified as being Sec secretory proteins by combining SignalP 4.0, LipoP and PRED-TAT.
# Strain (Uniprot) Gram class Organism ID (Uniprot) # Proteins # Secretory # Proteins with Type I signal peptides # proteins with Type II signal peptides Gamma 1 Salmonella bongori N268-08 - 1197719 4751 376 265 111 2 Yersinia pestis bv. Antiqua (strain Antiqua) - 360102 4136 331 246 85 3 Citrobacter freundii UCI 31 - 1400136 4932 472 351 121 4 Klebsiella pneumoniae (strain 342) - 507522 5739 518 383 135 5 Pseudomonas fluorescens - 294 7426 783 565 218 6 Acinetobacter baumannii (strain ACICU) - 405416 3746 359 219 140 7 Coxiella burnetii (strain RSA 331 / Henzerling II) - 360115 1892 90 55 35 8 Legionella pneumophila subsp. pneumophila (strain Philadelphia 1 / ATCC 33152 / DSM 7513) - 272624 2950 225 167 58 9 Haemophilus influenzae - 727 3568 349 257 92 Beta 10 Neisseria gonorrhoeae - 485 6015 561 389 172 11 Bordetella pertussis (strain CS) - 1017264 3275 286 223 63 12 Ralstonia pickettii (strain 12J) - 402626 4891 490 374 116 Alpha 13 Bartonella quintana JK 19 - 1134507 1308 27 0 27 14 Brucella melitensis biotype 2 (strain ATCC 23457) - 546272 3125 216 173 43 15 Candidatus Liberibacter asiaticus str. Ishi-1 - 931202 1068 31 10 21 Proteobacteria 16 Helicobacter pylori (strain HPAG1) - 357544 1542 107 69 38 17 Campylobacter lari (strain RM2100 / D67 / ATCC BAA-1060) - 306263 1545 96 58 38 18 Campylobacter coli 2548 - 887315 1809 95 50 45 Clamydiae 19 Chlamydia trachomatis (strain D/UW-3/Cx) - 272561 897 46 27 19 20 chlamydophila pneumoniae - 83558 4081 205 119 86 Bacteroidetes 21 Bacteroides fragilis (strain YCH46) - 295405 4598 856 385 471 22 Capnocytophaga canimorsus (strain 5) - 860228 2395 324 126 198 Mollicutes 23 Mycoplasma pneumoniae 309 - 1112856 708 50 5 45 Bacilli 24 Streptococcus equinus (Streptococcus bovis) - 1335 1996 59 27 32 Spirochetes 25 Borrelia hermsii YOR 1591 168 17 151 Total 7120 4560 2560 # Strain (Uniprot) Gram class Organism ID (Uniprot) # Proteins # Secretory # Proteins with Type I signal peptides # proteins with Type II signal peptides 1 Enterococcus faecalis (strain 62) + 936153 3011 154 81 73 2 Streptococcus pneumoniae (strain 70585) + 488221 2179 74 31 43 3 Streptococcus uberis (strain ATCC BAA-854 / 0140J) + 218495 1761 59 25 34 4 Staphylococcus aureus (strain NCTC 8325) + 93061 2892 112 54 58 5 Listeria ivanovii WSLC3009 + 1457190 2773 160 83 77 6 Bacillus cereus (strain ATCC 10987) + 222523 5835 292 126 166 Clostridia 7 Clostridium tetani (strain Massachusetts / E88) + 212717 2416 117 47 70 Actinobacteria 8 Mycobacterium tuberculosis (strain ATCC 25177 / H37Ra) + 419947 3993 125 62 63 9 Mycobacterium paratuberculosis (strain ATCC BAA-968 / K-10) + 262316 4321 135 65 70 10 Mycobacterium bovis + 1765 4353 133 67 66 Total 1361 641 720 Non redundant datasets
According to Nielsen et al. the training and test sets should be non-redundant and that similar (homologous) sequences should be discarded to avoid overestimating the predictive performance of the classifiers. We performed redundancy reduction in the original dataset (above) following the procedures used by SignalP using the algorithm of Hobohm that performs iterative position specific alignments. The blast+ suite of NCBI was utilized: makeblastdb command to convert the input fasta files into blast database files and the psiblast command that implements the position-specific iterative basic local alignment search of Altschul et al. This resulted in a non-redundant dataset of 1070 cytoplasmic proteins, 207 preproteins and 247 mature domain sequences.