Information about the STEPdb database

Index:

     1. Updates
     2. Search
     3. Glossary
     4. Topology and orientation of IM proteins
     5. Internal Connections
     6. Color Code
     7. Multifun Terms
     8. MatureP classifier
1. Updates:

STEPdb website has been updated! STEPdb2.0 version is available from October 2019.
To see the changes in E. coli K-12 proteome, please consult Loos et al 2019 and the downloads section. Additional information is available in the Supplementary Tables:
- Table S1: Proteome changes (Download this table )
- Table S2: New features (Download this table )
- Table S3: Tools used (Download this table )
For more information: Loos et al., 2019. Go to www.stepdb.eu to consult the previous version
2. Search:

In STEP the search function is performed locally for the displayed proteome/sub-proteome. Consequently search is performed only within the current table. (e.g. when the user is browsing the Complexome he is able to search only for entries within this group of proteins.)

Search is for the moment not very sophisticated. Logical operators such OR, AND are not currently supported by the search tool and text is searched intact. (e.g. words separated with a space are not searched separately)
3. Glossary:

Basic Proteome of E.coli K-12 (MG1655)
STEP defines the "basic" proteome of E.coli K-12 (MG1655) as the fraction the proteome that is devoid of gene products that are likely not produced, or that result from enomic insertions enriched in defective prophages, transposons, pseudogenes, integrases and mobile elements. To define "basic" proteome we used manual annotation. We based this on EcoGene (Rudd, 2000), Uniprot (Dimmer et al, 2012) and other (McClelland et al, 2001; Ochman et al, 2000) studies. F1 plasmid-encoded proteins were also removed from the Uniprot data of the E.coli K-12 proteome used (release version of November 2010). According to our analysis the essential protein-coding sequences devoid of these elements encompasses only 3899 proteins.

Core Proteome of E.coli K-12
The common core proteome of all E.coli strains consists of 2073 proteins. These were derived from genome comparison with Multi-Genome Homology Comparison Tool (Davidsen et al, 2010) of 15 E.coli strains (Papanastasiou et al, submitted).

Protein Solubility

Protein structure is encoded in the primary amino acid sequence. However, the folding process of a protein to its functional form, in some cases, has to overcome miss-folding states that can lead to protein inclusion bodies. Niwa et al 2009 have calculated protein solubility for 3173 Escherichia coli proteins, in a chaperone-free reconstituted translation system. The aggregation propensity of each protein is examined by centrifugation assay. Solubility is defined as the index of aggregation propensity which is expressed as the proportion of the supernatant fraction, which is obtained after the centrifugation of a translation mixture, to the uncentrifuged total protein. Therefore solubility is a percentage and ranges from 0% to 100%
Manual Curation and non-experimental qualifiers
We follow the same experimental qualifiers with Uniprot: Potential: There is some logical or conclusive evidence that the given annotation could apply. This non-experimental qualifier is often used to present results from protein sequence analysis software tools, which are only annotated if the result makes sense in the biological context of a given protein. Probable: Indicates stronger evidence than the qualifier "Potential". This qualifier implies that there must be at least some experimental evidence, which indicates, that the information is expected to be found in the natural environment of a protein. By similarity: When some biological information was experimentally obtained for a given protein (or part of it), it may be transferred to other protein family members within a certain taxonomic range, dependent on the biological event or characteristic.
Sub-cellular localization special characters and formalisms

To denote multiple localization possibilities that have been experimentally established we introduced the comma "," formalism whereas a slash "/" denotes two or more possible sub-cellular locations that have not yet been experimentally determined.

To denote subcellular location of protein complexes that span both cellular membranes we introdiced the "&" formalism thus corresponding complexes (e.g copper / silver efflux transport system) are annotated as "B&H".

Exportome and subclasses (Secretome, non-classical secretion etc.)

We define as exportome those proteins that are localized within the inner membrane and beyond (e.g. lipoproteins, extra-cellular proteins). This includes the STEPdb sub-cellular classes: B, I, E, F2, F3, G, X, F4. We divide exportome into two subclasses membranome and secretome. The membranome contains proteins that are embedded in the inner mebrane whereas secretome referes to proteins that are fully translocated across the inner membrane. Membranome and secretome can further divided into proteins that are substrates of the Sec and Tat secretion pathways. Finally, within secretome is included a particular class of non-classical secretory proteins which are secreted without the presence of an apparent signal peptide motif.
4. Topology and orientation of IM proteins

Transmembrane regions of integral membrane proteins were predicted using Phobius. In cases where Phobius failed to identify any transmembrane region the prediction of TMHMM was used instead. The predicted orientation of the polypeptide sequences, which equals to the location of the C-terminus (cytoplasmic or periplasmic) was reconsidered based on experimental verification of C-terminus for 734 transmembrane proteins (Daley et al, 2005).
5. Internal Connections

There are internal connections between some of the tables. Proteins listed in the E.coli K-12 proteome table connect to the list of the complexes they participate. Further more from the Complexome table the user is able to view the schematically representation of each complex and through this schematic to link directly to the K-12 table.

6. Color Code

STEPdb follows a specific color code to represent proteins in the various sub-cellular locations. The E.coli K-12 export systems and the Peripherome are draw as cartoons where each protein is represented as a filled circle following the color code below. Additionally in the "Complexome" page, each complex is drawn dynamically upon clicking "draw" button. The protein subunits of each complex also follow STEPdb's color code.

Protein Symbol	Protein Localization
	Nucleoid (N)
	Cytoplasmic (A)
	Ribosomal (r)
	Prepherally associated with the plasma membrane facing the cytoplasm (F1)
	Inner membrane protein (B)
	Prepherally associated with the plasma membrane facing the periplasm (F2)
	Inner membrane lipoprotein (E)
	Periplasmic (G)
	Outer membrane lipoprotein (I)
	Prepherally associated with the outer membrane facing the periplasm (F3)
	Outer membrane protein b-barrel protein (H)
	Prepherally associated with the outer membrane facing the extra-cellular space (F4)
	Other

7. Multifun Terms

Peripheral inner membrane proteins were classified in eight major categories of cellular function mainly based on Multifun Terms (Serres & Riley, 2000). These are summarized in the table below.

Cellular Process	Multifun term	GO term	GO id
Metabolism	MultiFun:1 Metabolism	GO:metabolism	GO:0008152
DNA-related	MultiFun:2.1 DNA related	GO:DNA metabolism	GO:0006259
RNA-related	MultiFun:2.2 RNA related	GO:RNA metabolism	GO:0016070
Protein-related	MultiFun:2.3 Protein related	GO:protein biosynthesis	GO:0006412
Transport	MultiFun:4 Transport	GO:transport	GO:0006810
Cell division	MultiFun:5.1 Cell division	GO:cytokinesis	GO:0000910
Response to stress	MultiFun:5.5 Adaptation to stress	GO:response to stress	GO:0006950
Cell structure	MultiFun:6 Cell structure	GO:cellular_component	GO:0005575

8. MatureP classifier

Methods

MatureP classifier predicts Sec secretory proteins over cytoplasmic ones. Two methods are provided: 1. MatureP classifier that accepts only the mature sequences of potential secretory or cytoplasmic proteins (i.e. known or potential signal peptide sequences must be removed) 2. SP-MatureP a combinatorial classifier that takes into account both the MatureP and the pre-protein classifiers. In SP-MatureP method first the pre-protein classifer predicts the existance of a signal peptide sequence and then the MatureP classifier tests the validity of the mature sequence. SP-MatureP decides whether a sequence is “cytoplasmic”, a mature or a secretory pre-protein sequence or, more interestingly, if a sequence is a “non-secretory” (i.e. possessing a signal peptide but having a non-compatible mature sequence).

MatureP score threshold

MatureP is a linear classifier that explores a variety of features derived from the amino acid sequence such as: amino acids, di-peptides and tri-peptides or pairwise interaction energy. MatureP assigns a classification score to each provided sequence. The final decision of the classifier depends on the selected score threshold above which proteins are considered to be secretory. The most commonly used threshold is zero and following this positively scored sequences are predicted as secretory. Score threshold can be chosen otherwise. Using the scores of the training samples we can draw the hit rate curves (y-axis) versus score (x-axis) for both the positive and the negative classes (press button below to draw the hit rate distributions of MatureP). That is the percent of correctly predicted positive/negative samples per selected score as a classification threshold. When the hit rate of the positive class is increased then the hit rate of the negative class is decreased. For every classifier there exist a score threshold where the two hit rates are equal.

Datasets

Escherichia coli
505 Sec-dependent secretory and 2365 cytoplasmic sequences of the Escherichia coli K-12 proteome (STEPdb) were used during the machine learning analysis. The class of secretory proteins includes eight sub-cellular categories of STEPdb (see table below). Only proteins that utilize the Sec secretion system for their translocation from the cytoplasm to the periplasm were included. 39 proteins with a Tat signal peptide or the flagellar Type III were excluded. The cleavage site of the type I signal peptides (e.g. periplasmic proteins excluding lipoproteins) were predicted using SignalP 4.0 and Phobius. The cleavage site of the type II signal peptides (i.e. inner and outer membrane lipoproteins) was predicted with LipoP.

	Sub-cellular Location	Stepdb nomenlature	# Proteins
Sec Secretory proteins	Peripheral inner membrane protein facing the periplasm	F2	10
	Inner Membrane Lipoprotein	E	21
	Periplasmic	G	295
	Peripheral outer membrane protein facing the periplasm	F3	8
	Outer Membrane Lipoprotein	I	94
	Outer Membrane b-barrel protein	H	64
	Peripheral outer membrane protein facing the extra-cellular space	F4	12
	Extra-cellular	X	1
	Total		505
Cytoplasmic	Cytoplasmic	A	1851
	Peripheral proteins	F1	514
	Total		2365

Other Gram-negative and Gram-positive bacteria

To test if the features that MatureP selects are universal we measured its effectiveness in predicting secretory proteins from 25 Gram- and 10 Gram+ bacteria from various phyla (7120 and 1361 secretory proteins, see table below). These were identified as being Sec secretory proteins by combining SignalP 4.0, LipoP and PRED-TAT.

#	Strain (Uniprot)	Gram class	Organism ID (Uniprot)	# Proteins	# Secretory	# Proteins with Type I signal peptides	# proteins with Type II signal peptides
Gamma
1	Salmonella bongori N268-08	-	1197719	4751	376	265	111
2	Yersinia pestis bv. Antiqua (strain Antiqua)	-	360102	4136	331	246	85
3	Citrobacter freundii UCI 31	-	1400136	4932	472	351	121
4	Klebsiella pneumoniae (strain 342)	-	507522	5739	518	383	135
5	Pseudomonas fluorescens	-	294	7426	783	565	218
6	Acinetobacter baumannii (strain ACICU)	-	405416	3746	359	219	140
7	Coxiella burnetii (strain RSA 331 / Henzerling II)	-	360115	1892	90	55	35
8	Legionella pneumophila subsp. pneumophila (strain Philadelphia 1 / ATCC 33152 / DSM 7513)	-	272624	2950	225	167	58
9	Haemophilus influenzae	-	727	3568	349	257	92
Beta
10	Neisseria gonorrhoeae	-	485	6015	561	389	172
11	Bordetella pertussis (strain CS)	-	1017264	3275	286	223	63
12	Ralstonia pickettii (strain 12J)	-	402626	4891	490	374	116
Alpha
13	Bartonella quintana JK 19	-	1134507	1308	27	0	27
14	Brucella melitensis biotype 2 (strain ATCC 23457)	-	546272	3125	216	173	43
15	Candidatus Liberibacter asiaticus str. Ishi-1	-	931202	1068	31	10	21
Proteobacteria
16	Helicobacter pylori (strain HPAG1)	-	357544	1542	107	69	38
17	Campylobacter lari (strain RM2100 / D67 / ATCC BAA-1060)	-	306263	1545	96	58	38
18	Campylobacter coli 2548	-	887315	1809	95	50	45
Clamydiae
19	Chlamydia trachomatis (strain D/UW-3/Cx)	-	272561	897	46	27	19
20	chlamydophila pneumoniae	-	83558	4081	205	119	86
Bacteroidetes
21	Bacteroides fragilis (strain YCH46)	-	295405	4598	856	385	471
22	Capnocytophaga canimorsus (strain 5)	-	860228	2395	324	126	198
Mollicutes
23	Mycoplasma pneumoniae 309	-	1112856	708	50	5	45
Bacilli
24	Streptococcus equinus (Streptococcus bovis)	-	1335	1996	59	27	32
Spirochetes
25	Borrelia hermsii YOR			1591	168	17	151
Total					7120	4560	2560

#	Strain (Uniprot)	Gram class	Organism ID (Uniprot)	# Proteins	# Secretory	# Proteins with Type I signal peptides	# proteins with Type II signal peptides
1	Enterococcus faecalis (strain 62)	+	936153	3011	154	81	73
2	Streptococcus pneumoniae (strain 70585)	+	488221	2179	74	31	43
3	Streptococcus uberis (strain ATCC BAA-854 / 0140J)	+	218495	1761	59	25	34
4	Staphylococcus aureus (strain NCTC 8325)	+	93061	2892	112	54	58
5	Listeria ivanovii WSLC3009	+	1457190	2773	160	83	77
6	Bacillus cereus (strain ATCC 10987)	+	222523	5835	292	126	166
Clostridia
7	Clostridium tetani (strain Massachusetts / E88)	+	212717	2416	117	47	70
Actinobacteria
8	Mycobacterium tuberculosis (strain ATCC 25177 / H37Ra)	+	419947	3993	125	62	63
9	Mycobacterium paratuberculosis (strain ATCC BAA-968 / K-10)	+	262316	4321	135	65	70
10	Mycobacterium bovis	+	1765	4353	133	67	66
Total					1361	641	720

Non redundant datasets

According to Nielsen et al. the training and test sets should be non-redundant and that similar (homologous) sequences should be discarded to avoid overestimating the predictive performance of the classifiers. We performed redundancy reduction in the original dataset (above) following the procedures used by SignalP using the algorithm of Hobohm that performs iterative position specific alignments. The blast+ suite of NCBI was utilized: makeblastdb command to convert the input fasta files into blast database files and the psiblast command that implements the position-specific iterative basic local alignment search of Altschul et al. This resulted in a non-redundant dataset of 1070 cytoplasmic proteins, 207 preproteins and 247 mature domain sequences.

Information about the STEPdb database

Index:

1. Updates:

2. Search:

3. Glossary:

Protein Solubility

Sub-cellular localization special characters and formalisms

Exportome and subclasses (Secretome, non-classical secretion etc.)

4. Topology and orientation of IM proteins

5. Internal Connections