RUPEE Protein Structure Search

Search Database

The databases available for searching, along with their corresponding versions, are shown in the following table.

Database Version
PDB Chain as of 2022-07-16
SCOPe 2.08
CATH v4.3
AlphaFold DB v2

Search By

You can search by a structure id or an uploaded PDB file.

Search By Structure Id

The structure id can be any SCOPe or CATH domain identifier or a PDB id and PDB chain id concatenated together for identifying whole chains within the PDB. The structure id does not have to match the database type you are searching -- this is a feature of RUPEE that arose naturally when implementing the upload PBD file feature.

Search By PDB File

When uploading a PDB file, only the first chain of the first model is considered. Additionally, all backbone atoms (i.e. N, CA, C, and O) should be present for the search to be effective. If you want to find structures similar to a given domain, then upload the domain. If you want to find structures similar to a protein chain, then upload the chain and search the PDB Chain database.

Search Filter

The available search filters change dynamically based on the selected search database and search by criteria. SCOPe and CATH are hierarchical classifications. CATH designates a representative domain for each grouping at each level of the CATH hierarchy whereas SCOPe does not (or at least I'm not aware of this being the case). On the other hand, whole PDB chains are not classified into a hierarchy at all. Given the above, the SCOPe and CATH databases allow you to filter the search results by differences from the query structure for different hierarchy level classifications. In addition to filtering by different classifications, the CATH database allows you to filter by hierarchy level representatives. Currently, search filters are not provided for searching the PDB Chain database. Search filters allow for the discovery of structural similarities between differentially classified domains while preventing the results from being buried by known similarities.

Search Types

6 search types are available for searching. The first 3 are the recommended search types.

Full-Length

Return structures similar to the full-length of the query.

Contained In

Return structures that contain the query structure.

Contains

Return structures similar to a fragment of the query structure.

RMSD

Search on RMSD. Ubiquitous and generally useful.

In order to avoid too many trivial matches, the Contains search type is limited to structures no less than one third the size of the query structure. Containment searches will also return many structures similar to the full-length of the query structure since that is trivial containment.

Search Mode (referred to as Operating Modes in the PLoS ONE paper)

For Fast mode, the initial filtering with min-hashing and LSH still functions as described in the PLoS ONE paper. However, to increase the utility of fast mode, instead of immediately returning the min-hashing and LSH results without alignment scores, TM-align alignments are run for the results and the results are then sorted based on the alignment scores. For containment searches, the length of one of the structures is used to normalize the TM-score rather than using the average length of compared structures as is used for the Full-Length search type.

For Top-Aligned mode, the initial filtering with min-hashing and LSH still functions as described in the PLoS ONE paper. However, once filtered matches are obtained, instead of adjusting Jaccard similarities estimates, a simplified Needleman-Wunsch residue descriptor alignment between the query structure and all filtered structures is performed. Mismatches and gaps are penalized -1 points and matches are awarded +1 points. For containment searches, depending on whether or not the search type is Contained In or Contains, one of the sides of the dynamic programming matrix is not penalized for the opening gap and end gap. Likewise, for containment searches, the length of one of the structures is used to normalize the TM-score rather than using the average length of compared structures as is used for the Full-Length search type. Once the rough alignment scores are obtained for the filtered structures, TM-align alignments are run as described in the PLoS ONE paper.

All-Aligned mode skips the min-hashing and LSH steps and instead, for each structure in the searched database, performs the simplified Needleman-Wunsch algorithm to obtain an alignment that is used as the initial alignment in a modified TM-align algorithm that does not attempt to find initial alignments by other means. This allows RUPEE to apply the modified TM-align to all available structures in a reasonable amount of time, typically between 5 and 10 minutes. The results of all-aligned mode are virtually identical to top-aligned mode for known structures. However, for the case of novel structures, such as those output from protein structure prediction protocols, all-aligned mode is an improvement.