Please report any queries concerning the funding data grouped in the sections named "Externally Awarded" or "Internally Disbursed" (shown on the profile page) to
your Research Finance Administrator. Your can find your Research Finance Administrator at https://www.ucl.ac.uk/finance/research/rs-contacts.php by entering your department
Please report any queries concerning the student data shown on the profile page to:
Email: portico-services@ucl.ac.uk
Help Desk: http://www.ucl.ac.uk/ras/portico/helpdesk
Email: portico-services@ucl.ac.uk
Help Desk: http://www.ucl.ac.uk/ras/portico/helpdesk
Publication Detail
CATHe: Detection of remote homologues for CATH superfamilies using embeddings from protein language models
-
Publication Type:Working discussion paper
-
Authors:Nallapareddy V, Bordin N, Sillitoe I, Heinzinger M, Littmann M, Waman V, Sen N, Rost B, Orengo C
-
Publication date:13/03/2022
-
Status:Published
Abstract
1. CATH is a protein domain classification resource that combines an automated workflow of structure and sequence comparison alongside expert manual curation to construct a hierarchical classification of evolutionary and structural relationships. The aim of this study was to develop algorithms for detecting remote homologues that might be missed by state-of-the-art HMM-based approaches. The proposed algorithm for this task (CATHe) combines a neural network with sequence representations obtained from protein language models. The employed dataset consisted of remote homologues that had less than 20% sequence identity. The CATHe models trained on 1773 largest, and 50 largest CATH superfamilies had an accuracy of 85.6+−0.4, and 98.15+−0.30 respectively. To examine whether CATHe was able to detect more remote homologues than HMM-based approaches, we employed a dataset consisting of protein regions that had annotations in Pfam, but not in CATH. For this experiment, we used highly reliable CATHe predictions (expected error rate <0.5%), which provided CATH annotations for 4.62 million Pfam domains. For a subset of these domains from homo sapiens, we structurally validated 90.86% of the predictions by comparing their corresponding AlphaFold structures with experimental structures from the CATHe predicted superfamilies.
› More search options
UCL Researchers
Show More