Clustering Profile Hidden Markov Models based on Profile-Profile Alignment
Ahmad Shawqi Tamimi
Supervisors Dr. Hashem Tamimi and Dr. Yaqoub Ashhab
Profile hidden Markov model (Profile-HMM) is a statistical approach that serves to represent protein sequence families. Nowadays, many databases maintain protein information through profile-HMMs. The size of these databases is increasing rapidly, which calls for indexing them in a way that enables retrieving similar profiles with high accuracy and efficient time.
In this thesis, we introduce a novel method for clustering profile-HMMs based on profile-profile alignment approach. The resulted clusters are used as a basis for indexing profile databases.
Three different clustering approaches are used and compared: hierarchical, k-mean and connected component. Using these three approaches we obtained a reduction time up to 54.69\%, 68.53\% and 60.07\%, respectively. Also, we achieved an accuracy of 96.22\%, 90.23\% and 86.88\%, respectively. These results were accomplished after applying an overlapping clustering technique with each approach.