This procedure continues until no additional merges can together due to the chain effects

Therefore, the problem is formulated as finding all the strongly connected subgraphs in the graph. Moreover, since PCI-32765 proteins usually contain more than one domain, a practical algorithm must be able to find the overlapping subgraphs. In graph theory, a subgraph that is more highly connected than other parts of the graph is also called a community. The community-finding problem has received much attention since the seminal paper by Newman. Unfortunately, the overlapping community-finding problem has not been tackled in most of the traditional graph-based or clustering algorithms. In 2005, Palla et al. proposed a clique percolation method for uncovering overlapping communities. They defined the k-clique community as a set of nodes belonging to adjacent k-cliques, i.e., cliques with k nodes. Later, Kumpula et al. proposed a more efficient clique percolation algorithm to find the overlapping kcommunities, for a fixed k. Their algorithm works in a sequential manner. This algorithm can detect the overlapping kclique communities in linear time in terms of the number of kcliques in the graph. However, none of these algorithms can be directly applied to the domain finding problem. Both algorithms require the enumeration of all cliques with sizes smaller than kmax, where kmax is the size of the largest clique in the graph. This is not practical for proteome-scale domain detection, in which we have a dense graph of about 20,000-70,000 nodes. A populated domain can appear hundreds or even thousands of times in a genome. On the other hand, one may suggest using a small value instead of kmax to overcome this issue. However, this will cause irrelevant domains to be merged. Here, we propose a heuristic algorithm that does not enumerate all the small cliques by using the properties of the domain detection problem. First, from our protein sequence-indexing step, we extract and store all the sequences that share the same hash seeds. According to the way our graph is defined, all such sequences are connected to each other and thus form a clique. Second, the more frequently a hash seed appears, the higher the confidence assigned to this seed. Larger cliques therefore have higher confidence. According to these properties, in order to avoid the chain effect caused by the 2-clique, our algorithm is designed to work in a backward manner. It first eliminates all the edges with weights smaller than a pre-defined threshold. We use two as the default value, which means that two sequences are considered to be homologous if they contain at least two common hash seeds. The algorithm then begins with the largest clique in the graph, i.e., the one that corresponds to the most frequent hash seed. If there are other cliques with the same size kmax in the graph, our algorithm projects the cliques into kmax {cliques using the same method described in. Each connected component in this projection corresponds to a kmax{clique community. The communities are then compared with the communities with larger size. If the majority of the nodes of the smaller community are shared between the two, these two communities are merged.