Library for unsupervised clustering of large data sets of T cell receptor sequences.
ClusTCR is a two-step clustering approach that combines the speed of the Faiss Clustering Library with the accuracy of Markov Clustering Algorithm to efficiently group large sets of T cell receptor sequences.
In the first step, ClusTCR queries the search space in order to roughly divide the dataset into ‘superclusters’. To determine these superclusters, ClusTCR utilizes the Faiss Clustering Library, which is specifically developed for rapid clustering of dense vectors through efficient indexing. For this, input sequences are mapped to a numerical embedding that reflects their physicochemical properties. Next, ClusTCR applies a very efficient K-means implementation to compute n centroids, generating an index. Next, a similarity search is performed on the index to assign each sequence to its nearest centroid.
In the second step, ClusTCR reclusters each individual supercluster to accurately identify specificity groups within. Adaptive immune receptor repertoires can be represented as graphs in which nodes represent the sequences and the edges represented similarity between sequences. To create this graph, ClusTCR uses efficient sequence hashing to determine each pair of sequences with a maximum edit distance of 1. Hamming distance (HD) is used as the default metric to express similarity between TCRs, implying equal length of sequences within a single cluster. Next, ClusTCR uses the corresponding similarity-grouped graph to identify potential epitope-specific clusters. Evaluating the graph structure allows the interrogation of sequence-based relationship in the repertoire because similar sequences will share edges within the graph. To this end, ClusTCR applies the Markov clustering algorithm (MCL) for the identification of dense network substructures, representing dense groups of CDR3 sequences with similar sequential characteristics. MCL simulates stochastic flow inside the graph, thereby identifying dense network substructures where flow is high. These network substructures represent the clusters in ClusTCR’s output. This allows efficient and accurate clustering of relatively small sets of CDR3 sequences (up to 50 000). Combined with the first step, the clustering can be made efficient for any TCR dataset.