Keeping our organizers and attendees safe from malicious and unwanted spam from is a challenge, especially given the high volume of messages that are routed through the our system every day
One way that we deal with this volume of data, is to cluster up all the similar messages together to find patterns in behavior of senders. For example, if someone is contacting thousands of different organizers with similar messages, that behavior is suspect and will be examined.
The big question is, how can we compare every single message we see with every other message efficiently and accurately? In this article, we’ll be exploring a technique known as Multi-Index Locality Sensitive Hashing.
To perform the the comparison efficiently, we pre-process the data with a series of steps:
- Message Tokenization
- Locality Sensitive Hashing
- Multi-Index Optimization
Let’s first define what similar messages are. Here we have and example of two similar messages A and B:
A = "Is there a dress code for this event? Thanks!"
B = "Hi, is there a DRESS CODE to this event"
To our human eyes of course they’re similar, but we want determine this similarity quantitatively. The solution is to break up the message into tokens, and then treat each message as a bag of tokens. The simplest, naive way to do tokenization is to split up a message on spaces/punctuation and convert each character to lowercase. So our result from our tokenization of the above messages would be:
tokenize(A) -> tA = "is","there","a","dress","code","for","this","event","thanks"
tokenize(B) -> tB = "hi","is","there","a","dress","code","to","this","event"
I’ll leave as an exercise to the reader to come up with more interesting ways to do tokenization for handling contractions, plurals, foreign languages, etc.
To calculate the similarity between these two bags of tokens, we’ll use an estimation known as the Jaccard Similarity Coefficient. This is defined as “the ratio of sizes of the intersection and union of A and B”. Therefore, in our example:
Similarity = Jaccard(A, B) = |A ∩ B| / |A ∪ B|
= size(intersection(tA, tB)) / size(union(tA, tB))
= size("is","there","a","dress","code","this","event") /
= 7 / 11
We’ll then set a threshold, above which, we will consider two messages to be similar. So then, when given a set of M messages, we simply compute the similarity of a message to every other message. This works in theory, but in practice there are cases where this metric is unreliable (eg. if one message is significantly longer than the other); not to mention horribly inefficient (O(N² M²), where N is the number of tokens per message). We need do things smarter!