John Smith. J Smith. Smith, John. How to find if this John is the same as that John!
Have you ever searched for a contact on your phone and come up with several duplicate or near-duplicate entries? (I’m fairly certain this isn’t just me!) For me, this tends to happen when I forget that I already have a particular contact saved and create a new one for a new number. Duplicated contacts on a phone is a fairly minor annoyance and, despite my crappy memory, a fairly infrequent one at that.
However, for companies and organizations with huge databases of client information maintained by many different people, it is quite common to have multiple entries for the same entity with tiny variations in the data. These variations could include, for example, misspelt names, addresses written differently, use of special characters, and abbreviated/non-abbreviated names. Such duplicate entries can end up exploding the size of databases, which in turn can slow down entire systems. It can also make it difficult to do proper data analysis and can even cause misleading results.
Duplicate entries in this context refer to multiple entries created by mistake for the same entity. The entries are duplicate conceptually but are different in terms of the raw information content.
Removing these duplicates has obvious benefits but it is surprisingly difficult to do! It can be done manually for small databases but quickly becomes impractical as the database size grows. Rule-based methods, with string matching, regex, etc., can be used but the ways in which near-duplicates vary is quite diverse and is typically hard to capture with fixed rules.
Looking for a solution to this problem, we took a step back and really looked at why a human can (usually) easily tell whether two entries are duplicate despite the actual data being different. The reason, of course, is that we look at the meaning behind the data, rather than doing string matching in our heads!
For example, a human can deduce that Liverpool Football Club, Liverpool FC, and, LFC all refer to the same entity, especially if the rest of the information (address, phone number, etc.) are also similar. Writing a fixed rule to look for this kind of relationship, however, is not straightforward! Especially considering that there could be countless such variations which also vary in type from field to field (phone numbers with/without country code, “street” abbreviated to “St.” in addresses, etc.).
Our solution essentially boils down to the same “logic” that a human will use. Instead of trying to match characters and strings, we also match the meaning behind the data.
Fortunately, the incredible progress in Natural Language Processing in the last few years have enabled us to accurately represent natural text as numerical vectors which encode the semantic meaning of the original text. The Sentence Transformers library supports this task by making it easy to generate accurate representations for sentences with state-of-the-art Transformer models.
The code snippets (not production level) below provide a rough demonstration of the process.
Let’s assume we start with a list of text (corpus), where each list item represents an entry in a database (the list item could simply be the values concatenated together).
Creating the embeddings for the text is as easy as that!
Comparing a pair of embeddings and obtaining the similarity score between them is now quite simple. Since the embeddings are nothing more than vectors of numbers, we can use cosine similarity to find how similar they are to each other.
As we are interested in finding duplicates, we want to compare the embeddings against all the other embeddings
That’s it! A simple but powerful way to find duplicates and clean up a database.
I am a consultant in Deep Learning and AI-related technology for Skil.ai. As part of the Deep Learning Research team at Skil.ai, we work towards making AI accessible to small businesses and big tech alike. This article is aimed towards sharing our knowledge.