I have two strings that represent two institutions. For instance,
a1="University of Milan"
a2="University Milan"
or
a1="University of Milan"
a2="Università di Milano"
I have to tell whether they refer to the same institution.
I built two binary classifiers.
One is based on the fuzzy distance between the strings, i.e., the distance between the two strings is the amount of work you have to do to change one string into the other (how many characters to delete, how many characters to move, how many characters to insert).
The other is based on a semantic similarity-based distance measure between the two strings, computed using word embeddings.
I don't have a training set as the character-based distance does not require training, and the word embedding model comes pre-trained.
I built a validation set and found out the threshold of the two distance measures that guarantees an FPR below 1%.
Given this threshold, the TPR of the embeddings-based classifier is higher than the character-based one, because the name of the institution can be written in a different language between the two strings (and my embedding model is multi-lingual).
Stated otherwise, the AUC of the embeddings-based classifier is higher than the character-based one.
The problem is that the embeddings-based classifier is much slower than the character-based one.
My strategy is then to classify the strings with the character-based classifier and, if that classifier gives me a negative outcome, try again with the embeddings-based classifier, which has the same FPR but higher recall.
Is this strategy correct?