4
$\begingroup$

I have two strings that represent two institutions. For instance,

a1="University of Milan"
a2="University Milan"

or

a1="University of Milan"
a2="Università di Milano"

I have to tell whether they refer to the same institution.

I built two binary classifiers.

One is based on the fuzzy distance between the strings, i.e., the distance between the two strings is the amount of work you have to do to change one string into the other (how many characters to delete, how many characters to move, how many characters to insert).

The other is based on a semantic similarity-based distance measure between the two strings, computed using word embeddings.

I don't have a training set as the character-based distance does not require training, and the word embedding model comes pre-trained.

I built a validation set and found out the threshold of the two distance measures that guarantees an FPR below 1%.

Given this threshold, the TPR of the embeddings-based classifier is higher than the character-based one, because the name of the institution can be written in a different language between the two strings (and my embedding model is multi-lingual).

Stated otherwise, the AUC of the embeddings-based classifier is higher than the character-based one.

The problem is that the embeddings-based classifier is much slower than the character-based one.

My strategy is then to classify the strings with the character-based classifier and, if that classifier gives me a negative outcome, try again with the embeddings-based classifier, which has the same FPR but higher recall.

Is this strategy correct?

$\endgroup$

1 Answer 1

2
$\begingroup$

Yes, this is a sound strategy. If you provide the output of the first classifier to the second, it would even become cascading classifiers, which is a form of ensemble learning.

This goes a bit beyond the scope of what you asked, but:

  • If you know roughly which institutions and languages you'll be dealing with, you could build a simple lookup for some common cases.
  • I can also imagine that many institution names contain a description of the institution (i.e., school, department, university, institute) and then a qualifier (i.e., a country, city name, a person's name, etc.). I feel that you could probably parse your string to separate these things and potentially perform some matching on the individual components (i.e., they're both universities, but one is in Milan, the other in Rome)
$\endgroup$
3
  • $\begingroup$ uhm, but the cascading classifier would lose the performance benefit of using only the fast classifier when it gives a positive outcome. Is there a name for the particular strategy I'm using? The wikipedia page you linked talks about "stacking ensembles", but it seems to be yet another thing $\endgroup$ Commented 13 hours ago
  • $\begingroup$ You're not forced to go towards the other step if you're satisfied with the results from the first classifier $\endgroup$ Commented 11 hours ago
  • $\begingroup$ I'm not satisfied with the result of the first classifier. Its recall is too low. $\endgroup$ Commented 10 hours ago

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.