特写：“科学”号上的“美国时间”(1)

Question

I have two strings that represent two institutions. For instance,

a1="University of Milan"
a2="University Milan"

or

a1="University of Milan"
a2="Università di Milano"

I have to tell whether they refer to the same institution.

I built two binary classifiers.

One is based on the fuzzy distance between the strings, i.e., the distance between the two strings is the amount of work you have to do to change one string into the other (how many characters to delete, how many characters to move, how many characters to insert).

The other is based on a semantic similarity-based distance measure between the two strings, computed using word embeddings.

I don't have a training set as the character-based distance does not require training, and the word embedding model comes pre-trained.

I built a validation set and found out the threshold of the two distance measures that guarantees an FPR below 1%.

Given this threshold, the TPR of the embeddings-based classifier is higher than the character-based one, because the name of the institution can be written in a different language between the two strings (and my embedding model is multi-lingual).

Stated otherwise, the AUC of the embeddings-based classifier is higher than the character-based one.

The problem is that the embeddings-based classifier is much slower than the character-based one.

My strategy is then to classify the strings with the character-based classifier and, if that classifier gives me a negative outcome, try again with the embeddings-based classifier, which has the same FPR but higher recall.

Is this strategy correct?

Valentin Calomme · Accepted Answer · 2025-08-08 08:09:14Z

2

Yes, this is a sound strategy. If you provide the output of the first classifier to the second, it would even become cascading classifiers, which is a form of ensemble learning.

This goes a bit beyond the scope of what you asked, but:

If you know roughly which institutions and languages you'll be dealing with, you could build a simple lookup for some common cases.
I can also imagine that many institution names contain a description of the institution (i.e., school, department, university, institute) and then a qualifier (i.e., a country, city name, a person's name, etc.). I feel that you could probably parse your string to separate these things and potentially perform some matching on the individual components (i.e., they're both universities, but one is in Milan, the other in Rome)

answered 14 hours ago

Valentin Calomme

6,6113 gold badges23 silver badges55 bronze badges

$\begingroup$ uhm, but the cascading classifier would lose the performance benefit of using only the fast classifier when it gives a positive outcome. Is there a name for the particular strategy I'm using? The wikipedia page you linked talks about "stacking ensembles", but it seems to be yet another thing $\endgroup$
– robertspierre
Commented 13 hours ago
$\begingroup$ You're not forced to go towards the other step if you're satisfied with the results from the first classifier $\endgroup$
– Valentin Calomme
Commented 11 hours ago
$\begingroup$ I'm not satisfied with the result of the first classifier. Its recall is too low. $\endgroup$
– robertspierre
Commented 10 hours ago

Add a comment |

流产有什么症状	足三里在什么位置图片	心主什么	氨气是什么味道	b型血为什么叫贵族血
吃什么死的比较舒服	为什么每天晚上睡觉都做梦	愚孝什么意思	宫颈多发纳囊是什么病	7月8号是什么星座的
bcc是什么意思	植树节是什么季节	属相牛和什么属相配	卷饼里面配什么菜好吃	主是什么结构的字体
考护士资格证需要什么条件	女生下体长什么样子	6.5号是什么星座	室上性心动过速是什么原因引起的	前列腺吃什么药效果好

自由职业可以做什么hcv8jop2ns9r.cn	月结是什么意思hcv7jop5ns5r.cn	高血压什么症状adwl56.com	紫外线过敏用什么药hcv8jop7ns0r.cn	蜂蜜不能和什么食物一起吃hcv7jop4ns8r.cn
狡黠什么意思shenchushe.com	角弓反张是什么意思hcv8jop6ns4r.cn	贫血吃什么水果好hcv8jop6ns4r.cn	h7n9是什么病毒hcv8jop9ns1r.cn	粑粑黑色是什么原因hcv9jop4ns5r.cn
黑玫瑰代表什么意思hcv7jop6ns3r.cn	吃什么长指甲最快hcv7jop9ns3r.cn	白头发补什么维生素hcv8jop4ns3r.cn	减肥吃什么药瘦得快fenrenren.com	奇货可居什么意思hcv9jop1ns0r.cn
月经不来是什么原因hcv9jop5ns2r.cn	汗斑用什么药擦最有效hcv7jop9ns6r.cn	什么样的女人招人嫉妒hcv8jop0ns6r.cn	61是什么意思xjhesheng.com	明知故犯的故是什么意思clwhiglsz.com

Stack Exchange Network

特写：“科学”号上的“美国时间”(1)

1 Answer 1

Your Answer

Hot Network Questions

特写：“科学”号上的“美国时间”(1)

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Hot Network Questions