全文搜索相关性测量在吗? - Full-text search relevance is measured in?

- 此内容更新于:2015-12-09
主题:

我做一个测试系统,当quizmakers插入问题问题银行,我为重复检查DB/高度相似的问题。测试MySQL的比赛()……对(),最高的相关性得到30+,当我测试100%相似的字符串。所以的相关性究竟是什么?引用手册:相关性值非负浮点数。零相关性意味着没有相似性。相关性计算基于单词的数量的行,独特的单词的数量在这一行,单词的总数在收集和文档的数量(行)包含一个特定的词。我的问题是如何测试的相关性值如果一个字符串是一个重复的。如果是100%重复,防止它被隔板质疑银行。但如果只有如此相似,提示quizmaker核实,插入或不是。那么我该怎么做呢?30+100%相同的字符串不是百分比,所以我树桩。提前谢谢。

原文:

I am making a quiz system, and when quizmakers insert questions into the Question Bank, I am to check the DB for duplicate / very highly similar questions.

Testing MySQL's MATCH() ... AGAINST(), the highest relevance I get is 30+, when I test against a 100% similar string.

So what exactly is the relevance? To quote the manual:

Relevance values are non-negative floating-point numbers. Zero relevance means no similarity. Relevance is computed based on the number of words in the row, the number of unique words in that row, the total number of words in the collection, and the number of documents (rows) that contain a particular word.

My problem is how to test the relevance value if a string is a duplicate. If it's 100% duplicate, prevent it from being inserter into Question Bank. But if it is only so similar, prompt the quizmaker to verify, insert or not. So how do I do that? 30+ for 100% identical string is not percentage, so I'm stump.

Thanks in advance.

解决方案:
andygeers是正确的:这些数字没有实证意义除了他们的相互关系和不能使用自己决定什么是或不是一个“完全匹配”。你需要确定自己。甚至除了全文搜索排名的局限性,也有开放的问题你考虑consitiute“完全匹配”。(实际文本只有soundex匹配数吗?(如做的同义词。,“沙发”vs。“沙发”)算匹配或者是不同的吗?应该是尝试弥补拼写错误?等等)。如果我有需要执行这种检查,我只能拿排名最高的全文搜索返回的条目,删除任何指定stopwords,正常化的空白,转换为小写,做比较,离开它,直到我遇到了一个情况,呼吁进一步精制。不是这么多额外的工作——如果你指定应用程序的语言你使用,你可以在这里找一个谁可以写归一化函数在一个十几行代码。
原文:

andygeers is on the right track: Those numbers have no empirical meaning other than their relations to each other and cannot be used on their own to determine what is or is not an "exact match". You need to determine that yourself. Even aside from the limitations of fulltext search ranking, there's also the open question of just what you consider to consitiute an "exact match". (Actual text only or do soundex matches count? Do synonyms (e.g., "couch" vs. "sofa") count as matching or as distinct? Should an attempt be made to compensate for misspellings? Etc.)

If I had the need to perform such a check, I would grab only the highest-ranked entry returned by the fulltext search, remove any designated stopwords, normalize whitespace, convert to lowercase, do the comparison, and leave it at that until I encountered a case that called for it to be refined further. It's not really all that much extra work - if you specify the language you're using for your application, you could probably find someone around here who could write the normalization function within a dozen or so lines of code.

解决方案:
文本检索系统的基本数据结构是一个反向索引。这本质上是一个单词列表文档集合中发现他们出现在列表的文件。它还可以有关于出现为每个文档的元数据,比如单词出现的次数。包含单词的文档可以查询通过匹配搜索词。确定相关性、启发式称为余弦计算排名的。这是通过构造n维向量和一个组件的每个搜索词。您也可以重搜索条件(如果需要的话)。这个向量给出了n维空间中的一个点对应于您的搜索条件。类似的基于加权向量出现在每个文档可以由反向索引与每个轴与轴相对应的向量为每个搜索词。如果你计算一个向量的点积夹角的余弦值。1.0相当于cos(0),这将假设向量占领一个共同的线从原点。向量在一起,越接近角越小,越接近余弦是1.0。如果你对搜索结果进行排序的余弦(或塞住到一个优先队列是mg)得到最相关。聪明的相关性算法倾向于摆弄搜索词的权重,扭曲点积的高相关性。如果你想挖一个小、管理由贝尔和gMoffet讨论文本检索系统的内部结构。
原文:

The basic data structure for a text retrieval system is an Inverted Index. This is essentially a list of words found in the document collection with a list of the documents they occur in. It can also have metadata about the occurrence for each document, such as the number of times the word appears.

Documents containing the words can be queried by matching on the search terms. To determine relevance, a heuristic known as a Cosine Ranking is calculated on the hits. This works by constructing n-dimensional vector with one component for each of the n search terms. You can also weight the search terms if desired. This vector gives a point in n-dimensional space that corresponds to your search terms.

A similar vector based on the weighted occurrences in each document can be constructed from the inverted index with each axis in the vector corresponding with the axis for each search term. If you calculate a dot product of these vectors you get the cosine of the angle between them. 1.0 is equivalent to cos (0), which would assume the vectors occupy a common line from the origin. The closer the vectors together, the smaller the angle and the closer the cosine is to 1.0.

If you sort the search results by the cosine (or bung them into a priority queue as mg does) you get the most relevant. Cleverer relevance algorithms tend to fiddle with the weights of the search terms, skewing the dot product in favour of terms with high relevance.

If you want to dig a little, Managing Gigabytes by Bell and Moffet discusses the internal architecture of text retrieval systems.

解决方案:
我不知道具体的MySQL函数使用,但我想可能是对这些数字并没有绝对的意义,他们只是设计成与其他值由相同的功能。检查一个绝对匹配可以手动选择的文本本身和比较。
原文:

I don't know the specifics of the MySQL function you're using, but I imagine it could be that there is no absolute meaning for those numbers - they're just designed to be compared with other values produced by the same function. To check for an absolute match you could select out the text itself and compare manually.

网友:我更喜欢使用MySQL尽可能的搜索引擎。如果我是对比我自己的,我需要做很多的准备和检查如删除所有空格和特殊字符,全部转换为大写,什么的。这是我的最后一招。

(原文:I prefer to use MySQL search engine whenever possible. If I were to compare my own, I need to do a lot of preparing and checkings e.g. remove all whitespace and special characters, convert all to uppercase, and whatnot. That is my last resort.)