我如何创建一个有序列表最常见的子字符串在MySQL varchar列?How can I create an ordered list of the most common substrings inside of my MySQL varchar column?

- 此内容更新于:2014-12-30
主题:

原文:

I have a MySQL database table with a couple thousand rows. The table is setup like so:

id | text

The id column is an auto-incrementing integer, and the text column is a 200-character varchar.

Say I have the following rows:

3 | I think I'll have duck tonight

4 | Maybe the chicken will be alright

5 | I have a pet duck now, awesome!

6 | I love duck

Then the list I'm wanting to generate might be something like:

  • 3 occurrences of 'duck'
  • 3 occurrences of 'I'
  • 2 occurrences of 'have'
  • 1 occurrences of 'chicken'
  • .etc .etc

Plus, I'll probably want to maintain a list of substrings to ignore from the list, like 'I', 'will' and 'have. It's important to note that I do not know what people will post.

I do not have a list of words that I want to monitor, I just want to find the most common substrings. I'll then filter out any erroneous substrings that are not interesting from the list manually by editing the query.

Can anyone suggest the best way to do this? Thanks everyone!

解决方案:
原文:

MySQL already does this for you.

First make sure your table is a MyISAM table

Define a FULLTEXT index on your column

On a shell command line navigate to the folder where your MySQL data is stored, then type:

myisam_ftdump -c yourtablename 1 >wordfreq.dump

You can then process wordfreq.dump to eliminate the unwanted column and sort by frequency decending.

You could do all the above with a single command line and some sed/awk wizardry no doubt. And you could incorporate it into your program without needing a dump file.

More info on myisam_ftdump here: http://dev.mysql.com/doc/refman/5.0/en/myisam-ftdump.html

Oh... one more thing, the stopwords for MySQL are precompiled into the engine. And words with 3 or less characters are not indexed. The full list is here:

http://dev.mysql.com/doc/refman/5.0/en/fulltext-stopwords.html

If this list isn't adequate for your needs, or you need words with less than 3 characters to count, the only way is to recompile MySQL with different rules for FULLTEXT. I don't recommend that!

rmh的回复:奇妙的,谢谢!

(原文:Fantastic, thank you!)

解决方案:
平面文件,然后使用您最喜爱的快速提取语言,perl、python、ruby等来处理平面文件。 如果你没有这些语言技能的一部分,这是一个完美的开始使用一个小任务,你不会花很长时间。 一些数据库任务就是这样更容易做外部的数据库。
原文:

Extract to flat file and then use your favorite quick language, perl, python, ruby, etc to process the flat file.

If you don't have one these languages as part of your skillset, this is a perfect little task to start using one, and it won't take you long.

Some database tasks are just so much easier to do OUTSIDE of the database.

解决方案:
您可能想要查看MySQL全文解析器插件
原文:

You might want to look into the MySQL Full-Text Parser Plugins