我如何创建一个有序列表最常见的子字符串在MySQL varchar列?How can I create an ordered list of the most common substrings inside of my MySQL varchar column?
I have a MySQL database table with a couple thousand rows. The table is setup like so:
id | text
id column is an auto-incrementing integer, and the
text column is a 200-character varchar.
Say I have the following rows:
3 | I think I'll have duck tonight
4 | Maybe the chicken will be alright
5 | I have a pet duck now, awesome!
6 | I love duck
Then the list I'm wanting to generate might be something like:
- 3 occurrences of 'duck'
- 3 occurrences of 'I'
- 2 occurrences of 'have'
- 1 occurrences of 'chicken'
- .etc .etc
Plus, I'll probably want to maintain a list of substrings to ignore from the list, like 'I', 'will' and 'have. It's important to note that I do not know what people will post.
I do not have a list of words that I want to monitor, I just want to find the most common substrings. I'll then filter out any erroneous substrings that are not interesting from the list manually by editing the query.
Can anyone suggest the best way to do this? Thanks everyone!
MySQL already does this for you.
First make sure your table is a MyISAM table
Define a FULLTEXT index on your column
On a shell command line navigate to the folder where your MySQL data is stored, then type:
myisam_ftdump -c yourtablename 1 >wordfreq.dump
You can then process wordfreq.dump to eliminate the unwanted column and sort by frequency decending.
You could do all the above with a single command line and some sed/awk wizardry no doubt. And you could incorporate it into your program without needing a dump file.
More info on myisam_ftdump here: http://dev.mysql.com/doc/refman/5.0/en/myisam-ftdump.html
Oh... one more thing, the stopwords for MySQL are precompiled into the engine. And words with 3 or less characters are not indexed. The full list is here:
If this list isn't adequate for your needs, or you need words with less than 3 characters to count, the only way is to recompile MySQL with different rules for FULLTEXT. I don't recommend that!
(原文：Fantastic, thank you!)
平面文件,然后使用您最喜爱的快速提取语言,perl、python、ruby等来处理平面文件。 如果你没有这些语言技能的一部分,这是一个完美的开始使用一个小任务,你不会花很长时间。 一些数据库任务就是这样更容易做外部的数据库。
Extract to flat file and then use your favorite quick language, perl, python, ruby, etc to process the flat file.
If you don't have one these languages as part of your skillset, this is a perfect little task to start using one, and it won't take you long.
Some database tasks are just so much easier to do OUTSIDE of the database.
You might want to look into the MySQL Full-Text Parser Plugins
- 通过cli从远程mysql导入sql服务器mysql import sql via cli from remote server
- 最好的货币MySQL数据类型是什么?What is the best datatype for currencies in MySQL?
- 编写一个单元测试框架,测试SQL存储过程Writing a unit testing framework for testing SQL stored procedures
- 我怎么能在MySQL在两列布尔逻辑吗?How can I do boolean logic on two columns in MySQL?
- 我怎么做布尔逻辑在MySql的两列,其中一个是一个Varchar吗?How do I do boolean logic on two columns in MySql, one of which is a Varchar?
- 我如何创建一个有序列表最常见的子字符串在MySQL varchar列?How can I create an ordered list of the most common substrings inside of my MySQL varchar column?
- 简单随机样本从一个Sql数据库Simple Random Samples from a Sql database
- 查询优化技术?(关闭)Query optimization techniques? [closed]
- 什么最好的方法访问数据库在一个PHP类?What the best way to access the database inside a class in PHP?
- MySqlCommand参数不工作MySqlCommand Parameter not Working
- 重装MYSQL(windows)——怎么去旧的数据与新安装吗?Re-installing MYSQL(windows) - How to get to the old data with the new install?