MySQL—;马克1匹配的行MySQL — mark all but 1 matching row
This is similar to this question, but it seems like some of the answers there aren't quite compatible with MySQL (or I'm not doing it right), and I'm having a heck of a time figuring out the changes I need. Apparently my SQL is rustier than I thought it was. I'm also looking to change a column value rather than delete, but I think at least that part is simple...
I have a table like:
rowid SERIAL fingerprint TEXT duplicate BOOLEAN contents TEXT created_date DATETIME
I want to set duplicate=true for all but the first (by created_date) of each group by fingerprint. It's easy to mark all of the rows with duplicate fingerprints as dupes. The part I'm getting stuck on is keeping the first.
One of the apps that populates the table does bulk loads of data, with multiple workers loading data from different sources, and the workers' data isn't necessarily partitioned by date, so it's a pain to try to mark these all as they come in (the first one inserted isn't necessarily the first one by date). Also, I already have a bunch of data in there I'll need to clean up either way. So I'd rather just have a relatively efficient query I can run after a bulk load to clean up than try to build it into that app.
(原文：is (fingerprint, created_date) unique?)
MySQL needs to be explicitly told if the data you are grouping by is larger than 1024 bytes (see this link for details). So if your data in the fingerprint column is larger than 1024 bytes you should use set the
max_sort_length variable (see this link for details about values allowed, and this link about how to set it) to a larger number so that the group by wont silently use only part of your data for grouping.
Once you're certain that MySQL will group your data properly, the following query will set the duplicate flag so that the first fingerprint record has duplicate set to FALSE/0 and any subsequent fingerprint records have duplicate set to TRUE/1:
UPDATE mytable m1 INNER JOIN (SELECT fingerprint , MIN(rowid) AS minrow FROM mytable m2 GROUP BY fingerprint) m3 ON m1.fingerprint = m3.fingerprint SET m1.duplicate = m3.minrow != m1.rowid;
Please keep in mind that this solution does not take NULLs into account and if it is possible for the fingerprint field to be NULL then you would need additional logic to handle that case.
如何一个两步的方法,假设你可以离线数据加载期间: 每一个条目标记为一式两份。 从每组选择最早的行,并清除重复的旗帜。 不优雅,但得到了工作。
How about a two-step approach, assuming you can go offline during a data load:
- Mark every item as duplicate.
- Select the earliest row from each group, and clear the duplicate flag.
Not elegant, but gets the job done.
(原文：This can easily be accomplished with a single, rather easy query. No reason to go to these lengths to complicate things.)
Here's a funny way to do it:
SET @rowid := 0; UPDATE mytable SET duplicate = (rowid = @rowid), rowid = (@rowid:=rowid) ORDER BY rowid, created_date;
- First set a user variable to zero, assuming this is less than any rowid in your table.
- Then use the MySQL
UPDATE...ORDER BYfeature to ensure that the rows are updated in order by
rowid, then by
- For each row, if the current
rowidis not equal to the user variable
duplicateto 0 (false). This will be true only on the first row encountered with a given value for
- Then add a dummy set of
rowidto its own value, setting
@rowidto that value as a side effect.
- As you
UPDATEthe next row, if it's a duplicate of the previous row,
rowidwill be equal to the user variable
@rowid, and therefore
duplicatewill be set to 1 (true).
Edit: Now I have tested this, and I corrected a mistake in the line that sets
Here's another way to do it, using MySQL's multi-table
UPDATE mytable m1 JOIN mytable m2 ON (m1.rowid = m2.rowid AND m1.created_date < m2.created_date) SET m2.duplicate = 1;
(原文：Doesn't account for duplicate dates...)Bill Karwin的回复:哦,是的,你,是正确的。它假定每一个日期是独一无二的。啊。
(原文：Oh, yes, you're right. It assumes each date is unique. Ah well.)Chris的回复:没错,但你能做的m1。primary_key & lt;m2.primary_key。我知道OP说他想把第一个记录的创建日期,是不会# 39;t表的主键(我们# 39;指的是一个独特的汽车增量字段,因为这# 39;s MySQL)一定是升序顺序吗?
(原文：True, but you could do m1.primary_key < m2.primary_key. I know the OP said he wanted to keep the first record by creation date - wouldn't the primary keys of the table (we're talking an UNIQUE AUTO INCREMENT field here, since it's MySQL) necessarily be in ascending chronological order?)Bill Karwin的回复:@Chris:是的,通常这# 39;年代真实的。但是你可能无法按时间顺序假设行插入。也就是说,m1。primary_key & lt;平方米。primary_key可能不保证m1。created_date & lt;m2.created_date。YMMV。
(原文：@Chris: Yes, usually that's true. But you might not be able to assume rows are inserted in chronological order. That is, m1.primary_key < m2.primary_key might not guarantee that m1.created_date < m2.created_date. YMMV.)
我不知道MySQL语法,但在PLSQL你刚才做的事: 可能有一些语法错误,我只是打字即兴/无法测试它,但这是它的要点。 MySQL版本(未测试): 更新t1 设置复制= 1 从MyTable t1 rowid !=( 从MyTable t2选择rowid t2。指纹= t1.fingerprint 分配秩序created_date类型 限制1 )
I don't know the MySQL syntax, but in PLSQL you just do:
UPDATE t1 SET duplicate = 1 FROM MyTable t1 WHERE rowid != ( SELECT TOP 1 rowid FROM MyTable t2 WHERE t2.fingerprint = t1.fingerprint ORDER BY created_date DESC )
That may have some syntax errors, as I'm just typing off the cuff/not able to test it, but that's the gist of it.
MySQL version (not tested):
UPDATE t1 SET duplicate = 1 FROM MyTable t1 WHERE rowid != ( SELECT rowid FROM MyTable t2 WHERE t2.fingerprint = t1.fingerprint ORDER BY created_date DESC LIMIT 1 )
(原文：SELECT TOP is a Microsoft SQL Server feature. It is not supported in Oracle or MySQL.)sliderhouserules的回复:只是抬头MySQL语法,# 39;实质性的限制。
(原文：Just looked up the MySQL syntax, it's LIMIT.)
UPDATE TheAnonymousTable SET duplicate = TRUE WHERE rowid NOT IN (SELECT rowid FROM (SELECT MIN(created_date) AS created_date, fingerprint FROM TheAnonymousTable GROUP BY fingerprint ) AS M, TheAnonymousTable AS T WHERE M.created_date = T.created_date AND M.fingerprint = T.fingerprint );
The logic is that the innermost query returns the earliest
created_date for each distinct fingerprint as table alias M. The middle query determines the rowid value for each of those rows; it is a nuisance to have to do this (but necessary), and the code assumes that you won't get two records for the same fingerprint and timestamp. This gives you the rowid for the earlist record for each separate fingerprint. Then the outer query (the UPDATE) sets the 'duplicate' flag on all those rows where the rowid is not one of the earliest rows.
Some DBMS may be unhappy about doing (nested) sub-queries on the table being updated.
如果你只对去除重复感兴趣,您可以使用此技术: 创建一个复制的表(仅结构——没有数据) 执行一个选 在MySQL(这种类型的组,但可能会或可能不会在别人工作。)
If you are only interested in removing duplicates, you can use this technique:
- Create a duplicate of your table (structure only -- no data)
- Perform an INSERT-SELECT
(This type of group by works in MySQL but may or may not work in others.)
INSERT INTO table_copy ( rowid, fingerprint, duplicate, contents, created_date ) SELECT rowid, fingerprint, 0, contents, MAX(created_date) FROM table_original GROUP BY fingerprint
- 如何使用外键为新表how to use foreign key into new table
- 如果不存在MYSQL:创建表MYSQL: Create Table If Not Exists
- 当我应该使用执照;不是NULL”在MySQL表和有什么好处吗?When should I be using “NOT NULL” in a MySQL table and are there any benefits?
- 你怎么两个语句吗?How do you OR two LIKE statements?
- 个人项目——下一个实际的语言/技术学习Personal Project - Next practical language/tech to learn
- MySQL—;马克1匹配的行MySQL — mark all but 1 matching row
- 什么你读过的最好的书或文章优化mysql服务器(linux)?(关闭)Whats the best book or article you have read on optimizing mysql servers (linux)? [closed]
- 发现差异在MySQL的两个表的行数Finding difference in row count of two tables in MySQL
- 如何设置连接超时根据MySQL用户登录的吗How to setup a connection timeout depending of the user login in MySQL
- MySQL计数(不同的())意想不到的结果MySQL COUNT(DISTINCT()) unexpected results
- 有可能有一个MySQL索引视图?Is it possible to have an indexed view in MySQL?