MySQL—;马克1匹配的行MySQL — mark all but 1 matching row

- 此内容更新于:2014-12-30
主题:

原文:

This is similar to this question, but it seems like some of the answers there aren't quite compatible with MySQL (or I'm not doing it right), and I'm having a heck of a time figuring out the changes I need. Apparently my SQL is rustier than I thought it was. I'm also looking to change a column value rather than delete, but I think at least that part is simple...

I have a table like:

rowid SERIAL
fingerprint TEXT
duplicate BOOLEAN
contents TEXT
created_date DATETIME

I want to set duplicate=true for all but the first (by created_date) of each group by fingerprint. It's easy to mark all of the rows with duplicate fingerprints as dupes. The part I'm getting stuck on is keeping the first.

One of the apps that populates the table does bulk loads of data, with multiple workers loading data from different sources, and the workers' data isn't necessarily partitioned by date, so it's a pain to try to mark these all as they come in (the first one inserted isn't necessarily the first one by date). Also, I already have a bunch of data in there I'll need to clean up either way. So I'd rather just have a relatively efficient query I can run after a bulk load to clean up than try to build it into that app.

Thanks!

noob source的回复:(分配指纹,created_date)独特的吗?

(原文:is (fingerprint, created_date) unique?)

解决方案:
原文:

MySQL needs to be explicitly told if the data you are grouping by is larger than 1024 bytes (see this link for details). So if your data in the fingerprint column is larger than 1024 bytes you should use set the max_sort_length variable (see this link for details about values allowed, and this link about how to set it) to a larger number so that the group by wont silently use only part of your data for grouping.

Once you're certain that MySQL will group your data properly, the following query will set the duplicate flag so that the first fingerprint record has duplicate set to FALSE/0 and any subsequent fingerprint records have duplicate set to TRUE/1:

    UPDATE mytable m1
INNER JOIN (SELECT fingerprint
                 , MIN(rowid) AS minrow 
              FROM mytable m2 
          GROUP BY fingerprint) m3 
        ON m1.fingerprint = m3.fingerprint
       SET m1.duplicate = m3.minrow != m1.rowid;

Please keep in mind that this solution does not take NULLs into account and if it is possible for the fingerprint field to be NULL then you would need additional logic to handle that case.

解决方案:
如何一个两步的方法,假设你可以离线数据加载期间: 每一个条目标记为一式两份。 从每组选择最早的行,并清除重复的旗帜。 不优雅,但得到了工作。
原文:

How about a two-step approach, assuming you can go offline during a data load:

  • Mark every item as duplicate.
  • Select the earliest row from each group, and clear the duplicate flag.

Not elegant, but gets the job done.

sliderhouserules的回复:这可以很容易地完成一个单一的,而简单的查询。没有理由去这些长度把事情复杂化。

(原文:This can easily be accomplished with a single, rather easy query. No reason to go to these lengths to complicate things.)

解决方案:
原文:

Here's a funny way to do it:

SET @rowid := 0;

UPDATE mytable
SET duplicate = (rowid = @rowid), 
    rowid = (@rowid:=rowid)
ORDER BY rowid, created_date;
  • First set a user variable to zero, assuming this is less than any rowid in your table.
  • Then use the MySQL UPDATE...ORDER BY feature to ensure that the rows are updated in order by rowid, then by created_date.
  • For each row, if the current rowid is not equal to the user variable @rowid, set duplicate to 0 (false). This will be true only on the first row encountered with a given value for rowid.
  • Then add a dummy set of rowid to its own value, setting @rowid to that value as a side effect.
  • As you UPDATE the next row, if it's a duplicate of the previous row, rowid will be equal to the user variable @rowid, and therefore duplicate will be set to 1 (true).

Edit: Now I have tested this, and I corrected a mistake in the line that sets duplicate.

解决方案:
这里有另一个方法,使用MySQL的多表更新的语法:
原文:

Here's another way to do it, using MySQL's multi-table UPDATE syntax:

UPDATE mytable m1
  JOIN mytable m2 ON (m1.rowid = m2.rowid AND m1.created_date < m2.created_date)
SET m2.duplicate = 1;
sliderhouserules的回复:并# 39;t占重复日期……

(原文:Doesn't account for duplicate dates...)

Bill Karwin的回复:哦,是的,你,是正确的。它假定每一个日期是独一无二的。啊。

(原文:Oh, yes, you're right. It assumes each date is unique. Ah well.)

Chris的回复:没错,但你能做的m1。primary_key & lt;m2.primary_key。我知道OP说他想把第一个记录的创建日期,是不会# 39;t表的主键(我们# 39;指的是一个独特的汽车增量字段,因为这# 39;s MySQL)一定是升序顺序吗?

(原文:True, but you could do m1.primary_key < m2.primary_key. I know the OP said he wanted to keep the first record by creation date - wouldn't the primary keys of the table (we're talking an UNIQUE AUTO INCREMENT field here, since it's MySQL) necessarily be in ascending chronological order?)

Bill Karwin的回复:@Chris:是的,通常这# 39;年代真实的。但是你可能无法按时间顺序假设行插入。也就是说,m1。primary_key & lt;平方米。primary_key可能不保证m1。created_date & lt;m2.created_date。YMMV。

(原文:@Chris: Yes, usually that's true. But you might not be able to assume rows are inserted in chronological order. That is, m1.primary_key < m2.primary_key might not guarantee that m1.created_date < m2.created_date. YMMV.)

解决方案:
我不知道MySQL语法,但在PLSQL你刚才做的事: 可能有一些语法错误,我只是打字即兴/无法测试它,但这是它的要点。 MySQL版本(未测试): 更新t1 设置复制= 1 从MyTable t1 rowid !=( 从MyTable t2选择rowid t2。指纹= t1.fingerprint 分配秩序created_date类型 限制1 )
原文:

I don't know the MySQL syntax, but in PLSQL you just do:

UPDATE t1
SET duplicate = 1
FROM MyTable t1
WHERE rowid != (
  SELECT TOP 1 rowid FROM MyTable t2
  WHERE t2.fingerprint = t1.fingerprint ORDER BY created_date DESC
)

That may have some syntax errors, as I'm just typing off the cuff/not able to test it, but that's the gist of it.


MySQL version (not tested):

UPDATE t1
  SET duplicate = 1
FROM MyTable t1
WHERE rowid != (
  SELECT rowid FROM MyTable t2
  WHERE t2.fingerprint = t1.fingerprint
  ORDER BY created_date DESC
  LIMIT 1
)
Bill Karwin的回复:选择是一个Microsoft SQL Server的特性。它不支持Oracle或MySQL。

(原文:SELECT TOP is a Microsoft SQL Server feature. It is not supported in Oracle or MySQL.)

sliderhouserules的回复:只是抬头MySQL语法,# 39;实质性的限制。

(原文:Just looked up the MySQL syntax, it's LIMIT.)

解决方案:
原文:

Untested...

UPDATE TheAnonymousTable
   SET duplicate = TRUE
 WHERE rowid NOT IN
       (SELECT rowid
          FROM (SELECT MIN(created_date) AS created_date, fingerprint
                  FROM TheAnonymousTable
                 GROUP BY fingerprint
               ) AS M,
               TheAnonymousTable AS T
         WHERE M.created_date = T.created_date
           AND M.fingerprint  = T.fingerprint
       );

The logic is that the innermost query returns the earliest created_date for each distinct fingerprint as table alias M. The middle query determines the rowid value for each of those rows; it is a nuisance to have to do this (but necessary), and the code assumes that you won't get two records for the same fingerprint and timestamp. This gives you the rowid for the earlist record for each separate fingerprint. Then the outer query (the UPDATE) sets the 'duplicate' flag on all those rows where the rowid is not one of the earliest rows.

Some DBMS may be unhappy about doing (nested) sub-queries on the table being updated.

解决方案:
如果你只对去除重复感兴趣,您可以使用此技术: 创建一个复制的表(仅结构——没有数据) 执行一个选 在MySQL(这种类型的组,但可能会或可能不会在别人工作。)
原文:

If you are only interested in removing duplicates, you can use this technique:

  1. Create a duplicate of your table (structure only -- no data)
  2. Perform an INSERT-SELECT

(This type of group by works in MySQL but may or may not work in others.)

INSERT INTO table_copy (
  rowid,
  fingerprint,
  duplicate,
  contents,
  created_date
)
SELECT 
  rowid,
  fingerprint,
  0,
  contents,
  MAX(created_date)
FROM table_original
GROUP BY fingerprint