正则表达式,比较两个捕捉组 - Regex - Compare two capture groups

- 此内容更新于:2015-12-20
主题:

试图创建一个正则表达式来限制我们的垃圾邮件的摄入量。问题是,我并不是完全精通正则表达式。下面的产品我的工作大多是复制粘贴,调整,寻找东西帮助调整它。我已经决定我想尝试使用一个正则表达式来匹配电子邮件链接歪曲了主机名。例如:我基本上只关心主机名,或多或少限制假阳性和避免合法如HREF链接……>请点击这里!到目前为止,我有这个:根据https://regex101.com/我有两个名叫捕捉组(hostname1和hostname2),和其他正常组织,我不确定我关心。我想做的是匹配字符串如果hostname1hostname2是相同的。我感觉,它涉及向后插入或超前,但老实说,我不知道。编辑:感谢Jan原型。按照意见在他的回答,我做了一个快速添加添加下落不明的图像标记。对于大型网站(例如百思买)他们存储图像在不同的内容服务器,触发规则。我决定排除图像标记,我相信(在我非常非专业意见)我已经成功地完成了。YMMV。

原文:

trying to create a regex to limit our spam intake. Problem is, I'm not exactly fluent in regular expressions. The product of my work below is mostly copy and paste, tweaks, and searches for things to help tweak it more.

What I've decided I want to try is using a regex to match emails where a link misrepresents the hostname.

For example:

<A HREF="http://phishers.org/we_want_your_money.htm">http://someLegitimateSite.com/somewhere </A>

I basically only care about the hostnames, to limit false positives and to avoid more or less legitimate links such as A HREF...>click here!

To date, I have this:

(HREF="http[s]?:\/\/)(?'hostname1'(.*?))[:|\/|"].*?\"\>(http[s]?:\/\/)(?'hostname2'(.*?))[<|\/|:]

According to https://regex101.com/ I have two named capture groups (hostname1 and hostname2), and a whack of other groups that I'm not sure I care about.

What I want to do is match the string if hostname1 and hostname2 are the same. I get the feeling that it involves either a lookbehind or a lookahead, but I honestly don't know.

EDIT: Thanks to Jan for prototyping this. I, as per the comments in his answer, made one quick addition to add the unaccounted for case of image tags. In the case of large websites (BestBuy for example) they store their images on a different content server, which was triggering the rule. I've decided to exclude image tags, which I BELIEVE (in my very non-expert opinion) I have successfully done. YMMV.

href=["']https?:\/\/(?<hostname>[^\/"]+)[^>]+>((?!<IMG).?)(?:https?:\/\/)?(?!.*\k'hostname')
网友:它只涉及一个backreference。在这里看到的。然而,您可能想切换到一个HTML解析器来解析HTML。

(原文:It only involves a backreference. See here. However, you might want to switch to a HTML parser to parse your HTML.)

网友:我不是一个html专家,但解析器做什么当他们遇到重复的属性的吗?

(原文:I'm not an html expert, but what do parsers do when they encounter duplicate attribute's ? <a href="here" href="there">)

楼主:更具体地说,我们的垃圾邮件过滤解决方案允许我们分电子邮件(或其他东西,如接受/拒绝等)基于数量的标准。这样的一个标准,我计划使用的是“原始的身体”“匹配正则表达式”<正则表达式>。不幸的是,使用解析器删除的可能性。

(原文:To be more specific, our spam filtering solution allows us to score emails (or other things such as accept/reject, etc) based on a number of criteria. One such criteria, and what I plan to use is the "Raw Body" "Matches Regular Expression" <regex>. Unfortunately that removes the possibility of using a parser.)

解决方案:
这一定程度上取决于你的编程语言。在PHP中你可以想出某物:如果是这种情况,它可能不是一个垃圾邮件链接(href和链接文本匹配)。概述:在regex101.com上看到一个工作示例。编辑:根据你的评论,你想要否定的结果,这可以通过一个消极的超前:看到一个工作示例,这个正则表达式。
原文:

It somewhat depends on your programming language. In PHP you could come up with sth. like:

href=["']https?:\/\/(?<hostname>[^\/]+)[^>]+>(?:https?:\/\/)?\k'hostname'
# match href, =, a single/double quote, :// literally
# capture everything up to a forward slash (but not including) in a group called hostname
# followed by anything but >
# followed by >
# start a non capturing group (?:) with http/https://
# look if one can match the previously captured group called hostname

If this is the case, it is presumably not a spam link (href and link text match).

An overview:

<A HREF="http://phishers.org/we_want_your_money.htm">http://someLegitimateSite.com/somewhere </A>
<a href="https://example.com/subfolder">example.com</a> <-- will match, the others not
<a href="http://somebadsite.com">https://somegoodsite.com</a>

See a working example here on regex101.com.

EDIT: According to your comment, you want the negated result, this can be done via a negative lookahead:

href=["']https?:\/\/(?<hostname>[^\/"]+)[^>]+>(?:https?:\/\/)?(?!.*\k'hostname')
# same as before, except for the last part: (?!...)
# this one assures that the following group (hostname in our case) is not matched

See a working example for this regex here.

楼主:非常接近我所需要的东西,我现在尝试用否定的方式这是我需要的结果的不匹配,而不是那些做。这样触发器不电子邮件和分数适用于那些我希望举行了。

(原文:That is VERY close to what I need - I'm experimenting now with ways of negating the results of this as I need the ones which do not match rather than the ones which do. That way it triggers on the bad emails and applies the score to the ones I wish to have held.)

网友:@NetworkingGuy看到我更新答案,你需要一个消极的超前。

(原文:@NetworkingGuy See my updated answer, you need a negative lookahead then.)

楼主:1月,谢谢你。我是把各种!在不同的地方,但是它似乎并不奏效。看起来像一个触摸需要更多的托架。

(原文:Jan, thank you. I was putting various ! in various places, however it didn't seem to do the trick. Looks like a touch more bracketing was needed.)

楼主:1月,我发现我并没有占到一个特定的用例。我相信我现在已经占了。图像标记为“身体”的链接。href=[‘]https:\/\/(?<主机名>[^\]+)[^>]+>((?!(原文:Jan, I've discovered one particular use-case which I did not account for. I believe I've accounted for it now. Image tags as the link "body". href=["']https?:\/\/(?<hostname>[^\/"]+)[^>]+>((?!<IMG).)(?:https?:\/\/)?(?!.*\k‌​'hostname'))