trying to create a regex to limit our spam intake. Problem is, I'm not exactly fluent in regular expressions. The product of my work below is mostly copy and paste, tweaks, and searches for things to help tweak it more.
What I've decided I want to try is using a regex to match emails where a link misrepresents the hostname.
<A HREF="http://phishers.org/we_want_your_money.htm">http://someLegitimateSite.com/somewhere </A>
I basically only care about the hostnames, to limit false positives and to avoid more or less legitimate links such as A HREF...>click here!
To date, I have this:
According to https://regex101.com/ I have two named capture groups (hostname1 and hostname2), and a whack of other groups that I'm not sure I care about.
What I want to do is match the string if hostname1 and hostname2 are the same. I get the feeling that it involves either a lookbehind or a lookahead, but I honestly don't know.
EDIT: Thanks to Jan for prototyping this. I, as per the comments in his answer, made one quick addition to add the unaccounted for case of image tags. In the case of large websites (BestBuy for example) they store their images on a different content server, which was triggering the rule. I've decided to exclude image tags, which I BELIEVE (in my very non-expert opinion) I have successfully done. YMMV.