Killing cformsII contact form SPAM with regex

in site by Ian | 4 comments

cforms2

It used to be a trickle of two or three a day, but lately we’ve been hammered with hundreds of automated SPAM submissions through the cformsII contact form plugin we use with WordPress. The contact form feeds into a Google Groups mailing list that forwards to the whole team.

99% of the SPAM advertised a dozen designer brand knock-offs: North Face, Gucci handbags, Burberry, etc. This stuff isn’t hard to filter by keyword. First we looked at Google Groups to see if there was a keyword filter to protect group lists from SPAM, naughty language, or whatever. Surprisingly nothing.

Next we checked the cformsII plugin. As versatile and flexible as it is, there’s no easy way to add a banned word list. There’s a couple antispam features, but the honeypot requires CSS changes to the site theme and we’d die before burdening you with a captcha to contact us.

cformII allows us to use a regular expression to evaluate each field of the submission form. Searching the support forum turned up tons of requests for a simple regex to filter SPAM words. Each was met with suggestions to search the forum for some epic post on the topic, but we couldn’t find it anywhere.

^(?i)(\b(red bottoms|timberland|beats by dre|burberry|Louis Vuitton|Gucci|uggs|ray ban|north face|Tiffany|Michael Kors|coach)\b)$

First we worked up a regex to match SPAM words with The Regex Coach. The problem is this matches words and evaluates to true, allowing only forms with these words to be accepted. We needed a way to negate the regex result. The equivalent of ! or NOT in most languages like C, PHP, Perl, Basic, etc.

^(?i)(?:(?!\b(red bottoms|timberland|beats by dre|burberry|Louis Vuitton|Gucci|uggs|ray ban|north face|Tiffany|Michael Kors|coach)\b).)*$

Here’s the final regex we’re using after much painful tinkering. ^ and $ encase the regex. (?i) makes it case insensitive. (?:(?! and .)* negate the results of the list of SPAM words so only messages without these words are allowed. \b( and )\b is a list of bad words that are rejected from the message body and name, \b defines the word boundary.

Hope this helps out anyone else dealing with repetitive SPAM from the same few bots. Obviously its not a perfect solution for everything, but it stopped the flood instantly without modifying site themes or forcing you to use a captcha on the contact form.

This entry was posted in site and tagged , , .

Comments

  1. Sleepwalker3 says:

    A few suggestions which may possibly help –
    You may perhaps consider something like many sites I see where they ask you a simple question with pictures (often the question itself is a picture). Simple things like “What is 1+4?”. I think that’s much less annoying than some cryptic disguised, contorted alpha-numeric code that you have to strain to read and often get wrong.
    Another good example (but don’t tell him I said so ;) I’ve seen is over on the Ultrakeet site, look down below http://ultrakeet.com.au/contact/

    If you want a nice simple way to deal with RegEx in almost any flavour, then consider the very powerful RegEx utils ‘RegExMagic’ and ‘RegExBuddy’ over at http://www.just-great-software.com/
    These could save you a lot of time and hair-pulling.

    • Sjaak says:

      That samoerai kat looks cool!!

      Also like the preferred snailmail contact form :)

      • Sleepwalker3 says:

        Yes, Ahmad’s a pretty funny character, he comes up with some good stuff.
        With the ‘question on a picture’ stuff I was talking about, they usually have a bunch of pictures that are put up randomly asking you simple questions, but ones that might confuse a script, so even if they have good OCR, it would probably get it wrong anyway.

  2. Sleepwalker3 says:

    Also, if looking for a tutorial on RegEx’s there is a link to one here http://www.regexbuddy.com/tutorial.html and the guy from JG Soft who wrote those utils also has written a best selling book on the subject ‘Regular Expressions Cookbook’ that is now in 2nd edition either in print form or electronic readers. You can get it from Amazon or O’Reilly. http://www.amazon.com/exec/obidos/ASIN/1449319432/jgsbookselection

Leave a Comment

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Notify me of followup comments via e-mail. You can also subscribe without commenting.