Front page
Archive
Silflay Hraka?


Bigwig is a systems administrator at a public university
Hrairoo is the proprietor of a quality used bookstore
Kehaar is.
Woundwort is a professor of counseling at a private university

The Hraka RSS feed

Email
bigwig AT nc.rr.com

Friends of Hraka
InstaPundit
Daily Pundit
cut on the bias
Meryl Yourish
This Blog Is Full Of Crap
Winds of Change
A Small Victory
Silent Running
Dr. Weevil
Little Green Footballs
ColdFury
Oceanguy
Fragments from Floyd
VodkaPundit
Allah
The Feces Flinging Monkey
the skwib
Dean's World
Little Tiny Lies
The Redsugar Muse
Sperari
Natalie Solent
From the Mrs.
ErosBlog
The Anti-Idiotarian Rottweiler
On the Third Hand
Public Nuisance
Not a Fish
Rantburg
AMCGLTD
WeckUpToThees!
Electric Venom
Skippy, The Bush Kangaroo
Common Sense and Wonder
Neither Here Nor There
Wizbang!
Bogieblog
ObscuroRant
RocketJones
The Greatest Jeneration
Ravenwolf
Ipse Dixit
TarHeelPundit
Blog On the Run
blogatron
Redwood Dragon
Notables
Greeblie Blog
Have A Cuppa Tea
A Dog's Life
IMAO
Zonitics.com
Iberian Notes
Midwest Conservative Journal
A Voyage to Arcturus
HokiePundit
Trojan Horseshoes
In Context
dcthornton.blog
The People's Republic of Seabrook
Country Store
Blog Critics
Chicago Boyz
Hippy Hill News
Kyle Still Free Press
The Devil's Excrement
The Fat Guy
War Liberal
Assume the Position
Balloon Juice
Iron Pen In A Velvet Glove
IsraPundit
Freedom Lives
Where Worlds Collide
Knot by Numbers
How Appealing
South Knox Bubba
Heretical Ideas
The Kitchen Cabinet
Dustbury.com
tonecluster
Bo Cowgill
mtpolitics.net
Raving Atheist
The Short Strange Trip
Shark Blog
Hoplites
Jimspot
Ron Bailey's Weblog
Cornfield Commentary
Testify!
Northwest Notes
pseudorandom
The Blog from the Core
Ain'tNoBadDude
CroMagnon
The Talking Dog
WTF Is It Now??
Blue Streak
Smarter Harper's Index
nikita demosthenes
Bloviating Inanities
Sneakeasy's Joint
Ravenwood's Universe
The Eleven Day Empire
World Wide Rant
All American
Pdawwg
The Rant
The Johnny Bacardi Show
The Head Heeb
Viking Pundit
Mercurial
Oscar Jr. Was Here
Just Some Poor Schmuck
Katy & Bruce Loebrich
But How's The Coffee?
Roscoe Ellis
Foolsblog
Sasha Castel
Dodgeblogium
Susskins Central Dispatch
DoggerelPundit
Josh Heit
Attaboy
Aaron's Rantblog
MojoMark
As I was saying...
Blog O' Dob
Dr. Frank's Blogs Of War
Betsy's Page
A Knob for Brightness
Fresh Bilge
The Politburo Diktat
Drumwaster's rants
Curt's Page
The Razor
An Unsealed Room
The Legal Bean
Helloooo chapter two!
As I Was Saying...
SkeptiLog AGOG!
Tong family blog
Vox Beth
Velociblog
I was thinking
Judicious Asininity
This Woman's Work
Fragrant Lotus
DaGoddess
Single Southern Guy
Caerdroia
GrahamLester.Com
Jay Solo's Verbosity
TacJammer
Snooze Button Dreams
Horologium
You Big Mouth, You!
From the Inside looking Out
Night of the Lepus
No Watermelons Allowed
From The Inside Looking Out
Lies, Damn Lies, and Statistics
Suburban Blight
Aimless
The SmarterCop
Dog of Flanders
From Behind the Wall of Sleep
Beaker's Corner
Bad State of Gruntledness
Who Tends The Fires
Granny Rant
Elegance Against Ignorance
Moxie.nu
Eccentricity
Say What?
Blown Fuse
Wait 'til Next Year
The Pryhills
The Whomping Willow
The National Debate
The Skeptician
Zach Everson
MonkeyWatch
Geekward Ho
Argghhh!!!
Life in New Orleans
Rotten Miracles
Fringe
The Biomes Blog
illinigirl
See What You Share
Truthprobe
Blog dElisson
Your Philosophy Sucks
Watauga Rambler
Socialized Medicine
Consternations
Verging on Pertinence
Read My Lips
ambivablog
Soccerdad
The Flannel Avenger
Butch Howard's WebLog
Castle Argghhh!
Andrew Hofer
kschlenker.com
Moron Abroad
White Pebble
Darn Floor
Wizblog
tweedler
Pajama Pundits
BabyTrollBlog
Cadmusings
Goddess Training 101
A & W
Medical Madhouse
Slowly Going Sane
The Oubliette
American Future
Right Side Redux
See The Donkey
Newbie Trucker
The Right Scale
Running Scared
Ramblings Journal
Focus On Reality
Wyatt's Torch

December 14, 2004

Filtering

Understandably at the moment I am not at my mental peak, my body having made the decision to divert internal resources to fighting off microbes rather than enlisting them in support of higher functions, like thought or typing ability, but I'm still perplexed by the time it has taken me to do rather simple tasks--such as creating a regular expression for my MT-blacklist instance.

Something I've noticed over the past week about the comment spam that shows up at Hraka is that the urls embedded within them are, in many cases, characterized by an inordinate number of dashes, along the lines of http://liza-minnellis-house-of-animal-fun. What I've been doing to block future iterations similar to that url is by adding "-minelli" and "minelli-" to the blacklist, so that if in future a spammer tries to advertise http://vincent-minnellis-house-of-animal-spanking the attempt will fail. I've done the same thing with the vast majority of sexual terms, especially those that have been purposefully misspelled in an attempt to get around other filter entries, such as "-erotik," "sexe-," and "bestialitee-"

A more elegant solution than an endless list of terms ended or preprended with a dash is to come up with one rule that blocks entries containing more that their fair share of dashes, which is where regular expressions come in.

A regular expression is basically a way of allowing a computer to match a pattern within a given amount of text. At its most basic one would look something like this; "h.t"--which, as the dot inside means "any single character," would cause the computer to catch all instances of "hat," "hit," and "hot," in a chunk of text, as well as more nonsensical entries like "h8t," and "hzt."

The character "$" means "at the end of a line," so "h.t$" would match any of the above occurrences as long as they were the last word before a carriage return. There's also a symbol for "the beginning of a line." It's "^", which gives us "^h.t".

One of the more common elements in a regular expression is the asterisk, which matches zero or more occurrences of the character immediately proceeding. The regular expression ".*" matches any number of any characters, for example.

What I'm interested in is a regular expression that will hit on any instances of a dash followed by two or more dashes in a comment. Here's my first attempt; "-.*-.*-.*," which in English I think translates as "match any instance where a dash is followed by any character or series of characters, and is followed in turn by two more dashes followed by any character or series of characters."

My past experience with regular expressions leads me to believe that this is probably incorrect, as they can be fiendishly complicated beasties, and my first attempts at them almost never work.

At least, if commenting happens to be somewhat wonky today, you'll know why.

Update: It seems to have worked, though I cut the final filter entry down to "-.*-.*-". Let me know if you have trouble with commenting, though.

Posted by Bigwig at December 14, 2004 11:48 AM | TrackBack
Postscript:
First time visitor to House Hraka? Wondering if everything we produce could possibly be as brilliant/stupid/evil/pedantic/insipid/inspired as the post you just read? Check out the Hraka Essentials, the (mostly) reader-selected guide to Hraka's best posts, and decide for yourself.
Comments

"My past experience with regular expressions leads me to believe that this is probably incorrect, as they can be fiendishly complicated beasties, and my first attempts at them almost never work."

You, sir, just made my day... I feel better now about my irrational fear of the things.

Posted by: ben at December 14, 2004 12:27 PM

How funny! Especially as I'd picked you to be the first to respond from the "Nice try, but it could be done so much more elegantly like this" crowd--which is a compliment, in case you're wondering. :)

Posted by: Bigwig at December 14, 2004 12:50 PM

"...I'd picked you to be the first to respond from the "Nice try, but it could be done so much more elegantly like this" crowd..."

Ask, and you shall receive...

I've tried to start a fight of sorts with Jay Allen to the effect that it just ought to become standard practice (good citizenship) in the blogging community to either allow commenting or comment indexing, but not both.

If your markup is properly written, all of your comments are in a continguous block. It's a lot easier to make a timestamp comparison or a evaluate single boolean value from a database record you've already retrieved - and consequently deny service of that markup to anything with googlebot.com (for example) in the REMOTE_HOST - than it is to check against a BIG WHOPPING BLACKLIST, yes?

Then, the ISP's themselves could check a blacklist they maintain, and weed out customers who are too lazy to undertake the proper care and feeding of their comment threads.

It would probably also be necessary in this approach to close comment threads at a randomly assigned time within a given range, so that spammers could never know for sure when their time's run out.

Throw in a simple moderation system, and you're good to go.

The transition would be a painful, time-consuming, weeks-long process of debugging code, getting people on the bandwagon, and waiting for spammers to get a clue... but the spam issue would end once the spammers realize that their unwanted linkage isn't making it into Google, et. al.

There is your elegant solution: don't change the code, change the process.

Posted by: ben at December 14, 2004 03:00 PM

I suspect strongly that "erotik" and "sexe" are not intentional misspellings, but foreign-language.

German and/or French, in fact.

Posted by: Sigivald at December 15, 2004 03:19 PM

#!/usr/bin/perl

# the trick I use is to use tr to translate a character to itself
# since this returns the number of translations performed, it
# effectively counts the occurrences of that character *very*
# efficiently

# heh, your blocker blocked it. - translated to : for this example

@str=qw( 'asdfasdf:adfasdfa:adsfa:asdfasda:::' ':' '::' 'asdfa:' 'asdfasdf' );
$thresh=0;

foreach $str (@str)
{
foreach $thresh (0..7)
{
if (($str =~ tr/:/:/) > $thresh)
{
print "$str would fail " . $str =~ tr/:/:/ . " more than $thresh\n";
}
else
{
print "$str is OK in my contrived example\n";
}
}
}

Posted by: Jeff Medcalf at December 15, 2004 08:13 PM

The standard reference work is Mastering Regular Expressions, by Jeffrey Friedl. Watch out—some of those things are frightening. Jeff Metcalf above is right; that's a great method for your purpose.

Posted by: Eric Jablow at December 15, 2004 11:42 PM

Ohgod I own that book. I hate regular expressions, but they once had a strange fascination for me. Now they just make my brain hurt. I skipped over most of your post, even. :)

I once did some perl projects that involved massaging text file databases. You ever write something that you look at later and don't understand?

Posted by: Greg at December 17, 2004 10:17 AM
Post a comment Note: Comments with more than two dashes per line will be blocked as spam.









Remember personal info?