Home Effective Spam Filtering Techniques With Eudora Email

 

OVERVIEW

STRATEGY

"REGEXP"

THE FILTERS  

LINKS

  B
  A
  C
  K

 

 

REGULAR EXPRESSIONS  - "regexp (case insensitive)"


As Used in Eudora 5.1 and later for Windows
Regular Expressions are also available in Eudora for Macintosh starting in Version 6.

"Regular expressions" are used to match patterns of characters in many programming languages.  There are various implementations of regular expressions, and Eudora's is based on the POSIX  implementation. Regular expressions use a set of special characters and notation to allow functions such as wildcard characters, character set substitutions, the logical "or" operator, grouping of characters or expressions into sub-expressions, and searching from the beginning or end of a line. The ability to match patterns is a powerful tool when creating email filters, and regular expressions allow the creation of complex and effective filtering rules.

There are two forms of the regular expression verb in Eudora:

• "matches regexp"
• "matches regexp (case insensitive)"

These verbs are found in the Eudora Filters window, in the drop-down list where you assign the relationship between an email header and your text search string(s). The "matches regexp" is case sensitive. I've heard rumors that it's buggy but I haven't tested it much yet. So far I've used the "matches regexp (case insensitive)" option exclusively, but I'm beginning some testing of filter rules using case sensitivity.

These are the characters with special powers when used with regular  expressions:

. ?  *  +  |  \  [ ]  { }  ( ) ^  $

Because they are special, you can't search for them by themselves. To search for one of the special characters, you must put the \ (slant-bar) "escape" character in front of it. For example, if you want to search for a literal period " . " you would put "slant-bar period" like this " \. " To find a dollar sign you would search for "\$". And to find a slant bar, use "\\".

The minus " - " is special only inside the square brackets and only if it is between other characters "[a-z0-9 ]".

 


regular expressions - the Special Characters

   

.

. Period - A "wildcard", used to represent any one character including spaces. Especially handy with a multiplier after it (asterisk, question mark or plus sign), to find zero, one or many of any unspecified characters.

Example1:  ".1"  - Finds "a1" or "B1" or "c1" or " 1" (<sp>1) etc.
Example2:  "1.3" - Finds "123" or 1z3" or "1 3" or "1\3".
Example3:  "123.*567"  - Finds "123567" or "1234567" or "123lots of stuff  can go here when used with an asterisk!567"

Weirdness*  (Tested only in v5.1) When searching for non-alphanumeric characters (punctuation marks and white space):

  • a single non-alphanumeric character will be found by one, two or three periods
  • two sequential non-alphanumeric characters will be found by two, three or four periods
   

?

? Question Mark - A multiplier for the previous character, character [set] or (group|sub-expression). It will match zero or one of the previous entity.

Example1: "Web-?Site" - Finds "Website" or "Web-Site"
Example2: "Clicki?n?g?" - Finds "Click" or "Clicking" or "Clicki" or "Clickin"
Example3: "Click(ing)? Here" - Finds "Click Here" or "Clicking Here".
Example4: "Click ?Here" - Finds "Click Here" or "ClickHere".

   

*

* Asterisk - A multiplier for the previous character, character [set] or (group|sub-expression). It will match zero or more of the previous entity.

Example1: "12*3" - Finds "123" or "13" or "1223" or "12222222223"
Example2: ".*\.com"  - Finds ".com" and also "anything.com",  "aol.com" or
"www.fountainofspam.com".

   

+

+ Plus Sign - A multiplier for the previous character, character [set] or (group|sub-expression). It will match one or more of the previous entitys, but not zero.

Example1: "12+3" - Finds "123" or "1223" or "1222222223" etc.
Example2: "http://[0-9]+\.[0-9]+\.0-9]" - Finds "http:1.2.3"  or "http://123.45.678" etc.

   

|

| "OR" - Used between characters, words or phrases to find "one or the other"

Example1: "This|that|the other"   
Example2: "This and (that|the other)

Notes* Eudora's regexp Help page state that the parenthesis are required with the "or" symbol - but that is not correct. The parenthesis are useful as shown above, but they are not required.  Mind your spaces when using the "or" verb as they are included in the search.

The "|" character is located on your keyboard above the "\", (it may appear on the keyboard to be split in the center) - so to make this character just press <shift> and <backslash> "\".
(The ASCII value of "|" is Dec 124, Hex 7C)

   

\

\ Backslash - The "escape" character - it foils the special powers of any other special character immediately following it, rending it "ordinary" . Makes it possible to search for the special characters.
Example1: "123\.456" - Finds 123.456
Example2: "123\\456" - Finds 123\456
Example3: "attached-file\.(com|exe|bat)" - Finds "attached-file.com" or "attached-file.exe" or "attached-file.bat"

   

[ ]

[ ] Brackets - used to create a [set] of characters, from which we will find one and only one item {unless told otherwise}. The "minus" sign takes on special meaning inside the brackets if it is between other characters, forming a range of characters to include in the search, such as "[a-z]" or "[1-9]". The minus sign is normal if it is the first or last character in the brackets. The "caret" sign "^" when placed first in the brackets changes the meaning inside the brackets, to "not this" or "not these", for alphanumeric characters only. The caret sign ^ has NO special meaning if not in the first position. The other special characters are also all stripped of their special powers when place inside the brackets, and become ordinary. .

Example1: [aeiou] - Finds any one occurrence of an "a" or "e" or "i" or "o" or "u".
Example2: [howdy ] - Finds any one occurrence of "h" or "o" or "w" or "d" or "y" or a <space>.
Example3: [0-9a-z] - Finds any number "0" to"9" or any letter "a" through "z" or "A" through "Z" (with case insensitive search).
Example4: [$0-9]{5} - Finds any five sequential "$" signs and/or numbers "0" through "9".
Example5: 123[^A-F]456 - Finds "123(any one thing except A through F)456"
Example6: "[A-Z]<!--" - Finds words broken by HTML comment tags. For example:
    "S<!-- haha -->EX" or "Nor<!-- html comment -->ton Antivirus".

Weirdness* The caret sign seems either not to work or works erratically? for negating punctuation marks or white space characters, and sometimes won't work for alphanumerics if there are adjacent non-alphanumeric characters.(V5.1) I haven't quite figured out what the rules are for this yet.

   

{ }

{ } Squiggly Brackets - Put a number {2} or a range of numbers {1,5} between them to specify exactly how many of the previous character or group you wish to find together (sequentially).

Example1: "ABC{4}D" - Finds "ABCCCCD".
Example2: "(Http:.*){3}" - Finds any three occurrences of "Http:" separated by anything (because of the period-asterisk wildcard combination included in the parenthesis).
Example3: "(Http:){3} - Finds "Http:Http:Http:"
Example4: "\$.?.?.,?[0-9]{3}" - Finds "$1000" or "$1,000" or "$22,000" or "$399,456,789".
Example5: "ABC{2,4}D" - Finds "ABCCD" or "ABCCCD" or "ABCCCCD" but not "ABCD".

   

( )

( ) parenthesis - used to make a group of things or a sub-expression. Works well with the "|" (or) symbol. Groups and sub-expressions can be used anywhere a single character could be used in a regular expression, and repeated sub-expressions using the multipliers " ?*+ " are allowed.

Example1: "This and (that|the other)" Finds "This and that" or "This and the other".
Example2: "123( optional words )?456"  - finds "123456", or "123 optional words 456"
Example3: "123( )*ABC" - finds "123(        any number of spaces     )ABC"

   

^

^ Caret - (When not in square brackets) Represents the start of the line - the character following it must be the first character of a line. When used as the first character in square brackets, the Caret means "[^not these]".

Example1: "^Adv:" - If applied to the Subject Header will find email with subjects starting with "Adv:".
Example2: "^<X-html>" - Finds "<X-html>" but only if it is at the beginning of a line.

NOTE* Eudora treats each header as one line, and the entire body of an email message as one line.

   

$

$ Dollar sign - Represents the end of the line. The character preceding it will be the last character on a line if it's been found.

Example: "^A.*Z$"   If applied to any header, will look for any header whose first character is an "A" and whose last character is a "Z".

NOTE* Eudora treats each header as one line, and the entire body of an email message as one line.

   

-

Minus sign or Hyphen - When used inside the square brackets [ ] and between two alpha or numeric characters it denotes a range of characters to search for. Otherwise it is treated as normal.

Example1: [a-z] - Finds any one letter "a" through "z" or "A" through "Z" (with case-insensitive match)
Example2: [0-9a-z] - Finds any one number "0" to"9" or any letter "a" through "z" or "A" through "Z"
Example3: [-a-z] - Finds any one "-", "a" through "z" or "A" through "Z"
Example4: [0-9-] - Finds any one number "0" through "9" or a hyphen.

   
 
The following "character class" sets are written in lower case only.
This way:  "[[:alpha:]]" ,  and not like this:  "[[:ALPHA:]]" 
   

 [[:alpha:]]

[[:alpha:]] - Represents any one alphabet character; same as "[a-z]". Other characters may be included in the search by placing them within the outer brackets.
[^[:alpha:]]
matches one non-alpha character.

   

[[:digit:]]

[[:digit:]] - Represents any one number character; same as "[0-9]". Other characters may be included in the search by placing them within the outer brackets.
[^[:digit:]] matches one non-numeric character.

Example1: "abc[[:digit:]]def" - Finds "abc1def" or "abc7def" etc.
Example2: "a[[:digit:]!@$]b" - Finds "a1b" or "a!b" or "a@b" or "a$b".

   

[[:blank:]]

[[:blank:]] - Represents a <space> or <tab>. Other characters may also be included in the search by placing them within the outer brackets.
[^[:blank:]] matches one non-blank character.

   

[[:punct:]]

[[:punct:]] - Represents any one punctuation character. If it's a character you can see, and it's not [A-Z] or [ 0-9], this probably gets it. (Does not catch space or tab). Other characters may be included in the search by placing them within the outer brackets.
[^[:punct:]]
matches one non-punctuation character.

Example1: "123[[:punct:]]456" - Finds "123!456" or "123&456" or "123@456".

   

[[:space:]]

[[:space:]] - Represents any one whitespace character - space, tab, carriage return, linefeed. Other characters may be included in the search by placing them within the outer brackets.
[^[:space:]] matches one non-space character.

Example1: "123[[:space:]]456" - Finds "123 456" or "123<tab>456" or "123<cr>456". But will not find "123<cr><lf>456".
Example2: "Click([[:space:]]{3})?Here" - Finds "ClickHere" or "Click Here" or "Click<sp><cr><lf>Here" etc.

   

[[:graph:]]

[[:graph:]] - Finds any one displayable character.
[^[:graph:]] matches one non-displayable character.

Example1: "123[[:graph:]]456" - Finds "123 456" or "123@456" or "123A456" etc.

   

[[:cntrl:]]

[[:cntrl:]] - Matches any one non-printable character such as carriage return or linefeed.
[^[:cntrl:]] matches one printable character.

Example1: "Hello[[:cntrl:]]{2}Bye!" - Will find:
"Hello
Bye!
"

   

[[:alnum:]]

 

[[:alnum:]] - Finds any one alpha or numeric character. Same as "[0-9a-z]".
[^[:alnum:]]
matches one non-alphanumeric character.

   

[[:xdigit:]]

 

[[:xdigit:]] - Finds any one hexadecimal character "0123456789ABCDEF".
[^[:xdigit:]] matches one non-hexadecimal character.

   

\<

Doesn't Seem to Work
But according to the Eudora help page it represents "the start of a word."

   

\>

Doesn't Seem to Work
But according to the Eudora help page it represents "the end of a word."

   

 

 

Note* "contains", "doesn't contain", "regexp", and "regexp(case insensitive)" will all search for individual characters, groups or words within other larger words or character strings, so keep this in mind in choosing your search terms. Searching for the word "sex" for example will also find "Essex" or "heterosexual" or "sextant". Phrase searches can run into a similar trap: Searching for "the other" will also locate "Team Grenthe otherwise won the match", for example.

 

Hit Counter