Evaluation Ranges

IP statistics in GBUdb are evaluated in two dimensions. This is usually represented graphically with the probability figure on the x axis (horizontally left to right from -1 to +1) and the confidence figure on the y axis (vertically from top to bottom from 0.0 to 1.0).

The envelope for each evaluation range can then be drawn as a collection of points. These ranges are evaluated in a priority sequence so that overlapping ranges can easily be resolved. The priority is (from highest priority to lowest) White, Black, Caution, Undefined. A higher priority range always overrides a lower priority range.

Below is an ascii-art representation of the default GBUdb Range Map. This ascii-art is produced in the <licenseid>_snf_engine_cfg.log file as a debugging aid. This file is produced by the SNFServer whenever it interprets a new configuration. The configuration log can be compared with the snf_engine.xml file to locate discrepancies.

Range Map - [W]hite [B]lack [C]aution [  ]Normal

    |-9876543210123456789+|
    |               CCCCCC|0
    |               CCCCCC|0.1
    |                CCCBB|0.2
    |                 CCBB|0.3
    |W                 CBB|0.4
    |W                  BB|0.5
    |W                  BB|0.6
    |WW                 BB|0.7
    |WW                 BB|0.8
    |WW                 BB|0.9
    |WWW                BB|1
    |---------------------|
			

White

IPs that fall in the white range consistently produce good messages. Normally if the source IP of a message falls in this range then the GBUdb will override any pattern matching rules so that the message will not be tagged as spam. Learning will continue, however, so if a good IP turns bad it will eventually be pushed out of this range and lose that privilege.

Caution

IPs that fall in the caution range are likely to be spam producers, however there is not yet enough confidence to treat them as bad sources (depending upon your system policy). It could be that the first few message from this IP are unlucky spam from a mixed source that later will produce mostly ham.

In testing it is *almost* always true that if one of the first dozen or so messages from a new IP are spam that the source is a bad source and that any messages that did not match were simply so new that no patterns were in the rulebase yet. Early on our default for the caution range extended all the way to a probability of -0.9 so that if any of the first few messages turned out to be spam the system was highly prejudiced. Unfortunately, this did cause a few false positives in early training. The current default settings are very conservative in order to avoid any false positives we can.

Some systems may find that they can re-tune this range to be extremely prejudicial of new IPs with great success. Others will most likely leave this range mapped as it is rather than risk an occasional false positive from a new mixed source.

By default, if a message comes through with an IP source in this range and no pattern match is found then SNF will produce a 40 result code. This is a unique code associated with the caution range. Filtering systems that translate SNF result codes to weighting schemes may want to chose an alternate weight for messages that are tagged with this code.

Black

IPs that fall in this range consistently produce bad messages. It is extremely unlikely that any legitimate source will fall in this range. By default, if a message comes through with an IP source in this range and no pattern match is found then SNF will produce a 63 result code. This result code is typically associated with IP black rules. If the message does match a pattern rule (white or black) then the pattern rule will determine the result code.

Truncate

How much more black could it be? The answer is none. None more black. - or - These go to eleven.

IPs that fall in this range are "blacker than black". That is, they fall in the black range but in addition to that their probability figure is sufficiently high that we are willing to cut the scanning process short and base the scan result solely on the GBUdb result. This saves CPU cycles and increases throughput at the expense of some detail about the message contents.

By default, if a message comes through with an IP source in this range the message is truncated as soon as the source IP is identified and SNF will produce a 20 result code. This result code is unique to this mode. Filtering systems may want to treat messages differently when SNF tags them with this code either by translating the code to a different (probably higher) weight, or by disabling some later tests. All of these choices are, of course, a matter of system policy.