Message Sniffer.TechnicalDetails.GBUdb
From ARM-KB
Contents |
An introduction to GBUdb
GBUdb is an IP reputation system based on a collaborative behavior monitoring network. Each SNF node learns the behaviors of the message source IPs it encounters and then shares that information with all of the other SNF nodes in the cloud.
The name is derived from the four principal types of IP classifications in the system:
- Good - Administratively white-listed.
- Bad - Administratively black-listed.
- Ugly or Undefined - Evaluated by learned behavior.
- Ignore or Infrastructure - Gateways, or mid-way MTAs.
How does GBUdb work (in general)
When a message is processed by SNF the source IP of the message is determined and evaluated by the GBUdb. Then the message is scanned by the pattern matching engine in the usual way.
If the IP is white listed (Good), black listed (Bad), or confidence in the learned behavior of the IP is high then the scan result will be influenced by the GBUdb information. If the GBUdb doesn't have any strong evidence about a given IP then the SNF pattern scanner operates normally.
Some examples
- Caution / Black - If a message fails to match a black SNF pattern rule but the IP is known to be bad then a nonzero result will be returned so that the message will be considered spam. This mode reduces false negatives (leakage).
- White - If a message matches a black SNF pattern rule but the IP is known to be good then the result will be forced to zero (typ) so that the message will not be considered spam. This mode reduces false positives.
Special cases
- Truncate - If the IP is known to be very, very bad, then the SNF pattern scan will be interrupted as soon as the source IP can be determined and a special result code will be returned to indicate that the message was truncated. This saves CPU cycles for other work and improves system throughput. This mode can also reduce leakage since there is no chance that a message coming from a truncated source will fail to match a pattern rule and slip past the scanner.
- Auto Panic - If the IP is known to be good and the SNF pattern scan matches a new black rule then the system will "auto-panic". This causes the new pattern rule to be put into a temporary rule-panic list so that it becomes inert. Telemetry from the system will notify us of the conflict so that we can correct the troublesome rule.
The system is designed so that each individual SNF node retains it's own unique perspective on the IPs it encounters while sharing that information with all other SNF nodes and gathering additional information from them.
What GBUdb knows and learns
Each IP is represented by three pieces of information. These are a type flag, an integer count of bad encounters, and an integer count of good encounters.
In general, the type flag allows for gross classifications of each IP and the good and bad counters provide for fine details.
The good and bad counters combine to provide a pair of statistical values. These are the probability figure which represents how likely a message is to be "spam" or "ham", and the confidence figure which represents how much current data we have to support the probability assessment. Each counter is a 15 bit integer with a range from 0 to 32767.
Probability
The probability figure is based on the ratio of bad encounters to good encounters. This value ranges from -1 which indicates a 100% probability that the message will be "ham" and +1 which indicates a 100% probability that the message will be "spam". In the center we have 0 which indicates a 50/50 probability of "spam" or "ham".
The probability is calculates as follows:
Probability = (Bad - Good) / (Bad + Good);
| Probability | Expectation |
|---|---|
| +1.0 | Expect 100% of messages to be spam |
| +0.9 | Expect 95% of messages to be spam |
| +0.8 | Expect 90% of messages to be spam |
| +0.7 | Expect 85% of messages to be spam |
| +0.6 | Expect 80% of messages to be spam |
| +0.5 | Expect 75% of messages to be spam |
| +0.4 | Expect 70% of messages to be spam |
| +0.3 | Expect 65% of messages to be spam |
| +0.2 | Expect 60% of messages to be spam |
| +0.1 | Expect 55% of messages to be spam |
| 0.0 | Expect 50% of messages to be spam |
| -0.1 | Expect 55% of messages to be ham |
| -0.2 | Expect 60% of messages to be ham |
| -0.3 | Expect 65% of messages to be ham |
| -0.4 | Expect 70% of messages to be ham |
| -0.5 | Expect 75% of messages to be ham |
| -0.6 | Expect 80% of messages to be ham |
| -0.7 | Expect 85% of messages to be ham |
| -0.8 | Expect 90% of messages to be ham |
| -0.9 | Expect 95% of messages to be ham |
| -1.0 | Expect 100% of messages to be ham |
Confidence
The confidence figure is based on the raw number of encounters remembered. This value ranges from 0 which indicates no samples and no confidence to 1.0 which indicates a very high confidence. The confidence value is calculated on a logarithmic scale as follows:
Confidence = (log(Bad + Good) / log(16383.5);
| Bad + Good | Confidence |
|---|---|
| 1 | 0.0 |
| 2 | 0.071429 |
| 4 | 0.142858 |
| 8 | 0.214287 |
| 16 | 0.285716 |
| 32 | 0.357145 |
| 64 | 0.428574 |
| 128 | 0.500003 |
| 256 | 0.571432 |
| 512 | 0.642861 |
| 1024 | 0.71429 |
| 2048 | 0.785719 |
| 4096 | 0.857148 |
| 8192 | 0.928577 |
| 16384 and above | 1.0 |
IP Types
There are four basic types of IPs known by the system. These types provide a context for interpreting the other data that is stored about each IP and often imply specific actions to be taken on messages from those IPs.
Infrastructure (Type = Ignore)
Ignore Flag Basics
IPs with the Ignore flag are considered to be part of the messaging infrastructure and so they cannot be considered as the source IP for a message. In fact, the source IP for a message is considered to be the first IP (received header) that is not marked as infrastructure with the Ignore flag.
The Ignore flag should be set for any trusted IP that may appear in a Received header:
- Localhost (127.0.0.1)
- Your own inbound SMTP gateways
- Internal servers that create email
- Routers that make email reports
- Firewalls that make email reports
- Web servers that produce email (Be careful!)
- Other trusted systems that may generate email
- Outbound servers from mixed-source ISPs (more on this later)
The GBUdb learns which IPs to ignore either from the command line SNFClient utility, SNF_XCI transactions, or most commonly from the GBUdbIgnoreList.txt file. Each time the SNFServer reloads it's configuration information the GBUdbIgnoreList.txt file is applied to the GBUdb database. This ensures that infrastructure IPs will be part of the GBUdb in all cases.
If you want to remove the Ignore flag from an IP in the GBUdb then you must use the SNFClient or an SNF_XCI request to do so. Simply removing the IP from the GBUdbIgnoreList will NOT remove the flag from the database.
Drill Down Training
Some large and important ISPs have a problem with sending out spam from their customers. None the less we must accept many legitimate messages from these same mixed sources. In these cases one solution that GBUdb provides is drill-down training.
By adding the troublesome outbound server's IP to your Ignore list you instruct GBUdb to look beyond that IP to the next one in the Received headers when determining the true IP source of the message. This way the GBUdb is able to make a distinction between the IP of the mixed-source's outbound SMTP server and the IP that originated the message. As a result the GBUdb may be able to learn which sources behind that mixed source can be trusted and which cannot.
Consider:
- Received: from mixed.mta.at.big.isp [12.34.56.78] so forth and so on Ignored
- Received: from hijacked.host.at.big.isp [23.45.67.89] so forth and so on Source
- GBUdb learns not to trust IP 23.45.67.89.
Consider:
- Received: from mixed.mta.at.big.isp [12.34.56.78] so forth and so on Ignored
- Received: from well.behaved.host.at.big.isp [34.56.78.90] so forth and so on Source
- GBUdb learns to trust 34.56.78.90.
The Good (Type = Good)
Good IPs are considered to be administratively white-listed. It is presumed that messages from these IPs should be passed through the system without regard to their content.
In this context the counters are free to be used for other purposes. For the time being they will be used to store their ordinary statistics.
The Bad (Type = Bad)
Bad IPs are considered to be administratively black-listed. It is presumed that messages from these IPs should be blocked or discarded without regard to their content.
In this context the counters are free to be used for other purposes. For the time being they will be used to store their ordinary statistics.
The Ugly (Type = Ugly)
Ugly IPs are evaluated based on their observed behavior. If messages from an IP consistently match black pattern rules then they will have statistics that predict more "spam" from this source. Similarly if messages from an IP consistently match white pattern rules or more commonly fail to match pattern rules then they will have statistics that predict "ham" is more likely.
GBUdb Evaluation Ranges
IP statistics in GBUdb are evaluated in two dimensions. This is usually represented graphically with the probability figure on the x axis (horizontally left to right from -1 to +1) and the confidence figure on the y axis (vertically from top to bottom from 0.0 to 1.0).
The envelope for each evaluation range can then be drawn as a collection of points. These ranges are evaluated in a priority sequence so that overlapping ranges can easily be resolved. The priority is (from highest priority to lowest) White, Black, Caution, Undefined. A higher priority range always overrides a lower priority range.
Below is an ascii-art representation of the default GBUdb Range Map. This ascii-art is produced in the <licenseid>_snf_engine_cfg.log file as a debugging aid. This file is produced by the SNFServer whenever it interprets a new configuration. The configuration log can be compared with the snf_engine.xml file to locate discrepancies.
Range Map - [W]hite [B]lack [C]aution [ ]undefined
|-9876543210123456789+|
| CCCCCCCCC|0
| CCCCCCCCC|0.1
| CCCCBBBBB|0.2
| CBBBBBBB|0.3
| BBBBBBBB|0.4
| BBBBBBBBB|0.5
| BBBBBBBBBB|0.6
|W BBBBBBBBBBBB|0.7
|W BBBBBBBBBBBBB|0.8
|WW BBBBBBBBBBBBBB|0.9
|WW BBBBBBBBBBBBBBB|1
|---------------------|
White
IPs that fall in the white range consistently produce good messages. Normally if the source IP of a message falls in this range then the GBUdb will override any pattern matching rules so that the message will not be tagged as spam. Learning will continue, however, so if a good IP turns bad it will eventually be pushed out of this range and lose that privilege.
Caution
IPs that fall in the caution range are likely to be spam producers, however there is not yet enough confidence to treat them as bad sources (depending upon your system policy). It could be that the first few message from this IP are unlucky spam from a mixed source that later will produce mostly ham.
In testing it is *almost* always true that if one of the first dozen or so messages from a new IP are spam that the source is a bad source and that any messages that did not match were simply so new that no patterns were in the rulebase yet. Early on our default for the caution range extended all the way to a probability of -0.9 so that if any of the first few messages turned out to be spam the system was highly prejudiced. Unfortunately this did cause a few false positives in early training. The current default settings are very conservative in order to avoid any false positives we can.
Some systems may find that they can re-tune this range to be extremely prejudicial of new IPs with great success. Others will most likely leave this range mapped as it is rather than risk an occasional false positive from a new mixed source.
By default, if a message comes through with an IP source in this range and no pattern match is found then SNF will produce a 40 result code. This is a unique code associated with the caution range. Filtering systems that translate SNF result codes to weighting schemes may want to chose an alternate weight for messages that are tagged with this code.
Black
IPs that fall in this range consistently produce bad messages. It is extremely unlikely that any legitimate source will fall in this range. By default, if a message comes through with an IP source in this range and no pattern match is found then SNF will produce a 63 result code. This result code is typically associated with IP black rules. If the message does match a pattern rule (white or black) then the pattern rule will determine the result code.
Truncate
How much more black could it be? The answer is none. None more black. - or - These go to elleven.
IPs that fall in this range are "blacker than black". That is, they fall in the black range but in addition to that their probability figure is sufficiently high that we are willing to cut the scanning process short and base the scan result solely on the GBUdb result. This saves CPU cycles and increases throughput at the expense of some detail about the message contents.
By default, if a message comes through with an IP source in this range the message is truncated as soon as the source IP is identified and SNF will produces a 20 result code. This result code is unique to this mode. Filtering systems may want to treat messages differently when SNF tags them with this code either by translating the code to a different (probably higher) weight, or by disabling some later tests. All of these choices are, of course, a matter of system policy.
The Blindness Paradox (and how to get out of it)
As previously stated, messages from IPs in the other ranges continue to be scanned by SNF's pattern matching engine. Messages in the truncate range are not scanned however. This can create what is known as the blindness paradox.
The blindness paradox says that a spam filtering system may become so good at filtering out spam that it can no-longer see what spam looks like.
In order to prevent this, the truncate mode also has a "peek" setting that allows some fraction of truncated messages to be scanned in the normal way. This allows the pattern matching engine to "see" what kinds of messages are coming from the IP source and retrain the GBUdb - all be it at a slower rate than normal.
If an IP source in the truncate range suddenly becomes a source of good messages then the combination of re-training through the "peek" mechanism and regular GBUdb "condensation" will eventually force the IP back into the ordinary black range where all of it's messages will be evaluated by the SNF pattern matching engine.
If the system administrator notices the change before the GBUdb then they can always use the SNFClient utility (or an SNF_XCI transaction) to immediately update their system.
Virtual Spam Traps
On the dark side of the blindness paradox it is possible that new kinds of spam may be coming from known bad message sources. We might otherwise never see these messages until they start coming from new, as yet unknown IPs in the form of leakage.
This is actually an opportunity in disguise. Since we have known bad message sources and we have a high confidence in that assessment we can randomly sample messages from these sources and if they are new to us (they do not match SNF pattern rules) then we can send those samples to special (virtual) spam traps for evaluation. This has many benefits:
- We can code pattern rules for new spam campaigns more quickly since we don't have to wait for them to be reported by customers as leakage and we don't have to wait for them to arrive at our other (conventional) spam traps.
- We don't have to create conventional spam traps that may be difficult to seed and may be easily discovered and avoided by the blackhats.
- Virtual spam traps have no identity so they cannot be easily identified nor can they be easily avoided. Essentially, these virtual spam traps are everywhere and nowhere at the same time. Once a bad message source is identified the virtual spam trap system is silently "plugged in" to that message source.
For security reasons some systems may choose not to participate in the virtual spam trap program. For this reason it can easily be turned off without compromising the "peek" functionality that prevents the blindness paradox.
Information Sharing and Security
The GBUdb Cloud
When an IPs good or bad counter reaches an even power of 2 due to a new encounter then an "alert" is sent to the GBUdb cloud. The alert contains the composite data for that IP from the perspective of the node creating the alert. The SYNC server integrates this information into the "consensus" and returns a "reflection". The reflection is a composite record for that IP from the cloud's perspective. The reflection is then integrated into the local GBUdb node.
In order to maintain individuality and prevent poisoning, the "consensus" database is isolated from each GBUdb node mathematically. The maximum influence any single transaction can convey is magnitude of the "opinion" of each peer based on the forumula:
LocalCount = LocalCount + Log2(RemoteCount)
For example, if the good count of one peer is 7 (requiring 3 bits) then the amount that the other peer will accept as influence is 3. If the good count of one peer is 255 (requiring 8 bits) then the other peer will be influenced by a count of 8.
Since this equation works at each interaction then the influence of any single node on any other single node across the cloud is even less.
Consider:
- A <---> (SYNC/CLOUD) <---> B
Suppose node A attempts to influence node B by posting a good count of 1024. 1024 requires 11 bits so the cloud registers 11 where perhaps the original count was 0 (to keep things simple). When B syncs with the cloud the reflection will show a count of 11. The number 11 requires 4 bits and so the influence of A on B has been limited from 1024 to only 4.
The combination of SYNC event pacing and the above transfer formula has the effect of giving the general consensus of the GBUdb cloud precedence over any single (or small number) of nodes and ensures that the reflections received by each node are diluted enough to preserve individuality while being potent enough to encourage a perspective that is likely to be correct given the overall experience of the cloud.
The SYNC process
Every minute or so (dynamically adjusted by the SYNC server), each SNF node connects to one of our SYNC servers to share it's GBUdb data and report errors and statistics. The data that is sent and received is generally plain-text (so you can see what's happening) however strong encryption is applied to the authentication protocol.
In addition to this authentication scheme the SNF SYNC servers also monitor the behavior of SNF nodes and automatically reject connections that are unknown or behave badly in any way.
Maintenance
All IP reputation systems must "forget" what they know from time to time. IPs go out of service or change hands; infected systems are cleaned and clean systems become infected.
GBUdb statistics degrade over time through "condensation". Condensation is when the good and bad counts of a GBUdb record are both divided by 2. More precisely they are usually right-shifted one bit. The result is that the Confidence figure degrades while the Probability figure remains the same (or at least very close to it).
For example, consider an IP that has 100 bad encounters and 50 good encounters. The resulting probability figure can be calculated as (100 - 50) / (100 + 50) giving us a probability figure of .33333. After condensation the bad count would be 50 and the good count would be 25. Now the probability figure can be calculated as (50 - 25) / (50 + 25) giving the result .33333.
For all Ugly IPs, once both the good and bad counts have reached zero the record is removed from the database. IPs with other flags must remain because the flags carry important information.
Condensation can be triggered by a number of factors. Most commonly (by default) a condensation cycle is triggered once per day. The result is that if a particular IP is not seen for 15 days then information about that IP will be "forgotten". In most cases information about an IP is lost much more quickly since it is unlikely there would have been enough encounters to generate 15 bits worth of data. More commonly IPs from defunct bot-net addresses tend to disappear in 3-5 days.
These numbers are only guesses based on observations during testing. They are intended to show the basic concepts involved and not to predict day-to-day activity. It is very likely that you will experience different results.
More to come
This is a work in progress... ;-)

