Friday, April 11, 2008
On email, what comes in it
A friend recently posted the following:
Looking at statistics on the mail filter in front of Exchange, it looks like 5.9% of incoming messages for the last 7 days are clean. That is a LOT of messages getting dropped on the floor. This comes to just shy of 40,000 legitimate mail messages a day. For comparison, the number of mail messages coming in from Titian (the student email system, and unpublished backup MTA) has a 'clean' rate of 42.5%, or 2800ish legit messages a day.
People expect their email to be legitimate. Directory-harvesting attacks do constitute the majority to discrete emails; these are the messages you receive that have weird subjects, come from people you don't know, but don't have anything in the body. They're looking to see which addresses result in 'no person by that name here' messages and those that seemingly deliver. This is also why people unfortunate enough to have usernames or emails like "fred@" or "cindy@" have the worst spam problems of any organization.
As I've mentioned many times, we're actively considering migrating student email to one of the free email services offered by Google or Microsoft. This is because historically student email has had a budget of "free", and our current strategy is not working. The way it is not working is because the email filters aren't robust enough to meet expectation. Couple that with the expectation of effectively unlimited mail quota (thank you Google) and student email is no longer a "free" service. We can either spend $30,000 or more on an effective commercial anti-spam product, or we can give our email to the free services in exchange for valuable demographic data.
It's very hard to argue with economics like that.
One thing that you haven't seen yet in this article are viruses. In the last 7 days, our border email filter saw that 0.108% of incoming messages contain viruses. This is a weensy bit misleading, since the filter will drop connections with bad reputations before even accepting mail and that may very well cut down the number of reported viruses. But the fact remains that viruses in email are not the threat they once were. All the action these days are on subverted and outright evil web-sites, and social engineering (a form of virus of the mind).
This is another example of how expectation and reality differ. After years of being told, and in many cases living through the after-effects of it, people know that viruses come in email. The fact that the threat is so much more based on social engineering hasn't penetrated as far, so products aimed at the consumer call themselves anti-virus when in fact most of the engineering in them was pointed at spam filtering.
Anti-virus for email is ubiquitous enough these days that it is clear that the malware authors out there don't bother with email vectors for self-propagating software any more. That's not where the money is. The threat had moved on from cleverly disguised .exe files to cunningly wrought (in their minds) emails enticing the gullible to hit a web site that will infest them through the browser. These are the emails that border filters try to keep out, and it is a fundamentally harder problem than .exe files were.
The big commercial vendors get the success rate they do for email cleaning in part because they deploy large networks of sensors all across the internet. Each device or software-install a customer turns on can potentially be a sensor. The sensors report back to the mother database, and proprietary and patented methods are used to distill out anti-spam recipes/definitions/modules for publishing to subscribed devices and software. There is nothing saying that an open-source product can't do this, but the mother-database is a big cost that someone has to pay for and is a very key part of this spam fighting strategy. Bayesian filtering only goes so far.
And yet, people expect email to just be clean. Especially at work. That is a heavy expectation to meet.
80-90% of ALL email is directory harvesting attacks. 60-70% of the rest is spam or phishing. 1-5% of email is legit. Really makes you think about the invisible hand of email security, doesn't it?Those of us on the front lines of email security (which isn't quite me, I'm more of a field commander than a front line researcher) suspected as much. And yes, most people, nay, the vast majority, don't realize exactly what the signal-to-noise ratio is for email. Or even suspect the magnitude. I suspect that the statistic of, "80% of email is crap," is well known, but I don't think people even realize that the number is closer to, "95% of email is crap."
Looking at statistics on the mail filter in front of Exchange, it looks like 5.9% of incoming messages for the last 7 days are clean. That is a LOT of messages getting dropped on the floor. This comes to just shy of 40,000 legitimate mail messages a day. For comparison, the number of mail messages coming in from Titian (the student email system, and unpublished backup MTA) has a 'clean' rate of 42.5%, or 2800ish legit messages a day.
People expect their email to be legitimate. Directory-harvesting attacks do constitute the majority to discrete emails; these are the messages you receive that have weird subjects, come from people you don't know, but don't have anything in the body. They're looking to see which addresses result in 'no person by that name here' messages and those that seemingly deliver. This is also why people unfortunate enough to have usernames or emails like "fred@" or "cindy@" have the worst spam problems of any organization.
As I've mentioned many times, we're actively considering migrating student email to one of the free email services offered by Google or Microsoft. This is because historically student email has had a budget of "free", and our current strategy is not working. The way it is not working is because the email filters aren't robust enough to meet expectation. Couple that with the expectation of effectively unlimited mail quota (thank you Google) and student email is no longer a "free" service. We can either spend $30,000 or more on an effective commercial anti-spam product, or we can give our email to the free services in exchange for valuable demographic data.
It's very hard to argue with economics like that.
One thing that you haven't seen yet in this article are viruses. In the last 7 days, our border email filter saw that 0.108% of incoming messages contain viruses. This is a weensy bit misleading, since the filter will drop connections with bad reputations before even accepting mail and that may very well cut down the number of reported viruses. But the fact remains that viruses in email are not the threat they once were. All the action these days are on subverted and outright evil web-sites, and social engineering (a form of virus of the mind).
This is another example of how expectation and reality differ. After years of being told, and in many cases living through the after-effects of it, people know that viruses come in email. The fact that the threat is so much more based on social engineering hasn't penetrated as far, so products aimed at the consumer call themselves anti-virus when in fact most of the engineering in them was pointed at spam filtering.
Anti-virus for email is ubiquitous enough these days that it is clear that the malware authors out there don't bother with email vectors for self-propagating software any more. That's not where the money is. The threat had moved on from cleverly disguised .exe files to cunningly wrought (in their minds) emails enticing the gullible to hit a web site that will infest them through the browser. These are the emails that border filters try to keep out, and it is a fundamentally harder problem than .exe files were.
The big commercial vendors get the success rate they do for email cleaning in part because they deploy large networks of sensors all across the internet. Each device or software-install a customer turns on can potentially be a sensor. The sensors report back to the mother database, and proprietary and patented methods are used to distill out anti-spam recipes/definitions/modules for publishing to subscribed devices and software. There is nothing saying that an open-source product can't do this, but the mother-database is a big cost that someone has to pay for and is a very key part of this spam fighting strategy. Bayesian filtering only goes so far.
And yet, people expect email to just be clean. Especially at work. That is a heavy expectation to meet.
Friday, January 11, 2008
Disk-space over time
I've mentioned before that I do SNMP-based queries against NetWare and drop the resulting disk-usage data into a database. The current incarnation of this database went in August of 2004, so I have just over 4 years of data in it now. You can see some real trends in how we manage data in the charts.
To show you what I'm talking about, I'm going to post a chart based on the student-home-directory data. We have three home-directory volumes for students, which run between 7000-8000 home directories on them. We load-balance by number of directories rather than least-size. The chart:

As you can see, I've marked up our quarters. Winter/Spring is one segment on this chart since Spring Break is hard to isolate on these scales. We JUST started Winter 2008, so the last dot on the chart is data from this week. If you squint in (or zoom in like I can) you can see that last dot is elevated from the dot before it, reflecting this week's classes.
There are several sudden jumps on the chart. Fall 2005. Spring 2005. Spring 2007 was a big one. Fall 2007 just as large. These reflect student delete processes. Once a student hasn't been registered for classes for a specified period of time (I don't know what it is off hand, but I think 2 terms) their account goes on the 'ineligible' list and gets purged. We do the purge once a quarter except for Summer. The Fall purge is generally the biggest in terms of numbers, but not always. Sometimes the number of students purged is so small it doesn't show on this chart.
We do get some growth over the summer, which is to be expected. The only time when classes are not in session is generally from the last half of August to the first half of September. Our printing volumes are also w-a-y down during that time.
Because the Winter purge is so tiny, Winter quarter tends to see the biggest net-gain in used disk-space. Fall quarter's net-gain sometimes comes out a wash due to the size of that purge. Yet if you look at the slopes of the lines for Fall, correcting for the purge of course, you see it matches Winter/Spring.
Somewhere in here, and I can't remember where, we increased the default student directory-quota from 200MB to 500MB. We've found Directory Quotas to be a much better method of managing student directory sizes than User Quotas. If I remember my architectures right, directory quotas are only possible because of how NSS is designed.
If you take a look at the "Last Modified Times" chart in the Volume Inventory for one of the student home-directory volumes you get another interesting picture:

We have a big whack of data aged 12 months or newer. That said, we have non-trivial amounts of data aged 12 months or older. This represents where we'd get big savings when we move to OES2 and can use Dynamic Storage Technology (formerly known as 'shadowvolumes'). Because these are students and students only stick around for so long, we don't have a lot of stuff in the "older than 2 years" column that is very present on the Faculty/Staff volumes.
Being the 'slow, cheap,' storage device is a role well suited to the MSA1500 that has been plaguing me. If for some reason we fail to scare up funding to replace our EVA3000 with another EVA less filled-to-capacity, this could buy a couple of years of life on the EVA3000. Unfortunately, we can't go to OES2 until Novell ships an edirectory enabled AFP server for Linux, currently scheduled for late 2008 at the earliest.
Anyway, here is some insight into some of our storage challenges! Hope it has been interesting.
To show you what I'm talking about, I'm going to post a chart based on the student-home-directory data. We have three home-directory volumes for students, which run between 7000-8000 home directories on them. We load-balance by number of directories rather than least-size. The chart:

As you can see, I've marked up our quarters. Winter/Spring is one segment on this chart since Spring Break is hard to isolate on these scales. We JUST started Winter 2008, so the last dot on the chart is data from this week. If you squint in (or zoom in like I can) you can see that last dot is elevated from the dot before it, reflecting this week's classes.
There are several sudden jumps on the chart. Fall 2005. Spring 2005. Spring 2007 was a big one. Fall 2007 just as large. These reflect student delete processes. Once a student hasn't been registered for classes for a specified period of time (I don't know what it is off hand, but I think 2 terms) their account goes on the 'ineligible' list and gets purged. We do the purge once a quarter except for Summer. The Fall purge is generally the biggest in terms of numbers, but not always. Sometimes the number of students purged is so small it doesn't show on this chart.
We do get some growth over the summer, which is to be expected. The only time when classes are not in session is generally from the last half of August to the first half of September. Our printing volumes are also w-a-y down during that time.
Because the Winter purge is so tiny, Winter quarter tends to see the biggest net-gain in used disk-space. Fall quarter's net-gain sometimes comes out a wash due to the size of that purge. Yet if you look at the slopes of the lines for Fall, correcting for the purge of course, you see it matches Winter/Spring.
Somewhere in here, and I can't remember where, we increased the default student directory-quota from 200MB to 500MB. We've found Directory Quotas to be a much better method of managing student directory sizes than User Quotas. If I remember my architectures right, directory quotas are only possible because of how NSS is designed.
If you take a look at the "Last Modified Times" chart in the Volume Inventory for one of the student home-directory volumes you get another interesting picture:

We have a big whack of data aged 12 months or newer. That said, we have non-trivial amounts of data aged 12 months or older. This represents where we'd get big savings when we move to OES2 and can use Dynamic Storage Technology (formerly known as 'shadowvolumes'). Because these are students and students only stick around for so long, we don't have a lot of stuff in the "older than 2 years" column that is very present on the Faculty/Staff volumes.
Being the 'slow, cheap,' storage device is a role well suited to the MSA1500 that has been plaguing me. If for some reason we fail to scare up funding to replace our EVA3000 with another EVA less filled-to-capacity, this could buy a couple of years of life on the EVA3000. Unfortunately, we can't go to OES2 until Novell ships an edirectory enabled AFP server for Linux, currently scheduled for late 2008 at the earliest.
Anyway, here is some insight into some of our storage challenges! Hope it has been interesting.
Friday, December 29, 2006
Year in review: web-stats
So I won't have Saturday and Sunday in these stats. Big deal, don't care much. But heck, these are fun!
At a guess, I'd say that students are using the Student MyWeb service more as a link-space than as a web-page host. Since that space is linked in with their home-directory quota, and also isn't bandwidth limited, it provides a much better experience than things like photobucket. Since items like avatars are files that get hit a LOT, icon-hosting is something of a bugabo and having your own host for that is handy.
One 125K JPG file is the number one hit file on the Student MyWeb for fall quarter, and is also the #4 file in terms of bytes transferred. This file is a Friendster background. Clearly, this person has a lot of traffic to their friendster site. Of the top 10 files-by-hit, 7 are clearly icons, avatars, or other 'personalization' images. On the flip side, all but one of the top 10 files-by-transfer are movies; this one lone (big) JPG file is the exception.
Looking at the Facstaff side, and 8 of the top 10 files-by-hit are components of web-pages hosted on myweb.facstaff.wwu.edu. One is the feed-file for this blog, and the other is an EXE file hosted by ATUS that seems to get a lot of off-campus requests. The files-by-byes are a bit different, in that four of the top 10 are EXE files from the same page. There is a PDF and three PPT files in the top 10 by byte, and only one movie.
Interesting stuff.
MyWeb serversClearly, the Student myweb gets a LOT more traffic. A lot more bigger traffic, as the hits ratio is 8:1 but the data transferred ratio is closer to 10:1. Now, lets look at some averages:
Total sessions 2006Total pageviews 2006
- Student: 436,048
- Facstaff: 11,640
Total hits 2006
- Student: 2,927,305
- Facstaff: 760,077
Total bytes transferred 2006
- Student: 8,178,044
- Facstaff: 1,282,892
- Student: 1.74 TB
- Facstaff: 116.37 GB
MyWeb serversTHAT'S a difference! The traffic generated by the FacStaff MyWeb looks to be a lot more browsing, where I'd guess a lot of the traffic coming out of the Student MyWeb is one-off stuff like user-avatars in web-forums, and media files. The average session length is night-and-day, though; average session length of 42 minutes? Clearly there is a lot more 'dwell' on instructor sites than on the student sites.
Average pageviews per sessionAverage hits per session
- Student: 6.71
- Facstaff: 65.30
Average bytes per session
- Student: 18.75
- Facstaff: 110.21
Average length of session (HH:MM:SS)
- Student: 4.18 MB
- Facstaff: 10.24 MB
- Student: 00:04:14
- Facstaff: 00:42:21
At a guess, I'd say that students are using the Student MyWeb service more as a link-space than as a web-page host. Since that space is linked in with their home-directory quota, and also isn't bandwidth limited, it provides a much better experience than things like photobucket. Since items like avatars are files that get hit a LOT, icon-hosting is something of a bugabo and having your own host for that is handy.
One 125K JPG file is the number one hit file on the Student MyWeb for fall quarter, and is also the #4 file in terms of bytes transferred. This file is a Friendster background. Clearly, this person has a lot of traffic to their friendster site. Of the top 10 files-by-hit, 7 are clearly icons, avatars, or other 'personalization' images. On the flip side, all but one of the top 10 files-by-transfer are movies; this one lone (big) JPG file is the exception.
Looking at the Facstaff side, and 8 of the top 10 files-by-hit are components of web-pages hosted on myweb.facstaff.wwu.edu. One is the feed-file for this blog, and the other is an EXE file hosted by ATUS that seems to get a lot of off-campus requests. The files-by-byes are a bit different, in that four of the top 10 are EXE files from the same page. There is a PDF and three PPT files in the top 10 by byte, and only one movie.
Interesting stuff.
Labels: stats
Friday, December 15, 2006
Yummy stats (fall quarter)
Today is the close of fall quarter, so I'm taking a look at Student MyWeb usage for the quarter. First, some quick hits:
- 235GB were transferred.
- 609,501 page views.
- Top file by hit: http://myweb.students.wwu.edu/~beattya/cs102/Warming_Up_for_the_Night_s_Howl__Gray_Wolf.jpg at around 89,000 hits.
- Top file by bytes: http://myweb.students.wwu.edu/~conroyg/telltale.mov at 12.71GB. This is specifically interesting, since it got there only over the past week. I should probably take a look at that and see if its something I have to officially notice.
- #1 browser used (by hit): IE, at 48%
- #1 browser used (by session): "Mozilla Compatible Agent" at 47%
- #1 platform (by hit): Windows, at 66% (of which WinXP is 94%). #2 is Macintosh, at 19%.
- 146GB were transferred
- 516,866 pageviews.
- Top file by hit: http://myweb.facstaff.wwu.edu/%7Ebowkerb/ATUS/Utilities/OfficeConverterPack.EXE
- Top file by bytes: http://myweb.facstaff.wwu.edu/%7Ebowkerb/ATUS/Utilities/OfficeConverterPack.EXE by a long shot. This represented 25% of traffic!
- #1 browser used (by hit): IE at 61.2%
- #1 browser used (by session): IE at 38.62%
- #1 platform (by hit): Windows, at 79% (of which WinXP is 91%). #2 is 'unknown' at 12%, and Macintosh comes in at 7%.
Labels: stats
