Sabbatical: Clearing the list

I've got a number of tasks I've had on my list of "stuff to do" for quite a while. some of these are simple, such as correct the notation in a couple of tunes i transcribed, while some are more complex and along the lines of create a better way to filter and categorize email. So i'm trying to knock some of those out right now.

The music one isn't too hard. basically I've got a book of music that I've been working off and on for about 10 years, and every once in a while it needs to be updated, or someone finds an error. In this case it was an error in a tune that needed to be corrected. To do this it is a multi-step process, the first of which is to make sure I have all of the software and the source material.

First off I fire up Finale and open the tune source in there. years ago I notated everything in an earlier version of Finale, so loading wasn't too much of a problem. I found out that it had been a while since I last updated things, since the first complaint I had was it needed to update to the new format. I think I originally had things in Finale 2003, and now I had Finale 2008, so there are minor differences. I should get the latest version, but probably won't for a while - this version does what I need it to do.

Finale is a software package for notating and writing sheet music, and has quite a number of features that make it stand out from those various free packages. It does have a couple of drawbacks, though. the first is it takes a few steps to get it into a book format, since it is geared towards writing orchestral scores and part sheets for a single piece of music at a time, rather than a book of distinct pieces.

So I made my changes (which involved removing and redefining some section marks. I saved the results, and started step 2.

Step 2 is getting it into a document capable of creating pages, indexes, text, and images. To do this I went with Microsoft Word. My version of this software has also changed, so there are some differences in formatting involved here and there. to get the music into the word document I first print it off as a PDF or XPS file, expand it out to a common size, and then do a screen print, pasting the results into the word document. Yes, it is a bit convoluted, and i lose some resolution here and there, but it works.

After pasting in the new image, I realized that it looked different. In the past 10 years the resolution has gone up tremendously, but I haven't kept the document up since about 2006. I'lll need to go through the book a tune at a time and correct everything. I'll also look at a better way of embedding the images into the document so that they look better.

the other thing I was doing this week involved spam classification. I (like most others) get a lot of unsolicited email, commonly known as spam. I have a spam filter, but it isn't the greatest. I've tried a few others, but in some cases they fall short. I'd like a filter to be able be more accurate, and possibly categorize the mail even more. Put the Faire related stuff into one folder, the Need to Keep stuff in another, the spam into the spam folder, and the Nigerian scam emails into it's own special place. So I am playing with this to better hone some skills in design and testing, and investigate techniques for categorization and detection of spam.

So at the first cut of things there are a handful of boxes on the sheet. First is something to work with the email server. second is the classification engine, and third is something to tie the two together. I'll need some more stuff, but those are the first cut components.

so first, the mail library.

I didn't want to write the raw connectivity to the mail server. IMAP and POP3 are well defined, complex, and I didn't feel the need to do that part of things. A couple of Google searches later I found a number of dot net libraries that I could make use of. After writing some test applications and seeing how they behaved, I finally settled on MailKit.

http://jeffreystedfast.blogspot.com/2014/02/introducing-mailkit-cross-platform-net.html

https://www.nuget.org/packages/MailKit/

This library was relatively simple to use and had everything I needed. after making sure that I could easily navigate through my email and view individual messages I added the NuGet reference to my project solution and continued on.

One thing I wanted to do was to be able to run this as a separate process from the mail client itself, which would allow it to be used for a wider array of clients. there are two ways to get email from a server: POP3 and IMAP. POP3 doesn't leave the email on the server, while IMAP does. IMAP also gives you a number of other capabilities, such as folders, and the capability to run multiple clients at the same time. While I have seen separate filter applications for POP3, they need some specialized setup on the client to read their email through the filter application, and the application also has to act as a pseudo email server. It also makes it very hard, if not impossible, to secure the channel because the filter needs to sit in the middle of the connection, which it can't if you want it to be secure.

IMAP, on the other hand will allow multiple clients accessing the server at the same time, each on its own connection. this would allow the filter application to be a separate client and manipulate the email separately, which would then be reflected on the actual email client.

So I had my email connection. I created some scaffolding around it to make it do what I wanted it to, and went to the other side of the software.

To filter software you need something to do the actual filtering. there are a number of different things that you can do: whitelists, blacklists, SpamHaus lists, and text analysis, to name a few. a good spam filter will usually make use of another of different things to better do things.

A whitelist is a list of email addresses (usually) that you will unconditionally accept as being valid emails. while it can be other things than an email address, those are usually the main thing that is accepted. I will be implementing a whitelist based on email addresses.

a blacklist is the exact opposite. It is the list of email addresses that you will immediately toss out as being spam. These are usually a bit harder to manage, since spam will not always be consistently coming from the same people all the time.

SpamHaus maintains a list of IP addresses and domain names for known spamming operations. You can connect to their servers and check the senders IP and domain to see if it comes from one of these locations.

http://www.spamhaus.org/

textual analysis is the process of looking at the text and determining whether it is spam or not based on the text. usually there will be a set of rules and algorithms for this process. I'lll be exploring a number of different methods down the road, but the first will be a naive bayesian analysis of the words in the message.

the Reverend Thomas Bayes developed his theorem on probability well before the US was a country. basically it allows you to calculate the probability of an event happening based on the total observed population. for our spam filter the probability of an email being spam is based on the probability of each individual token being in a spam email, and then combining the probabilities together based on the total occurrence of the tokens in the entire set.

To do this I need to be able to tokenize the message, and then compare these tokens to a database of words and their probabilities. once I have the probabilities I need to be able to calculate the probability that this email is spam, and then act on it accordingly. finally, once the user has acknowledged that the email has been categorized correctly (or incorrectly), then I need to feed this information back into the system so that it can learn. this feedback loop is necessary, since me guessing what words are spam related and what words are not is not going to happen.

I got the analyst working this afternoon. it wasn't too hard - I created an in-memory repository to hold the data, and made sure I could serialize it in and out easily. the actual math isn't hard - basically it is the sum of the probabilities of the message being spam (we'll call that P) divided by P plus the sum of the probabilities of the email not being spam (we'll call the last part NP). so P / (N+NP).

the database is relatively simple - it is a list of words, the number of times they have occurred in an email. and the number of times they have occurred in a spam email. the training function is also relatively simple. after the user acknowledges that the email is categorized correctly, I can then update the appropriate row in the table - increment the number of occurrences, and if it is a spam message, the spam count.

the final part will be developing the interactions with the mail server. I have the capability to watch the mail folders for changes. when a change occurs my application will be notified. here's what I see as happening:

when a new message arrives, categorize it. if it is spam, move it to the spam folder, and if it is good, leave it in the inbox.

if the user deletes the email from the inbox, it is validated as good. update the database by incrementing the appropriate word counts.

if the user deletes the email from the spam folder, it is also validated as spam. update the database.

if the user moves the email to the spam folder, or vice versa, then we had an error in validation. again, we update the database (or we can wait until the delete time again - that might be easier)

this can also be extrapolated out to multiple categories. we do the same process, except there are more folders to monitor. when an email comes into the inbox, we again categorize it, but instead of being a single percentage we do the same process for each category we are tracking. the category with the highest percentage will win. moving the emails will again trigger the miscategorization action, and deleting will trigger the validation action.

While doing this I determined that I will need a couple of other components, primarily dealing with the UI and running the application itself. I'll need a Windows service to run the actual spam filter, and a UI to allow the user to manipulate the parameters for the system a bit. I also might need another small UI to enable better testing by allowing me to run the filter outside of a service, since services are annoying, to say the least, to run in a debugger.

So next week I'll get my middle layer in place - the one tying the mail components to the analysis components, and I'll have a fully functioning spam filter.

And the music reedit - that will be needing done some time or another, too.

Sabbatical

Friday, February 20, 2015

Clearing the list

No comments:

Post a Comment