I created a version that would poll the server for changes, but I'm not really liking it. Polling takes effort on the client side to get things done, and most of the time there will not be any changes. While it is handy to get everything, it just doesn't feel efficient.
Now under the IMAP specs, you can only watch one folder at a time, which means that changes in other folders won't be caught. While it would be nice to see changes in multiple folders, you can't do it under a single connection.
And thats the loophole I think I can exploiit. I only want to watch a couple of folders - the inbox and the spam folder. If i could open multiple connections to the mail server, I could do this. and according to the IMAP specs I am allowed to do this. one of the primary reasons and differences for IMAP over POP was it would allow multiple clients to access the server at the same time, allowing unique IDs, and generally allowing a constantly open connection to the server.
So that's what I'm going to do. I'm going to rebuild things a bit to support multiple connections to the server, rather than a single connection and trying to manipulate everything through that. The two connections will point to inbox and spam respectively, and will be able to watch for the changes in these folders.
Inbox will watch for incoming new emails. when one is received, it will be scored, and if it passes the threshold, will be moved to the spam folder. If an email is deleted, then train the application for a good email. If the email is moved outside of the spam/inbox folders, then it is considered to be good, and trained as such.
the spam folder will also be watched. when an email is deleted, train it as a spam message. if moved to the inbox, ignore it. if moved someplace else, then train as a good email.
I'm also pulling back the scope a bit. originally I wanted to allow multiple categories, and move things to different folders based on this categorization. Normally this wouldn't be a problem if I could maintain all of the different categories with the email, and have the user acknowledge all of the categories. but email can only go in a single bin at the same time - if you open your email client you'll notie that you can't have the same email in separate folders. you can copy the email, but that is a copy - not the original email. Since I don't wish to disrupt the normal email reading workflow just to categorize it in multiple places, I'll stick with the binary on Spam or No Spam.
Another problem is training an email that has more than one valid category. if the user hasn't checked all of the categories that the email is in, then it could end up be miscategorized, hampering training. Also the problem of tracking what categories that an email falls into. do we just choose the best one, or dump all of the possible categories into the headers in some form or fashion.
So if we just go with the best, we can train on this category, ignoring all of the others. basically it would be trained on the best category, without training on the other categories. This would work, but will slow the training quite a bit, and require more emails to train. Finding the best category will also be problematic, since we will probably need to get a percentage on each one, and then select the best one.
another way to possibly get to a single category is a series of binary categorizations, much like a search tree. at the top of the tree we would categorize possibly between spam and non-spam. next down in the non-spam side might be a series of different categories, possibly getting to finer and finer detail.
For now i'll just do the simple yes or no on whether it is spam. If that works, then I'll expand it out to more explicit categorization of email. It still won't solve the multiple category problem, but will at least be a consistent way of looking at the email.
No comments:
Post a Comment