Hello,
I'm trying to implement that to classify email in spam or not spam. I read that one technique is delete the more frequent and less frequent words. So I'm wodering this:
words from Spam email:
"kill", freq =100
"ate", freq=40
"viagra", freq=39
"love", freq =5
......................
"break", 1
words from non-Spam email:
"love", freq=85
"hello", freq=90
...........................
"semp", 2
Which words must I delete? I read that I should delete,for example, the top 100 words; maybe "semp" and break; anyway, what about "love"? which is is frequency? It's very common in non-spam; Must i pick the more frequent word in non-spam set and after in spam set? Or must I sum their frequency and after that check dor the most and less frequent?
thanks,
Hope you understand my problem




Reply With Quote
Bookmarks