naive bayes

**mickey** · 10th September 2009, 13:10

Hello,
I'm trying to implement that to classify email in spam or not spam. I read that one technique is delete the more frequent and less frequent words. So I'm wodering this:

words from Spam email:
"kill", freq =100
"ate", freq=40
"viagra", freq=39
"love", freq =5
......................
"break", 1

words from non-Spam email:
"love", freq=85
"hello", freq=90
...........................
"semp", 2

Which words must I delete? I read that I should delete,for example, the top 100 words; maybe "semp" and break; anyway, what about "love"? which is is frequency? It's very common in non-spam; Must i pick the more frequent word in non-spam set and after in spam set? Or must I sum their frequency and after that check dor the most and less frequent?

thanks,
Hope you understand my problem

**wysota** · 10th September 2009, 17:23

What do those numbers represent? Percentages?

**mickey** · 10th September 2009, 18:43

Actually I made a mistake; the number are frequencies of the words; I calculated it reading a lots of emails from file; for each word I have its frequency in the spam emails and no-spam emails:

"Hello", FreqInSpam=100, FreqInNonSpam=110
"Viagra", FreqInSpam=70, FreqInNonSpam=0
"the", FreqInSpam=200, FreqInNonSpam=400
.................................................. ................

**wysota** · 10th September 2009, 21:40

A very naive approach would be to calculate percentages of occurences in each of the categories:
Hello - S: 100/(100+110) = 100/210 = 47%; H: 63%
Viagra - S: 100%; H: 0%
the - S: 33%; H: 66% (actually this is a stop word, I'd discard it)

Now, given enough samples per word (i.e. 1000 samples per word) calculate the total probability of the content being spam or ham. Discard all words that don't have enough samples.
I.e. the probability of "Hello Viagra" being spam is 47% while probability of "Hello the" being spam is 15%. You can also perform another calculation - add percentages instead of multiplying them: (100+47)/200 and (47+33)/200 accordingly.

Or take the real formula:
http://en.wikipedia.org/wiki/Bayesian_probability

**mickey** · 10th September 2009, 22:46

you said "stop word": how can I remove these word? Do I need a list of these words to remove? Where can I find that list?
My idea was cut out the word with higher frequency; so I read 'top list' that should refer to that, is that?
BTW it's not clear to me what is a top list; maybe the 100 words with higher frequency; but if I had:

"hello" 300
"hi" 150
.............

which I remove? only hello? or hi as well?

**wysota** · 10th September 2009, 23:05

Originally Posted by mickey

you said "stop word": how can I remove these word?

Just don't consider them when evaluating the message.

Do I need a list of these words to remove?

Yes.

Where can I find that list?

It depends on the language of the message.
http://en.wikipedia.org/wiki/Stop_words

BTW it's not clear to me what is a top list; maybe the 100 words with higher frequency; but if I had:

"hello" 300
"hi" 150
.............

which I remove? only hello? or hi as well?

Why do you want to remove them? Unless they are stop words, of course (I don't think they are).

If you want something more advanced, you can also perform the stemming process.

**mickey** · 11th September 2009, 00:26

I chose wrong words as example, sorry; I only need stop word; But I must put into a List the words on my own or are there around file with stop word; I need them in English.
I actually need stemming too but I don't understand how do it. I mean: can I integrate something built from third parts (but which one) into my Java code? Which one is the simplest to use?

**wysota** · 11th September 2009, 09:38

Originally Posted by mickey

But I must put into a List the words on my own or are there around file with stop word; I need them in English.

Follow the link I gave you and scroll down.

**mickey** · 11th September 2009, 13:30

Originally Posted by wysota

Follow the link I gave you and scroll down.

Sorry, I saw it; I don't know how to use it; so I thought to put it inside the java code on this way:

Qt Code:

Switch view

List<String> stopwords = new ArrayList<String>(Arrays.asList("a", "about",................);

List<String> stopwords = new ArrayList<String>(Arrays.asList("a", "about",................);

To copy to clipboard, switch view to plain text mode

Is it too much ugly?
What I need with lemmatization should be an hint about something very fast to use; is there any Java standard library? I didn't find it and it seems odd to me...

thanks,

EDIT:
one more thing not claer: I removed the stop word while reading them from file; but I know that there's a technicque that remove the words with highest frequency; stop-word and this last technique are the same thing?

**wysota** · 14th September 2009, 08:38

Originally Posted by mickey

Is it too much ugly?

It's slow, that's for sure. Some kind of hash or dictionary based approach would be more efficient.

one more thing not claer: I removed the stop word while reading them from file; but I know that there's a technicque that remove the words with highest frequency; stop-word and this last technique are the same thing?

Stopwords are words with very high frequency in a particular language. They don't have to appear in each message (and not even in each message in this particular language). So this is something different. If a message has a lot of "buy" and "viagra" words, you probably shouldn't remove them, regardless of the fact that they have high frequency - they are still spam carrying words.