Results 1 to 10 of 10

Thread: naive bayes

  1. #1
    Join Date
    Jan 2006
    Posts
    976
    Qt products
    Qt3
    Platforms
    Windows
    Thanks
    53

    Default naive bayes

    Hello,
    I'm trying to implement that to classify email in spam or not spam. I read that one technique is delete the more frequent and less frequent words. So I'm wodering this:

    words from Spam email:
    "kill", freq =100
    "ate", freq=40
    "viagra", freq=39
    "love", freq =5
    ......................
    "break", 1

    words from non-Spam email:
    "love", freq=85
    "hello", freq=90
    ...........................
    "semp", 2


    Which words must I delete? I read that I should delete,for example, the top 100 words; maybe "semp" and break; anyway, what about "love"? which is is frequency? It's very common in non-spam; Must i pick the more frequent word in non-spam set and after in spam set? Or must I sum their frequency and after that check dor the most and less frequent?

    thanks,
    Hope you understand my problem
    Regards

  2. #2
    Join Date
    Jan 2006
    Location
    Warsaw, Poland
    Posts
    33,373
    Qt products
    Qt3 Qt4 Qt5 Qt/Embedded
    Platforms
    Unix/X11 Windows Android Maemo/MeeGo
    Thanks
    4
    Thanked 5,019 Times in 4,795 Posts
    Wiki edits
    10

    Default Re: naive bayes

    What do those numbers represent? Percentages?
    Your biological and technological distinctiveness will be added to our own. Resistance is futile.

    Please ask Qt related questions on the forum and not using private messages or visitor messages.


  3. #3
    Join Date
    Jan 2006
    Posts
    976
    Qt products
    Qt3
    Platforms
    Windows
    Thanks
    53

    Default Re: naive bayes

    Actually I made a mistake; the number are frequencies of the words; I calculated it reading a lots of emails from file; for each word I have its frequency in the spam emails and no-spam emails:

    "Hello", FreqInSpam=100, FreqInNonSpam=110
    "Viagra", FreqInSpam=70, FreqInNonSpam=0
    "the", FreqInSpam=200, FreqInNonSpam=400
    .................................................. ................
    Regards

  4. #4
    Join Date
    Jan 2006
    Location
    Warsaw, Poland
    Posts
    33,373
    Qt products
    Qt3 Qt4 Qt5 Qt/Embedded
    Platforms
    Unix/X11 Windows Android Maemo/MeeGo
    Thanks
    4
    Thanked 5,019 Times in 4,795 Posts
    Wiki edits
    10

    Default Re: naive bayes

    A very naive approach would be to calculate percentages of occurences in each of the categories:
    Hello - S: 100/(100+110) = 100/210 = 47%; H: 63%
    Viagra - S: 100%; H: 0%
    the - S: 33%; H: 66% (actually this is a stop word, I'd discard it)

    Now, given enough samples per word (i.e. 1000 samples per word) calculate the total probability of the content being spam or ham. Discard all words that don't have enough samples.
    I.e. the probability of "Hello Viagra" being spam is 47% while probability of "Hello the" being spam is 15%. You can also perform another calculation - add percentages instead of multiplying them: (100+47)/200 and (47+33)/200 accordingly.

    Or take the real formula:
    http://en.wikipedia.org/wiki/Bayesian_probability
    Your biological and technological distinctiveness will be added to our own. Resistance is futile.

    Please ask Qt related questions on the forum and not using private messages or visitor messages.


  5. #5
    Join Date
    Jan 2006
    Posts
    976
    Qt products
    Qt3
    Platforms
    Windows
    Thanks
    53

    Default Re: naive bayes

    you said "stop word": how can I remove these word? Do I need a list of these words to remove? Where can I find that list?
    My idea was cut out the word with higher frequency; so I read 'top list' that should refer to that, is that?
    BTW it's not clear to me what is a top list; maybe the 100 words with higher frequency; but if I had:

    "hello" 300
    "hi" 150
    .............

    which I remove? only hello? or hi as well?
    Regards

  6. #6
    Join Date
    Jan 2006
    Location
    Warsaw, Poland
    Posts
    33,373
    Qt products
    Qt3 Qt4 Qt5 Qt/Embedded
    Platforms
    Unix/X11 Windows Android Maemo/MeeGo
    Thanks
    4
    Thanked 5,019 Times in 4,795 Posts
    Wiki edits
    10

    Default Re: naive bayes

    Quote Originally Posted by mickey View Post
    you said "stop word": how can I remove these word?
    Just don't consider them when evaluating the message.
    Do I need a list of these words to remove?
    Yes.
    Where can I find that list?
    It depends on the language of the message.
    http://en.wikipedia.org/wiki/Stop_words

    BTW it's not clear to me what is a top list; maybe the 100 words with higher frequency; but if I had:

    "hello" 300
    "hi" 150
    .............

    which I remove? only hello? or hi as well?
    Why do you want to remove them? Unless they are stop words, of course (I don't think they are).


    If you want something more advanced, you can also perform the stemming process.
    Your biological and technological distinctiveness will be added to our own. Resistance is futile.

    Please ask Qt related questions on the forum and not using private messages or visitor messages.


  7. #7
    Join Date
    Jan 2006
    Posts
    976
    Qt products
    Qt3
    Platforms
    Windows
    Thanks
    53

    Default Re: naive bayes

    I chose wrong words as example, sorry; I only need stop word; But I must put into a List the words on my own or are there around file with stop word; I need them in English.
    I actually need stemming too but I don't understand how do it. I mean: can I integrate something built from third parts (but which one) into my Java code? Which one is the simplest to use?
    Regards

  8. #8
    Join Date
    Jan 2006
    Location
    Warsaw, Poland
    Posts
    33,373
    Qt products
    Qt3 Qt4 Qt5 Qt/Embedded
    Platforms
    Unix/X11 Windows Android Maemo/MeeGo
    Thanks
    4
    Thanked 5,019 Times in 4,795 Posts
    Wiki edits
    10

    Default Re: naive bayes

    Quote Originally Posted by mickey View Post
    But I must put into a List the words on my own or are there around file with stop word; I need them in English.
    Follow the link I gave you and scroll down.
    Last edited by wysota; 13th September 2009 at 16:44.
    Your biological and technological distinctiveness will be added to our own. Resistance is futile.

    Please ask Qt related questions on the forum and not using private messages or visitor messages.


  9. #9
    Join Date
    Jan 2006
    Posts
    976
    Qt products
    Qt3
    Platforms
    Windows
    Thanks
    53

    Default Re: naive bayes

    Quote Originally Posted by wysota View Post
    Follow the link I gave you and scroll down.
    Sorry, I saw it; I don't know how to use it; so I thought to put it inside the java code on this way:
    Qt Code:
    1. List<String> stopwords = new ArrayList<String>(Arrays.asList("a", "about",................);
    To copy to clipboard, switch view to plain text mode 
    Is it too much ugly?
    What I need with lemmatization should be an hint about something very fast to use; is there any Java standard library? I didn't find it and it seems odd to me...

    thanks,

    EDIT:
    one more thing not claer: I removed the stop word while reading them from file; but I know that there's a technicque that remove the words with highest frequency; stop-word and this last technique are the same thing?
    Last edited by mickey; 11th September 2009 at 14:05.
    Regards

  10. #10
    Join Date
    Jan 2006
    Location
    Warsaw, Poland
    Posts
    33,373
    Qt products
    Qt3 Qt4 Qt5 Qt/Embedded
    Platforms
    Unix/X11 Windows Android Maemo/MeeGo
    Thanks
    4
    Thanked 5,019 Times in 4,795 Posts
    Wiki edits
    10

    Default Re: naive bayes

    Quote Originally Posted by mickey View Post
    Is it too much ugly?
    It's slow, that's for sure. Some kind of hash or dictionary based approach would be more efficient.

    one more thing not claer: I removed the stop word while reading them from file; but I know that there's a technicque that remove the words with highest frequency; stop-word and this last technique are the same thing?
    Stopwords are words with very high frequency in a particular language. They don't have to appear in each message (and not even in each message in this particular language). So this is something different. If a message has a lot of "buy" and "viagra" words, you probably shouldn't remove them, regardless of the fact that they have high frequency - they are still spam carrying words.
    Your biological and technological distinctiveness will be added to our own. Resistance is futile.

    Please ask Qt related questions on the forum and not using private messages or visitor messages.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Qt is a trademark of The Qt Company.