Results 1 to 7 of 7

Thread: Detect language from the unicode

  1. #1
    Join Date
    Jan 2008
    Location
    Bengaluru
    Posts
    144
    Thanks
    8
    Thanked 7 Times in 7 Posts
    Qt products
    Qt3 Qt4
    Platforms
    Windows

    Lightbulb Detect language from the unicode

    Is there any way to detect language from the respecting unicode supported string ? Please let me know if you any approaches which you feel it might help me.
    Thanks in advance.

  2. #2
    Join Date
    Jan 2006
    Location
    Warsaw, Poland
    Posts
    33,359
    Thanks
    3
    Thanked 5,015 Times in 4,792 Posts
    Qt products
    Qt3 Qt4 Qt5 Qt/Embedded
    Platforms
    Unix/X11 Windows Android Maemo/MeeGo
    Wiki edits
    10

    Default Re: Detect language from the unicode

    You can use statistics -- search input for common words (or sets of words) unique for a given language. You can try looking for character sets but that will only detect alphabets and not languages.
    Your biological and technological distinctiveness will be added to our own. Resistance is futile.

    Please ask Qt related questions on the forum and not using private messages or visitor messages.


  3. #3
    Join Date
    Mar 2009
    Location
    Brisbane, Australia
    Posts
    7,729
    Thanks
    13
    Thanked 1,610 Times in 1,537 Posts
    Qt products
    Qt4 Qt5
    Platforms
    Unix/X11 Windows
    Wiki edits
    17

    Default Re: Detect language from the unicode

    It would be unusual to get a single string in isolation: external data can help. For example, if the string is part of address and you collect the country you get another criteria to help limit the possibilities. The presence of RTL characters or direction marks can also limit the options.

  4. #4
    Join Date
    Jan 2008
    Location
    Bengaluru
    Posts
    144
    Thanks
    8
    Thanked 7 Times in 7 Posts
    Qt products
    Qt3 Qt4
    Platforms
    Windows

    Default Re: Detect language from the unicode

    Thanks wysota. I can think of an idea here now. I can save all the language unicode alphabets with their respective language name. And I can do unicode contains operation to detect the respective language. Can i do unicode alphabet comparison ?

    Quote Originally Posted by ChrisW67 View Post
    It would be unusual to get a single string in isolation: external data can help. For example, if the string is part of address and you collect the country you get another criteria to help limit the possibilities. The presence of RTL characters or direction marks can also limit the options.
    Yeah, nice idea. But I have a sample data with all the languages in it mixed intermediately. Let me see if i can get any information in the file related to the language.

  5. #5
    Join Date
    Jan 2006
    Location
    Warsaw, Poland
    Posts
    33,359
    Thanks
    3
    Thanked 5,015 Times in 4,792 Posts
    Qt products
    Qt3 Qt4 Qt5 Qt/Embedded
    Platforms
    Unix/X11 Windows Android Maemo/MeeGo
    Wiki edits
    10

    Default Re: Detect language from the unicode

    If you have large text with mixed languages it will be extremely hard to separate them. Besides that naive checking every possible entry will take ages. You need to preprocess the text making some kind of dictionary counting words or phrases, filtering out entries common to many languages and then employ some statistics aparatus to guess the language. The easiest one to guess is definitely chinese and most other asian languages as well as languages where a particular alphabet is used by a small number of languages (arabic, hebrew, etc.). It will probably be the most difficult to detect English as many texts often quote some terms coming from English, especially in technical texts.

    Language recognition is a domain in itself, don't expect to write a 100-200 line long program that will do what you want. You can improve your chances by connecting to some online ontology database (such as wordnet for English) to detect phrases or even sentences that do have a meaning in a particular language.
    Your biological and technological distinctiveness will be added to our own. Resistance is futile.

    Please ask Qt related questions on the forum and not using private messages or visitor messages.


  6. #6
    Join Date
    Mar 2009
    Location
    Brisbane, Australia
    Posts
    7,729
    Thanks
    13
    Thanked 1,610 Times in 1,537 Posts
    Qt products
    Qt4 Qt5
    Platforms
    Unix/X11 Windows
    Wiki edits
    17

    Default Re: Detect language from the unicode

    Then you have a real problem. The only approach would be statistical and have fun with text like:
    Tom approached the man in uniform, "Je ne parle pas bien français. Pouvez-vous m'aider à trouver un poste de police?" The man replied, "Je ne parle pas français non plus. Sprechen Sie deutsch?"

  7. #7
    Join Date
    Jan 2006
    Location
    Warsaw, Poland
    Posts
    33,359
    Thanks
    3
    Thanked 5,015 Times in 4,792 Posts
    Qt products
    Qt3 Qt4 Qt5 Qt/Embedded
    Platforms
    Unix/X11 Windows Android Maemo/MeeGo
    Wiki edits
    10

    Default Re: Detect language from the unicode

    If you want statistics then you need definitely more than 1000 tokens.
    Your biological and technological distinctiveness will be added to our own. Resistance is futile.

    Please ask Qt related questions on the forum and not using private messages or visitor messages.


Similar Threads

  1. Qt Unicode Problems
    By Sven in forum Qt Programming
    Replies: 9
    Last Post: 28th December 2010, 07:28
  2. How use unicode in Arabic language?
    By mismael85 in forum Qt Programming
    Replies: 1
    Last Post: 18th November 2010, 14:08
  3. Unicode
    By qtuser20 in forum Qt Programming
    Replies: 0
    Last Post: 28th September 2009, 22:43
  4. i have a problem with Qt unicode
    By coder1985 in forum Qt Programming
    Replies: 5
    Last Post: 20th November 2007, 21:08
  5. Unicode + plain C++
    By ct in forum General Programming
    Replies: 7
    Last Post: 20th March 2007, 08:34

Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Digia, Qt and their respective logos are trademarks of Digia Plc in Finland and/or other countries worldwide.