Results 1 to 13 of 13

Thread: Regex to filter words containing english alphabets

  1. #1
    Join Date
    Mar 2011
    Location
    New Delhi, India
    Posts
    31
    Thanks
    6
    Qt products
    Qt4 Qt/Embedded
    Platforms
    Windows Symbian S60

    Default Regex to filter words containing english alphabets

    I have text expressions something like as mentioned below. I want to filter only words with English alphabets (i.e no special chars, no operators, no quotes ...etc)

    As an output, i expect words like this:


    Input:

    Buy 1 lakh SMS at 5p/SMS & Get 1lakh Data & keyword for 2 months free. Pay Rs.2000 more & Get a Dynamic 15 pages WebSite. Call 9811968238.
    Output:

    Buy lakh SMS at SMS Get lakh Data keyword for months free Pay Rs more Get Dynamic pages WebSite Call
    Qt Code:
    1. QFile file("E:\\SMS\\dout.csv");
    2. file.open(QIODevice::ReadWrite | QIODevice::Text);
    3. QTextStream out(&file);
    4. out << "This file is generated by Qt\n\n\n";
    5.  
    6. QFile file2("E:\\SMS\\CCHECK.txt");
    7. file2.open(QIODevice::ReadWrite | QIODevice::Text);
    8. QTextStream cc_in(&file2);
    9.  
    10. QString chk_ln = cc_in.readLine();
    11.  
    12. while(!chk_ln.isNull())
    13. {
    14. //Problem in below line
    15. QStringList list2 = chk_ln.split(QRegExp("\\W+"), QString::SkipEmptyParts);
    16.  
    17. for (int i = 0; i < list2.size(); ++i)
    18. {
    19. out<<list2[i]<<" ";
    20.  
    21. }
    22. out<<"\n";
    23.  
    24. chk_ln = cc_in.readLine();
    25. }
    To copy to clipboard, switch view to plain text mode 

    Any help is appreciated !!

  2. #2
    Join Date
    Jan 2006
    Location
    Warsaw, Poland
    Posts
    33,359
    Thanks
    3
    Thanked 5,015 Times in 4,792 Posts
    Qt products
    Qt3 Qt4 Qt5 Qt/Embedded
    Platforms
    Unix/X11 Windows Android Maemo/MeeGo
    Wiki edits
    10

    Default Re: Regex to filter words containing english alphabets

    Qt Code:
    1. QFile in(...);
    2. if(!in.open(...)) ...;
    3. QFile out(...);
    4. if(!out.open(...)) ...;
    5. char c;
    6. QChar ch;
    7. while(!in.atEnd()){
    8. in.getChar(&c);
    9. ch = c;
    10. if(ch.isSpace() || ch.isLetter())
    11. out.putChar(c);
    12. }
    To copy to clipboard, switch view to plain text mode 
    Your biological and technological distinctiveness will be added to our own. Resistance is futile.

    Please ask Qt related questions on the forum and not using private messages or visitor messages.


  3. The following user says thank you to wysota for this useful post:

    dipeshtech (31st March 2011)

  4. #3
    Join Date
    Mar 2011
    Location
    New Delhi, India
    Posts
    31
    Thanks
    6
    Qt products
    Qt4 Qt/Embedded
    Platforms
    Windows Symbian S60

    Default Re: Regex to filter words containing english alphabets

    Thanks...!!

    I got it done like this:

    Qt Code:
    1. Qstring temp;
    2.  
    3. while(!chk_ln.isNull())
    4. {
    5. QStringList list2 = chk_ln.split(QRegExp("\\W+"), QString::SkipEmptyParts);
    6.  
    7. for (int i = 0; i < list2.size(); ++i)
    8. {
    9. temp= list2[i]; //added
    10. if(temp.contains(QRegExp("[0-9]"))) //added
    11. { continue; } //added
    12. out<<list2[i]<<" ";
    13.  
    14. }
    15. out<<"\n";
    16.  
    17. chk_ln = cc_in.readLine();
    18. }
    To copy to clipboard, switch view to plain text mode 

    But, thanks for answering!

  5. #4
    Join Date
    Apr 2010
    Posts
    769
    Thanks
    1
    Thanked 94 Times in 86 Posts
    Qt products
    Qt3 Qt4
    Platforms
    Unix/X11

    Default Re: Regex to filter words containing english alphabets

    Quote Originally Posted by dipeshtech View Post
    Thanks...!!

    I got it done like this:

    Qt Code:
    1. Qstring temp;
    2.  
    3. while(!chk_ln.isNull())
    4. {
    5. QStringList list2 = chk_ln.split(QRegExp("\\W+"), QString::SkipEmptyParts);
    6.  
    7. for (int i = 0; i < list2.size(); ++i)
    8. {
    9. temp= list2[i]; //added
    10. if(temp.contains(QRegExp("[0-9]"))) //added
    11. { continue; } //added
    12. out<<list2[i]<<" ";
    13.  
    14. }
    15. out<<"\n";
    16.  
    17. chk_ln = cc_in.readLine();
    18. }
    To copy to clipboard, switch view to plain text mode 

    But, thanks for answering!
    What happens when your input contains punctuation marks?

    I think you want something along the lines of /[a-z][A-Z]/, or a simple check of ASCII/Unicode value range to take diacritical marks into account while excluding all control characters, punctuation marks and whitespace.

  6. #5
    Join Date
    Mar 2011
    Location
    New Delhi, India
    Posts
    31
    Thanks
    6
    Qt products
    Qt4 Qt/Embedded
    Platforms
    Windows Symbian S60

    Default Re: Regex to filter words containing english alphabets

    Yeah..!! you spotted it right, i am looking for the same what you mentioned.

    But, by the aforementioned solution (In my previous post) i am able to to get the right answer in presence of punctuation marks also.

    I tested it.

  7. #6
    Join Date
    Jan 2006
    Location
    Warsaw, Poland
    Posts
    33,359
    Thanks
    3
    Thanked 5,015 Times in 4,792 Posts
    Qt products
    Qt3 Qt4 Qt5 Qt/Embedded
    Platforms
    Unix/X11 Windows Android Maemo/MeeGo
    Wiki edits
    10

    Default Re: Regex to filter words containing english alphabets

    Using a regular expression just to see if a single character is a digit is a really inefficient idea.
    Your biological and technological distinctiveness will be added to our own. Resistance is futile.

    Please ask Qt related questions on the forum and not using private messages or visitor messages.


  8. #7
    Join Date
    Mar 2011
    Location
    New Delhi, India
    Posts
    31
    Thanks
    6
    Qt products
    Qt4 Qt/Embedded
    Platforms
    Windows Symbian S60

    Default Re: Regex to filter words containing english alphabets

    Yeah..!! I agree it is bit inefficient, but i didn't wanted to go for character by character reading so tried this. I am not an expert and new to Qt, so please forgive me for this ignorance. Anyways, thanks for pointing it out. ( I am still learning)

  9. #8
    Join Date
    Jan 2006
    Location
    Warsaw, Poland
    Posts
    33,359
    Thanks
    3
    Thanked 5,015 Times in 4,792 Posts
    Qt products
    Qt3 Qt4 Qt5 Qt/Embedded
    Platforms
    Unix/X11 Windows Android Maemo/MeeGo
    Wiki edits
    10

    Default Re: Regex to filter words containing english alphabets

    The fact that you didn't write code for reading character by character doesn't mean the code you have written doesn't internally read character by character. It does and it does it inefficiently - QString::contains() with a regexp that evaluates to a single character tries to match a string that is one character long so effectively it does character by character evaluation and is slower than if it were evaluating a single character as it works on strings. It's fastest to do an ascii value comparison:
    Qt Code:
    1. if(c>='A' && c<='z') character is ok;
    To copy to clipboard, switch view to plain text mode 
    For most architectures these are three (or maybe even two) machine instructions. If you have a lot of comparisons to make, the performance hit is significant.
    Last edited by wysota; 31st March 2011 at 23:39. Reason: spelling corrections
    Your biological and technological distinctiveness will be added to our own. Resistance is futile.

    Please ask Qt related questions on the forum and not using private messages or visitor messages.


  10. The following user says thank you to wysota for this useful post:

    dipeshtech (31st March 2011)

  11. #9
    Join Date
    Mar 2011
    Location
    New Delhi, India
    Posts
    31
    Thanks
    6
    Qt products
    Qt4 Qt/Embedded
    Platforms
    Windows Symbian S60

    Default Re: Regex to filter words containing english alphabets

    Yeah!! I was thinking on this line and got this point in mind :-

    doesn't internally read character by character
    but, wasn't aware (actually ignorant) that it is slower than character by character evaluation. Actually, i need to process a large data and that too on my mobile device after implementing the algorithm. It would really make difference for my processing.

    Thanks a TON for explaining the fact VERY Clearly.

  12. #10
    Join Date
    Jan 2006
    Location
    Warsaw, Poland
    Posts
    33,359
    Thanks
    3
    Thanked 5,015 Times in 4,792 Posts
    Qt products
    Qt3 Qt4 Qt5 Qt/Embedded
    Platforms
    Unix/X11 Windows Android Maemo/MeeGo
    Wiki edits
    10

    Default Re: Regex to filter words containing english alphabets

    Quote Originally Posted by dipeshtech View Post
    but, wasn't aware (actually ignorant) that it is slower than character by character evaluation.
    You have an additional overhead of compiling a state machine for the regular expression in every iteration of the loop. The least you can do is move the regular expression out of the loop so it lives through iterations. But still it's just inefficient to use one character regular expression unless maybe when the expression tests many classes of characters and not just one like in your case and defnitely not using contains().
    Your biological and technological distinctiveness will be added to our own. Resistance is futile.

    Please ask Qt related questions on the forum and not using private messages or visitor messages.


  13. #11
    Join Date
    Mar 2011
    Location
    New Delhi, India
    Posts
    31
    Thanks
    6
    Qt products
    Qt4 Qt/Embedded
    Platforms
    Windows Symbian S60

    Default Re: Regex to filter words containing english alphabets

    I think i will go by simple character by character tokenization to filter the words. That should reduce the overhead then.

  14. #12
    Join Date
    Jan 2006
    Location
    Warsaw, Poland
    Posts
    33,359
    Thanks
    3
    Thanked 5,015 Times in 4,792 Posts
    Qt products
    Qt3 Qt4 Qt5 Qt/Embedded
    Platforms
    Unix/X11 Windows Android Maemo/MeeGo
    Wiki edits
    10

    Default Re: Regex to filter words containing english alphabets

    The fastest approach I can think of is to use something similar to QStringRef where you will not copy the data from the original string in every iteration but instead mark positions and lengths of every valid token and at the end extract those tokens from the string in one go, possibly with merging the marks so that you can extract as large areas as possible. Then you avoid the split, avoid copying data and other expensive operations. And you will omit the effect of your current implementation that doesn't preserve whitespaces.
    Your biological and technological distinctiveness will be added to our own. Resistance is futile.

    Please ask Qt related questions on the forum and not using private messages or visitor messages.


  15. #13
    Join Date
    Mar 2011
    Location
    New Delhi, India
    Posts
    31
    Thanks
    6
    Qt products
    Qt4 Qt/Embedded
    Platforms
    Windows Symbian S60

    Default Re: Regex to filter words containing english alphabets

    I have to try it out...!! Not getting it at present, but will try it for sure.

    It's already dawn here, will try after some rest.

Similar Threads

  1. Partial matching of regex
    By ehamberg in forum Qt Programming
    Replies: 1
    Last Post: 28th May 2008, 20:13
  2. Adding 3 words to Button with specified length betwwen words
    By chikkireddi in forum Qt Programming
    Replies: 1
    Last Post: 26th October 2007, 11:08
  3. Having a brain cramp on a regex
    By Spockmeat in forum Qt Programming
    Replies: 2
    Last Post: 16th July 2007, 14:26
  4. Problem with regex
    By mikro in forum Newbie
    Replies: 4
    Last Post: 14th December 2006, 10:43
  5. need help with my regex
    By patcito in forum Qt Programming
    Replies: 1
    Last Post: 29th May 2006, 17:39

Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Digia, Qt and their respective logos are trademarks of Digia Plc in Finland and/or other countries worldwide.