Results 1 to 14 of 14

Thread: Reading large unsorted file

  1. #1
    Join Date
    Jul 2016
    Posts
    6
    Thanks
    3
    Qt products
    Qt5
    Platforms
    MacOS X Windows

    Default Reading large unsorted file

    Hi everybody!

    I have a few text files with roughly 50 million lines. Each line has 3 fields separated by a space, being the first one a ID.

    The file is not sorted so there are multiple entries for the same ID all over the file. I would like to create different files, one for each ID, but with my limited capabilities I have no idea on how to make this fast. Any advice?

    Thanks

    ps: I want the lines in each file sorted by the second column.
    Last edited by GSS; 14th July 2016 at 13:45.

  2. #2
    Join Date
    Jan 2006
    Location
    Graz, Austria
    Posts
    8,416
    Thanks
    37
    Thanked 1,544 Times in 1,494 Posts
    Qt products
    Qt3 Qt4 Qt5
    Platforms
    Unix/X11 Windows

    Default Re: Reading large unsorted file

    I would do this in two steps:

    1) read the input file line by line and create temporary files for each ID
    2) then process each of those files, by reading the data into memory into a sorted container, e.g QMap, then writing that into the target file

    Cheers,
    _

  3. #3
    Join Date
    Jul 2016
    Posts
    6
    Thanks
    3
    Qt products
    Qt5
    Platforms
    MacOS X Windows

    Default Re: Reading large unsorted file

    Quote Originally Posted by anda_skoa View Post
    I would do this in two steps:

    1) read the input file line by line and create temporary files for each ID
    2) then process each of those files, by reading the data into memory into a sorted container, e.g QMap, then writing that into the target file

    Cheers,
    _
    Thanks!

    Wouldn't it be even better/faster/simpler if I put all the initial data into a QMultiMap (I'm not sure if this is even possible)?

  4. #4
    Join Date
    Jan 2008
    Location
    Alameda, CA, USA
    Posts
    5,230
    Thanks
    302
    Thanked 864 Times in 851 Posts
    Qt products
    Qt5
    Platforms
    Windows

    Default Re: Reading large unsorted file

    Wouldn't it be even better/faster/simpler if I put all the initial data into a QMultiMap
    I would use QMap< QPair< QString, QString >, QString >. QMap sorts by key value. The QPair<> key has the ID as the first entry, the column 1 value as the second entry. QPair's less than operator compares the first values and then the second values if there is a tie. Thus, when you finish building the map, you can read it out in order sorted by ID then column 1 within ID.

    Assuming you can fit 50 million lines in memory, that is.
    Last edited by d_stranz; 14th July 2016 at 18:23.
    <=== The Great Pumpkin says ===>
    Please use CODE tags when posting source code so it is more readable. Click "Go Advanced" and then the "#" icon to insert the tags. Paste your code between them.

  5. The following user says thank you to d_stranz for this useful post:

    GSS (18th July 2016)

  6. #5
    Join Date
    Jan 2006
    Location
    Graz, Austria
    Posts
    8,416
    Thanks
    37
    Thanked 1,544 Times in 1,494 Posts
    Qt products
    Qt3 Qt4 Qt5
    Platforms
    Unix/X11 Windows

    Default Re: Reading large unsorted file

    Quote Originally Posted by d_stranz View Post
    Assuming you can fit 50 million lines in memory, that is.
    Exactly, hence the idea of processing each ID separately.

    But if you have the memory to do all in one go then I would go for a map of maps or a hash of maps.

    Cheers,
    _

  7. #6
    Join Date
    Apr 2013
    Location
    Prague
    Posts
    258
    Thanks
    3
    Thanked 65 Times in 59 Posts
    Qt products
    Qt4
    Platforms
    Unix/X11

    Default Re: Reading large unsorted file

    If you cannot process 50 millions of records on memory, search an algorithm named file merging. It is an algorithm which allows you sorting huge files using your hard disk and a small amount of memory rather effectively. 50 millions of records isn't a big task for the algorithm.

  8. #7
    Join Date
    Jan 2011
    Location
    Sri Lanaka
    Posts
    64
    Thanks
    39
    Qt products
    Qt4
    Platforms
    MacOS X Unix/X11 Windows Symbian S60

    Default Re: Reading large unsorted file

    if you want to make it fast use Qt Concurrent. no matter how bad your code is Qt Concurrent will user all your cores and finish the job fast.


    #include <QtConcurrent>

    void readDataFile(QString filePath)
    {
    //read data from file here
    }

    int main(int argc, char *argv[])
    {
    QtConcurrent::run(QThreadPool::globalInstance(), &readDataFile, "C:\DataFile1.txt");
    QtConcurrent::run(QThreadPool::globalInstance(), &readDataFile, "C:\DataFile2.txt");
    QtConcurrent::run(QThreadPool::globalInstance(), &readDataFile, "C:\DataFile3.txt");
    }

  9. #8
    Join Date
    Jul 2016
    Posts
    6
    Thanks
    3
    Qt products
    Qt5
    Platforms
    MacOS X Windows

    Default Re: Reading large unsorted file

    Hi everyone!

    I tried the suggestions of using QMap and QPair but it takes ages! For example, one of the files has almost 25 million lines and it took 14 hours just to read with the following code:

    Qt Code:
    1. #include <QCoreApplication>
    2. #include <QMultiMap>
    3. #include <QtCore/QFile>
    4. #include <QTextStream>
    5. #include <QStringList>
    6. #include <iostream>
    7. #include <QDebug>
    8. #include <QDir>
    9. #include <QByteArray>
    10. #include <QTime>
    11.  
    12. using namespace std;
    13.  
    14. int main(int argc, char *argv[])
    15. {
    16. QTime myTimer;
    17. QCoreApplication a(argc, argv);
    18. QPair < QString, QString > pair1;
    19. QMap< QPair < QString, QString >, QString > map1;
    20. QMap< QString, QString > map2;
    21. QMap< QString, QMap< QString, QString > > map3;
    22. QString id, time, value;
    23.  
    24. QFile file("File2.txt");
    25. if (!file.open(QIODevice::ReadOnly | QIODevice::Text)) {
    26. qDebug() << file.fileName() << file.error() << file.errorString();
    27. return 0;
    28. }
    29.  
    30. QTime begin, end;
    31. begin = begin.currentTime();
    32.  
    33. QTextStream in(&file);
    34. myTimer.start();
    35. int count = 0;
    36. /*while (!in.atEnd()) {
    37.   QString line = in.readLine();
    38.   QStringList list1 =line.split(' ');
    39.   pair1.first = list1[0];
    40.   pair1.second = list1[1];
    41.   map1.insert(pair1, list1[2]);
    42.   cout << count++ << "\n";
    43.   }*/
    44. while (!in.atEnd()) {
    45. QString line = in.readLine();
    46. QStringList list1 =line.split(' ');
    47. map2.insert(list1[1], list1[2]);
    48. map3.insert(list1[0], map2);
    49. cout << count++ << "\n";
    50. }
    51.  
    52. end = end.currentTime();
    53. cout << begin.hour() << ":" << begin.minute() << ":" << begin.second() << " " << myTimer.elapsed() << " " << end.hour() << ":" << end.minute() << ":" << end.second() << "\n";
    54. return 0;
    55. }
    To copy to clipboard, switch view to plain text mode 

    I'll give a try to the the QtConcurrent.

  10. #9
    Join Date
    Mar 2008
    Location
    Kraków, Poland
    Posts
    1,536
    Thanked 284 Times in 279 Posts
    Qt products
    Qt4
    Platforms
    Unix/X11 Windows

    Default Re: Reading large unsorted file

    As I see You don't reset map2. So after next line map2 contains all lines from begin to current.
    I think that first line in whie loop should be :
    Qt Code:
    1. map2.clear();
    To copy to clipboard, switch view to plain text mode 

  11. The following user says thank you to Lesiok for this useful post:

    GSS (18th July 2016)

  12. #10
    Join Date
    Jan 2006
    Location
    Graz, Austria
    Posts
    8,416
    Thanks
    37
    Thanked 1,544 Times in 1,494 Posts
    Qt products
    Qt3 Qt4 Qt5
    Platforms
    Unix/X11 Windows

    Default Re: Reading large unsorted file

    Quote Originally Posted by GSS View Post
    I'll give a try to the the QtConcurrent.
    No point in doing that, QtConcurrent is for parallelizing multiple tasks, you only have one task.

    Quote Originally Posted by Lesiok View Post
    As I see You don't reset map2. So after next line map2 contains all lines from begin to current.
    I think that first line in whie loop should be :
    Qt Code:
    1. map2.clear();
    To copy to clipboard, switch view to plain text mode 
    No, every loop iteration is a single record.

    The loop needs to work with map3 and only with map3.

    Something like
    Qt Code:
    1. map3[list1[0].insertMulti(list1[1], list1[2]);
    To copy to clipboard, switch view to plain text mode 
    i.e. retrieve or create the inner map for key "list1[0]" then insert the current inner pair, allowing multiple values for key "list1[1]"

    Since the result is written back into files again, one could probably even avoid the encoding and decoding and just work with QByteArray instead of QString.

    Is the machine's physical RAM large enough to do that without swapping?
    Such a long time sounds like as if the machine started swapping.

    Cheers,
    _
    Last edited by anda_skoa; 18th July 2016 at 12:08.

  13. #11
    Join Date
    Jul 2016
    Posts
    6
    Thanks
    3
    Qt products
    Qt5
    Platforms
    MacOS X Windows

    Default Re: Reading large unsorted file

    Quote Originally Posted by Lesiok View Post
    As I see You don't reset map2. So after next line map2 contains all lines from begin to current.
    I think that first line in whie loop should be :
    Qt Code:
    1. map2.clear();
    To copy to clipboard, switch view to plain text mode 
    Well spotted. This reduced the running time to 40 minutes.

    Quote Originally Posted by anda_skoa View Post
    The loop needs to work with map3 and only with map3.

    Something like
    Qt Code:
    1. map3[list1[0].insertMulti(list1[1], list1[2]);
    To copy to clipboard, switch view to plain text mode 
    i.e. retrieve or create the inner map for key "list1[0]" then insert the current inner pair, allowing multiple values for key "list1[1]"

    Since the result is written back into files again, one could probably even avoid the encoding and decoding and just work with QByteArray instead of QString.

    Is the machine's physical RAM large enough to do that without swapping?
    Such a long time sounds like as if the machine started swapping.

    Cheers,
    _
    I'm still running with your suggestions so I cannot tell if it will take less than the 40 minutes.

    Regarding memory, it doesn't seem to be an issue. The file has almost 400 MB and the application uses less than 3 GB while running. Anyway, I'm not familiar with the concepts of "swap memory".

    Using QByteArray seems to be interesting but how could I then sort the entries and put everything back to files?

    Thanks!

    EDIT: Maybe memory is an issue. I just got a "Segmentation fault" and only half of the lines were read. Perhaps this happened now and not with the previous attempts because I have other applications running now.

    EDIT2:
    Quote Originally Posted by anda_skoa View Post
    The loop needs to work with map3 and only with map3.

    Something like
    Qt Code:
    1. map3[list1[0].insertMulti(list1[1], list1[2]);
    To copy to clipboard, switch view to plain text mode 
    I run for only 1/4 of the lines and it took 10 minutes, which means 40 minutes as the correction suggested by Lesiok.
    Last edited by GSS; 18th July 2016 at 14:21.

  14. #12
    Join Date
    Mar 2008
    Location
    Kraków, Poland
    Posts
    1,536
    Thanked 284 Times in 279 Posts
    Qt products
    Qt4
    Platforms
    Unix/X11 Windows

    Default Re: Reading large unsorted file

    Quote Originally Posted by anda_skoa View Post
    No, every loop iteration is a single record._
    Sorry anda_skoa, you're wrong. map2 is declared outside the loop thus every iteration appends next elements. Adding a clear shortened significantly duration of process.

  15. #13
    Join Date
    Jan 2006
    Location
    Graz, Austria
    Posts
    8,416
    Thanks
    37
    Thanked 1,544 Times in 1,494 Posts
    Qt products
    Qt3 Qt4 Qt5
    Platforms
    Unix/X11 Windows

    Default Re: Reading large unsorted file

    Quote Originally Posted by Lesiok View Post
    Sorry anda_skoa, you're wrong.
    Maybe, but I don't think so :-)

    Quote Originally Posted by Lesiok View Post
    map2 is declared outside the loop thus every iteration appends next elements.
    yes

    Quote Originally Posted by Lesiok View Post
    Adding a clear shortened significantly duration of process.
    Yes, and it also makes map3 only contain the last map2 for each key, which will alway be only one entry since it is cleared for every line in the file.

    My understanding was that the programs should keep all entries for a "map3" key, not just the last pair.

    Quote Originally Posted by GSS View Post
    Regarding memory, it doesn't seem to be an issue. The file has almost 400 MB and the application uses less than 3 GB while running. Anyway, I'm not familiar with the concepts of "swap memory".
    Swap is a file or partition on disk to which the operating system "swaps" memory that is currently not used by the running application, e.g. memory used by another application which is currently not running.
    See https://en.wikipedia.org/wiki/Swap_memory

    Getting a system into a state where it starts swapping heavily wil slow it "to its knees", since there is a lot of I/O overhead with a slow (compared to RAM) memory device.

    Quote Originally Posted by GSS View Post
    Using QByteArray seems to be interesting but how could I then sort the entries and put everything back to files?
    I am not sure what you mean, QFile is a QIODevice subclass, it has read/write functions for QByteArray.

    Cheers,
    _

  16. The following user says thank you to anda_skoa for this useful post:

    GSS (18th July 2016)

  17. #14
    Join Date
    Jul 2016
    Posts
    6
    Thanks
    3
    Qt products
    Qt5
    Platforms
    MacOS X Windows

    Default Re: Reading large unsorted file

    Quote Originally Posted by anda_skoa View Post
    Yes, and it also makes map3 only contain the last map2 for each key, which will alway be only one entry since it is cleared for every line in the file.

    My understanding was that the programs should keep all entries for a "map3" key, not just the last pair.
    You are absolutely right. It has to be as you say or it will replace and not append.

    Quote Originally Posted by anda_skoa View Post
    I am not sure what you mean, QFile is a QIODevice subclass, it has read/write functions for QByteArray.
    Sorry, I though you were referring to QFile::readAll() and work from there.

    I will explore the use of QByteArray to avoid the conversion to QString.

    Thank you very much!

Similar Threads

  1. Advice: Reading large text file.
    By enricong in forum Qt Programming
    Replies: 7
    Last Post: 16th July 2011, 13:11
  2. High performance large file reading on OSX
    By mikeee7 in forum Qt Programming
    Replies: 2
    Last Post: 15th October 2009, 15:18
  3. To large exe file
    By wydesenej in forum Installation and Deployment
    Replies: 8
    Last Post: 24th January 2009, 22:44
  4. large file handling
    By sakthi in forum Qt-based Software
    Replies: 1
    Last Post: 30th October 2008, 01:34
  5. large file management
    By sakthi in forum Qt Programming
    Replies: 1
    Last Post: 22nd October 2008, 09:13

Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Digia, Qt and their respective logos are trademarks of Digia Plc in Finland and/or other countries worldwide.