Results 1 to 5 of 5

Thread: File based merge sort

  1. #1
    Join Date
    Dec 2010
    Posts
    2
    Qt products
    Qt4
    Platforms
    Unix/X11

    Default File based merge sort

    Hi!

    I try to solve word count problem (calculate count of each word in files array) with very big set of data - larger than my pc's memory (RAM).

    Format of each file:
    string: int
    string: int

    I can save intermediate results in files, but at the end i'll have hundreds of files, each of them is equal in size than my pc's memory.

    Can i merge all this intermediate files in 1 with Qt tools? Is there an implementation of merge sort algorithm in qt or community code? Like this http://adaszewski.fachowcy.pl/file-based-merge-sort/ but working with Qt data containers like QMap?

    Thank you for help!

  2. #2
    Join Date
    Jan 2006
    Location
    Munich, Germany
    Posts
    4,714
    Thanks
    21
    Thanked 418 Times in 411 Posts
    Qt products
    Qt3 Qt4 Qt5 Qt/Embedded
    Platforms
    Unix/X11 Windows

    Default Re: File based merge sort

    I can save intermediate results in files
    What exactly is an intermediate result?
    What is such a result made of in terms of data?
    I don't understand why intermediate results should consume so much memory...
    ==========================signature=============== ==================
    S.O.L.I.D principles (use them!):
    https://en.wikipedia.org/wiki/SOLID_...iented_design)

    Do you write clean code? - if you are TDD'ing then maybe, if not, your not writing clean code.

  3. #3
    Join Date
    Dec 2010
    Posts
    2
    Qt products
    Qt4
    Platforms
    Unix/X11

    Default Re: File based merge sort

    What exactly is an intermediate result?
    Eg:
    word1 23
    word2 12
    Where word* - it's natural language word, and numbers - it's counts of this words in some collection of plain text files. The size of intermediate result is limited by RAM size.

    What is such a result made of in terms of data?
    Qt Code:
    1. typedef QMap<QString, int> WordCount;
    To copy to clipboard, switch view to plain text mode 
    Note that i talk about standard example from Qt examples. The difference is that i have very huge set of source plain text files to process and QMap could not fit into RAM. So i make intermediate results of the same format. After that i want to merge all of intermediate results with summarize counts of the same words in different intermediate results, eg:
    intermediate result 1:
    cat 3
    dog 1
    intermediate result 2:
    bird 10
    cat 5
    final result (after external merge sort):
    bird 10
    cat 8
    dog 1

    Intermediate results stores in separate files limited by size of RAM. So i cant just read all files in memory and sort them, because each file is little less than memory size.
    I know that there is a standard algorithm called "external memory merge sort" which can sort many large sets of data in one single largest set.
    So i could not find such implementation for Qt and asked people about it.

    May be there is an unofficial plugin for Qt?

    Sorry for my chinglish...

  4. #4
    Join Date
    Mar 2009
    Location
    Brisbane, Australia
    Posts
    7,729
    Thanks
    13
    Thanked 1,610 Times in 1,537 Posts
    Qt products
    Qt4 Qt5
    Platforms
    Unix/X11 Windows
    Wiki edits
    17

    Default Re: File based merge sort


  5. #5
    Join Date
    Aug 2009
    Location
    Belgium
    Posts
    310
    Thanks
    10
    Thanked 31 Times in 25 Posts
    Qt products
    Qt4
    Platforms
    Windows

    Default Re: File based merge sort

    So you have a text file that is so big that even the word count is a bit less than your computers' memory ??? Wow. Can you mail it to me ?

    Merging two results files is easy, especially if they are already sorted. Roughly something like this :
    Qt Code:
    1. //- Open two results files
    2. // - read a line of each file
    3. //- until both files are completely read :
    4. // - are the words identical ?
    5. // - yes -> write word and the sum to your output file
    6. // read next line in both files
    7. // - no -> write only the word that is lowest in the alphabet to your output file
    8. // read next line from this words' results file
    To copy to clipboard, switch view to plain text mode 

    You could speed this up by comparing e.g. 100 files simultaneously.

    Maybe you could also use a MySql database, but that will probably be too slow for word counting operations. You could store the results of all the input text files in a single table, and then execute a grouping and summing SQL statement onto it. I wonder if you would hit any limitation of MySql.

    Best regards,
    Marc

Similar Threads

  1. Replies: 0
    Last Post: 31st July 2010, 17:27
  2. How to sort a Qlist AND get a sort key?
    By agerlach in forum Qt Programming
    Replies: 3
    Last Post: 26th July 2010, 18:44
  3. How to merge two QPixmaps in to one.
    By live_07 in forum Qt Programming
    Replies: 2
    Last Post: 23rd September 2009, 08:19
  4. How to merge two QPixmaps in to one.
    By babu198649 in forum Newbie
    Replies: 1
    Last Post: 17th August 2008, 13:59
  5. Can I split file and merge them.
    By safknw in forum Newbie
    Replies: 1
    Last Post: 3rd October 2006, 09:40

Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Digia, Qt and their respective logos are trademarks of Digia Plc in Finland and/or other countries worldwide.