EDIT: OK. Such as other user, I couldn't understand the trick to work on block of 256k
Last edited by mickey; 12th July 2007 at 17:40. Reason: error in reply
Regards
Suppose you have file X and you want to sort it. Using merge sort you would do it this way:
- blockSize := 1.
- while not end of file X:
- copy blockSize lines to file A,
- if no more lines in file X:
- END: file A is sorted.
- copy blockSize lines to file B.
- rewind files X, A and B.
- while not end of file A and B:
- merge two blocks from A and B and write them to X,
- rewind files X, A and B.
- blockSize := 2 * blockSize.
- go to 2.
For example, let X= 42, 24, 67, 45, 79, 33, 76, 85, 29, 59, 26, 92, 30, 56, 81, 27.
Iteration 1:
A = 24, 45, 33, 85, 59, 92, 56, 27
B = 42, 67, 79, 76, 29, 26, 30, 81
X = 24, 42, 45, 67, 33, 79, 76, 85, 29, 59, 26, 92, 30, 56, 27, 81
Now X consists of 2-element sorted blocks.
Iteration 2 (this time blockSize is 2):
A = 24, 42, 33, 79, 29, 59, 30, 56
B = 45, 67, 76, 85, 26, 92, 27, 81
X = 24, 42, 45, 67, 33, 76, 79, 85, 26, 29, 59, 92, 27, 30, 56, 81
Now 4-element blocks are sorted.
Iteration 3 (blockSize = 4):
A = 24, 42, 45, 67, 26, 29, 59, 92
B = 33, 76, 79, 85, 27, 30, 56, 81
X = 24, 33, 42, 45, 67, 76, 79, 85, 26, 27, 29, 30, 56, 59, 81, 92
Final iteration:
A = 24, 33, 42, 45, 67, 76, 79, 85,
B = 26, 27, 29, 30, 56, 59, 81, 92
X = 24, 26, 27, 29, 30, 33, 42, 45, 56, 59, 67, 76, 79, 81, 85, 92
Note that:
- you don't have to read whole blocks to merge them,
- you don't have to start with blockSize = 1 (you can use some more efficient algorithm to sort blocks in memory, dump them into a file and then continue with merge sort),
- when the size of the input file isn't a power of 2, you will have a block that is shorter than blockSize.
The thing becames interesting....
I still haven't clear a point: How do I put lines of file X in the vector line?
When I speak of "speed" I mean that I know read a file line by line isn't so efficent..(!?)Qt Code:
while( getline (ifile, temp, '\n') ) { lines[l] = temp; l++; }To copy to clipboard, switch view to plain text mode
Then you said to work on 3 files: but file access and work on files is so efficent? (maybe it'll be better use 2 vector instead of 2 files?)
thanks
Regards
I would use operator>>.
If you read 2GB file into 512MB of RAM, over 3/4 of it will land back on the hard drive in a swap partition/file. If you divide that file into several blocks and sort them in memory and finally merge them, it should be faster than sorting a vector which 3/4 are in swap.
If you write each sorted part of the file into separate temporary file and then merge all of those files at once (not two at a time as in classical merge sort), you will need only one merge operation. This means that you will have to read and write every line from/to hard drive only twice.
No, that's the first step, which allows you to save some time.
Instead of starting the merge sort algorithm with blockSize = 1, you can start it with blockSize = 256000, but you'll need a file that contains sorted blocks of 256000 lines. That's where you are going to use the vector and sort().
- f := [A, B].
- fileId := 0.
- while not end of file X:
- vector := read 256000 lines from file X,
- sort vector,
- write vector to file f[fileId],
- fileId := 1 - fileId.
- blockSize := 256000.
- go to 3 from post #23.
Merge sort is one of the basic algorithms that every programmer should know. If you need more explanations, see Algorithms + Data Structures = Programs by Niklaus Wirth.
sorry, last question: your solution counts on many (dim_file/block_size) temporary files? Because that's what I understand only now..
Regards
sorry but I'm thinking that what I don't understand isn't the merge sort...
You said that I can do with 2 files A and B. So I work with 256000 lines. Sort they and put in file A; then read other 256000 lines, sort(), and put in file B. Now using mergesort to merge two file and write they in the first 500.000 lines of file X. Now I go on. 256.000*2 = 512.000; now the first 512000 lines of X are sorted! But the file isn't finish. Here begin what I don't understand....(I'm thinking this: ); rewind X,A,B. from line 512.001 of X: take a block of 256.000, sort, copy on A; take the next 256.000 block of X, sort, copy on B; mergesort on A and B and overwrite lines from 512.001 to 1024.000; take a block from of 256.000 lines of X from line 1024.000; there's only 120000 lines! Sort them; restart: blocksize 256.000*2=512.000; copy lines from 1 to 512.000 to A (they're sorted); copy lines from 512.001 to B (sorted too); mergesort on A and B and write 1024.000 lines on X; take lines from 1024.000 to end; copy 1024.000 to A and copy the 120.000 lines to B; mergesort on A and B and write on X. Stop. Is it this? It seems me slow...
Regards
Before you start to merge anything, you have to split the whole file, not only the first blocks.
If you have a file 'X' with 2048k of lines, you have to do the following:
- repeat 4 times:
- lines := read 256k lines from X,
- sort lines
- write lines to file A
- lines := read 256k lines from X,
- sort lines
- write lines to file B
- rewind files
- repeat 4 times:
- merge 256k line blocks from A and B and append them to X
- rewind files
- repeat 2 times:
- append 512k lines from X to A
- append 512k lines from X to B
- rewind files
- repeat 2 times:
- merge 512k line blocks from A and B and append them to X
- rewind files
- append 1024k lines from X to A
- append 1024k lines from X to B
- rewind files
- merge 1024k line blocks from A and B and append them to X
Of course it's slow. If you can't fit your data in memory, everything will be slow, but other methods might be event slower.
Although there might be other solution. If you find such function f, that if line1 < line2, then f(line1) < f(line2), you can sort values returned by that function instead.
Bookmarks