hi, my file contains lines of strings (it's a text); At the moment I'd like order every line of the file ( line1 < line2)
hi, my file contains lines of strings (it's a text); At the moment I'd like order every line of the file ( line1 < line2)
Regards
First of all there's a small application called sort that will do that for you, but I'm not sure if it's available on windows.
If you still want to do it yourself, it sounds like a job for merge sort. You can save some time by starting with the biggest blocks that fit into your RAM and sorting them using quick sort or similar algorithm.
If you know the maximum line length or at least its approximate value, you can create std::vector< std::string > of some constant size and preallocate space in every string (you can do that using two-parameter version of std::vector's constructor).
I'm confused; I know the maximum dimension of lines (1KB) but I cannot know the file size... When I say vector <string> I mean every element of vector is a line of the file.
Are you saying to do somthing like: vector<string> line (10, ""); ?? (if it's so I have allocate 10 lines.. and I don't know how many lines could have the file).
Regards
Luckily it isn't required for merge sort.
Yes. To be exact, something like:Where MAX_LINE_LEN = 1024 and NLINES is around 256k. Remember that the point is to avoid reallocations and swapping.Qt Code:
std::string temp; temp.reserve( MAX_LINE_LEN ); std::vector< std::string > lines( NLINES, temp ); // or if std::string tries to be smart: std::vector< std::string > lines( NLINES ); for( ... ) { line->reserve( MAX_LINE_LEN ); }To copy to clipboard, switch view to plain text mode
This way you can read 256k lines into preallocated space, which should be fast enough. Then you can sort that vector using std::sort, dump everyting into a temporary file and read another set of lines. And so on until the end of file. When you are finished you can free the memory and continue with merge sort algorithm.
No, that's the first step, which allows you to save some time.
Instead of starting the merge sort algorithm with blockSize = 1, you can start it with blockSize = 256000, but you'll need a file that contains sorted blocks of 256000 lines. That's where you are going to use the vector and sort().
- f := [A, B].
- fileId := 0.
- while not end of file X:
- vector := read 256000 lines from file X,
- sort vector,
- write vector to file f[fileId],
- fileId := 1 - fileId.
- blockSize := 256000.
- go to 3 from post #23.
Merge sort is one of the basic algorithms that every programmer should know. If you need more explanations, see Algorithms + Data Structures = Programs by Niklaus Wirth.
sorry, last question: your solution counts on many (dim_file/block_size) temporary files? Because that's what I understand only now..
Regards
sorry but I'm thinking that what I don't understand isn't the merge sort...
You said that I can do with 2 files A and B. So I work with 256000 lines. Sort they and put in file A; then read other 256000 lines, sort(), and put in file B. Now using mergesort to merge two file and write they in the first 500.000 lines of file X. Now I go on. 256.000*2 = 512.000; now the first 512000 lines of X are sorted! But the file isn't finish. Here begin what I don't understand....(I'm thinking this: ); rewind X,A,B. from line 512.001 of X: take a block of 256.000, sort, copy on A; take the next 256.000 block of X, sort, copy on B; mergesort on A and B and overwrite lines from 512.001 to 1024.000; take a block from of 256.000 lines of X from line 1024.000; there's only 120000 lines! Sort them; restart: blocksize 256.000*2=512.000; copy lines from 1 to 512.000 to A (they're sorted); copy lines from 512.001 to B (sorted too); mergesort on A and B and write 1024.000 lines on X; take lines from 1024.000 to end; copy 1024.000 to A and copy the 120.000 lines to B; mergesort on A and B and write on X. Stop. Is it this? It seems me slow...
Regards
Before you start to merge anything, you have to split the whole file, not only the first blocks.
If you have a file 'X' with 2048k of lines, you have to do the following:
- repeat 4 times:
- lines := read 256k lines from X,
- sort lines
- write lines to file A
- lines := read 256k lines from X,
- sort lines
- write lines to file B
- rewind files
- repeat 4 times:
- merge 256k line blocks from A and B and append them to X
- rewind files
- repeat 2 times:
- append 512k lines from X to A
- append 512k lines from X to B
- rewind files
- repeat 2 times:
- merge 512k line blocks from A and B and append them to X
- rewind files
- append 1024k lines from X to A
- append 1024k lines from X to B
- rewind files
- merge 1024k line blocks from A and B and append them to X
Of course it's slow. If you can't fit your data in memory, everything will be slow, but other methods might be event slower.
Although there might be other solution. If you find such function f, that if line1 < line2, then f(line1) < f(line2), you can sort values returned by that function instead.
Bookmarks