best way to parsing this line

**mickey** · 24th February 2008, 22:22

Hello,
I have to parsing (and to do something) with lines read from a file like 10,50,30,4 (10,22,3,^,22 -> the ^ means the line contains an error so I have to reject it)...
My code seems ok and it's this:

Qt Code:

Switch view

const char delim=',';
void myFunction(vector<string>& lineOfFile, const char& delim) {
	vector<string>::iterator it = lineOfFile.begin();
	for( ;it != lineOfFile.end(); ++it) {
		vector<double> line; 		
		size_t pStart=-1;
		size_t pEnd=-2;
		double val;
		string subStr("");
 		bool lineIsOk = true;
		while (pEnd != -1 && lineIsOk) {							
				pStart = (int) (*it).find(delim, pStart+1);				
				pEnd = (int) (*it).find(delim, pStart+1);	
				subStr = (*it).substr(pStart+1, pEnd-1 - pStart );  
				if (subStr != "^" ) {  
					val = (double) atof( subStr.c_str() ); 
					subStr=""; 
					line.push_back(val); 
				}
				else { lineIsOk = false; }
		}
		if ( lineIsOk) insertInOtherStruvture(line);
	}

const char delim=',';
void myFunction(vector<string>& lineOfFile, const char& delim) {
	vector<string>::iterator it = lineOfFile.begin();
	for( ;it != lineOfFile.end(); ++it) {
		vector<double> line; 		
		size_t pStart=-1;
		size_t pEnd=-2;
		double val;
		string subStr("");
 		bool lineIsOk = true;
		while (pEnd != -1 && lineIsOk) {							
				pStart = (int) (*it).find(delim, pStart+1);				
				pEnd = (int) (*it).find(delim, pStart+1);	
				subStr = (*it).substr(pStart+1, pEnd-1 - pStart );  
				if (subStr != "^" ) {  
					val = (double) atof( subStr.c_str() ); 
					subStr=""; 
					line.push_back(val); 
				}
				else { lineIsOk = false; }
		}
		if ( lineIsOk) insertInOtherStruvture(line);
	}

To copy to clipboard, switch view to plain text mode

Problem now is that this is working only delim=',' and I'd like parsing that line in way like mine (it seems simple, doens't it?) but with other delim for example, I'd like cover line like
10 2 22 200 1
or
10 space space space 22 space 2 33
if the creator of file don't create the file correct with more space (but I'd like to parse it how it was)

Could you help me, please? thanks a lot.

**wysota** · 24th February 2008, 22:59

Your case is very simple. You can have three types of tokens in your data stream:
- digits
- a separator
- an "error" token

Everything else determines a parse error (you can even forget the last token type and treat the error character as a parse error as well.

First you'll need a lexer function that reads the data stream and returns tokens. Then you have to have a parsing function that fetches tokens using the lexer function and interprets them according to a grammar of your choice. The parsing function knows which symbols to expect and can interprete them or generate a parse error.

Here is a primitive lexer:

Qt Code:

Switch view

enum Token { EndOfStream=-1, EndOfLine=-2, Separator=-3, Error=-4 };
int lexer(istream &stream){
  int value = 0;
  bool hasValue = false;
  while(1){
    if(stream.eof()) return EndOfStream;
    int ch = stream.peek(); // peek a char
    if(ch>='0' && ch<='9'){
      // a digit, calculate value
      value = value*10 + (ch-'0');
      hasValue = true;
      stream.get(); // discard the character
      continue;
    } else if(hasValue) return value; // if not a digit and already read a digit, return the value
      stream.get(); // discard the upcoming character which was already peeked
    switch(ch){
      case ' ': case '\t': break; // white space - ignore
      case '\n': return EndOfLine;
      case ',': return Separator;
      default: return Error;
    }
  }
  return Error;
}

enum Token { EndOfStream=-1, EndOfLine=-2, Separator=-3, Error=-4 };
int lexer(istream &stream){
  int value = 0;
  bool hasValue = false;
  while(1){
    if(stream.eof()) return EndOfStream;
    int ch = stream.peek(); // peek a char
    if(ch>='0' && ch<='9'){
      // a digit, calculate value
      value = value*10 + (ch-'0');
      hasValue = true;
      stream.get(); // discard the character
      continue;
    } else if(hasValue) return value; // if not a digit and already read a digit, return the value
      stream.get(); // discard the upcoming character which was already peeked
    switch(ch){
      case ' ': case '\t': break; // white space - ignore
      case '\n': return EndOfLine;
      case ',': return Separator;
      default: return Error;
    }
  }
  return Error;
}

To copy to clipboard, switch view to plain text mode

The parser is really a finite state machine:

Qt Code:

Switch view

enum State { NumberOrSeparatorOrEnd, Number, SeparatorOrEnd, NumberOrEnd };
void parser(){
//...
int token;
State state = NumberOrEnd;
do {
  token = lexer(stream);
  switch(state){
    case NumberOrEnd:
      if(token==EndOfLine || token==EndOfStream) return;
      else if(token>=0){
        processNumber(token);
        state = SeparatorOrEnd; // expect a separator or end
      } else {
        error = true;
        return; // parse error
      }
      break;
    case SeparatorOrEnd:
      if(token==Separator){
        state = Number;
      } else if(token==EndOfLine || token==EndOfStream){
        return;
      } else {
        error = true; return; // parse error
      }
      break;
//...
 
  }
}
//...
}

enum State { NumberOrSeparatorOrEnd, Number, SeparatorOrEnd, NumberOrEnd };
void parser(){
//...
int token;
State state = NumberOrEnd;
do {
  token = lexer(stream);
  switch(state){
    case NumberOrEnd:
      if(token==EndOfLine || token==EndOfStream) return;
      else if(token>=0){
        processNumber(token);
        state = SeparatorOrEnd; // expect a separator or end
      } else {
        error = true;
        return; // parse error
      }
      break;
    case SeparatorOrEnd:
      if(token==Separator){
        state = Number;
      } else if(token==EndOfLine || token==EndOfStream){
        return;
      } else {
        error = true; return; // parse error
      }
      break;
//...
      
  }
}
//...
}

To copy to clipboard, switch view to plain text mode

And there you have a complete extendible parser

**mickey** · 24th February 2008, 23:52

OK, but isn't more fast a way (if there is) that put the entire line of the file into a vector<string> (even the delim) and then work with string method to extract the numbers?

?

**wysota** · 24th February 2008, 23:59

The method I provided is the fastest possible as you process each character only once (thus the complexity is exactly O(n) where 'n' is the number of characters in the string).

**mickey** · 29th February 2008, 20:25

hello,

could you suggests me any change for cover the case that the file contain double value instead integer (eg: 2.3 2.55 3 4 2.3 0 1). Or better: how change sif I don't know what type it'll be? (float, double, int)

**wysota** · 29th February 2008, 22:59

The only difference is that after a digit you may expect a comma (or dot) and after the comma (or dot) at least one digit (and no more commas/dots).

**mickey** · 1st March 2008, 01:20

One more question please: If you see my first post, I have "char* delim" as parameter; so I can read file that contains ',' as separator and other separator at choice. But I can choose as delim ' '; in that case inside your switch (that won't compile) what'll happen? In general ' ' and '\t' can be in the file because the file constructor (the user) can be a bit distracted and insert space at choice (and my program will accept if even if not full correct); but how can I do if user decide (and it can) choose ' ' as delimiters (eg. 200 3 4 5 3 instead of 200,3,4,5,3) ? (i have no idea how to treat this case).

Thanks in advance.

**wysota** · 1st March 2008, 08:55

Originally Posted by mickey

But I can choose as delim ' '; in that case inside your switch (that won't compile) what'll happen?

If you want tokens to be configurable either don't use switch and use the equality operator instead or in each case check if a particular symbol is a token you expect depending on the configuration.

In general ' ' and '\t' can be in the file because the file constructor (the user) can be a bit distracted and insert space at choice (and my program will accept if even if not full correct);

The lexer ignores or spaces and tabs - it treats the [\t ]+ regular expression as a token separator (just like for example C syntax does). \n should probably also be in the group but I guess it depends on your needs.

but how can I do if user decide (and it can) choose ' ' as delimiters (eg. 200 3 4 5 3 instead of 200,3,4,5,3) ? (i have no idea how to treat this case).

See above.

BTW. Have you seen bison/flex and QLALR?

**mickey** · 1st March 2008, 16:03

Hello,
I don't understand if I don't understand you or viceversa (I refere to the case of ' ' as delimiter); However, I modified your initial switch to cope this case; is it right or there is a clear way to do this?; (the problem with previous switch and delimiter ' ' was that one time lexer see a blank it return immediately Separator as token (but I could not be right because afte,r it could find other blank or '\n' or '\t'; maybe the code explain better what I mean))....

Qt Code:

Switch view

switch(ch){
				 case ' ' : case '\t': 
					 if (delim == ' ') {
						 switch (fileStream.peek()) {
						 case delim: break;
						 case '\n': break;
						 case '\t': break;
						 default: return Separator;
						 };
					 }
 
					 else  break; // white space - ignore
				 case '\n': line++; return EndOfLine;
				case ',' : 
					 if (delim == ',') 
						 return Separator; 
					 else return Error;
				 case  -1 : return EndOfStream;
				 default  : return Error;
		  };

switch(ch){
				 case ' ' : case '\t': 
					 if (delim == ' ') {
						 switch (fileStream.peek()) {
						 case delim: break;
						 case '\n': break;
						 case '\t': break;
						 default: return Separator;
						 };
					 }
					
					 else  break; // white space - ignore
				 case '\n': line++; return EndOfLine;
				case ',' : 
					 if (delim == ',') 
						 return Separator; 
					 else return Error;
				 case  -1 : return EndOfStream;
				 default  : return Error;
		  };

To copy to clipboard, switch view to plain text mode

Second:
enum State { NumberOrEnd, Number };
State token = Number;
cout << token; // I'd like print the string "Number" and not '1'; is it possibile?

**wysota** · 1st March 2008, 16:16

This case code of yours doesn't look good... If you want a whitespace as a separator simply change your grammar, you don't have to change the lexer except maybe removing the part that handles commas. You don't need to report a separator, just return numbers.

Here is EBNF for comma separated lines

EBNF Code:

Switch view

CommaSeparatedLine ::== Number, { Separator, Number } ;

CommaSeparatedLine ::== Number, { Separator, Number } ;

To copy to clipboard, switch view to plain text mode

And here is one for space separated lines:

EBNF Code:

Switch view

SpaceSeparatedLine ::== Number, { Number } ;

SpaceSeparatedLine ::== Number, { Number } ;

To copy to clipboard, switch view to plain text mode

In the second case you simply don't handle commas. As the lexer returns the value of Number and not single digits it should be easy.

Oh, and about your question about printing enum names. It's possible in Qt when you register an enum in a metaobject using Q_ENUMS.

**mickey** · 1st March 2008, 16:57

my parser have to accept files that contain as separator (only) "delim"; delim is a console parameter; so at times it could be ','; at other times could be blank; if user doesn't specify anything from console, by default the (only) accepted separator will be ','

**wysota** · 1st March 2008, 17:03

How about you just use strchr() and atof() and forget about the parser?

**mickey** · 3rd March 2008, 00:29

At the moment it enjoy me. What do u think about this mixed if-switch (is it accetable?Or is it better a pure if-chains?). thanks.

Qt Code:

Switch view

if (token >=0) { //in this case token could be more digits than one
   ..............
}
else {
  switch (token) {
     case Special
     case EndOfLine:
     case EndOfStream:
     case ........
  };
}

if (token >=0) { //in this case token could be more digits than one
   ..............
}
else {
  switch (token) {
     case Special
     case EndOfLine:
     case EndOfStream:
     case ........
  };
}

To copy to clipboard, switch view to plain text mode

**wysota** · 3rd March 2008, 01:08

Switch and if are equivalent, so that doesn't really matter...

**mickey** · 3rd March 2008, 01:43

? But with if in the worst case if have to do 'n' compare where 'n' is the number of if in chains. switch is implemented in a totally different way; with this last I have to do only one compare. Isn't it? Should be quicker......

**wysota** · 3rd March 2008, 07:08

Originally Posted by mickey

switch is implemented in a totally different way; with this last I have to do only one compare. Isn't it?

Not really. There is no machine instruction that would allow to make an arbitrary number of comparisons. Compile to assembly (gcc -S) and see for yourself.

**mickey** · 9th March 2008, 11:04

Hello,
I'm still using the parser and assuming it a good way, I have this question: I have in many part of my program to know how many lines have the file and how many elements have the lines (10, 20, 20 40 -> this line has 4 elements). My program has to check if all line have the same number of element, so I thought to do this check inside the parser; if they haven't the same number of element the parse generates Error; and I thought to count the lines of file in the parser too. All of these soil the parser with additional if and variables; isn't better put the counting of lines out parser (in the constructor class for examples). I don't like the parser become so large.....any suggests, please?

**wysota** · 9th March 2008, 13:06

Let me put it this way - the parser is something that deals with grammar (syntax), not logic (semantics). I suggest you parse your file into some kind of list or tree or whatever else you want and then perform semantic checks or actions on the resulting structure. Of course you can combine the two operations to save time when you can deduce while parsing that the result will be useless (like when the number of items in each line differs). If you wrap that all in a class with different methods performing different tasks, the code should be simple and easily readable.

**mickey** · 9th March 2008, 15:55

OK, I could put it in a vector <vector<double> > as you can see before and then check everything (eg. find the number of line, the number of element of a line, the min max values of each single line). All those need to scan the vector of vector from begin to end; so I thought to do this while parsing (but It'll soil the parser). So, read what you have write above, I'm thinking to do one method that while parsing do: push_back into the vector, increment the number of elements, and check for min, max. Are you mean this? With this parser appear more light...thanks

**wysota** · 9th March 2008, 16:11

There is no single correct way to complete the task. You may do as you see fit depending on what's most important for you.

Thread: best way to parsing this line

Thread Tools

Search Thread

Display

best way to parsing this line

Re: best way to parsing this line

The following user says thank you to wysota for this useful post:

Re: best way to parsing this line

Re: best way to parsing this line

Re: best way to parsing this line

Re: best way to parsing this line

The following user says thank you to wysota for this useful post:

Re: best way to parsing this line

Re: best way to parsing this line

Re: best way to parsing this line

Re: best way to parsing this line

Re: best way to parsing this line

Re: best way to parsing this line

Re: best way to parsing this line

Re: best way to parsing this line

Re: best way to parsing this line

Re: best way to parsing this line

Re: best way to parsing this line

Re: best way to parsing this line

Re: best way to parsing this line

Re: best way to parsing this line

Similar Threads

QTcpSocket exception.

Some very weird compilation warnings

Qwizard crashed when created in a slot

KDE/QWT doubt on debian sarge

QTableView paints too much

Bookmarks

Bookmarks

Posting Permissions