Regular expressions: Automata theory applied

This was originally posted in this TFI topic about bulletin board code parsing. Alright, for this I have to introduce you a bit into automata theory and languages. First a couple of definitions:

  • An alphabet in automata theory is a set of symbols (ASCII characters in applications usually) that can be used in the input of your automaton.
  • A language consists of all the strings of symbols that an automaton accepts.

That all sounds very vague probably. Let's give an example of a finite state automaton (every automaton has states, we have a finite number of them), we have the following automaton: Automata diagram The alphabet for this one is: a, b, c (so input streams will only contain those characters, this is only for didactic purposes, in practice the alphabet might be the whole ASCII table).

An automaton consists of states (named q0, q1 and q2 in this case). What's a state? Well in your program it will probably just be a variable, having the current state as value (you can number them or name them). An automaton has an initial state (q0 in this one) and an accepting state (meaning "this string of symbols is part of our language"), q2 in this case (you can see this because of the additional ellipse around it). The symbols next to the arrows are symbols.

How does it work? Ok, let's see if the string "acccbc" is in the language defined by this FSA (Finite State Automaton).

  • We start in q0, the first symbol we have to "eat up" is the a, is there an arrow (transition) from q0 which says a? Yes there is, and it leads to state q1.
  • Ok, we're now in q1 and as input we have cccbc left. Is there a transition from q1 with a c as symbol? Yes there is and it leads back to q1.
  • We're now in q1 again and as input we have ccbc left. Is there a transition from q1 with a c as symbol? Yes there is and it leads back to q1.
  • We're now in q1 still and as input we have cbc left. Is there a transition from q1 with a c as symbol? Yes there is and it leads back to q1.
  • We're now in q1 again and as input we have bc left. Is there a transition from q1 with b as symbol? Yes there is and it leads to q2, an accepting state. Are we done then? No, an input string is only accepted when the last symbol leads to an accepting state.
  • We're now in q2 and as input we have c left. Is there a transition from q2 with c as symbol? Yes there is, and it leads to q1

Ok, we're done. We ended up in q1 and as that's not an accepting state so acccbc is not in the language described by the FSA, it does not match. If we left out the last c: acccb, it would be accepted and would have matched.

You probably wonder what on earth is the use of this. Well, everyone who used regular expressions and probably everyone has, as indirectly used FSA's. Let's look at the FSA I just demonstrated, what's the regex that belongs to that? It's ac*(bc+)*b. Regular expression are something that came from automata theory as a simple way to describe languages. Every regular expression can be translated to a non-deterministic finite state automaton with lambda transition which on it's turn can be translated to a non-detereministic finite state automaton (NFA) and a NFA can be tranlated into a finite state automaton. Internally this is what PHP/Perl/Whatever does. Every regular expression you use is first translated into a FSA and then executed. With minor enhancements this is what preg_match (and ereg) do, they match an input string with a regular expression (in practice FSA) and if it matches, return true. preg_replace is a minor modification to that, it can also replace certain sub-strings matching a regex with other expressions.

What happens when you use multiple preg_replaces is that on every run each of them is translated, the whole input is then scanned and actions are taken. On the second one this process is repeated, so if you use 20 preg_replaces your input is scanned 20 times. Many people think regexes do magic by guessing if a certain pattern matches, but of course it's not, it just scans your input.