This is the simplest parsing method, used by some scratch wikis and "write a wiki in minimal number of lines of code" engines. It's pretty easy to uderstand and implement in any language that supports regular expressions, but is very limited and not guaranteed to produce valid html. Still, it works most of the time
The idea is to apply a number of regexp subtitutions:
s/&/&/
s/</</
s/>/>/
s/^/<p>
s/$/</p>
s/\n(\s*\n)+/</p><p>/
s/\n----/<hr>/
s/\n==(.*)=*/<\/p><h1>\1<\/h1><p>/
s/\n===(.*)=*/<\/p><h2>\1<\/h2><p>/
s/\n====(.*)=*/<\/p><h3>\1<\/h3><p>/
s/\n\s*\*\s+(.*)/<\/p><ul><li>\1<\/li><\/ul><p>/
s/\n\s*\*\*\s+(.*)/<\/p><ul><li><ul><li>\1<\/li><\/ul><\/li><\/ul><p>/
s/\n\s*\*\*\*\s+(.*)/<\/p><ul><li><ul><li><ul><li>\1<\/li><\/ul><\/li><\/ul><\/li><\/ul><p>/
s/\n\{\{\{(([^}]|\}[^}]|\}\}[^}])*)\n\}\}\}/</p><pre>\1</pre><p>/
s/\{\{\{(([^}]|\}[^}]|\}\}[^}])*)\}\}\}/<code>\1</code>/
s/\/\/([^\/]|\/[^\/])*\/\//<em>\1<\/em>/
s/\*\*([^\*]|\*[^\*])*\*\*/<strong>\1<\/strong>/
s/\[\[(\w+)\]\]/<a href="wiki?\1">\1</a>/
s/\[\[(\w+)\|(.*)\]\]/<a href="wiki?\1">\2</a>/
s/\[\[(http:[^\]|]*)\]\]/<a href="\1">\1</a>/
s/\[\[(http:[^|]*)\|(.*)\]\]/<a href="\1">\2</a>/
s/<p></p>//
s/</li></ul><ul><li>//
s/</li></ul><ul><li>//
s/</li></ul><ul><li>//
You get the point. This technique has a number of drawbacks:
This is a very popular approach. It's pretty fast, as it scans the raw text exactly two times.
The idea is to create two kinds of regular expressions -- one to split the text into blocks of different kinds (paragraphs, headings, lists, preformatted blocks, etc.) and then process each of them with different character-level regular expression.
Welcome Visitor
Log in