(anonymous guest) (logged out)

Copyright (C) by the contributors. Some rights reserved, license BY-SA.

Sponsored by the Wiki Symposium and the Nuveon GmbH.

 

Add new attachment

Only authorized users are allowed to upload new attachments.

This page (revision-43) was last changed on 17-Dec-2007 18:47 by YaroslavStavnichiy  

This page was created on 28-Aug-2006 11:25 by Christoph

Only authorized users are allowed to rename pages.

Only authorized users are allowed to delete pages.

Difference between version and

At line 373 changed one line
For Creole 0.4 I'd like to bring out the issue of spaces after the bullets. The current (0.3) draft and previous specs have this ugly special case:
Moved part of discussion to [Talk.RequireSpaceAfterBulletProposal].
----
At line 375 changed 3 lines
{{{
About unordered lists and bold: a line starting with ** (including optional whitespace before and afterwards), immediately following an unordered list element a line above, will be treated as a nested unordered list element. Otherwise it will be treated as the beginning of bold text. Also note that bold and/or italics cannot span lines in a list.
}}}
How about allowing spaces __in between__ the atserisks/hashes? I've seen at least one user trying to escape the bold markup this way and I think it's pretty creative. Would it collide with something?
At line 379 changed one line
I think it's ugly and complicates the parser needlessly. Also, many wikis already have very similar list markup, just without this special case -- making them accept both Creole and native markup at the same time would require some sort of a hack (I can't even imagine it curently).
-- RadomirDopieralski, 2007-01-22
At line 381 changed one line
One possible way of getting rid of that special case and still keeping list markup unambigous with bold markup is //requiring// a space after the bullet.
Here is another problem, maybe someone can give some advice. My parser coudn't handle the "smart" resolving of ambiguity between bold and second-level lists, because of its construction:
At line 383 changed one line
Now, this is a different case than with space //before// the bullet. There are wiki engines that don't allow space before the bullet, and those that require it -- making it optional is really the only way to make them agree.
It parses the text in two passes, using two regular expressions. The first regular expression divides the text into block-level elements, deciding also of the block type. The decission is final and cannot be changed later. The second pass is performed for some blocks only and it handles the character-level markup like bold and italics and links.
At line 385 changed one line
On the other hand, no wiki engine I know prohibits the space after the bullet. Some require it.
Up until now every list item was considered a separate block-level element. Since the division was made using a single regular expression, and the lookbehind patterns are required to have fixed length, I couldn't find a way to implement the context-depended special case (where {{{**}}} at the beginning of the line is a list only, if there is a {{{*}}} at the beginning of the previous line).
At line 387 changed 3 lines
Moreover, putting a space after most punctuation characters is a tradition, and for many people -- a reflex. I can see nothing unnatural in requiring it -- and it simplifies the parsers and the specs -- making Creole both easier to implement and to teach.
By the way, there is a (pretty ugly) hack to get a bold line even if the above special case is removed (remove the single space):
No I'm trying a different approach: treat the whole list as a block element, and divide it into list items in an additional, intemediate pass. I thought that I can just require the list block to start with a single asterisk or hash, and treat everything else as normal paragraph -- then the bold would be properly handled in the second pass. But I was too optimistic. This is a properl list (according to the current Creole spec) even when it starts with multiple asterisks:
At line 391 changed one line
{{{}} }**bold line**
***first list item bold**
* second list item
** sublist
At line 393 removed one line
-- [[RadomirDopieralski]], 2006-12-14
At line 395 changed one line
Why not accept both (asterisks and dashes)? And it goes with the unofficial [Goals] {{{Rule of least surprise}}} and some others...
Now I'm buffled. How to recognize that a block of text is a list without parsing all the character-level markup inside it first?
At line 397 changed one line
-- [EricChartre], 2006-12-28
-- RadomirDopieralski, 2007-02-05
At line 399 changed one line
Regarding the possible ambiguity of the asterisks, there are none (for the parser anyway) if the specs do not allow for bold text to span multiple lines and that bold text must end at some point with **. Also, I __don't__ think that a user would ever, on purpose, do something like:
Question: How is the following markup should be rendered?
At line 402 changed 2 lines
** is this text bold
** or are these just two second-level list items
# first list item
# second list item
**continued bold on the next line**
At line 406 changed one line
meaning
I think we should require nested list characters match like this (MediaWiki style):
At line 409 changed 2 lines
<em> is this text bold<br />
</em> or are these just two second-level list items
* One
*# Two
*#* Three
At line 413 added one line
then the first sample above can be easily resolved as bold continuation, which a user would more naturally be expecting, in my point of view.
At line 414 changed 7 lines
However, the parser must do a look-ahead or a two-level parsing...
-- [EricChartre], 2006-12-28
I don't think there is any ambiguity, in the example given above. I believe the asterix signify strong, as it seems illogical to start a sub-list directly.
And the following would be considered list items.
Although this is not resolvable in this way:
At line 422 changed 3 lines
* List
** SubItem 1
** SubItem 2
* first list item
* second list item
**continued bold on the next line**
At line 427 changed 3 lines
-- [JaredWilliams], 2006-12-30
Yes, the problem is rather with these examples:
One more example:
At line 431 changed 2 lines
**foo**bar**baz
**one**two
# first list item
# second list item
* - what is this?
an asterisk in continuation of second list item,
a third list item,
or first item of a new list?
At line 435 changed 16 lines
They could be parsed as:
----
__foo__bar__baz__
__one__two
----
or
----
** foo__bar__baz
** one__two__
----
or
----
** foo__bar__baz
__one__two
----
You can't really decide without infinite (unbound) lookahead -- and that's a great problem if you need to use a ready parsing algorithm or parser framework -- this rules out most of the extensible, plugin-based wiki engines.
This can only be resolved to literal asterisk if we require blank lines between adjacent but not connected list blocks. I don't think that it is necessary. I think this should rather be rendered as a first item of a new list. A single asterisk in the beginning of line should always be escaped - in list items as well as in paragraphs.
At line 452 changed 13 lines
You can't just make list or bold the default here -- because there are popular use cases for both:
__Paragraph titles__ are often integrated in the paragraph, like in this example. They are tradidtionally distinguished by making them bold. Italics won't do.
* multilevel lists
** can contain __bold__ fragments
Really, I think that requiring a space after the list bullets is a simple and effective solution. And it also removes the conflict with {{{#pragma}}} and {{{# numbered list}}} for many wiki engines.
-- RadomirDopieralski, 2006-12-30
I have my parser doing this
And one more:
At line 466 changed 2 lines
**foo**bar**baz
**one**two
* One
*# One.Two
*#* One.Two.Three
* Two
# One
#* One.Two
#*# One.Two.Three
At line 469 removed 15 lines
is
{{{<div><p>
<strong>foo</strong>bar<strong>baz</strong>one<strong>two</strong>
</p></div>}}}
But
{{{
*list
**foo**bar**baz
**one**two
}}}
is
{{{<div><ul><li>list<ul>
<li>foo<strong>bar</strong>baz</li>
<li>one<strong>two</strong></li>
</ul></li></ul></div>}}}
At line 485 changed 5 lines
Which I think covers it.
-- [JaredWilliams], 2006-12-30
How does it looks in the regular expressions? Something like:
I think here we have two not connected lists. Same as below:
At line 491 changed one line
(?=\n\s*\*+\s*.*)\n\s*\*+\s*(.*)
* One
*# One.Two
*#* One.Two.Three
* Two
# One
#* One.Two
#*# One.Two.Three
At line 493 removed one line
as an additional rule for the lists? Or did you just write your own algorithm and remember the state between the lines?
At line 495 changed 6 lines
-- RadomirDopieralski, 2006-12-30
I don't use regular expressions.
But here is the algorithm in PHP in anycase, called when the parse has seen {{{\n[*-#]}}}, with $i holding the position of the {{{[*-#]}}}.
What about this:
At line 502 changed 54 lines
/*
* $text is the creole text
* $i is the current position in $text
* $l is the strlen($text)
* $doc is the DOM Document
* $node is the current position in the DOM Document
* $listMap = array('-' => 'ul', '*' => 'ul', '#' => 'ol');
*/
// Traverse up the DOM tree, from our current position, looking for open lists.
$lists = array();
for($n = $node; $n; $n = $n->parentNode)
if ($n->nodeName == 'ol' || $n->nodeName == 'ul')
array_unshift($lists, $n);
// See how many lists we can match... from the $text
$j = 0;
while (isset($text[$i + $j], $lists[$j], $listMap[$text[$i + $j]])
&& $listMap[$text[$i + $j]] == $lists[$j]->nodeName)
++$j;
// See how many list markers left...
$k = strspn($text, '-#*', $i + $j);
switch ($k)
{
case 1:
// Going a level deeper..
if (isset($lists[$j - 1]))
$node = $lists[$j - 1]->lastChild;
else if ($j == 0 && $node->nodeName == 'li')
$node = $node->parentNode;
// Create UL or UL...
$node = $this->insertElement($node, $listMap[$text[$i + $j]]);
$node = $node->appendChild($doc->createElement('li'));
$i += $j + $k;
break;
case 0:
// List item of the most recent open list.
$node = $this->insertElement($lists[$j - 1], 'li');
$i += $j;
break;
default:
// Horizontal line...
if (strspn($text, '-', $i) >= 4)
{
$this->insertElement($node, 'hr');
$i += $j + $k;
}
break;
}
# One
#** Two bold
#*# Two.Three (with literal # in front)
At line 558 changed one line
So **foo**bar**baz doesn't get recognised as a list, as $k = 2, and gets left alone for the inline parser to interpret as <strong>. But *list\n**foo**bar**baz, $k = 1, for both lines.
Note: this wiki renders it with a bug (bold is not closed and goes up to my signature):
At line 560 changed one line
-- [JaredWilliams], 2006-12-30
# One
#** Two bold
#*# Two.Three (with literal # in front)
At line 562 removed one line
----
At line 564 changed one line
As I've mentioned in [Raph's 0.4 recommendations], I'm in favor of using trailing whitespace to disambiguate second level list bullets from bold. It's simple and easy to understand. I am not in favor of "magic" algorithms to resolve the ambiguity. I think that non-local algorithms are especially undesirable for bullet lists, because they're often rearranged by cutting and pasting. Requiring trailing whitespace is also NotNew.
Any comments?
At line 566 changed 3 lines
From what I can tell in the above tangled discussion, it's also Radomir's favored solution. It seems to me we should be able to reach consensus on this issue fairly easily. Am I off base?
-- [RaphLevien], 2007-01-07
-- YaroslavStavnichiy, 2007-12-17
Version Date Modified Size Author Changes ... Change note
43 17-Dec-2007 18:47 14.815 kB YaroslavStavnichiy to previous
42 26-Sep-2007 09:31 13.154 kB ChuckSmith to previous | to last restore
41 26-Sep-2007 01:04 13.184 kB 207.44.238.95 to previous | to last
« This page (revision-43) was last changed on 17-Dez-2007 18:47 by YaroslavStavnichiy