Discussion:
Regex and Syntax Highlighting
Graeme Geldenhuys
2010-05-25 12:34:15 UTC
Hi,

Does anybody know of a website or article I can read about how to
integrate regular expressions with an editor to end up with a editor
that can handle syntax highlighting. It doesn't need to be specific to
Object Pascal (that would be too easy and ideal). ;-)

For example, mcedit (from Midnight Commander), gEdit (Gnome's default
editor) etc all use regex to handle there syntax highlighting. What I
would like to find out is how to use regex with an editor component.
Both are new to me, so yes, I'm down on both counts, but I am very
eager to learn. :-)

As far as I can see, I need to find out about the following requirements:

* What events must the editor component make available (eg: OnLineDraw)
* how to use regular expressions with such events
* what if syntax highlighting spans multiple lines (eg: a multi-line
comment in Object Pascal)
* syntax definition file layout. I guess I can look at gEdit's spec.
They use XML files to define
each syntax highlighter. If I can piggy-back on there definition
file, I'll instantly have a whole
bunch of syntax highlighters available.

And yes I was told before that using regex for syntax highlighting is
slow, but I think that's a matter of implementation. The editors I
have seen and used are more that fast enough even on large files. The
huge benefit of external (runtime) syntax highlighting via something
like regex is that it is very simple to extend by anybody that knows
regex. This means, no need for custom components to do syntax
highlighting like SynEdit, no recompiling of components or
applications etc.

Anyway, this editor component will form part of a larger project I am
working on. I already have the basics of a editor component and need
to find out what else I need to implement for syntax highlighting to
be possible in that editor component.

Any thoughts, suggestions, pointers - any information or links to

I found a very nice article that explains the design of an editor that
must support Unicode and Syntax Highlighting. So far the author has
implemented the basic editor, Unicode support, but hasn't reaching the
part I am interested in - syntax highlighting. :-( None the less,
this is an interesting tutorial to read. You can find it at the
following URL.

--
Regards,
- Graeme -

_______________________________________________
fpGUI - a cross-platform Free Pascal GUI toolkit
http://opensoft.homeip.net/fpgui/

--
Marco van de Voort
2010-05-25 12:38:01 UTC
Post by Graeme Geldenhuys
And yes I was told before that using regex for syntax highlighting is
slow, but I think that's a matter of implementation. The editors I
have seen and used are more that fast enough even on large files. The
huge benefit of external (runtime) syntax highlighting via something
like regex is that it is very simple to extend by anybody that knows
regex.
Not as much, since it is not just the rule, but they must integrate as well.

But the main problem IMHO is that regex is not suited to highlight many
languages that require correct detection of matching tokens, like the Pascal

--
Graeme Geldenhuys
2010-05-25 13:14:01 UTC
Post by Marco van de Voort
But the main problem IMHO is that regex is not suited to highlight many
languages that require correct detection of matching tokens, like the Pascal
mcedit (editor part of Midnight Commander) supports 68 different file
formats by default. That alone is already very impressive.

gEdit supports 85 different file formats by default (included in
Ubuntu 10.04). Even more impressive than mcedit.

In both cases, I simply had to tweak one or two lines of the pascal
syntax definition files to get the Object Pascal syntax correct for
Free Pascal specific features I use and for my code. So this is
clearly not so much of a problem and regex is sufficiently flexible
for most (if not all) languages.

I don't want to make this a debate if regex is a good choice or not. I
am going to use regex no matter what others say, I simply need to find
out more information on how to tie it into a editor component and what
is required from the editor component. I'll download the code for the
gtksourceview component (what is used in gEdit) and see if that gives
me some hints.

Thinking about it, I should also take a look at Sun's Java code. I
quite like their Document interface and how flexible it is to use...
even for something as simple as a TEdit and to as complex as a
full-blown programming editor or WYSIWYG editor.
--
Regards,
- Graeme -

_______________________________________________
fpGUI - a cross-platform Free Pascal GUI toolkit
http://opensoft.homeip.net/fpgui/

--
Zaher Dirkey
2010-05-25 18:08:46 UTC
I beleave regex is not a good choice.
If you compare the speed you will find SynEdit is more faster.
I import it to make my syntax for my site using PHP i found my Highlighter
is more faster (10 time) than the famous one (Geshi) that use the regex.

http://www.parmaja.com/forums/viewtopic.php?id=45
http://qbnz.com/highlighter/

On Tue, May 25, 2010 at 4:14 PM, Graeme Geldenhuys
Post by Graeme Geldenhuys
Post by Marco van de Voort
But the main problem IMHO is that regex is not suited to highlight many
languages that require correct detection of matching tokens, like the
Pascal
Post by Marco van de Voort
mcedit (editor part of Midnight Commander) supports 68 different file
formats by default. That alone is already very impressive.
gEdit supports 85 different file formats by default (included in
Ubuntu 10.04). Even more impressive than mcedit.
In both cases, I simply had to tweak one or two lines of the pascal
syntax definition files to get the Object Pascal syntax correct for
Free Pascal specific features I use and for my code. So this is
clearly not so much of a problem and regex is sufficiently flexible
for most (if not all) languages.
I don't want to make this a debate if regex is a good choice or not. I
am going to use regex no matter what others say, I simply need to find
out more information on how to tie it into a editor component and what
is required from the editor component. I'll download the code for the
gtksourceview component (what is used in gEdit) and see if that gives
me some hints.
Thinking about it, I should also take a look at Sun's Java code. I
quite like their Document interface and how flexible it is to use...
even for something as simple as a TEdit and to as complex as a
full-blown programming editor or WYSIWYG editor.
--
Regards,
- Graeme -
_______________________________________________
fpGUI - a cross-platform Free Pascal GUI toolkit
http://opensoft.homeip.net/fpgui/
--
_______________________________________________
Lazarus mailing list
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
--
Zaher Dirkey
Graeme Geldenhuys
2010-05-26 08:43:56 UTC
Post by Zaher Dirkey
I beleave regex is not a good choice.
If you compare the speed you will find SynEdit is more faster.
Again, I think this is more related to implementation details. Some
editors do it great, some others don't. Some of the editors I looked
at that uses regular expressions for syntax highlighting, managed to
open 10MB source code files (like the old macosall.pp unit) and syntax
highlighting was instant (even when you scroll to the end of the
file).

So it is definitely *not* a hard and fast rule that if an editor uses
regular expressions for syntax highlighting, that it is immediately
slower than others.
--
Regards,
- Graeme -

_______________________________________________
fpGUI - a cross-platform Free Pascal GUI toolkit
http://opensoft.homeip.net/fpgui/

--
Zaher Dirkey
2010-05-26 10:11:17 UTC
Post by Graeme Geldenhuys
Post by Zaher Dirkey
I beleave regex is not a good choice.
If you compare the speed you will find SynEdit is more faster.
Again, I think this is more related to implementation details. Some
editors do it great, some others don't. Some of the editors I looked
at that uses regular expressions for syntax highlighting, managed to
open 10MB source code files (like the old macosall.pp unit) and syntax
highlighting was instant (even when you scroll to the end of the
file).
So it is definitely *not* a hard and fast rule that if an editor uses
regular expressions for syntax highlighting, that it is immediately
slower than others.
I meant the mechanism of SynEdit not SynEdit it self, Let us call "Line
Feeding Highlighting", LFH if you like :P
RegEx used on whale file in memory, but that LFH do it line by line, you can
generate the colored and syntax online without load it in memory, just line
by line.
SynEdit use that way the LFH one.

in fact i like FSHL (PHP), you can make the rules by the engine create the
scanner source from it and it use the generated one, the rules is more easy
than (SynEdit mechanism)
http://www.hvge.sk/scripts/fshl/

Best Regards
--
Zaher Dirkey
Graeme Geldenhuys
2010-05-26 11:06:59 UTC
Post by Zaher Dirkey
I meant the mechanism of SynEdit not SynEdit it self, Let us call "Line
Feeding Highlighting", LFH if you like :P
RegEx used on whale file in memory, but that LFH do it line by line, you can
generate the colored and syntax online without load it in memory, just line
by line.
SynEdit use that way the LFH one.
I doubt SynEdit only uses a line by line method. What happens when a
comment block covers multiple lines. The syntax highlighter needs to
keep track of when the comment block started and continue marking
everything as a comment until it finds the matching closing comment
block tag.

Like I said in my first post, I don't know how they currently
integrate regex with a syntax highlighter - that's the whole point of
the exercise, to find out how. You can very easily run a regex on a
line by line basis, but I doubt that is the best way of doing it,
because it will also produce problems with things like comment blocks.
Post by Zaher Dirkey
http://www.hvge.sk/scripts/fshl/
--
Regards,
- Graeme -

_______________________________________________
fpGUI - a cross-platform Free Pascal GUI toolkit
http://opensoft.homeip.net/fpgui/

--
Martin
2010-05-26 11:19:46 UTC
Post by Graeme Geldenhuys
Post by Zaher Dirkey
I meant the mechanism of SynEdit not SynEdit it self, Let us call "Line
Feeding Highlighting", LFH if you like :P
RegEx used on whale file in memory, but that LFH do it line by line, you can
generate the colored and syntax online without load it in memory, just line
by line.
SynEdit use that way the LFH one.
I doubt SynEdit only uses a line by line method. What happens when a
comment block covers multiple lines. The syntax highlighter needs to
keep track of when the comment block started and continue marking
everything as a comment until it finds the matching closing comment
block tag.
It keeps a general state for the end of each (beginning of next) line:
- eg: currently in comment nested 3

if a change in a line, changes the end-state, then further lines are
scanned, until they match again. In the worst case, that is the rest of
the file; but very often, it doesn't need to look over the current line.

---
try MacOsAll.pp (300000 lines) in your egex highlighter

--
Graeme Geldenhuys
2010-05-26 12:25:04 UTC
On 26 May 2010 13:19, Martin <***@mfriebe.de> wrote:

Thanks for the SynEdit implementation details.
Post by Martin
try MacOsAll.pp (300000 lines) in your egex highlighter
You guys don't listen! :-)

MacOSAll.pp was split some time ago into multiple include files. No
problems, I checked out an older version which was 10MB in size
(277,380 lines of text to be exact).

Using jEdit v4.3.1 (which is a Java program and uses regex for
highlighting) opened that file in under 1 second and syntax
of the file. Again, instantly moved there and instantly the syntax
highlighting was done!

THE SLOWNESS YOU GUYS ARE MENTIONING IS BASED ON AN CRAP IMPLEMENTATION.

I don't know what editor you guys used to test syntax highlighting,
but clearly it was a crap editor. jEdit being a Java program is damn
fast (imagine that, a Java app being fast.) and extremely efficient
with LARGE files. So regexp syntax highlighting, implemented
correctly, does not slow down syntax highlighting!!
--
Regards,
- Graeme -

_______________________________________________
fpGUI - a cross-platform Free Pascal GUI toolkit
http://opensoft.homeip.net/fpgui/

--
Mattias Gärtner
2010-05-26 12:59:10 UTC
Post by Graeme Geldenhuys
[...]
Post by Martin
try MacOsAll.pp (300000 lines) in your egex highlighter
You guys don't listen! :-)
MacOSAll.pp was split some time ago into multiple include files. No
problems, I checked out an older version which was 10MB in size
(277,380 lines of text to be exact).
If others want to try: It's in fpc 2.4.0.
Post by Graeme Geldenhuys
Using jEdit v4.3.1 (which is a Java program and uses regex for
highlighting) opened that file in under 1 second and syntax
of the file. Again, instantly moved there and instantly the syntax
highlighting was done!
This is a fake. But a nice one. See below.
Post by Graeme Geldenhuys
THE SLOWNESS YOU GUYS ARE MENTIONING IS BASED ON AN CRAP IMPLEMENTATION.
Maybe it is hard to implement it fast *and* flexible?
Post by Graeme Geldenhuys
I don't know what editor you guys used to test syntax highlighting,
but clearly it was a crap editor. jEdit being a Java program is damn
fast (imagine that, a Java app being fast.) and extremely efficient
with LARGE files. So regexp syntax highlighting, implemented
correctly, does not slow down syntax highlighting!!
Indeed. For a regex highlighter jedit is very fast.
Just replace all (* and *) with { } in macosall.pp.
jedit needs only 5 seconds to scan here. That is quite impressing for
a regex highlighter. OTOH just pressing up key gives 100% cpu and the
cursor moves very slowly. So I would not say that jedit is "extremely
efficient with LARGE files". The random access of files is impressing
though.

Martin, while doing the same in synedit: It seems that after every
replace the highlighter is started. When doing multiple replaces only
one start is needed. Maybe this can be improved.

Mattias

--
Graeme Geldenhuys
2010-05-26 13:24:33 UTC
Post by Mattias GÃ¤rtner
This is a fake. But a nice one. See below.
I still don't understand why you say it's fake? I see syntax
highlighter code, so it works.
Post by Mattias GÃ¤rtner
Indeed. For a regex highlighter jedit is very fast.
Just replace all (* and *) with { } in macosall.pp.
jedit needs only 5 seconds to scan here.
Mine took about 1 second, whereas before it was so quick I couldn't
measure it. So 1 second on a 10MB more that good enough for me. And
yes, all code after the initial (* was highlighted as comments - one
color of text.
Post by Mattias GÃ¤rtner
That is quite impressing for a
regex highlighter. OTOH just pressing up key gives 100% cpu and the cursor
moves very slowly. So I would not say that jedit is "extremely efficient
with LARGE files". The random access of files is impressing though.
Nope, didn't experience anything like that here. The initial Ctrl+end
to normal idle behaviour. Moving the cursor up or down made no
difference - cpu around 2-5% as normal and cursor movement was as fast
as any other application. PgUp and PgDn repeatedly made the cpu go to
about 20-30% but that is normal too. Maybe your computer needs a
reboot. ;-)

Attached is the CPU History graph.
--
Regards,
- Graeme -

_______________________________________________
fpGUI - a cross-platform Free Pascal GUI toolkit
http://opensoft.homeip.net/fpgui/
Mattias Gaertner
2010-05-26 18:54:20 UTC
On Wed, 26 May 2010 15:24:33 +0200
Post by Graeme Geldenhuys
Post by Mattias GÃ¤rtner
This is a fake. But a nice one. See below.
I still don't understand why you say it's fake? I see syntax
highlighter code, so it works.
I apologize.
Apparently the machine where I tested had a bug.
time than editing and then showing highlighting at end. Together with
the feature list on the jedit site I came to the wrong
conclusion that jedit has a fancy algorithm to open files with random
access, which would be cool. But it does not.
I tested now on another machine: now both takes the same time.
This machine is faster and I guess on an even faster machine it could
reach the one second.

Apparently it uses some kind of line state too and only updates
till the visible area. Maybe synedit can do the same. At the moment
Mattias

--
Martin
2010-05-26 19:09:47 UTC
Post by Mattias Gaertner
Apparently it uses some kind of line state too and only updates
till the visible area. Maybe synedit can do the same. At the moment
It's been on my todo for a long time (and some prep work, like moving
highlighter related functions from synedit to highlighter have started....)

It will be limited though => because folding needs to know the full deal
(as scrollbars depend on the total of visible (unfolded) lines.
Of course, if no nodes are folded => folding doesn't care (even if
switches on). It's only when a node actually is folded => then folding
needs to verify it still exists.

Another improvement (but much less noticable) would be to separate
structural and visual scan.

certain info (like making a "+" red) is only needed when you scan for
display.
It doesn't matter for the overall structure. Also many keywords make no
difference for the structure (as far as the highlighter is concerned)
"deprecated" for example. In structural scan it doesn't need to be a
keyword.

Martin

--
Mattias Gaertner
2010-05-26 19:29:42 UTC
On Wed, 26 May 2010 20:09:47 +0100
Post by Martin
Post by Mattias Gaertner
Apparently it uses some kind of line state too and only updates
till the visible area. Maybe synedit can do the same. At the moment
It's been on my todo for a long time (and some prep work, like moving
highlighter related functions from synedit to highlighter have started....)
It will be limited though => because folding needs to know the full deal
(as scrollbars depend on the total of visible (unfolded) lines.
Of course, if no nodes are folded => folding doesn't care (even if
switches on). It's only when a node actually is folded => then folding
needs to verify it still exists.
I see.
But at least "replace all" could be improved.
Post by Martin
Another improvement (but much less noticable) would be to separate
structural and visual scan.
But that would mean traversing two times though the code. Bad for
processor caching.
Post by Martin
certain info (like making a "+" red) is only needed when you scan for
display.
It doesn't matter for the overall structure. Also many keywords make no
difference for the structure (as far as the highlighter is concerned)
"deprecated" for example. In structural scan it doesn't need to be a
keyword.
I will send you a gprof log.

Mattias

--
Martin
2010-05-26 19:41:39 UTC
Post by Mattias Gaertner
On Wed, 26 May 2010 20:09:47 +0100
Post by Martin
Post by Mattias Gaertner
Apparently it uses some kind of line state too and only updates
till the visible area. Maybe synedit can do the same. At the moment
It's been on my todo for a long time (and some prep work, like moving
highlighter related functions from synedit to highlighter have started....)
It will be limited though => because folding needs to know the full deal
(as scrollbars depend on the total of visible (unfolded) lines.
Of course, if no nodes are folded => folding doesn't care (even if
switches on). It's only when a node actually is folded => then folding
needs to verify it still exists.
I see.
But at least "replace all" could be improved.
Hm How to you mean. Unless there is a bug I don't know about.

The search replace is done inside a "PaintLock" => that (to the very
best of my knowledge) prevents the highlighter from doing anything.

Only one exception:
- if you chose to "prompt and confirm" => then each time a prompt comes,
the paintlock is interupted, and a rescan is done => but this si needed,
because in order to prompt, the display must be updated => the
highlighter must scan.

But without prompt, it should only scan once at the end of the replace.
(if there is evidence it does more => let me know)
Post by Mattias Gaertner
Post by Martin
Another improvement (but much less noticable) would be to separate
structural and visual scan.
But that would mean traversing two times though the code. Bad for
processor caching.
Not more than currently:

1) The highlighter scans the in order to update all ranges (but that
does not store any mid-line info)
2) each time a line is painted => that line is parsed, and token by
token returned to be painted.

so the first scan can skip alot of details => but it requires a lot of
rewrite work on the highlighter.

there is plenty of other optimizations in and outside the highlighter....

Martin

--
Mattias Gaertner
2010-05-26 22:26:24 UTC
On Wed, 26 May 2010 20:41:39 +0100
Post by Martin
Post by Mattias Gaertner
On Wed, 26 May 2010 20:09:47 +0100
Post by Martin
Post by Mattias Gaertner
Apparently it uses some kind of line state too and only updates
till the visible area. Maybe synedit can do the same. At the moment
It's been on my todo for a long time (and some prep work, like moving
highlighter related functions from synedit to highlighter have started....)
It will be limited though => because folding needs to know the full deal
(as scrollbars depend on the total of visible (unfolded) lines.
Of course, if no nodes are folded => folding doesn't care (even if
switches on). It's only when a node actually is folded => then folding
needs to verify it still exists.
I see.
But at least "replace all" could be improved.
Hm How to you mean. Unless there is a bug I don't know about.
You are right, it is not the highlighter. I will send you the gprof
output.

Mattias

--
Martin
2010-05-26 19:49:55 UTC
Post by Mattias Gaertner
On Wed, 26 May 2010 20:09:47 +0100
Post by Martin
Post by Mattias Gaertner
Apparently it uses some kind of line state too and only updates
till the visible area. Maybe synedit can do the same. At the moment
It's been on my todo for a long time (and some prep work, like moving
highlighter related functions from synedit to highlighter have started....)
It will be limited though => because folding needs to know the full deal
(as scrollbars depend on the total of visible (unfolded) lines.
Of course, if no nodes are folded => folding doesn't care (even if
switches on). It's only when a node actually is folded => then folding
needs to verify it still exists.
I see.
It doesn't mean that a deferred scan would not help at all.

Many people do not have folded nodes (collapsed nodes) at start up => if
all code is unfolded, the scan can be deferred, or done on idle.

of synedit doing all the calling.... but it's still some work left....

Martin

--
Hans-Peter Diettrich
2010-05-27 00:33:17 UTC
Post by Graeme Geldenhuys
THE SLOWNESS YOU GUYS ARE MENTIONING IS BASED ON AN CRAP IMPLEMENTATION.
Well, the goal has shifted away from only *syntax* highlighting.
Post by Graeme Geldenhuys
I don't know what editor you guys used to test syntax highlighting,
but clearly it was a crap editor. jEdit being a Java program is damn
fast (imagine that, a Java app being fast.) and extremely efficient
with LARGE files. So regexp syntax highlighting, implemented
correctly, does not slow down syntax highlighting!!
RegExp per se is a DFA, but recognition of multiple possible tokens
requires a NFA. This increases the O() complexity of the algorithm.

an file, before the highlighter can be used.

So it depends on the *concrete* syntax, how fast or slow the lexer
automaton can be.

Next comes the storage of modifications. A mere viewer can work directly
on the immutable file, but when the source can be modified at runtime,
with undo-tracking, and foldable blocks come into play, and UTF-8 and
tab expansion, then it may take longer to retrieve the text to show, the
highlighter must be fault-tolerant to cover temporarily invalid tokens,
and the source may need a reparse on every single insert/delete.

So yes, a syntax highlither *can* be amazingly fast, as I know from my
own experiments, but this can vary dramatically with more complex
requirements.

IMO it's a matter of preferences, whether one wants to construct an
editor in the first place, and add syntax-highlighting to it, or whether
one wants to implement an syntax highlighter for an file viewer. So it
doesn't make sense to compare apples and oranges, and to suspect a
crappy implementation, unless one knows all related requirements.

DoDi

--
Graeme Geldenhuys
2010-05-27 06:29:40 UTC
file, before the highlighter can be used.
I don't think this is true. I briefly looked at the jEdit code that
manages the syntax highlighting, and it doesn't do multiple passes
over the file. It tokenizes a line, if that line contains something
that could span multiple lines (like { or (* style comments), it
carries the state and context over to the next line. Then it processes
the next line etc... only a single pass seems to be used.

jEdit also supports code-folding, but I haven't looked at how they tie
that into the whole TextEdit component yet. The more I play with jEdit
the more impressed I get regarding all it's features and how
extensible it is via it's plugin system. A pretty amazing piece of
undo-tracking, and foldable blocks come into play, and UTF-8 and tab
expansion, then it may take longer to retrieve the text to show, the
highlighter must be fault-tolerant to cover temporarily invalid tokens,
jEdit supports all that and more, and it is very fast! Like I said, a
very impressive piece of engineering.
--
Regards,
- Graeme -

_______________________________________________
fpGUI - a cross-platform Free Pascal GUI toolkit
http://opensoft.homeip.net/fpgui/

--
Hans-Peter Diettrich
2010-05-27 09:42:04 UTC
Post by Graeme Geldenhuys
file, before the highlighter can be used.
I don't think this is true. I briefly looked at the jEdit code that
manages the syntax highlighting, and it doesn't do multiple passes
over the file.
I didn't say that parsing has to be done in multiple passes.

DoDi

--
Graeme Geldenhuys
2010-05-27 12:48:12 UTC
Post by Hans-Peter Diettrich
Post by Graeme Geldenhuys
file, before the highlighter can be used.
I don't think this is true. I briefly looked at the jEdit code that
manages the syntax highlighting, and it doesn't do multiple passes
over the file.
I didn't say that parsing has to be done in multiple passes.
My mistake, I took "pre-scan" as one pass and then "highlighting" as
another pass. It seem perfectly possible to use a single pass/scan of
the source code to manage syntax highlighting - as long as state and
context is carried forward when language syntax is detected that could
span multiple lines (like { or (* style comments).
--
Regards,
- Graeme -

_______________________________________________
fpGUI - a cross-platform Free Pascal GUI toolkit
http://opensoft.homeip.net/fpgui/

--
Mattias Gärtner
2010-05-26 12:24:19 UTC
Post by Graeme Geldenhuys
Post by Zaher Dirkey
I meant the mechanism of SynEdit not SynEdit it self, Let us call "Line
Feeding Highlighting", LFH if you like :P
LFH normally comes with a line state (some booleans or counters). Same
as synedit. But synedit supports arbitrary states (the default
implementation implements one stack). You need a stack for different
keyword sets. For example for the method modifiers. This is not fully
used in the synedit pascal highlighter, because IMO highlighting some
variables as keywords is not a big deal and because IFDEFs and macros
make it hard to implement fully.
For example:

procedure DoIt(
{$IFNDEF FPC} ); {$ELSE} i: integer = 0); inline; macro_modifier;
{$ENDIF} Post by Graeme Geldenhuys Post by Zaher Dirkey RegEx used on whale file in memory, but that LFH do it line by line, you can generate the colored and syntax online without load it in memory, just line by line. That's good for logs and csv files. That does not work well for sources, where blocks span many lines. You need the line state. Post by Graeme Geldenhuys [...] Like I said in my first post, I don't know how they currently integrate regex with a syntax highlighter - that's the whole point of the exercise, to find out how. You can very easily run a regex on a line by line basis, but I doubt that is the best way of doing it, because it will also produce problems with things like comment blocks. AFAIK they do. Because regex can not count nor save states, you need a state machine, which selects which set of regex to use. Post by Graeme Geldenhuys Post by Zaher Dirkey http://www.hvge.sk/scripts/fshl/ http://code.google.com/p/fshl/ Their configs are php. I think the gtksourceview (gedit) syntax is easier to understand. As far as I can see it is powerful enough. It is not as fast and flexible as synedit, but it is enough to highlight even blocks. Graeme, when you start implementing a highlighting machine, you might want to consider code folding too. Mattias -- Martin 2010-05-26 12:41:58 UTC Permalink Post by Mattias GÃ¤rtner Post by Zaher Dirkey I meant the mechanism of SynEdit not SynEdit it self, Let us call "Line Feeding Highlighting", LFH if you like :P LFH normally comes with a line state (some booleans or counters). Same as synedit. But synedit supports arbitrary states (the default implementation implements one stack). You need a stack for different keyword sets. For example for the method modifiers. This is not fully used in the synedit pascal highlighter, because IMO highlighting some variables as keywords is not a big deal and because IFDEFs and macros make it hard to implement fully. procedure DoIt( {$IFNDEF FPC} );
{$ELSE} i: integer = 0); inline; macro_modifier; {$ENDIF}
IFDEF pose a bigger problem

if foo then begin
bar();
{$IFDEF a} end; {$endif}
xyz();
{$IFnDEF a} end; {$endif}

The ifdef maynot be possible to evaluate (within the highlighter,
codetool could)

for xyz => 2 states are possible, and would need to be maintained.
with more ifdef, any number of states for a single line are possible.

So IFDEf should always balance egin/end correctly => otherwise there is
currently no way to deal with them.

----
As for using special colors for "inactive" code => that isn't something
the highlighte would do (because the highlighter is per file => it does
have no way to know)

A special IDE specific SynEdit extension (that combines info from the
highlighter and codetools) will one day deal with this

Martin

--
Graeme Geldenhuys
2010-05-26 12:43:46 UTC
Post by Graeme Geldenhuys
line by line basis, but I doubt that is the best way of doing it,
because it will also produce problems with things like comment blocks.
AFAIK they do. Because regex can not count nor save states, you need a state
machine, which selects which set of regex to use.
I guess I will have to delve into the jEdit code, as that is extremely
fast no matter the file size. So clearly following there design should
yield very good results.
Graeme, when you start implementing a highlighting machine, you might want
to consider code folding too.
Umm, personally I hate code folding, but it might be useful to some.
The "rich edit/view" component I want to create will be used as a
programming editor, but I would also like it to be used as the
RichView component used in DocView. Docview's current "richview"
component is only a read-only component which supports Unicode text,
varying fonts, varying fonts on the same line, embedded images,
margins, varying text colors, hyper links etc... I would like to see
if I can implement all that and editing/syntax highlighting in a
single component, plus adding "elastic tabstops" support to boot.

A tall order, but so far a very interesting exercise. I have already
learned a whole bunch of things I never knew about editors. :) For
example, the various designs for managing the buffer: Gap buffer,
linked lists, Piece Chains, etc... pretty amazing stuff.
--
Regards,
- Graeme -

_______________________________________________
fpGUI - a cross-platform Free Pascal GUI toolkit
http://opensoft.homeip.net/fpgui/

--
Marco van de Voort
2010-05-25 18:23:35 UTC
Post by Graeme Geldenhuys
syntax definition files to get the Object Pascal syntax correct for
Free Pascal specific features I use and for my code. So this is
clearly not so much of a problem and regex is sufficiently flexible
for most (if not all) languages.
I just pointed a theoretical deficit of regex. It is up to you to do
something with it before you commit yourself to this or not. So again to be
clear:

Standard regex can't deal with nested structures like nesting of comments.

Maybe your regex library has some substitute or workaround to disambiguate
common cases or not. I don't know much about what the variants can do.

I just recommend to look into it before you spend too much time on this.

Since you already identified several editors with support for it, a quick
test to see if it can fully parse such construct should give more comfort.

--
Graeme Geldenhuys
2010-05-26 08:37:20 UTC
Post by Marco van de Voort
Standard regex can't deal with nested structures like nesting of comments.
OK. Do you know of some complex code maybe included in Lazarus or FPC
that I could use as sample code to test?
Post by Marco van de Voort
Since you already identified several editors with support for it, a quick
test to see if it can fully parse such construct should give more comfort.
I installed jEdit yesterday. It supports a mammoth 177 different
syntax highlighter styles for all types of source code, text files
like xml/html/css, config files etc. They also use a combination of
regex and various code rules.

Interesting thing is that for all the different syntax styles they map
the colors back to a handful of hard-coded color keywords. eg:
KEYWORD1, KEYWORD2, OPERATOR, LITERAL1 etc. The editor component then
uses those color keywords to syntax highlight the text being viewed.
Attached is an example of such a color setup. For the Object Pascal
syntax they map for example Keywords to KEYWORD1
, Directives to KEYWORD2 etc. This this gives me a nice idea of how to
tie the syntax definition rules back into the editor component.

As for nested syntax etc.. jEdit handles LaTeX very well, and if you
ever tried to write a latex parser, you would know that LaTeX can be
very hard to work with. Yet jEdit managed to do it damn well - though
it's syntax rules was a lot more complex than most other language
(which was to be expected), but it did handle things like
\begin{verbatim} .... \end{verbatim} correctly and ignore highlighting
whatever syntax came between those tags.

Anyway, it seems like I am starting to collect some nice ideas and
getting more source code to look at to draw ideas from. Still no idea
how they execute the regex statements against the text, but hopefully
looking at the source code from jEdit, gEdit etc will reveal some
clues.
--
Regards,
- Graeme -

_______________________________________________
fpGUI - a cross-platform Free Pascal GUI toolkit
http://opensoft.homeip.net/fpgui/
Marco van de Voort
2010-05-26 09:13:11 UTC
This post might be inappropriate. Click to display it.
Florian Klaempfl
2010-05-26 09:30:11 UTC
Post by Marco van de Voort
But are they complete/correct ? :-) I know that many editors aren't. They
support a basic easy subset and that is it. Stuff like using directive
names as variables where allowed, support for & to escape keywords etc.
A nice test is also
const
eof = ^Z;

--
Florian Klaempfl
2010-05-26 09:37:07 UTC
Post by Florian Klaempfl
Post by Marco van de Voort
But are they complete/correct ? :-) I know that many editors aren't. They
support a basic easy subset and that is it. Stuff like using directive
names as variables where allowed, support for & to escape keywords etc.
A nice test is also
const
eof = ^Z;
.. which makes me wonder if
type
a = array[^A..^Z] of integer;+

is valid pascal as well (fpc doesn't like it currently).

--
Florian Klaempfl
2010-05-26 09:39:07 UTC
Post by Florian Klaempfl
Post by Florian Klaempfl
Post by Marco van de Voort
But are they complete/correct ? :-) I know that many editors aren't. They
support a basic easy subset and that is it. Stuff like using directive
names as variables where allowed, support for & to escape keywords etc.
A nice test is also
const
eof = ^Z;
.. which makes me wonder if
type
a = array[^A..^Z] of integer;+
is valid pascal as well (fpc doesn't like it currently).
Oops, without the + obviously.

type
a = array[^A..^Z] of integer;

--
Graeme Geldenhuys
2010-05-26 10:59:39 UTC
Post by Marco van de Voort
Just that
codeblock 1
{  xxx
{ yyy }
zzz }
codeblock 2
is coloured properly  And xxx yyy and zzz can contain (commented)code too of
course
....and that will produce a nice compiler warning (and for good
reason, and why I, and none of my team will ever use that format).
It's basic Pascal 101! :-)

frm_learnerlist.pas(653,3) Warning: Comment level 2 found

Here is the example I used to produce the above compiler warning:

{
procedure TLearnerListForm.PerformTabQuery;
const
eof = ^Z;
var
lData: TViewFilter;
begin
{ load data based on view }
lData := TViewFilter.Create(sgb.View);
try
lData.TabLetters := pcName1.ActivePage.Text;
sgb.Data := lData.Data;
finally
lData.Free;
end;
end;
}

I tested: jEdit, gEdit, MSEide, Lazarus IDE and mcedit. Only Lazarus
IDE syntax highlighted the above code as one single block of commented
code. But considering that that code now gives a compiler waring
doesn't say much.

Changing the above code to the more correct commenting style when
nested comments apply; suddenly *all* editors passed with flying
colors, and the FPC compiler gave no warnings. :-)

(*
procedure TLearnerListForm.PerformTabQuery;
const
eof = ^Z;
var
lData: TViewFilter;
begin
{ load data based on view }
lData := TViewFilter.Create(sgb.View);
try
lData.TabLetters := pcName1.ActivePage.Text;
sgb.Data := lData.Data;
finally
lData.Free;
end;
end;
*)

Removing the outer comment block, again all editors correctly applied
syntax highlighting. The one exception being gEdit which highlighted
the identifier eof incorrectly, but I am pretty sure a minor tweak to
the pascal.lang file will fix that.

NOTE:
I never said all editors using regular expressions are 100% - then
neither is Lazarus IDE's syntax highlighting. Only thing is, normally
the regex way is easier to fix without the need for improving the
parser, highlighting component and recompiling the whole IDE.

Problems in Lazarus IDE:
* deprecated modifier is incorrectly highlighted, no matter where you use it.
* A method named 'write(...)' will be incorrectly highlighted.
Lazarus thinks identifier write is
the same as when it is used for a property setter method.
property Name: string read FName write FName;
vs
procedure Write(...);

... I remember seeing a few more in Lazarus, but can't remember them now.
--
Regards,
- Graeme -

_______________________________________________
fpGUI - a cross-platform Free Pascal GUI toolkit
http://opensoft.homeip.net/fpgui/

--
Martin Schreiber
2010-05-26 12:13:19 UTC
Post by Graeme Geldenhuys
I tested: jEdit, gEdit, MSEide, Lazarus IDE and mcedit. Only Lazarus
IDE syntax highlighted the above code as one single block of commented
code.
MSEide has another definition file (pascal2.sdef) which handles nested
comments. The default is pascal.sdef for Delphi compatibility.

Martin

--
Mattias Gärtner
2010-05-26 12:27:13 UTC
Post by Martin Schreiber
Post by Graeme Geldenhuys
I tested: jEdit, gEdit, MSEide, Lazarus IDE and mcedit. Only Lazarus
IDE syntax highlighted the above code as one single block of commented
code.
MSEide has another definition file (pascal2.sdef) which handles nested
comments. The default is pascal.sdef for Delphi compatibility.
Why not choose the right highlighter automatically by parsing the mode
directive and the compiler options?

Mattias

--
Martin Schreiber
2010-05-26 12:37:57 UTC
Post by Mattias GÃ¤rtner
Post by Martin Schreiber
MSEide has another definition file (pascal2.sdef) which handles nested
comments. The default is pascal.sdef for Delphi compatibility.
Why not choose the right highlighter automatically by parsing the mode
directive and the compiler options?
Too complicated for me to implement, I am a simple man. ;-)

Martin

--
Marco van de Voort
2010-05-26 13:10:25 UTC
Post by Graeme Geldenhuys
....and that will produce a nice compiler warning (and for good
reason, and why I, and none of my team will ever use that format).
It's basic Pascal 101! :-)
Not really, since TP didn't even supported it.
Post by Graeme Geldenhuys
frm_learnerlist.pas(653,3) Warning: Comment level 2 found
Warnings are not illegal. They are just warnings.
Post by Graeme Geldenhuys
I tested: jEdit, gEdit, MSEide, Lazarus IDE and mcedit. Only Lazarus
IDE syntax highlighted the above code as one single block of commented
code. But considering that that code now gives a compiler waring
doesn't say much.
If I factor in Martin S.' comment that MSIDE does it properly in Delphi mode
(and that is feature, not bug, see above), it is clear, the real Pascal IDEs
does it fine, the ones with the large amount of highlighters don't.

I think that says enough, and fits fine with the remark I made about average
quality of such highlighters.

I rather have decent one for the primary format I make advanced use of, than
100 for the ones I don't use or only occasionally.
Post by Graeme Geldenhuys
Changing the above code to the more correct commenting style when
nested comments apply; suddenly *all* editors passed with flying
colors, and the FPC compiler gave no warnings. :-)
For that rearranging I would like proper highlighting. This is exactly one
of the cases where I WOULD want highlighting to work fine to keep overview.
Post by Graeme Geldenhuys
I never said all editors using regular expressions are 100% - then
neither is Lazarus IDE's syntax highlighting. Only thing is, normally
the regex way is easier to fix without the need for improving the
parser, highlighting component and recompiling the whole IDE.
I think you are taking the difficult road. Of course, with enough fixes and
extra state tricks you can get it somewhat working. And if a language needs
more you add yet another hacky workaround.

But for what exactly? To recycle the nice readable format of regex ?!?!?!?
Post by Graeme Geldenhuys
* deprecated modifier is incorrectly highlighted, no matter where you use it.
* A method named 'write(...)' will be incorrectly highlighted.
Lazarus thinks identifier write is
the same as when it is used for a property setter method.
property Name: string read FName write FName;
vs
procedure Write(...);
... I remember seeing a few more in Lazarus, but can't remember them now.
I was not talking about missing features. I was talking about constructs
where regex will have more problems than other principles, namely nested
constructs.

That's because regex alone can't even parse (1+(1))=2 for correctness.

--
Graeme Geldenhuys
2010-05-26 13:47:02 UTC
Post by Marco van de Voort
Not really, since TP didn't even supported it.
I honestly can't say, except that I know about (* and { style
formatting for very long and somewhere in that time I was taught never
to nest two of the same comment styles, always alternate then.
Post by Marco van de Voort
Warnings are not illegal. They are just warnings.
Well if it is such accepted practice, then why is it a Warning in the
first place?
Post by Marco van de Voort
(and that is feature, not bug, see above), it is clear, the real Pascal IDEs
does it fine, the ones with the large amount of highlighters don't.
And it could maybe be that nobody (or very few) developers use
stand-alone editor to edit Pascal or Object Pascal code - they where
introduced to IDE's from the start. It's 99% of the time a rather
simple fix to get the other stand-alone editors to highlight like
Lazarus IDE. Many times it's just simple things that FPC or Delphi
introduced over the years that was not in the original Pascal (think
TP here) days.
Post by Marco van de Voort
I was not talking about missing features.
It's not missing features in Lazarus IDE, it's simple bugs in the
highlighter code.
Post by Marco van de Voort
where regex will have more problems than other principles, namely nested
constructs.
Nested if statements, nested procedures, nested comments (the correct
style) all work fine with most regex highlighters, so I really don't
see your issue. But enough said, we could go on forever like this,
but I have better things to do right now.

BTW:
I tried all the examples listed on this page (which FPC doesn't
support in most cases), and jEdit formatted then without problems -
perfectly, just like the website shows. Yes, nested types, nested
classes etc.. Lazarus and MSEide did too by the way [just to be fair].
:-)

--
Regards,
- Graeme -

_______________________________________________
fpGUI - a cross-platform Free Pascal GUI toolkit
http://opensoft.homeip.net/fpgui/

--
Marco van de Voort
2010-05-26 13:58:35 UTC
Post by Graeme Geldenhuys
Post by Marco van de Voort
where regex will have more problems than other principles, namely nested
constructs.
Nested if statements, nested procedures, nested comments (the correct
style) all work fine with most regex highlighters, so I really don't
see your issue. But enough said, we could go on forever like this,
but I have better things to do right now.
I tried all the examples listed on this page (which FPC doesn't
support in most cases), and jEdit formatted then without problems -
perfectly, just like the website shows. Yes, nested types, nested
classes etc.. Lazarus and MSEide did too by the way [just to be fair].
:-)
Note the nesting is about matching nesting _structures_ with highlighting.
Not highlighting keywords in nested structures them. That's why the primary
example was comments, not if..then because for comments it matters how the
result is highlighted.

--
Marco van de Voort
2010-05-27 09:17:35 UTC
Post by Graeme Geldenhuys
I tried all the examples listed on this page (which FPC doesn't
support in most cases), and jEdit formatted then without problems -
perfectly, just like the website shows. Yes, nested types, nested
classes etc.. Lazarus and MSEide did too by the way [just to be fair].
:-)
That's not so surprising , since that link only contains relative new
syntax, the compiler only is starting to understand in recent times. (see
e.g. the many bugs for fcl-passrc about these kinds of topics)

But the comment example is D2 or so. (if not D1, but I have no experience
with that one). So if Jedit doesn't support it, it is more likely due to a
flawed concept.

--
Juha Manninen
2010-05-28 15:58:59 UTC
Post by Graeme Geldenhuys
I installed jEdit yesterday. It supports a mammoth 177 different
syntax highlighter styles for all types of source code, text files
like xml/html/css, config files etc. They also use a combination of
regex and various code rules.
Hi,
as noted, regex can't parse any recursive, nested structures. Much of the
logic is hard-coded in the editor.
It sounds like you are re-inventing the wheel with another regex highlighter.
You could as well stitch an existing editor and highlighter to your IDE.

Enter Perl6 regex...
Its syntax has grammar and rules and can parse anything!

be FULLY configurable, and also be new and innovative.
It would also work as a "codetools replacement" for browsing and syntax
checking.
Highlighting different nested blocks of XML would be easy, too.

The problem is that Perl6 is still under construction. I don't know how well
it currently supports an API for C (or Pascal).
One choice is to make such Perl6 regex component with pascal. There are many
implementations of the current syntax which could work as a base for a "Rule".
Then a "Grammar" would contain many Rules.

Just my thoughts...

Regards,
Juha

--

João Marcelo Vaz
2010-05-25 13:18:50 UTC
Hi Graeme,

Have you seen Colorer-take5 on the following URL
http://colorer.sourceforge.net/ ?

It's cited in the tutorial you pointed at

João Marcelo
Post by Graeme Geldenhuys
Hi,
Does anybody know of a website or article I can read about how to
integrate regular expressions with an editor to end up with a editor
that can handle syntax highlighting. It doesn't need to be specific to
Object Pascal (that would be too easy and ideal). ;-)
For example, mcedit (from Midnight Commander), gEdit (Gnome's default
editor) etc all use regex to handle there syntax highlighting. What I
would like to find out is how to use regex with an editor component.
Both are new to me, so yes, I'm down on both counts, but I am very
eager to learn. :-)
* What events must the editor component make available (eg: OnLineDraw)
* how to use regular expressions with such events
* what if syntax highlighting spans multiple lines (eg: a multi-line
comment in Object Pascal)
* syntax definition file layout. I guess I can look at gEdit's spec.
They use XML files to define
each syntax highlighter. If I can piggy-back on there definition
file, I'll instantly have a whole
bunch of syntax highlighters available.
And yes I was told before that using regex for syntax highlighting is
slow, but I think that's a matter of implementation. The editors I
have seen and used are more that fast enough even on large files. The
huge benefit of external (runtime) syntax highlighting via something
like regex is that it is very simple to extend by anybody that knows
regex. This means, no need for custom components to do syntax
highlighting like SynEdit, no recompiling of components or
applications etc.
Anyway, this editor component will form part of a larger project I am
working on. I already have the basics of a editor component and need
to find out what else I need to implement for syntax highlighting to
be possible in that editor component.
Any thoughts, suggestions, pointers - any information or links to
I found a very nice article that explains the design of an editor that
must support Unicode and Syntax Highlighting. So far the author has
implemented the basic editor, Unicode support, but hasn't reaching the
part I am interested in - syntax highlighting. :-( None the less,
this is an interesting tutorial to read. You can find it at the
following URL.
--
Regards,
- Graeme -
_______________________________________________
fpGUI - a cross-platform Free Pascal GUI toolkit
http://opensoft.homeip.net/fpgui/
--
_______________________________________________
Lazarus mailing list
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Hans-Peter Diettrich
2010-05-25 14:02:53 UTC
Post by Graeme Geldenhuys
Does anybody know of a website or article I can read about how to
integrate regular expressions with an editor to end up with a editor
that can handle syntax highlighting. It doesn't need to be specific to
Object Pascal (that would be too easy and ideal). ;-)
I wonder how you want to use multiple regexp's to detect different
syntax elements.

DoDi

--
Marcos Douglas
2010-05-25 18:01:01 UTC
On Tue, May 25, 2010 at 9:34 AM, Graeme Geldenhuys
Post by Graeme Geldenhuys
Hi,
Does anybody know of a website or article I can read about how to
integrate regular expressions with an editor to end up with a editor
that can handle syntax highlighting. It doesn't need to be specific to
Object Pascal (that would be too easy and ideal). ;-)
I do not know if TextAdept uses regexp, but I know it uses the Lua
language for customize all editor.

Marcos Douglas

--
Graeme Geldenhuys
2010-05-26 08:39:06 UTC
Post by Marcos Douglas