FrontPage 

TB Wiki

Login

DocWebSed in 'raw' format

{{TableOfContents}}

= Introduction =
WebSed is a tbwiki processor for extracting information from other pages or
web sites, and using that as part of tbwiki output.

= Processor Syntax =
The WebSed processor uses the contents of the processor block to find
external data, and to assign names to individual pieces of data.  Also,
the processor block has text which this data can be placed into.

The processor block starts with the processor name and has the internal parts:
 * a line indicating the source
 * a match specification (one or more regular expression statements)
 * a delimiter (always "---")
 * a format specification, which is the output text, containing data references

Here's an example:
{{{
{{{#!WebSed
http ://testlab.celinuxforum.org:8000/files/UserTimTestWebsed.html
phone=^phone\s*=\s*(.*)
---
Tim's phone number is %%%%(phone)s
}}}
}}}

Assuming that the file UserTimTestWebsed.html had the contents:
{{{
   foo bar
   phone=123-4567
   baz baf
}}}
This this example would print out the text:

{{{
   Tim's phone number is 123-4567
}}}

In this example, the html file is read from the indicated web site, and
the lines are examined, looking for "phone=<something>"

That "<something>" is assigned to the websed variable 'phone'.

This variable can be placed in the block output for this processor block,
as indicated.


= WebSed configuration specification =
{{{
{{{#WebSed
<source>
<match_spec>
---
<format_spec>
}}}
}}}

== source line ==
The "source" line can refer to a page in this wiki or an external web page.
If the reference starts with an exclamation point, it refers to a page
in this wiki.  ie !FrontPage

if the name includes 'http:' or 'https:', then the page is assumed to
be an external page3, and it is opened with Python's url library.

Otherwise, the reference is presumed to be an internal page, but it is
still opened with Python's url library, using a path relative to this
wiki.
 
The URL should be of the form: name:password@http://domain/dir/file?params


== match specification ==
A match specification has lines describing the method to parse the source
data, and to assign values to named websed variables.

The specification can be one or more match expressions, each on it's own line.
The simplest specification is a single match expression consisting
of one variable and one regular expression (with one group).


Possible lines:
 * <var>=<regex>
 * <var>_endpat=<regex>
 * <var>_precursors=<var list>
 * search_space_start=<regex>
 * search_space_end=<regex>
 * linear
 * =<regex>    (anonymous sequence)

Any line starting with '#' is considered a comment and is ignored by the
websed engine.

All regular expressions use python "re" syntax.  See the [[http://docs.python.org/library/re.html|Python re library documentation]] for details.

Note that WebSed uses the "search" method when matching. Regular expression patterns may match anywhere on the line, not just at the beginning of the line.


=== match expressions ===
Match expressions indicates a variable name and a regular expression to
match, in order to provide a value for that variable.

Match expressions are of the form "<variable_name>=<regular expression>"

The regular expression should include a group (an expression inside parenthesis).  The text that matches this group is used at the value for the variable.

Each line in the source data is compared with the right-hand side of the
match expression (the regular expression), and if a match is found, the
sub-expression which matches the first group is saved as the value for the variable named in the left-hand size of the match expression.

Example:

user=name:\s*(.*)

Will match a line with the string "name:" in it, and assign the everything
past the whitespace (if any) following the colon to the websed variable "user".



=== end patterns ===
Most of the time, a match expression will provide a single value (from 
the group in the regular expression) for a match variable.  However, you
can also copy a set of lines from the source data, using a match expression
and an associated "end pattern match expression".

To specify an end pattern, suffix the name of the variable with "_endpat"
in the match expression.

The collecting of data will start when the match expression is matched, and collecting of data will stop when the _endpat match expression
is matched.

For example:

{{{
   foo=^BEGIN()
   foo_endpat=^END()
}}}

Will collect all the lines BETWEEN the lines in the source that start with
"BEGIN" and "END".  Note that the pattern expression must still have a group,
even though the group is empty.  The group is not currently used (but may be in
the future to support multi-line blocks starting or ending in the middle of a line).




=== precursors and the websed parsing state machine ===
The match specification allows for the description of a complex state machine,
which allows the regular expression to match only in certain contexts.
It is not, however, a full context sensitive parser.

For a given variable, it is possible to define a list of precursors,
which are variables which must be matched before a pattern 
can match the first variable.

This means that a complex state machine can be specified for parsing
the variables values from the source data.

For example, if I'm scanning for the word "foo" in the text, then 
I scan specify that I must first see matches for "cat" and "bar" first.

{{{
  cat=black (cat)
  bar=(the) bar
  foo=foo is (.*)[.]
  foo_precursors=bar,cat
}}}

In the above example, the following text would trigger the the matching of foo:
{{{
  I thought I saw a black cat
  in the bar.
  The value of foo is ufo.
}}}
Note that value of 'foo' would be "ufo", but would only be assigned if
both cat and bar were seen.  They do not have to be matched in any order.
If a specific order of matching is required for 'cat' and 'bar', then
a precursor can be used on those as well (e.g. bar_precursor=cat)

Notice that 'cat' would have the value "cat", while "bar" would have the value
"the".

If the word 'linear' appears on a line by itself, it indicates that the
search expressions preceding it must be matched in the order given
in match spec.

For example, if I listed three search expressions, foo, bar, and baz.
then the 'baz' expression would not match until both the foo and bar
expressions had matched (in that order).


=== Anonymous sequences ===
If a match line starts with '=' (that is, the variable name is missing in
the match expressions), then the match specifies an "anonymous sequence".

An anonymous sequence can be used to match several data items on a single line.

If used, a match spec may contain only one anonymous sequence, and the
format string should use string format specifiers instead of named format
variables.

For example, the following could be used to reword the sentence:

{{{
{{{#!WebSed
=the (\w*) bike was (\w*)
---
I believe that the %%s bicycle might have been %%s.
}}}
}}}




== format specification ==
The format specification has the text to return for this block.
Usually, this will include references to matched items parsed from the source.

<format_spec> can be empty, in which case a list of any variables found are
printed with their names (if any) separated by spaces.

'''NOTE that variables in the format specification have to use double-percents
in their declarations, like so:

{{{#!YellowBox
%%%%(var_name)s
}}}




= Examples =
See [[TestWebSedProcessor]] for examples uses of this processor


TBWiki engine 1.9.1 by Tim Bird