DocWebSed 

TB Wiki

Login

DocWebSed

Introduction [edit section]

WebSed is a tbwiki processor for extracting information from other pages or web sites, and using that as part of tbwiki output.

Processor Syntax [edit section]

The WebSed processor uses the contents of the processor block to find external data, and to assign names to individual pieces of data. Also, the processor block has text which this data can be placed into.

The processor block starts with the processor name and has the internal parts:

  • a line indicating the source
  • a match specification (one or more regular expression statements)
  • a delimiter (always "---")
  • a format specification, which is the output text, containing data references

Here's an example:

{{{#!WebSed
http ://testlab.celinuxforum.org:8000/files/UserTimTestWebsed.html
phone=^phone\s*=\s*(.*)
---
Tim's phone number is %%(phone)s
}}}

Assuming that the file UserTimTestWebsed.html had the contents:

   foo bar
   phone=123-4567
   baz baf
This this example would print out the text:

   Tim's phone number is 123-4567

In this example, the html file is read from the indicated web site, and the lines are examined, looking for "phone=<something>"

That "<something>" is assigned to the websed variable 'phone'.

This variable can be placed in the block output for this processor block, as indicated.

WebSed configuration specification [edit section]

{{{#WebSed
<source>
<match_spec>
---
<format_spec>
}}}

source line [edit section]

The "source" line can refer to a page in this wiki or an external web page. If the reference starts with an exclamation point, it refers to a page in this wiki. ie !FrontPage

if the name includes 'http:' or 'https:', then the page is assumed to be an external page3, and it is opened with Python's url library.

Otherwise, the reference is presumed to be an internal page, but it is still opened with Python's url library, using a path relative to this wiki.

The URL should be of the form: name:password@http://domain/dir/file?params

match specification [edit section]

A match specification has lines describing the method to parse the source data, and to assign values to named websed variables.

The specification can be one or more match expressions, each on it's own line. The simplest specification is a single match expression consisting of one variable and one regular expression (with one group).

Possible lines:

  • <var>=<regex>
  • <var>_endpat=<regex>
  • <var>_precursors=<var list>
  • search_space_start=<regex>
  • search_space_end=<regex>
  • linear
  • =<regex> (anonymous sequence)

Any line starting with '#' is considered a comment and is ignored by the websed engine.

All regular expressions use python "re" syntax. See the Python re library documentation for details.

Note that WebSed uses the "search" method when matching. Regular expression patterns may match anywhere on the line, not just at the beginning of the line.

match expressions [edit section]

Match expressions indicates a variable name and a regular expression to match, in order to provide a value for that variable.

Match expressions are of the form "<variable_name>=<regular expression>"

The regular expression should include a group (an expression inside parenthesis). The text that matches this group is used at the value for the variable.

Each line in the source data is compared with the right-hand side of the match expression (the regular expression), and if a match is found, the sub-expression which matches the first group is saved as the value for the variable named in the left-hand size of the match expression.

Example:

user=name:\s*(.*)

Will match a line with the string "name:" in it, and assign the everything past the whitespace (if any) following the colon to the websed variable "user".

end patterns [edit section]

Most of the time, a match expression will provide a single value (from the group in the regular expression) for a match variable. However, you can also copy a set of lines from the source data, using a match expression and an associated "end pattern match expression".

To specify an end pattern, suffix the name of the variable with "_endpat" in the match expression.

The collecting of data will start when the match expression is matched, and collecting of data will stop when the _endpat match expression is matched.

For example:

   foo=^BEGIN()
   foo_endpat=^END()

Will collect all the lines BETWEEN the lines in the source that start with "BEGIN" and "END". Note that the pattern expression must still have a group, even though the group is empty. The group is not currently used (but may be in the future to support multi-line blocks starting or ending in the middle of a line).

precursors and the websed parsing state machine [edit section]

The match specification allows for the description of a complex state machine, which allows the regular expression to match only in certain contexts. It is not, however, a full context sensitive parser.

For a given variable, it is possible to define a list of precursors, which are variables which must be matched before a pattern can match the first variable.

This means that a complex state machine can be specified for parsing the variables values from the source data.

For example, if I'm scanning for the word "foo" in the text, then I scan specify that I must first see matches for "cat" and "bar" first.

  cat=black (cat)
  bar=(the) bar
  foo=foo is (.*)[.]
  foo_precursors=bar,cat

In the above example, the following text would trigger the the matching of foo:

  I thought I saw a black cat
  in the bar.
  The value of foo is ufo.
Note that value of 'foo' would be "ufo", but would only be assigned if both cat and bar were seen. They do not have to be matched in any order. If a specific order of matching is required for 'cat' and 'bar', then a precursor can be used on those as well (e.g. bar_precursor=cat)

Notice that 'cat' would have the value "cat", while "bar" would have the value "the".

If the word 'linear' appears on a line by itself, it indicates that the search expressions preceding it must be matched in the order given in match spec.

For example, if I listed three search expressions, foo, bar, and baz. then the 'baz' expression would not match until both the foo and bar expressions had matched (in that order).

Anonymous sequences [edit section]

If a match line starts with '=' (that is, the variable name is missing in the match expressions), then the match specifies an "anonymous sequence".

An anonymous sequence can be used to match several data items on a single line.

If used, a match spec may contain only one anonymous sequence, and the format string should use string format specifiers instead of named format variables.

For example, the following could be used to reword the sentence:

{{{#!WebSed
=the (\w*) bike was (\w*)
---
I believe that the %s bicycle might have been %s.
}}}

format specification [edit section]

The format specification has the text to return for this block. Usually, this will include references to matched items parsed from the source.

<format_spec> can be empty, in which case a list of any variables found are printed with their names (if any) separated by spaces.

NOTE that variables in the format specification have to use double-percents in their declarations, like so:

    %%(var_name)s

Examples [edit section]

See TestWebSedProcessor for examples uses of this processor

TBWiki engine 1.9.1 by Tim Bird