| t | | t | |
| <table align="right"><tr><td><div class="toc"> | | |
| Contents: | | |
| <ul> | | |
| <li><a href="#Introduction">Introduction</a></li> | | |
| <li><a href="#Processor_Syntax">Processor Syntax</a></li> | | |
| <li><a href="#WebSed_configuration_specification">WebSed configuration specification</a></li> | | |
| <ul> | | |
| <li><a href="#source_line">source line</a></li> | | |
| <li><a href="#match_specification">match specification</a></li> | | |
| <ul> | | |
| <li><a href="#match_expressions">match expressions</a></li> | | |
| <li><a href="#end_patterns">end patterns</a></li> | | |
| <li><a href="#precursors_and_the_websed_parsing_state_machine">precursors and the websed parsing state machine</a></li> | | |
| <li><a href="#Anonymous_sequences">Anonymous sequences</a></li> | | |
| </ul> | | |
| <li><a href="#format_specification">format specification</a></li> | | |
| </ul> | | |
| <li><a href="#Examples">Examples</a></li> | | |
| </ul> | | |
| </div></td></tr></table> | | |
| <p> | | |
| <h1><a name="Introduction">Introduction</a> | | |
| <span align=right class="section_edit_link">[<a href="/tbwiki/DocWebSed?action=edit§ion=Introduction">edit section</a>]</font></span> | | |
| </h1> | | |
| WebSed is a tbwiki processor for extracting information from other pages or | | |
| web sites, and using that as part of tbwiki output. | | |
| <p> | | |
| <h1><a name="Processor_Syntax">Processor Syntax</a> | | |
| <span align=right class="section_edit_link">[<a href="/tbwiki/DocWebSed?action=edit§ion=Processor_Syntax">edit section</a>]</font></span> | | |
| </h1> | | |
| The WebSed processor uses the contents of the processor block to find | | |
| external data, and to assign names to individual pieces of data. Also, | | |
| the processor block has text which this data can be placed into. | | |
| <p> | | |
| The processor block starts with the processor name and has the internal parts: | | |
| <ul><li>a line indicating the source | | |
| <li>a match specification (one or more regular expression statements) | | |
| <li>a delimiter (always "---") | | |
| <li>a format specification, which is the output text, containing data references | | |
| </ul> | | |
| <p> | | |
| Here's an example: | | |
| <pre> | | |
| {{{#!WebSed | | |
| http ://testlab.celinuxforum.org:8000/files/UserTimTestWebsed.html | | |
| phone=^phone\s*=\s*(.*) | | |
| --- | | |
| Tim's phone number is %%(phone)s | | |
| </pre> | | |
| }}} | | |
| <p> | | |
| Assuming that the file UserTimTestWebsed.html had the contents: | | |
| <pre> | | |
| foo bar | | |
| phone=123-4567 | | |
| baz baf | | |
| </pre> | | |
| This this example would print out the text: | | |
| <p> | | |
| <pre> | | |
| Tim's phone number is 123-4567 | | |
| </pre> | | |
| <p> | | |
| In this example, the html file is read from the indicated web site, and | | |
| the lines are examined, looking for "phone=<something>" | | |
| <p> | | |
| That "<something>" is assigned to the websed variable 'phone'. | | |
| <p> | | |
| This variable can be placed in the block output for this processor block, | | |
| as indicated. | | |
| <p> | | |
| <h1><a name="WebSed_configuration_specification">WebSed configuration specification</a> | | |
| <span align=right class="section_edit_link">[<a href="/tbwiki/DocWebSed?action=edit§ion=WebSed_configuration_specification">edit section</a>]</font></span> | | |
| </h1> | | |
| <pre> | | |
| {{{#WebSed | | |
| <source> | | |
| <match_spec> | | |
| --- | | |
| <format_spec> | | |
| </pre> | | |
| }}} | | |
| <p> | | |
| <h2><a name="source_line">source line</a> | | |
| <span align=right class="section_edit_link">[<a href="/tbwiki/DocWebSed?action=edit§ion=source_line">edit section</a>]</font></span> | | |
| </h2> | | |
| The "source" line can refer to a page in this wiki or an external web page. | | |
| If the reference starts with an exclamation point, it refers to a page | | |
| in this wiki. ie !FrontPage | | |
| <p> | | |
| if the name includes '<a href="http:'">http:'</a> or '<a href="https:',">https:',</a> then the page is assumed to | | |
| be an external page3, and it is opened with Python's url library. | | |
| <p> | | |
| Otherwise, the reference is presumed to be an internal page, but it is | | |
| still opened with Python's url library, using a path relative to this | | |
| wiki. | | |
| <p> | | |
| The URL should be of the form: name:password@<a href="http://domain/dir/file?params">http://domain/dir/file?params</a> | | |
| <p> | | |
| <h2><a name="match_specification">match specification</a> | | |
| <span align=right class="section_edit_link">[<a href="/tbwiki/DocWebSed?action=edit§ion=match_specification">edit section</a>]</font></span> | | |
| </h2> | | |
| A match specification has lines describing the method to parse the source | | |
| data, and to assign values to named websed variables. | | |
| <p> | | |
| The specification can be one or more match expressions, each on it's own line. | | |
| The simplest specification is a single match expression consisting | | |
| of one variable and one regular expression (with one group). | | |
| <p> | | |
| Possible lines: | | |
| <ul><li><var>=<regex> | | |
| <li><var>_endpat=<regex> | | |
| <li><var>_precursors=<var list> | | |
| <li>search_space_start=<regex> | | |
| <li>search_space_end=<regex> | | |
| <li>linear | | |
| <li>=<regex> (anonymous sequence) | | |
| </ul> | | |
| <p> | | |
| Any line starting with '#' is considered a comment and is ignored by the | | |
| websed engine. | | |
| <p> | | |
| All regular expressions use python "re" syntax. See the <a href="http://docs.python.org/library/re.html">Python re library documentation</a> for details. | | |
| <p> | | |
| Note that WebSed uses the "search" method when matching. Regular expression patterns may match anywhere on the line, not just at the beginning of the line. | | |
| <p> | | |
| <h3><a name="match_expressions">match expressions</a> | | |
| <span align=right class="section_edit_link">[<a href="/tbwiki/DocWebSed?action=edit§ion=match_expressions">edit section</a>]</font></span> | | |
| </h3> | | |
| Match expressions indicates a variable name and a regular expression to | | |
| match, in order to provide a value for that variable. | | |
| <p> | | |
| Match expressions are of the form "<variable_name>=<regular expression>" | | |
| <p> | | |
| The regular expression should include a group (an expression inside parenthesis). The text that matches this group is used at the value for the variable. | | |
| <p> | | |
| Each line in the source data is compared with the right-hand side of the | | |
| match expression (the regular expression), and if a match is found, the | | |
| sub-expression which matches the first group is saved as the value for the variable named in the left-hand size of the match expression. | | |
| <p> | | |
| Example: | | |
| <p> | | |
| user=name:\s*(.*) | | |
| <p> | | |
| Will match a line with the string "name:" in it, and assign the everything | | |
| past the whitespace (if any) following the colon to the websed variable "user". | | |
| <p> | | |
| <h3><a name="end_patterns">end patterns</a> | | |
| <span align=right class="section_edit_link">[<a href="/tbwiki/DocWebSed?action=edit§ion=end_patterns">edit section</a>]</font></span> | | |
| </h3> | | |
| Most of the time, a match expression will provide a single value (from | | |
| the group in the regular expression) for a match variable. However, you | | |
| can also copy a set of lines from the source data, using a match expression | | |
| and an associated "end pattern match expression". | | |
| <p> | | |
| To specify an end pattern, suffix the name of the variable with "_endpat" | | |
| in the match expression. | | |
| <p> | | |
| The collecting of data will start when the match expression is matched, and collecting of data will stop when the _endpat match expression | | |
| is matched. | | |
| <p> | | |
| For example: | | |
| <p> | | |
| <pre> | | |
| foo=^BEGIN() | | |
| foo_endpat=^END() | | |
| </pre> | | |
| <p> | | |
| Will collect all the lines BETWEEN the lines in the source that start with | | |
| "BEGIN" and "END". Note that the pattern expression must still have a group, | | |
| even though the group is empty. The group is not currently used (but may be in | | |
| the future to support multi-line blocks starting or ending in the middle of a line). | | |
| <p> | | |
| <h3><a name="precursors_and_the_websed_parsing_state_machine">precursors and the websed parsing state machine</a> | | |
| <span align=right class="section_edit_link">[<a href="/tbwiki/DocWebSed?action=edit§ion=precursors_and_the_websed_parsing_state_machine">edit section</a>]</font></span> | | |
| </h3> | | |
| The match specification allows for the description of a complex state machine, | | |
| which allows the regular expression to match only in certain contexts. | | |
| It is not, however, a full context sensitive parser. | | |
| <p> | | |
| For a given variable, it is possible to define a list of precursors, | | |
| which are variables which must be matched before a pattern | | |
| can match the first variable. | | |
| <p> | | |
| This means that a complex state machine can be specified for parsing | | |
| the variables values from the source data. | | |
| <p> | | |
| For example, if I'm scanning for the word "foo" in the text, then | | |
| I scan specify that I must first see matches for "cat" and "bar" first. | | |
| <p> | | |
| <pre> | | |
| cat=black (cat) | | |
| bar=(the) bar | | |
| foo=foo is (.*)[.] | | |
| foo_precursors=bar,cat | | |
| </pre> | | |
| <p> | | |
| In the above example, the following text would trigger the the matching of foo: | | |
| <pre> | | |
| I thought I saw a black cat | | |
| in the bar. | | |
| The value of foo is ufo. | | |
| </pre> | | |
| Note that value of 'foo' would be "ufo", but would only be assigned if | | |
| both cat and bar were seen. They do not have to be matched in any order. | | |
| If a specific order of matching is required for 'cat' and 'bar', then | | |
| a precursor can be used on those as well (e.g. bar_precursor=cat) | | |
| <p> | | |
| Notice that 'cat' would have the value "cat", while "bar" would have the value | | |
| "the". | | |
| <p> | | |
| If the word 'linear' appears on a line by itself, it indicates that the | | |
| search expressions preceding it must be matched in the order given | | |
| in match spec. | | |
| <p> | | |
| For example, if I listed three search expressions, foo, bar, and baz. | | |
| then the 'baz' expression would not match until both the foo and bar | | |
| expressions had matched (in that order). | | |
| <p> | | |
| <h3><a name="Anonymous_sequences">Anonymous sequences</a> | | |
| <span align=right class="section_edit_link">[<a href="/tbwiki/DocWebSed?action=edit§ion=Anonymous_sequences">edit section</a>]</font></span> | | |
| </h3> | | |
| If a match line starts with '=' (that is, the variable name is missing in | | |
| the match expressions), then the match specifies an "anonymous sequence". | | |
| <p> | | |
| An anonymous sequence can be used to match several data items on a single line. | | |
| <p> | | |
| If used, a match spec may contain only one anonymous sequence, and the | | |
| format string should use string format specifiers instead of named format | | |
| variables. | | |
| <p> | | |
| For example, the following could be used to reword the sentence: | | |
| <p> | | |
| <pre> | | |
| {{{#!WebSed | | |
| =the (\w*) bike was (\w*) | | |
| --- | | |
| I believe that the %s bicycle might have been %s. | | |
| </pre> | | |
| }}} | | |
| <p> | | |
| <h2><a name="format_specification">format specification</a> | | |
| <span align=right class="section_edit_link">[<a href="/tbwiki/DocWebSed?action=edit§ion=format_specification">edit section</a>]</font></span> | | |
| </h2> | | |
| The format specification has the text to return for this block. | | |
| Usually, this will include references to matched items parsed from the source. | | |
| <p> | | |
| <format_spec> can be empty, in which case a list of any variables found are | | |
| printed with their names (if any) separated by spaces. | | |
| <p> | | |
| <p> | | |
| <b>NOTE that variables in the format specification have to use double-percents | | |
| in their declarations, like so: | | |
| <p> | | |
| <ul><div style="background-color:#ffffe0; padding:5px; border-style: solid solid solid solid; border-width: 1px 1px 1px 1px;"> | | |
| <pre>%%(var_name)s</pre></div></ul> | | |
| | | |
| <p> | | |
| <h1><a name="Examples">Examples</a> | | |
| <span align=right class="section_edit_link">[<a href="/tbwiki/DocWebSed?action=edit§ion=Examples">edit section</a>]</font></span> | | |
| </h1> | | |
| See <a style="color:red;" href="/tbwiki/TestWebSedProcessor">TestWebSedProcessor</a> for examples uses of this processor | | |
| <p> | | |
| </b> | | |