`woob.browser.filters.html`¶

class CSS(selector=None, default=_NO_DEFAULT)[source]¶

Bases: _Selector

Select HTML elements with a CSS selector

For example:

obj_foo = CleanText(CSS('div.main'))

will take the text of all <div> having CSS class “main”.

select(selector, item)[source]¶

class XPath(selector=None, default=_NO_DEFAULT)[source]¶

Bases: _Selector

Select HTML elements with a XPath selector

exception XPathNotFound[source]¶: Bases: ItemNotFound

exception AttributeNotFound[source]¶: Bases: ItemNotFound

class Attr(selector, attr, default=_NO_DEFAULT)[source]¶

Bases: Filter

Get the text value of an HTML attribute.

Get value from attribute attr of HTML element matched by selector.

For example:

obj_foo = Attr('//img[@id="thumbnail"]', 'src')

will take the “src” attribute of <img> whose “id” is “thumbnail”.

filter(el)[source]¶

Raises:: XPathNotFound if no element is found
Raises:: AttributeNotFound if the element doesn’t have the requested attribute

class Link(selector=None, default=_NO_DEFAULT)[source]¶

Bases: Attr

Get the link uri of an element.

If the <a> tag is not found, an exception IndexError is raised.

class AbsoluteLink(selector=None, default=_NO_DEFAULT)[source]¶

Bases: Link

Get the absolute link URI of an element.

class CleanHTML(selector=None, options=None, default=_NO_DEFAULT)[source]¶

Bases: Filter

Convert HTML to text (Markdown) using html2text.

See also

html2text site

filter(txt)[source]¶: This method has to be overridden by children classes.

classmethod clean(txt, options=None)[source]¶

class FormValue(selector=None, default=_NO_DEFAULT)[source]¶

Bases: Filter

Extract a Python value from a form element.

Checkboxes and radio return booleans, while the rest return text. For <select> tags, returns the user-visible text.

filter(el)[source]¶: This method has to be overridden by children classes.

class HasElement(selector, yesvalue=True, novalue=False)[source]¶

Bases: Filter

Returns yesvalue if the selector finds elements, novalue otherwise.

filter(value)[source]¶: This method has to be overridden by children classes.

class TableCell(*names, **kwargs)[source]¶

Bases: _Filter

Used with TableElement, gets the cell element from its name.

For example:

>>> from woob.capabilities.bank import Transaction
>>> from woob.browser.elements import TableElement, ItemElement
>>> class table(TableElement):
...     head_xpath = '//table/thead/th'
...     item_xpath = '//table/tbody/tr'
...     col_date =    u'Date'
...     col_label =   [u'Name', u'Label']
...     class item(ItemElement):
...         klass = Transaction
...         obj_date = Date(TableCell('date'))
...         obj_label = CleanText(TableCell('label'))
...

TableCell handles table tags that have a “colspan” attribute that modify the width of the column: for example <td colspan=”2”> will occupy two columns instead of one, creating a column shift for all the next columns that must be taken in consideration when trying to match columns values with column heads.

exception ColumnNotFound[source]¶: Bases: FilterError

class ReplaceEntities(selector=None, symbols='', replace=[], children=True, newlines=True, transliterate=False, normalize='NFC', **kwargs)[source]¶

Bases: CleanText

Filter to replace HTML entities like “é” or “B” with their unicode counterpart.

filter(data)[source]¶: This method has to be overridden by children classes.

`woob.browser.filters.html`¶

Navigation

External links

Related Topics

woob.browser.filters.html¶

`woob.browser.filters.html`¶