woob.browser.filters.html

class CSS(selector=None, default=_NO_DEFAULT)[source]

Bases: _Selector

Select HTML elements with a CSS selector

For example:

obj_foo = CleanText(CSS('div.main'))

will take the text of all <div> having CSS class “main”.

select(selector, item)[source]
class XPath(selector=None, default=_NO_DEFAULT)[source]

Bases: _Selector

Select HTML elements with a XPath selector

exception XPathNotFound[source]

Bases: ItemNotFound

exception AttributeNotFound[source]

Bases: ItemNotFound

class Attr(selector, attr, default=_NO_DEFAULT)[source]

Bases: Filter

Get the text value of an HTML attribute.

Get value from attribute attr of HTML element matched by selector.

For example:

obj_foo = Attr('//img[@id="thumbnail"]', 'src')

will take the “src” attribute of <img> whose “id” is “thumbnail”.

filter(el)[source]
Raises:

XPathNotFound if no element is found

Raises:

AttributeNotFound if the element doesn’t have the requested attribute

Bases: Attr

Get the link uri of an element.

If the <a> tag is not found, an exception IndexError is raised.

Bases: Link

Get the absolute link URI of an element.

class CleanHTML(selector=None, options=None, default=_NO_DEFAULT)[source]

Bases: Filter

Convert HTML to text (Markdown) using html2text.

See also

html2text site

filter(txt)[source]

This method has to be overridden by children classes.

classmethod clean(txt, options=None)[source]
class FormValue(selector=None, default=_NO_DEFAULT)[source]

Bases: Filter

Extract a Python value from a form element.

Checkboxes and radio return booleans, while the rest return text. For <select> tags, returns the user-visible text.

filter(el)[source]

This method has to be overridden by children classes.

class HasElement(selector, yesvalue=True, novalue=False)[source]

Bases: Filter

Returns yesvalue if the selector finds elements, novalue otherwise.

filter(value)[source]

This method has to be overridden by children classes.

class TableCell(*names, **kwargs)[source]

Bases: _Filter

Used with TableElement, gets the cell element from its name.

For example:

>>> from woob.capabilities.bank import Transaction
>>> from woob.browser.elements import TableElement, ItemElement
>>> class table(TableElement):
...     head_xpath = '//table/thead/th'
...     item_xpath = '//table/tbody/tr'
...     col_date =    u'Date'
...     col_label =   [u'Name', u'Label']
...     class item(ItemElement):
...         klass = Transaction
...         obj_date = Date(TableCell('date'))
...         obj_label = CleanText(TableCell('label'))
...

TableCell handles table tags that have a “colspan” attribute that modify the width of the column: for example <td colspan=”2”> will occupy two columns instead of one, creating a column shift for all the next columns that must be taken in consideration when trying to match columns values with column heads.

exception ColumnNotFound[source]

Bases: FilterError

class ReplaceEntities(selector=None, symbols='', replace=[], children=True, newlines=True, transliterate=False, normalize='NFC', **kwargs)[source]

Bases: CleanText

Filter to replace HTML entities like “&eacute;” or “&#x42;” with their unicode counterpart.

filter(data)[source]

This method has to be overridden by children classes.