weboob.browser.filters.standard

exception weboob.browser.filters.standard.FilterError

Bases: weboob.exceptions.ParseError

exception weboob.browser.filters.standard.ColumnNotFound

Bases: weboob.browser.filters.base.FilterError

exception weboob.browser.filters.standard.RegexpError

Bases: weboob.browser.filters.base.FilterError

exception weboob.browser.filters.standard.FormatError

Bases: weboob.browser.filters.base.FilterError

class weboob.browser.filters.standard.Filter(selector=None, default=NO_DEFAULT)

Bases: weboob.browser.filters.base._Filter

Class used to filter on a HTML element given as call parameter to return matching elements.

Filters can be chained, so the parameter supplied to constructor can be either a xpath selector string, or an other filter called before.

>>> from lxml.html import etree
>>> f = CleanDecimal(CleanText('//p'), replace_dots=True)
>>> f(etree.fromstring('<html><body><p>blah: <span>229,90</span></p></body></html>'))
Decimal('229.90')
Parameters:default – default value in case the filter fails to find or parse the requested value
filter(value)

This method has to be overridden by children classes.

select(selector, item)
class weboob.browser.filters.standard.Base(base, selector=None, default=NO_DEFAULT)

Bases: weboob.browser.filters.base.Filter

Change the base element used in filters.

>>> Base(Env('header'), CleanText('./h1'))  # doctest: +SKIP
class weboob.browser.filters.standard.Env(name, default=NO_DEFAULT)

Bases: weboob.browser.filters.base._Filter

Filter to get environment value of the item.

It is used for example to get page parameters, or when there is a parse() method on ItemElement.

class weboob.browser.filters.standard.TableCell(*names, **kwargs)

Bases: weboob.browser.filters.base._Filter

Used with TableElement, gets the cell element from its name.

For example:

>>> from weboob.capabilities.bank import Transaction
>>> from weboob.browser.elements import TableElement, ItemElement
>>> class table(TableElement):
...     head_xpath = '//table/thead/th'
...     item_xpath = '//table/tbody/tr'
...     col_date =    u'Date'
...     col_label =   [u'Name', u'Label']
...     class item(ItemElement):
...         klass = Transaction
...         obj_date = Date(TableCell('date'))
...         obj_label = CleanText(TableCell('label'))
...

The ‘colspan’ variable enables the handling of table tags that have a “colspan” attribute that modify the width of the column: for example <td colspan=”2”> will occupy two columns instead of one, creating a column shift for all the next columns that must be taken in consideration when trying to match columns values with column heads.

call_with_colspan(item)
call_without_colspan(item)
class weboob.browser.filters.standard.RawText(selector=None, children=False, default=NO_DEFAULT)

Bases: weboob.browser.filters.base.Filter

Get raw text from an element.

Unlike CleanText, whitespace is kept as is.

Parameters:children (bool) – whether to get text from children elements of the select elements
filter(value)

This method has to be overridden by children classes.

class weboob.browser.filters.standard.CleanText(selector=None, symbols='', replace=[], children=True, newlines=True, normalize='NFC', **kwargs)

Bases: weboob.browser.filters.base.Filter

Get a cleaned text from an element.

It first replaces all tabs and multiple spaces (including newlines if newlines is True) to one space and strips the result string.

The result is coerced into unicode, and optionally normalized according to the normalize argument.

Then it replaces all symbols given in the symbols argument.

>>> CleanText().filter('coucou ') == u'coucou'
True
>>> CleanText().filter(u'coucou coucou') == u'coucou coucou'
True
>>> CleanText(newlines=True).filter(u'coucou\r\n coucou ') == u'coucou coucou'
True
>>> CleanText(newlines=False).filter(u'coucou\r\n coucou ') == u'coucou\ncoucou'
True
Parameters:
  • symbols (list) – list of strings to remove from text
  • replace (list[tuple[str, str]]) – optional pairs of text replacements to perform
  • children (bool) – whether to get text from children elements of the select elements
  • newlines (bool) – if True, newlines will be converted to space too
  • normalize (str or None) – Unicode normalization to perform
classmethod clean(txt, children=True, newlines=True, normalize='NFC')
filter(value)

This method has to be overridden by children classes.

classmethod remove(txt, symbols)
classmethod replace(txt, replace)
class weboob.browser.filters.standard.Lower(selector=None, symbols='', replace=[], children=True, newlines=True, normalize='NFC', **kwargs)

Bases: weboob.browser.filters.standard.CleanText

Extract text with CleanText and convert to lower-case.

Parameters:
  • symbols (list) – list of strings to remove from text
  • replace (list[tuple[str, str]]) – optional pairs of text replacements to perform
  • children (bool) – whether to get text from children elements of the select elements
  • newlines (bool) – if True, newlines will be converted to space too
  • normalize (str or None) – Unicode normalization to perform
filter(value)

This method has to be overridden by children classes.

class weboob.browser.filters.standard.Upper(selector=None, symbols='', replace=[], children=True, newlines=True, normalize='NFC', **kwargs)

Bases: weboob.browser.filters.standard.CleanText

Extract text with CleanText and convert to upper-case.

Parameters:
  • symbols (list) – list of strings to remove from text
  • replace (list[tuple[str, str]]) – optional pairs of text replacements to perform
  • children (bool) – whether to get text from children elements of the select elements
  • newlines (bool) – if True, newlines will be converted to space too
  • normalize (str or None) – Unicode normalization to perform
filter(value)

This method has to be overridden by children classes.

class weboob.browser.filters.standard.Capitalize(selector=None, symbols='', replace=[], children=True, newlines=True, normalize='NFC', **kwargs)

Bases: weboob.browser.filters.standard.CleanText

Extract text with CleanText and capitalize it.

Parameters:
  • symbols (list) – list of strings to remove from text
  • replace (list[tuple[str, str]]) – optional pairs of text replacements to perform
  • children (bool) – whether to get text from children elements of the select elements
  • newlines (bool) – if True, newlines will be converted to space too
  • normalize (str or None) – Unicode normalization to perform
filter(value)

This method has to be overridden by children classes.

class weboob.browser.filters.standard.CleanDecimal(selector=None, replace_dots=False, sign=None, legacy=True, default=NO_DEFAULT)

Bases: weboob.browser.filters.standard.CleanText

Get a cleaned Decimal value from an element.

replace_dots is False by default. A dot is interpreted as a decimal separator.

If replace_dots is set to True, we remove all the dots. The ‘,’ is used as decimal separator (often useful for French values)

If replace_dots is a tuple, the first element will be used as the thousands separator, and the second as the decimal separator.

See http://en.wikipedia.org/wiki/Thousands_separator#Examples_of_use

For example, for the UK style (as in 1,234,567.89):

>>> CleanDecimal('./td[1]', replace_dots=(',', '.'))  # doctest: +SKIP
Parameters:sign – function accepting the text as param and returning the sign
classmethod French(*args, **kwargs)
classmethod SI(*args, **kwargs)
classmethod US(*args, **kwargs)
filter(value)

This method has to be overridden by children classes.

class weboob.browser.filters.standard.Field(name)

Bases: weboob.browser.filters.base._Filter

Get the attribute of object.

Example:

obj_foo = CleanText('//h1')
obj_bar = Field('foo')

will make “bar” field equal to “foo” field.

class weboob.browser.filters.standard.Regexp(selector=None, pattern=None, template=None, nth=0, flags=0, default=NO_DEFAULT)

Bases: weboob.browser.filters.base.Filter

Apply a regex.

>>> from lxml.html import etree
>>> doc = etree.fromstring('<html><body><p>Date: <span>13/08/1988</span></p></body></html>')
>>> Regexp(CleanText('//p'), r'Date: (\d+)/(\d+)/(\d+)', '\\3-\\2-\\1')(doc) == u'1988-08-13'
True
>>> (Regexp(CleanText('//body'), r'(\d+)', nth=1))(doc) == u'08'
True
>>> (Regexp(CleanText('//body'), r'(\d+)', nth=-1))(doc) == u'1988'
True
>>> (Regexp(CleanText('//body'), r'(\d+)', template='[\\1]', nth='*'))(doc) == [u'[13]', u'[08]', u'[1988]']
True
>>> (Regexp(CleanText('//body'), r'Date:.*'))(doc) == u'Date: 13/08/1988'
True
>>> (Regexp(CleanText('//body'), r'^(?!Date:).*', default=None))(doc)
>>>
expand(m)
filter(value)
Raises:RegexpError if pattern was not found
class weboob.browser.filters.standard.Map(selector, map_dict, default=NO_DEFAULT)

Bases: weboob.browser.filters.base.Filter

Map selected value to another value using a dict.

Example:

TYPES = {
    'Concert': CATEGORIES.CONCERT,
    'Cinéma': CATEGORIES.CINE,
}

obj_type = Map(CleanText('./li'), TYPES)
Parameters:selector – key from map_dict to use
filter(value)
Raises:ItemNotFound if key does not exist in dict
class weboob.browser.filters.standard.DateTime(selector=None, default=NO_DEFAULT, dayfirst=False, translations=None, parse_func=<function parse>, fuzzy=False)

Bases: weboob.browser.filters.base.Filter

Parse date and time.

Parameters:
  • dayfirst (bool) – if True, the day is be the first element in the string to parse
  • parse_func – the function to use for parsing the datetime
  • translations (list[tuple[str, str]]) – string replacements from site locale to English
filter(value)

This method has to be overridden by children classes.

class weboob.browser.filters.standard.Date(selector=None, default=NO_DEFAULT, dayfirst=False, translations=None, parse_func=<function parse>, fuzzy=False)

Bases: weboob.browser.filters.standard.DateTime

Parse date.

filter(value)

This method has to be overridden by children classes.

class weboob.browser.filters.standard.Time(selector=None, default=NO_DEFAULT)

Bases: weboob.browser.filters.base.Filter

Parse time.

filter(value)

This method has to be overridden by children classes.

klass

alias of datetime.time

kwargs = {'hour': 'hh', 'minute': 'mm', 'second': 'ss'}
class weboob.browser.filters.standard.DateGuesser(selector, date_guesser, **kwargs)

Bases: weboob.browser.filters.base.Filter

class weboob.browser.filters.standard.Duration(selector=None, default=NO_DEFAULT)

Bases: weboob.browser.filters.standard.Time

Parse a duration as timedelta.

klass

alias of datetime.timedelta

kwargs = {'hours': 'hh', 'minutes': 'mm', 'seconds': 'ss'}
class weboob.browser.filters.standard.MultiFilter(*args, **kwargs)

Bases: weboob.browser.filters.base.Filter

filter(values)

This method has to be overridden by children classes.

class weboob.browser.filters.standard.CombineDate(date, time)

Bases: weboob.browser.filters.standard.MultiFilter

Combine separate Date and Time filters into a single datetime.

filter(value)

This method has to be overridden by children classes.

class weboob.browser.filters.standard.Format(fmt, *args)

Bases: weboob.browser.filters.standard.MultiFilter

Combine multiple filters with string-format.

Example:

obj_title = Format('%s (%s)', CleanText('//h1'), CleanText('//h2'))

will concatenate the text from all <h1> and all <h2> (but put the latter between parentheses).

Parameters:
  • fmt (str) – string format suitable for “%”-formatting
  • args – other filters to insert in fmt string. There should be as many args as there are “%” in fmt.
filter(value)

This method has to be overridden by children classes.

class weboob.browser.filters.standard.Join(pattern, selector=None, textCleaner=<class 'weboob.browser.filters.standard.CleanText'>, newline=False, addBefore='', addAfter='')

Bases: weboob.browser.filters.base.Filter

filter(value)

This method has to be overridden by children classes.

class weboob.browser.filters.standard.Type(selector=None, type=None, minlen=0, default=NO_DEFAULT)

Bases: weboob.browser.filters.base.Filter

Get a cleaned value of any type from an element text. The type_func can be any callable (class, function, etc.). By default an empty string will not be parsed but it can be changed by specifying minlen=False. Otherwise, a minimal length can be specified.

>>> Type(CleanText('./td[1]'), type=int)  # doctest: +SKIP
>>> Type(type=int).filter(42)
42
>>> Type(type=int).filter('42')
42
>>> Type(type=int, default='NaN').filter('')
'NaN'
>>> Type(type=list, minlen=False, default=list('ab')).filter('')
[]
>>> Type(type=list, minlen=0, default=list('ab')).filter('')
['a', 'b']
filter(value)

This method has to be overridden by children classes.

class weboob.browser.filters.standard.Eval(func, *args)

Bases: weboob.browser.filters.standard.MultiFilter

Evaluate a function with given ‘deferred’ arguments.

>>> F = Field; Eval(lambda a, b, c: a * b + c, F('foo'), F('bar'), F('baz')) # doctest: +SKIP
>>> Eval(lambda x, y: x * y + 1).filter([3, 7])
22

Example:

obj_ratio = Eval(lambda x: x / 100, Env('percentage'))
Parameters:func – function to apply to all filters. The function should accept as many args as there are filters passed to Eval.
filter(value)

This method has to be overridden by children classes.

class weboob.browser.filters.standard.BrowserURL(url_name, **kwargs)

Bases: weboob.browser.filters.standard.MultiFilter

filter(value)

This method has to be overridden by children classes.

class weboob.browser.filters.standard.Async(name, selector=None)

Bases: weboob.browser.filters.base.Filter

Selector that uses another page fetched earlier.

Often used in combination with AsyncLoad filter. Requires that the other page’s URL is matched with a Page by the Browser.

Example:

class item(ItemElement):
    load_details = Field('url') & AsyncLoad

    obj_description = Async('details') & CleanText('//h3')
filter(*args)

This method has to be overridden by children classes.

loaded_page(item)
class weboob.browser.filters.standard.AsyncLoad(selector=None, default=NO_DEFAULT)

Bases: weboob.browser.filters.base.Filter

Load a page asynchronously for later use.

Often used in combination with Async filter.

Parameters:default – default value in case the filter fails to find or parse the requested value
class weboob.browser.filters.standard.QueryValue(selector, key, default=NO_DEFAULT)

Bases: weboob.browser.filters.base.Filter

Extract the value of a parameter from an URL with a query string.

>>> from lxml.html import etree
>>> from .html import Link
>>> f = QueryValue(Link('//a'), 'id')
>>> f(etree.fromstring('<html><body><a href="http://example.org/view?id=1234"></a></body></html>')) == u'1234'
True
filter(value)

This method has to be overridden by children classes.

class weboob.browser.filters.standard.Coalesce(*args, **kwargs)

Bases: weboob.browser.filters.standard.MultiFilter

Returns the first value that is not falsy, or default if all values are falsy.

filter(value)

This method has to be overridden by children classes.