Table Object

Note

This is everything you need to use TableDataExtractor. The other sections, Input, Output, History, Footnotes, Algorithms, Cell Parser, and Exceptions are for reference only.

Represents a table in a highly standardized format.

class tabledataextractor.table.table.Table(file_path, table_number=1, **kwargs)[source]

Main TableDataExtractor object that includes the raw (input), cleaned (processes) and labelled tables. Represents the table input (.csv, .html, python list, url) in a highly standardized category table format, using the MIPS (Minimum Indexing Point Search) algorithm.

Optional configuration keywords (defaults):

  • use_title_row = True
    A title row will be assumed if possible.
  • use_prefixing = True
    Will perform the prefixing steps if row or column index cells are not unique.
  • use_spanning_cells = True
    Will duplicate spanning cells in the row and column header regions if needed.
  • use_header_extension = True
    Will extend the row and column header beyond the MIPS-defined headers, if needed.
  • use_footnotes = True
    Will copy the footnote text into the appropriate cells of the table and remove the footnote prefix.
  • use_max_data_area = False
    If True the max data area will be used to determine the cell CC2 in the main MIPS algorithm. It is probably never necessary to set this to True.
  • standardize_empty_data = True
    Will standardize empty cells in the data region to ‘NoValue’
  • row_header = None
    If an integer is given, it indicates the index of row_header columns. This overwrites the MIPS algorithm. For example, row_header = 0 will make only the first column a row header.
  • col_header = None
    If an integer is given, it indicates the index of col_header rows. This overwrites the MIPS algorithm. For example, col_header = 0 will make only the first row a column header.
Parameters:
  • file_path (str | list) – Path to .html or .cvs file, URL or list object that is used as input
  • table_number (int) – Number of table to read, if there are several at the given url, or in the html file
category_table

Standardized table, where each row corresponds to a single data point of the original table. The columns are the row and column categories where the data point belongs to.

Type:list
col_header

Column header of the table.

Type:numpy.ndarray
configs

Configuration keywords set at the creation of the Table instance.

Type:dict
contains(pattern)[source]

Returns true if table contains a particular string.

Parameters:pattern – Regular expression for input
Returns:True/False
data

Data region of the table.

Type:numpy.ndarray
footnotes

List of footnotes in the table. Each footnote is an instance of Footnote.

Type:list[Footnote]
history

Indicates which algorithms have been applied to the table by TableDataExtractor.

Type:History
labels

Cell labels.

Type:list
pre_cleaned_table

Cleaned-up table. This table is used for labelling the table regions, finding data-cells and building the category table.

Type:numpy.array
pre_cleaned_table_empty

Mask array with True for all empty cells of the pre_cleaned_table.

Type:numpy.array
print()[source]

Prints the raw table (input), cleaned table (processed by TableDataExtractor) and labels (regions of the table) nicely.

print_raw_table()[source]

Prints raw input table nicely.

raw_table

Input table, as provided to TableDataExtractor.

Type:numpy.array
row_categories

Table where the original stub header is the first row(s) and all subsequent rows are the row categories of the original table. The assumption is made that the stub header labels row categories (that is, cells below the stub header). The row_categories table can be used if the row categories want to be analyzed as data themselves, which can occur if the header regions of the original table intentionally have duplicate elements.

Type:TrivialTable
row_header

Row header of the table.

Type:numpy.ndarray
stub_header

Stub header of the table.

Type:numpy.ndarray
subtables

List of all subtables. Each subtable is an instance of Table.

Type:list[Table]
title_row

Title row of the table.

Type:list
to_csv(file_path)[source]

Saves the raw_table to a .csv file.

to_pandas()[source]

Converts the Table into a Pandas DataFrame, taking the complex MultiIndex structure of the table into account.

Returns:pandas.DataFrame
transpose()[source]

Transposes the Table and performs the analysis again. In this way, if working interactively from a Jupyter notebook, it is possible to input a table and then transpose it to see how it looks like and if the results of the standardization are different.

class tabledataextractor.table.table.TrivialTable(file_path, table_number=1, **kwargs)[source]

Trivial Table object. No high level analysis will be performed. MIPS algorithm is never run. This table doesn’t have footnotes, a title row or subtables.

Optional configuration keywords (defaults):

  • standardize_empty_data = False
    Will standardize empty cells in the data region to ‘NoValue’.
  • clean_row_header = False
    Removes duplicate rows that span the whole table (all columns).
  • row_header = 0
    The column up to which the row header is defined.
  • col_header = 0
    The row up to which the column header is defined.
col_header

Column header of the table.

Type:numpy.ndarray
footnotes

None

labels

Cell labels.

Type:numpy.array
row_header

Row header of the table. Enables a one-column table.

Type:numpy.ndarray
subtables

None

title_row

None