Table Object¶
Note
This is everything you need to use TableDataExtractor. The other sections, Input, Output, History, Footnotes, Algorithms, Cell Parser, and Exceptions are for reference only.
Represents a table in a highly standardized format.
-
class
tabledataextractor.table.table.
Table
(file_path, table_number=1, **kwargs)[source]¶ Main TableDataExtractor object that includes the raw (input), cleaned (processes) and labelled tables. Represents the table input (.csv, .html, python list, url) in a highly standardized category table format, using the MIPS (Minimum Indexing Point Search) algorithm.
Optional configuration keywords (defaults):
use_title_row = True
- A title row will be assumed if possible.
use_prefixing = True
- Will perform the prefixing steps if row or column index cells are not unique.
use_spanning_cells = True
- Will duplicate spanning cells in the row and column header regions if needed.
use_header_extension = True
- Will extend the row and column header beyond the MIPS-defined headers, if needed.
use_footnotes = True
- Will copy the footnote text into the appropriate cells of the table and remove the footnote prefix.
use_max_data_area = False
- If True the max data area will be used to determine the cell CC2 in the main MIPS algorithm. It is probably never necessary to set this to True.
standardize_empty_data = True
- Will standardize empty cells in the data region to ‘NoValue’
row_header = None
- If an integer is given, it indicates the index of row_header columns. This overwrites the MIPS algorithm.
For example,
row_header = 0
will make only the first column a row header.
col_header = None
- If an integer is given, it indicates the index of col_header rows. This overwrites the MIPS algorithm.
For example,
col_header = 0
will make only the first row a column header.
Parameters: - file_path (str | list) – Path to .html or .cvs file, URL or list object that is used as input
- table_number (int) – Number of table to read, if there are several at the given url, or in the html file
-
category_table
¶ Standardized table, where each row corresponds to a single data point of the original table. The columns are the row and column categories where the data point belongs to.
Type: list
-
col_header
¶ Column header of the table.
Type: numpy.ndarray
-
contains
(pattern)[source]¶ Returns true if table contains a particular string.
Parameters: pattern – Regular expression for input Returns: True/False
-
data
¶ Data region of the table.
Type: numpy.ndarray
-
footnotes
¶ List of footnotes in the table. Each footnote is an instance of
Footnote
.Type: list[Footnote]
-
history
¶ Indicates which algorithms have been applied to the table by TableDataExtractor.
Type: History
-
pre_cleaned_table
¶ Cleaned-up table. This table is used for labelling the table regions, finding data-cells and building the category table.
Type: numpy.array
-
pre_cleaned_table_empty
¶ Mask array with True for all empty cells of the
pre_cleaned_table
.Type: numpy.array
-
print
()[source]¶ Prints the raw table (input), cleaned table (processed by TableDataExtractor) and labels (regions of the table) nicely.
-
raw_table
¶ Input table, as provided to TableDataExtractor.
Type: numpy.array
-
row_categories
¶ Table where the original stub header is the first row(s) and all subsequent rows are the row categories of the original table. The assumption is made that the stub header labels row categories (that is, cells below the stub header). The row_categories table can be used if the row categories want to be analyzed as data themselves, which can occur if the header regions of the original table intentionally have duplicate elements.
Type: TrivialTable
-
row_header
¶ Row header of the table.
Type: numpy.ndarray
-
stub_header
¶ Stub header of the table.
Type: numpy.ndarray
-
class
tabledataextractor.table.table.
TrivialTable
(file_path, table_number=1, **kwargs)[source]¶ Trivial Table object. No high level analysis will be performed. MIPS algorithm is never run. This table doesn’t have footnotes, a title row or subtables.
Optional configuration keywords (defaults):
standardize_empty_data = False
- Will standardize empty cells in the data region to ‘NoValue’.
clean_row_header = False
- Removes duplicate rows that span the whole table (all columns).
row_header = 0
- The column up to which the row header is defined.
col_header = 0
- The row up to which the column header is defined.
-
col_header
¶ Column header of the table.
Type: numpy.ndarray
-
footnotes
¶ None
-
labels
¶ Cell labels.
Type: numpy.array
-
row_header
¶ Row header of the table. Enables a one-column table.
Type: numpy.ndarray
-
subtables
¶ None
-
title_row
¶ None