Table Object¶

Note

This is everything you need to use TableDataExtractor. The other sections, Input, Output, History, Footnotes, Algorithms, Cell Parser, and Exceptions are for reference only.

Represents a table in a highly standardized format.

class tabledataextractor.table.table.Table(file_path, table_number=1, **kwargs)[source]¶

Main TableDataExtractor object that includes the raw (input), cleaned (processes) and labelled tables. Represents the table input (.csv, .html, python list, url) in a highly standardized category table format, using the MIPS (Minimum Indexing Point Search) algorithm.

Optional configuration keywords (defaults):

use_title_row = True

A title row will be assumed if possible.

use_prefixing = True

Will perform the prefixing steps if row or column index cells are not unique.

use_spanning_cells = True

Will duplicate spanning cells in the row and column header regions if needed.

use_header_extension = True

Will extend the row and column header beyond the MIPS-defined headers, if needed.

use_footnotes = True

Will copy the footnote text into the appropriate cells of the table and remove the footnote prefix.

use_max_data_area = False

If True the max data area will be used to determine the cell CC2 in the main MIPS algorithm. It is probably never necessary to set this to True.

standardize_empty_data = True

Will standardize empty cells in the data region to ‘NoValue’

row_header = None

If an integer is given, it indicates the index of row_header columns. This overwrites the MIPS algorithm. For example, row_header = 0 will make only the first column a row header.

col_header = None

If an integer is given, it indicates the index of col_header rows. This overwrites the MIPS algorithm. For example, col_header = 0 will make only the first row a column header.

Parameters:	file_path (str \| list) – Path to .html or .cvs file, URL or list object that is used as input table_number (int) – Number of table to read, if there are several at the given url, or in the html file

category_table¶

Standardized table, where each row corresponds to a single data point of the original table. The columns are the row and column categories where the data point belongs to.

Type:	list

col_header¶

Column header of the table.

Type:	numpy.ndarray

configs¶

Configuration keywords set at the creation of the Table instance.

Type:	dict

contains(pattern)[source]¶

Returns true if table contains a particular string.

Parameters:	pattern – Regular expression for input
Returns:	True/False

data¶

Data region of the table.

Type:	numpy.ndarray

footnotes¶

List of footnotes in the table. Each footnote is an instance of Footnote.

Type:	list[Footnote]

history¶

Indicates which algorithms have been applied to the table by TableDataExtractor.

Type:	History

labels¶

Cell labels.

Type:	list

pre_cleaned_table¶

Cleaned-up table. This table is used for labelling the table regions, finding data-cells and building the category table.

Type:	numpy.array

pre_cleaned_table_empty¶

Mask array with True for all empty cells of the pre_cleaned_table.

Type:	numpy.array

print()[source]¶: Prints the raw table (input), cleaned table (processed by TableDataExtractor) and labels (regions of the table) nicely.

print_raw_table()[source]¶: Prints raw input table nicely.

raw_table¶

Input table, as provided to TableDataExtractor.

Type:	numpy.array

row_categories¶

Table where the original stub header is the first row(s) and all subsequent rows are the row categories of the original table. The assumption is made that the stub header labels row categories (that is, cells below the stub header). The row_categories table can be used if the row categories want to be analyzed as data themselves, which can occur if the header regions of the original table intentionally have duplicate elements.

Type:	TrivialTable

row_header¶

Row header of the table.

Type:	numpy.ndarray

stub_header¶

Stub header of the table.

Type:	numpy.ndarray

subtables¶

List of all subtables. Each subtable is an instance of Table.

Type:	list[Table]

title_row¶

Title row of the table.

Type:	list

to_csv(file_path)[source]¶: Saves the raw_table to a .csv file.

to_pandas()[source]¶

Converts the Table into a Pandas DataFrame, taking the complex MultiIndex structure of the table into account.

Returns:	pandas.DataFrame

transpose()[source]¶: Transposes the Table and performs the analysis again. In this way, if working interactively from a Jupyter notebook, it is possible to input a table and then transpose it to see how it looks like and if the results of the standardization are different.

class tabledataextractor.table.table.TrivialTable(file_path, table_number=1, **kwargs)[source]¶

Trivial Table object. No high level analysis will be performed. MIPS algorithm is never run. This table doesn’t have footnotes, a title row or subtables.