Algorithms

Algorithms for TableDataExtractor.

tabledataextractor.table.algorithms.build_category_table(table, cc1, cc2, cc3, cc4)[source]

Build category table for given input table. Original header factorization, according to Embley et al., DOI: 10.1007/s10032-016-0259-1. This version is not used, instead build_category_table is being used.

Parameters:
  • table (Numpy array) – Table on which to perform the categorization
  • cc1 – key MIPS cell
  • cc2 – key MIPS cell
  • cc3 – key MIPS cell
  • cc4 – key MIPS cell
Returns:

category table as numpy array

tabledataextractor.table.algorithms.categorize_header(header)[source]

Performs header categorization (calls the SymPy fact function) for a given table.

Parameters:header – header region, Numpy array
Returns:factor_list
tabledataextractor.table.algorithms.clean_row_header(pre_cleaned_table, cc2)[source]

Cleans the row header by removing duplicate rows that span the whole table.

tabledataextractor.table.algorithms.clean_unicode(array)[source]

Replaces problematic unicode characters in a given numpy array. :param array: input array :type array: numpy.array :return: cleaned array

tabledataextractor.table.algorithms.duplicate_columns(table)[source]

Returns True if there are duplicate columns in the table and False if there are no duplicate columns :param table: :return: True or False

tabledataextractor.table.algorithms.duplicate_rows(table)[source]

Returns True if there are duplicate rows in the table and False if there are no duplicate rows :param table: :return: True or False

tabledataextractor.table.algorithms.duplicate_spanning_cells(table_object, array)[source]

Duplicates cell contents into appropriate spanning cells. This is sometimes necessary for .csv files where information has been lost, or, if the source table is not properly formatted.

Cells outside the row/column header (such as data cells) will not be duplicated. MIPS is run to perform a check for that.

Algorithm according to Nagy and Seth, 2016, in Procs. ICPR 2016, Cancun, Mexico.

Parameters:
  • table_object (Table) – Input Table object
  • array (Numpy array) – Table to use as input
Returns:

Array with spanning cells copied, if necessary. Alternatively, returns the original table.

tabledataextractor.table.algorithms.empty_cells(array, regex='^([\\s\\-\\–\\—\\"]+)?$')[source]

Returns a mask with True for all empty cells in the original array and False for non-empty cells.

Parameters:
  • regex (str) – The regular expression which defines an empty cell (can be tweaked).
  • array (numpy array) – Input array to return the mask for
tabledataextractor.table.algorithms.empty_string(string, regex='^([\\s\\-\\–\\—\\"]+)?$')[source]

Returns True if a particular string is empty, which is defined with a regular expression.

Parameters:
  • string (str) – Input string for testing
  • regex (str) – The regular expression which defines an empty cell (can be tweaked).
Returns:

True/False

tabledataextractor.table.algorithms.find_cc1_cc2(table_object, cc4, array)[source]

Main MIPS (Minimum Indexing Point Search) algorithm. According to Embley et al., DOI: 10.1007/s10032-016-0259-1. Searches for critical cells CC2 and CC3. MIPS locates the critical cells that define the minimum row and column headers needed to index every data cell.

Parameters:
  • table_object (Table) – Input Table object
  • cc4 ((int, int)) – Position of CC4 cell found with find_cc4()
  • array (numpy array) – table to search for CC1 and CC2
Returns:

cc1, cc2

tabledataextractor.table.algorithms.find_cc3(table_object, cc2)[source]

Searches for critical cell CC3, as the leftmost cell of the first filled row of the data region.

Comment on implementation

There are two options on how to implement the search for CC3:

  1. With the possibility of Notes rows directly below the header (default):
    • the first half filled row below the header is considered as the start of the data region, just like for the CC4 cell
    • implemented by Embley et. al.
  2. Without the possibility of Notes rows directly below the header:
    • the first row below the header is considered as the start of the data region
    • for scientific tables it might be more common that the first data row only has a single entry
    • this can be chosen my commenting/uncommenting the code within this function
Parameters:
  • table_object (Table) – Input Table object
  • cc2 ((int,int)) – Tuple, position of CC2 cell found with find_cc1_cc2()
Returns:

cc3

tabledataextractor.table.algorithms.find_cc4(table_object)[source]

Searches for critical cell CC4.

Searching from the bottom of the pre-cleaned table for the last row with a minority of empty cells. Rows with at most a few empty cells are assumed to be part of the data region rather than notes or footnotes rows (which usually only have one or two non-empty cells).

Parameters:table_object (Table) – Input Table object
Returns:cc4
tabledataextractor.table.algorithms.find_note_cells(table_object, labels_table)[source]

Searches for all non-empty cells that have not been labelled differently.

Parameters:
  • table_object (Table) – Input Table object
  • labels_table (Numpy array) – table that holds all the labels
Returns:

Tuple

tabledataextractor.table.algorithms.find_row_header_table(category_table, stub_header)[source]

Constructs a Table from the row categories of the original table.

Parameters:
  • category_table (list) – ~tabledataextractor.table.table.Table.category_table
  • stub_header (numpy.ndarray) – ~tabledataextractor.table.table.Table.stub_header
Returns:

list

tabledataextractor.table.algorithms.find_title_row(table_object)[source]

Searches for the topmost non-empty row.

Parameters:table_object (Table) – Input Table object
Returns:int
tabledataextractor.table.algorithms.header_extension_down(table_object, cc1, cc2, cc4)[source]

Extends the header downwards, if no prefixing was done and if the appropriate stub header is empty. For column-header expansion downwards, only the first cell of the stub header has to be empty. For row-header expansion to the right, the whole stub header column above has to be empty.

Parameters:
  • table_object (Table) – Input Table object
  • cc2 ((int, int)) – Critical cell CC2
  • cc1 ((int, int)) – Critical cell CC1
  • cc4 ((int, int)) – Critical cell CC4
Returns:

New cc2

tabledataextractor.table.algorithms.header_extension_up(table_object, cc1)[source]

Extends the header after main MIPS run.

Algorithm according to Nagy and Seth, 2016, “Table Headers: An entrance to the data mine”, in Procs. ICPR 2016, Cancun, Mexico.

Parameters:
  • table_object (Table) – Input Table object
  • cc1CC1 critical cell
Returns:

cc1_new

tabledataextractor.table.algorithms.pre_clean(array)[source]

Removes empty and duplicate rows and columns that extend over the whole table.

Parameters:array (Numpy array) – Input Table object
tabledataextractor.table.algorithms.prefix_duplicate_labels(table_object, array)[source]

Prefixes duplicate labels in first row or column where this is possible, by adding a new row/column containing the preceding (to the left or above) unique labels, if available.

Nested prefixing is not supported.

The algorithm is not completely selective and there might be cases where it’s application is undesirable. However, on standard datasets it significantly improves table-region classification.

Algorithm for column headers:

  1. Run MIPS, to find the old header region, without prefixing.
  2. For row in table, can meaningful prefixing in this row been done?
    • yes –> do prefixing and go to 3, prefixing of only one row is possible; accept prefixing only if prefixed rows/cells are above the end of the header (not in the data region), the prefixed cells can still be above the header
    • no –> go to 2, next row
  3. run MIPS to get the new header region
  4. accept prefixing only if the prefixing has not made the header region start lower than before and if it hasn’t made the header region wider than before

The algorithm has been modified from Embley et al., DOI: 10.1007/s10032-016-0259-1.

Parameters:
  • table_object (Table) – Input Table object
  • array (Numpy array) – Table to use as input and to do the prefixing on
Returns:

Table with added rows/columns with prefixes, or, input table, if no prefixing was done

tabledataextractor.table.algorithms.split_table(table_object)[source]

Splits table into subtables. Yields Table objects.

Algorithm:
If the stub header is repeated in the column header section the table is split up before the repeated element.
Parameters:table_object (Table) – Input Table object
tabledataextractor.table.algorithms.standardize_empty(array)[source]

Returns an array with the empty cells of the input array standardized to ‘NoValue’.

Parameters:array (numpy.array) – Input array
Returns:Array with standardized empty cells