Features

TableDataExtractor uses a variety of algorithms to represent a table in standardized format. They work independently of the input format in which the table was provided. Thus, TableDataExtractor works equally as good for .csv files, as for .html files.

Standardized Table

The main feature of TableDataExtractor is the standardization of the input table. All algorithms and features presented herein have the goal to create a higher quality standardized table. This can subsequenlty be used for automated parsing, and automated retrieval of information from the table.

The standardized table (category table) can be output as a list as table.category_table or simply printed with print(table). Table example from Embley et. al (2016):

[1]:
from tabledataextractor import Table

file = '../examples/tables/table_example_footnotes.csv'
table = Table(file)

table.print_raw_table()
print(table)
1 Development
Country        Million dollar  Million dollar  Million dollar  Percentage of GNI  Percentage of GNI
               2007            2010*           2011* a.        2007               2011
First table
Australia      3735            4580            4936            0.95               1
Greece         2669            3826            4799            0.32               0.35
New Zealand    320             342             429             0.27               0.28
OECD/DAC c     104206          128465          133526          0.27               0.31
c              (unreliable)
* world bank
a.


+--------+---------------------------+-------------------------------------------+
|  Data  |       Row Categories      |             Column Categories             |
+--------+---------------------------+-------------------------------------------+
|  3735  |       ['Australia']       |         ['Million dollar', '2007']        |
|  4580  |       ['Australia']       |   ['Million dollar', '2010 world bank ']  |
|  4936  |       ['Australia']       | ['Million dollar', '2011 world bank    '] |
|  0.95  |       ['Australia']       |       ['Percentage of GNI', '2007']       |
|   1    |       ['Australia']       |       ['Percentage of GNI', '2011']       |
|  2669  |         ['Greece']        |         ['Million dollar', '2007']        |
|  3826  |         ['Greece']        |   ['Million dollar', '2010 world bank ']  |
|  4799  |         ['Greece']        | ['Million dollar', '2011 world bank    '] |
|  0.32  |         ['Greece']        |       ['Percentage of GNI', '2007']       |
|  0.35  |         ['Greece']        |       ['Percentage of GNI', '2011']       |
|  320   |      ['New Zealand']      |         ['Million dollar', '2007']        |
|  342   |      ['New Zealand']      |   ['Million dollar', '2010 world bank ']  |
|  429   |      ['New Zealand']      | ['Million dollar', '2011 world bank    '] |
|  0.27  |      ['New Zealand']      |       ['Percentage of GNI', '2007']       |
|  0.28  |      ['New Zealand']      |       ['Percentage of GNI', '2011']       |
| 104206 | ['OECD/DAC (unreliable)'] |         ['Million dollar', '2007']        |
| 128465 | ['OECD/DAC (unreliable)'] |   ['Million dollar', '2010 world bank ']  |
| 133526 | ['OECD/DAC (unreliable)'] | ['Million dollar', '2011 world bank    '] |
|  0.27  | ['OECD/DAC (unreliable)'] |       ['Percentage of GNI', '2007']       |
|  0.31  | ['OECD/DAC (unreliable)'] |       ['Percentage of GNI', '2011']       |
+--------+---------------------------+-------------------------------------------+

Nested Headers and Cell Labelling

The data region of an input table is isolated, taking complex row/column header structures into account and preserving the information about which header categories a particular data point belongs to. The table cells are labelled, according to their role in the table, as Data, Row Header, Column Header, Stub Header, Title, Footnote, Footnote Text, and Note cells.

[2]:
from tabledataextractor.output.print import print_table
table.print_raw_table()
print_table(table.labels)
1 Development
Country        Million dollar  Million dollar  Million dollar  Percentage of GNI  Percentage of GNI
               2007            2010*           2011* a.        2007               2011
First table
Australia      3735            4580            4936            0.95               1
Greece         2669            3826            4799            0.32               0.35
New Zealand    320             342             429             0.27               0.28
OECD/DAC c     104206          128465          133526          0.27               0.31
c              (unreliable)
* world bank
a.


TableTitle         TableTitle  TableTitle         TableTitle                 TableTitle  TableTitle
StubHeader         ColHeader   ColHeader          ColHeader                  ColHeader   ColHeader
StubHeader         ColHeader   ColHeader & FNref  ColHeader & FNref & FNref  ColHeader   ColHeader
Note               /           /                  /                          /           /
RowHeader          Data        Data               Data                       Data        Data
RowHeader          Data        Data               Data                       Data        Data
RowHeader          Data        Data               Data                       Data        Data
RowHeader & FNref  Data        Data               Data                       Data        Data
FNprefix           FNtext      /                  /                          /           /
FNprefix & FNtext  /           /                  /                          /           /
FNprefix           /           /                  /                          /           /


Prefixing of headers

In many tables the headers are non-conclusive, meaning that they include duplicate elements that are usually highlighted in bold or italics. Due to the highlighting the structure of the table can still be understood by the reader. However, since TableDataExtractor doesn’t take any graphical features into account, but only consideres the raw content of cells in tabular format, a prefixing step needs to be performed in some cases to find the header region correctly.

Since the main algorithm used to find the data region, the MIPS algorithm (Minimum Indexing Point Search), relies on duplicate entries in the header regions, the prefixing step is done in an iterative fashion. First, the headers are found and only afterwards the prefixing is performed. By comparison of the new results before and after a decision is made whether to accept the prefixing or not.

Two examples of prefixing are shown below, for the column and row header, respectively (examples from Embley et. al 2016). Prefixing can be turned off by setting the use_prefixing = False keyword argument upon creation of the Table instance.

[3]:
file = '../examples/tables/table_example8.csv'
table = Table(file)
table.print()
Table 9.
Year      Short messages  Change %  Other messages  Multimedia messages  Change %
2003      1647218         24.3      347             2314
2004      2193498         33.2      439             7386                 219.2


Table 9.
                          Short messages                                       Multimedia messages
Year      Short messages  Change %        Other messages  Multimedia messages  Change %
2003      1647218         24.3            347             2314
2004      2193498         33.2            439             7386                 219.2


TableTitle  TableTitle  TableTitle  TableTitle  TableTitle  TableTitle
StubHeader  ColHeader   ColHeader   ColHeader   ColHeader   ColHeader
StubHeader  ColHeader   ColHeader   ColHeader   ColHeader   ColHeader
RowHeader   Data        Data        Data        Data        Data
RowHeader   Data        Data        Data        Data        Data


[4]:
file = '../examples/tables/table_example9.csv'
table = Table(file)
table.print()
Year                           2003     2004
Short messages/thousands       1647218  2193498
Change %                       24.3     33.2
Other messages                 347      439
Multimedia messages/thousands  2314     7386
Change %                                219.2


                               Year                           2003     2004
                               Short messages/thousands       1647218  2193498
Short messages/thousands       Change %                       24.3     33.2
                               Other messages                 347      439
                               Multimedia messages/thousands  2314     7386
Multimedia messages/thousands  Change %                                219.2


StubHeader  StubHeader  ColHeader  ColHeader
RowHeader   RowHeader   Data       Data
RowHeader   RowHeader   Data       Data
RowHeader   RowHeader   Data       Data
RowHeader   RowHeader   Data       Data
RowHeader   RowHeader   Data       Data


Spanning cells

Spanning cells are commonly encountered in tables. This information is easy to retreive if the table is provided in .html format. However, if the table is provided as .csv file or a python list, the content of spannig cells needs to be duplicated into each one of the spanning cells. TableDataExtractor does that automatically.

The duplication of spanning cells can be turned off by setting use_spanning_cells = False at creation of the Table instance. Table example from Embley et. al (2016):

[5]:
file = '../examples/tables/te_04.csv'
table = Table(file)
table.print()
Pupils in comprehensive schools
Year                             School  Pupils                                           Grade 1  Leaving certificates
                                         Pre-primary  Grades          Additional  Total
                                                      6 Jan   9 Jul
1990                             4869    2189         389410  197719              592920  67427    61054
1991                             4861    2181         389411  197711  3601        592921  67421


Pupils in comprehensive schools
Year                             School  Pupils       Pupils  Pupils  Pupils      Pupils  Grade 1  Leaving certificates
Year                             School  Pre-primary  Grades  Grades  Additional  Total   Grade 1  Leaving certificates
Year                             School  Pre-primary  6 Jan   9 Jul   Additional  Total   Grade 1  Leaving certificates
1990                             4869    2189         389410  197719              592920  67427    61054
1991                             4861    2181         389411  197711  3601        592921  67421


TableTitle  TableTitle  TableTitle  TableTitle  TableTitle  TableTitle  TableTitle  TableTitle  TableTitle
StubHeader  ColHeader   ColHeader   ColHeader   ColHeader   ColHeader   ColHeader   ColHeader   ColHeader
StubHeader  ColHeader   ColHeader   ColHeader   ColHeader   ColHeader   ColHeader   ColHeader   ColHeader
StubHeader  ColHeader   ColHeader   ColHeader   ColHeader   ColHeader   ColHeader   ColHeader   ColHeader
RowHeader   Data        Data        Data        Data        Data        Data        Data        Data
RowHeader   Data        Data        Data        Data        Data        Data        Data        Data


Subtables

If there are many tables nested within a single input table, and if they are of a compatible header structure, TableDataExtractor will automatically process them. table.subtables will contain a list of those subtables, where each entry will be an instance of the TableDataExtractor Table class.

[6]:
file = '../examples/tables/te_06.csv'
table = Table(file)
table.print_raw_table()

table.subtables[0].print_raw_table()
table.subtables[1].print_raw_table()
table.subtables[2].print_raw_table()
Material  Tc    A  Material     Tc    A  Material  Tc
Bi6Tl3    6.5   x  TiN          1.4   y  TiO2      1.1
Sb2Tl7    5.5   y  TiC          1.1   x  TiO3      1.2
Na2Pb5    7.2   z  TaC          9.2   x  TiO4      1.3
Hg5Tl7    3.8   x  NbC          10.1  a  TiO5      1.4
Au2Bi     1.84  x  ZrB          2.82  x  TiO6      1.5
CuS       1.6   x  TaSi         4.2   x  TiO7      1.6
VN        1.3   x  PbS          4.1   x  TiO8      1.7
WC        2.8   x  Pb-As alloy  8.4   x  TiO9      1.8
W2C       2.05  x  Pb-Sn-Bi     8.5   x  TiO10     1.9
MoC       7.7   x  Pb-As-Bi     9.0   x  TiO11     1.10
Mo2C      2.4   x  Pb-Bi-Sb     8.9   x  TiO12     1.11


Material  Tc    A
Bi6Tl3    6.5   x
Sb2Tl7    5.5   y
Na2Pb5    7.2   z
Hg5Tl7    3.8   x
Au2Bi     1.84  x
CuS       1.6   x
VN        1.3   x
WC        2.8   x
W2C       2.05  x
MoC       7.7   x
Mo2C      2.4   x


Material     Tc    A
TiN          1.4   y
TiC          1.1   x
TaC          9.2   x
NbC          10.1  a
ZrB          2.82  x
TaSi         4.2   x
PbS          4.1   x
Pb-As alloy  8.4   x
Pb-Sn-Bi     8.5   x
Pb-As-Bi     9.0   x
Pb-Bi-Sb     8.9   x


Material  Tc
TiO2      1.1
TiO3      1.2
TiO4      1.3
TiO5      1.4
TiO6      1.5
TiO7      1.6
TiO8      1.7
TiO9      1.8
TiO10     1.9
TiO11     1.10
TiO12     1.11


Footnotes

TableDataExtractor handles footnotes by copying the footnote text into the appropriate cells where the footnotes have been referenced. This is a useful feature for automatic parsing of the category table. The copying of the footnote text can be prevented by using the use_footnotes = False keyword argument on Table creation.

Each footnote is a TableDataExtractor.Footnote object that contains all the footnote-relevant information. It can be inspected with print(table.footnotes[0]). Table example from Embley et. al (2016):

[7]:
file = '../examples/tables/table_example_footnotes.csv'
table = Table(file)
table.print()

print(table.footnotes[0])
print(table.footnotes[1])
print(table.footnotes[2])
1 Development
Country        Million dollar  Million dollar  Million dollar  Percentage of GNI  Percentage of GNI
               2007            2010*           2011* a.        2007               2011
First table
Australia      3735            4580            4936            0.95               1
Greece         2669            3826            4799            0.32               0.35
New Zealand    320             342             429             0.27               0.28
OECD/DAC c     104206          128465          133526          0.27               0.31
c              (unreliable)
* world bank
a.


1 Development
Country                Million dollar  Million dollar    Million dollar       Percentage of GNI  Percentage of GNI
Country                2007            2010 world bank   2011 world bank      2007               2011
First table
Australia              3735            4580              4936                 0.95               1
Greece                 2669            3826              4799                 0.32               0.35
New Zealand            320             342               429                  0.27               0.28
OECD/DAC (unreliable)  104206          128465            133526               0.27               0.31
c                      (unreliable)
* world bank
a.


TableTitle         TableTitle  TableTitle         TableTitle                 TableTitle  TableTitle
StubHeader         ColHeader   ColHeader          ColHeader                  ColHeader   ColHeader
StubHeader         ColHeader   ColHeader & FNref  ColHeader & FNref & FNref  ColHeader   ColHeader
Note               /           /                  /                          /           /
RowHeader          Data        Data               Data                       Data        Data
RowHeader          Data        Data               Data                       Data        Data
RowHeader          Data        Data               Data                       Data        Data
RowHeader & FNref  Data        Data               Data                       Data        Data
FNprefix           FNtext      /                  /                          /           /
FNprefix & FNtext  /           /                  /                          /           /
FNprefix           /           /                  /                          /           /


Prefix: 'c'    Text: '(unreliable)'                                                 Ref. Cells: [(7, 0)]   References: ['OECD/DAC c']
Prefix: '*'    Text: 'world bank'                                                   Ref. Cells: [(2, 2), (2, 3)]   References: ['2010*', '2011* a.']
Prefix: 'a.'   Text: ''                                                             Ref. Cells: [(2, 3)]   References: ['2011 world bank  a.']