Features¶
TableDataExtractor uses a variety of algorithms to represent a table in standardized format. They work independently of the input format in which the table was provided. Thus, TableDataExtractor works equally as good for .csv
files, as for .html
files.
Standardized Table¶
The main feature of TableDataExtractor is the standardization of the input table. All algorithms and features presented herein have the goal to create a higher quality standardized table. This can subsequenlty be used for automated parsing, and automated retrieval of information from the table.
The standardized table (category table) can be output as a list as table.category_table
or simply printed with print(table)
. Table example from Embley et. al (2016):
[1]:
from tabledataextractor import Table
file = '../examples/tables/table_example_footnotes.csv'
table = Table(file)
table.print_raw_table()
print(table)
1 Development
Country Million dollar Million dollar Million dollar Percentage of GNI Percentage of GNI
2007 2010* 2011* a. 2007 2011
First table
Australia 3735 4580 4936 0.95 1
Greece 2669 3826 4799 0.32 0.35
New Zealand 320 342 429 0.27 0.28
OECD/DAC c 104206 128465 133526 0.27 0.31
c (unreliable)
* world bank
a.
+--------+---------------------------+-------------------------------------------+
| Data | Row Categories | Column Categories |
+--------+---------------------------+-------------------------------------------+
| 3735 | ['Australia'] | ['Million dollar', '2007'] |
| 4580 | ['Australia'] | ['Million dollar', '2010 world bank '] |
| 4936 | ['Australia'] | ['Million dollar', '2011 world bank '] |
| 0.95 | ['Australia'] | ['Percentage of GNI', '2007'] |
| 1 | ['Australia'] | ['Percentage of GNI', '2011'] |
| 2669 | ['Greece'] | ['Million dollar', '2007'] |
| 3826 | ['Greece'] | ['Million dollar', '2010 world bank '] |
| 4799 | ['Greece'] | ['Million dollar', '2011 world bank '] |
| 0.32 | ['Greece'] | ['Percentage of GNI', '2007'] |
| 0.35 | ['Greece'] | ['Percentage of GNI', '2011'] |
| 320 | ['New Zealand'] | ['Million dollar', '2007'] |
| 342 | ['New Zealand'] | ['Million dollar', '2010 world bank '] |
| 429 | ['New Zealand'] | ['Million dollar', '2011 world bank '] |
| 0.27 | ['New Zealand'] | ['Percentage of GNI', '2007'] |
| 0.28 | ['New Zealand'] | ['Percentage of GNI', '2011'] |
| 104206 | ['OECD/DAC (unreliable)'] | ['Million dollar', '2007'] |
| 128465 | ['OECD/DAC (unreliable)'] | ['Million dollar', '2010 world bank '] |
| 133526 | ['OECD/DAC (unreliable)'] | ['Million dollar', '2011 world bank '] |
| 0.27 | ['OECD/DAC (unreliable)'] | ['Percentage of GNI', '2007'] |
| 0.31 | ['OECD/DAC (unreliable)'] | ['Percentage of GNI', '2011'] |
+--------+---------------------------+-------------------------------------------+
Nested Headers and Cell Labelling¶
The data region of an input table is isolated, taking complex row/column header structures into account and preserving the information about which header categories a particular data point belongs to. The table cells are labelled, according to their role in the table, as Data, Row Header, Column Header, Stub Header, Title, Footnote, Footnote Text, and Note cells.
[2]:
from tabledataextractor.output.print import print_table
table.print_raw_table()
print_table(table.labels)
1 Development
Country Million dollar Million dollar Million dollar Percentage of GNI Percentage of GNI
2007 2010* 2011* a. 2007 2011
First table
Australia 3735 4580 4936 0.95 1
Greece 2669 3826 4799 0.32 0.35
New Zealand 320 342 429 0.27 0.28
OECD/DAC c 104206 128465 133526 0.27 0.31
c (unreliable)
* world bank
a.
TableTitle TableTitle TableTitle TableTitle TableTitle TableTitle
StubHeader ColHeader ColHeader ColHeader ColHeader ColHeader
StubHeader ColHeader ColHeader & FNref ColHeader & FNref & FNref ColHeader ColHeader
Note / / / / /
RowHeader Data Data Data Data Data
RowHeader Data Data Data Data Data
RowHeader Data Data Data Data Data
RowHeader & FNref Data Data Data Data Data
FNprefix FNtext / / / /
FNprefix & FNtext / / / / /
FNprefix / / / / /
Prefixing of headers¶
In many tables the headers are non-conclusive, meaning that they include duplicate elements that are usually highlighted in bold or italics. Due to the highlighting the structure of the table can still be understood by the reader. However, since TableDataExtractor doesn’t take any graphical features into account, but only consideres the raw content of cells in tabular format, a prefixing step needs to be performed in some cases to find the header region correctly.
Since the main algorithm used to find the data region, the MIPS algorithm (Minimum Indexing Point Search), relies on duplicate entries in the header regions, the prefixing step is done in an iterative fashion. First, the headers are found and only afterwards the prefixing is performed. By comparison of the new results before and after a decision is made whether to accept the prefixing or not.
Two examples of prefixing are shown below, for the column and row header, respectively (examples from Embley et. al 2016). Prefixing can be turned off by setting the use_prefixing = False
keyword argument upon creation of the Table
instance.
[3]:
file = '../examples/tables/table_example8.csv'
table = Table(file)
table.print()
Table 9.
Year Short messages Change % Other messages Multimedia messages Change %
2003 1647218 24.3 347 2314
2004 2193498 33.2 439 7386 219.2
Table 9.
Short messages Multimedia messages
Year Short messages Change % Other messages Multimedia messages Change %
2003 1647218 24.3 347 2314
2004 2193498 33.2 439 7386 219.2
TableTitle TableTitle TableTitle TableTitle TableTitle TableTitle
StubHeader ColHeader ColHeader ColHeader ColHeader ColHeader
StubHeader ColHeader ColHeader ColHeader ColHeader ColHeader
RowHeader Data Data Data Data Data
RowHeader Data Data Data Data Data
[4]:
file = '../examples/tables/table_example9.csv'
table = Table(file)
table.print()
Year 2003 2004
Short messages/thousands 1647218 2193498
Change % 24.3 33.2
Other messages 347 439
Multimedia messages/thousands 2314 7386
Change % 219.2
Year 2003 2004
Short messages/thousands 1647218 2193498
Short messages/thousands Change % 24.3 33.2
Other messages 347 439
Multimedia messages/thousands 2314 7386
Multimedia messages/thousands Change % 219.2
StubHeader StubHeader ColHeader ColHeader
RowHeader RowHeader Data Data
RowHeader RowHeader Data Data
RowHeader RowHeader Data Data
RowHeader RowHeader Data Data
RowHeader RowHeader Data Data
Spanning cells¶
Spanning cells are commonly encountered in tables. This information is easy to retreive if the table is provided in .html
format. However, if the table is provided as .csv
file or a python list, the content of spannig cells needs to be duplicated into each one of the spanning cells. TableDataExtractor does that automatically.
The duplication of spanning cells can be turned off by setting use_spanning_cells = False
at creation of the Table
instance. Table example from Embley et. al (2016):
[5]:
file = '../examples/tables/te_04.csv'
table = Table(file)
table.print()
Pupils in comprehensive schools
Year School Pupils Grade 1 Leaving certificates
Pre-primary Grades Additional Total
6 Jan 9 Jul
1990 4869 2189 389410 197719 592920 67427 61054
1991 4861 2181 389411 197711 3601 592921 67421
Pupils in comprehensive schools
Year School Pupils Pupils Pupils Pupils Pupils Grade 1 Leaving certificates
Year School Pre-primary Grades Grades Additional Total Grade 1 Leaving certificates
Year School Pre-primary 6 Jan 9 Jul Additional Total Grade 1 Leaving certificates
1990 4869 2189 389410 197719 592920 67427 61054
1991 4861 2181 389411 197711 3601 592921 67421
TableTitle TableTitle TableTitle TableTitle TableTitle TableTitle TableTitle TableTitle TableTitle
StubHeader ColHeader ColHeader ColHeader ColHeader ColHeader ColHeader ColHeader ColHeader
StubHeader ColHeader ColHeader ColHeader ColHeader ColHeader ColHeader ColHeader ColHeader
StubHeader ColHeader ColHeader ColHeader ColHeader ColHeader ColHeader ColHeader ColHeader
RowHeader Data Data Data Data Data Data Data Data
RowHeader Data Data Data Data Data Data Data Data
Subtables¶
If there are many tables nested within a single input table, and if they are of a compatible header structure, TableDataExtractor will automatically process them. table.subtables
will contain a list of those subtables, where each entry will be an instance of the TableDataExtractor Table
class.
[6]:
file = '../examples/tables/te_06.csv'
table = Table(file)
table.print_raw_table()
table.subtables[0].print_raw_table()
table.subtables[1].print_raw_table()
table.subtables[2].print_raw_table()
Material Tc A Material Tc A Material Tc
Bi6Tl3 6.5 x TiN 1.4 y TiO2 1.1
Sb2Tl7 5.5 y TiC 1.1 x TiO3 1.2
Na2Pb5 7.2 z TaC 9.2 x TiO4 1.3
Hg5Tl7 3.8 x NbC 10.1 a TiO5 1.4
Au2Bi 1.84 x ZrB 2.82 x TiO6 1.5
CuS 1.6 x TaSi 4.2 x TiO7 1.6
VN 1.3 x PbS 4.1 x TiO8 1.7
WC 2.8 x Pb-As alloy 8.4 x TiO9 1.8
W2C 2.05 x Pb-Sn-Bi 8.5 x TiO10 1.9
MoC 7.7 x Pb-As-Bi 9.0 x TiO11 1.10
Mo2C 2.4 x Pb-Bi-Sb 8.9 x TiO12 1.11
Material Tc A
Bi6Tl3 6.5 x
Sb2Tl7 5.5 y
Na2Pb5 7.2 z
Hg5Tl7 3.8 x
Au2Bi 1.84 x
CuS 1.6 x
VN 1.3 x
WC 2.8 x
W2C 2.05 x
MoC 7.7 x
Mo2C 2.4 x
Material Tc A
TiN 1.4 y
TiC 1.1 x
TaC 9.2 x
NbC 10.1 a
ZrB 2.82 x
TaSi 4.2 x
PbS 4.1 x
Pb-As alloy 8.4 x
Pb-Sn-Bi 8.5 x
Pb-As-Bi 9.0 x
Pb-Bi-Sb 8.9 x
Material Tc
TiO2 1.1
TiO3 1.2
TiO4 1.3
TiO5 1.4
TiO6 1.5
TiO7 1.6
TiO8 1.7
TiO9 1.8
TiO10 1.9
TiO11 1.10
TiO12 1.11
Footnotes¶
TableDataExtractor handles footnotes by copying the footnote text into the appropriate cells where the footnotes have been referenced. This is a useful feature for automatic parsing of the category table. The copying of the footnote text can be prevented by using the use_footnotes = False
keyword argument on Table
creation.
Each footnote is a TableDataExtractor.Footnote
object that contains all the footnote-relevant information. It can be inspected with print(table.footnotes[0])
. Table example from Embley et. al (2016):
[7]:
file = '../examples/tables/table_example_footnotes.csv'
table = Table(file)
table.print()
print(table.footnotes[0])
print(table.footnotes[1])
print(table.footnotes[2])
1 Development
Country Million dollar Million dollar Million dollar Percentage of GNI Percentage of GNI
2007 2010* 2011* a. 2007 2011
First table
Australia 3735 4580 4936 0.95 1
Greece 2669 3826 4799 0.32 0.35
New Zealand 320 342 429 0.27 0.28
OECD/DAC c 104206 128465 133526 0.27 0.31
c (unreliable)
* world bank
a.
1 Development
Country Million dollar Million dollar Million dollar Percentage of GNI Percentage of GNI
Country 2007 2010 world bank 2011 world bank 2007 2011
First table
Australia 3735 4580 4936 0.95 1
Greece 2669 3826 4799 0.32 0.35
New Zealand 320 342 429 0.27 0.28
OECD/DAC (unreliable) 104206 128465 133526 0.27 0.31
c (unreliable)
* world bank
a.
TableTitle TableTitle TableTitle TableTitle TableTitle TableTitle
StubHeader ColHeader ColHeader ColHeader ColHeader ColHeader
StubHeader ColHeader ColHeader & FNref ColHeader & FNref & FNref ColHeader ColHeader
Note / / / / /
RowHeader Data Data Data Data Data
RowHeader Data Data Data Data Data
RowHeader Data Data Data Data Data
RowHeader & FNref Data Data Data Data Data
FNprefix FNtext / / / /
FNprefix & FNtext / / / / /
FNprefix / / / / /
Prefix: 'c' Text: '(unreliable)' Ref. Cells: [(7, 0)] References: ['OECD/DAC c']
Prefix: '*' Text: 'world bank' Ref. Cells: [(2, 2), (2, 3)] References: ['2010*', '2011* a.']
Prefix: 'a.' Text: '' Ref. Cells: [(2, 3)] References: ['2011 world bank a.']