Basics

Input

  • from file, as .csv or .html
  • from url (if there are more tables at the provided url, use the table_number argument)
  • from python list object
[1]:
table_path = '../examples/tables/table_example.csv'

from tabledataextractor import Table
table = Table(table_path)

First we will check out the original table, which is now stored as table.raw_table. We can use the print_raw_table()function within TableDataExtractor:

[2]:
table.print_raw_table()
                           Rutile     Rutile  Rutile  Anatase    Anatase  Anatase
                           a = b (Å)  c (Å)   u       a = b (Å)  c (Å)    u
Computational  This study  4.64       2.99    0.305   3.83       9.62     0.208
Computational  GGA [25]    4.67       2.97    0.305   3.80       9.67     0.207
Computational  GGA [26]    4.63       2.98    0.305   -          -        -
Computational  HF [27]     -          -       -       3.76       9.85     0.202
Experimental   Expt. [23]  4.594      2.958   0.305   3.785      9.514    0.207


TableDataExtractor provides a category table, where each row corresponds to a single data point. This is the main result of TableDataExtractor. We can simply print the table to see it:

[3]:
print(table)
+-------+---------------------------------+--------------------------+
|  Data |          Row Categories         |    Column Categories     |
+-------+---------------------------------+--------------------------+
|  4.64 | ['Computational', 'This study'] | ['Rutile', 'a = b (Å)']  |
|  2.99 | ['Computational', 'This study'] |   ['Rutile', 'c (Å)']    |
| 0.305 | ['Computational', 'This study'] |     ['Rutile', 'u']      |
|  3.83 | ['Computational', 'This study'] | ['Anatase', 'a = b (Å)'] |
|  9.62 | ['Computational', 'This study'] |   ['Anatase', 'c (Å)']   |
| 0.208 | ['Computational', 'This study'] |     ['Anatase', 'u']     |
|  4.67 |  ['Computational', 'GGA [25]']  | ['Rutile', 'a = b (Å)']  |
|  2.97 |  ['Computational', 'GGA [25]']  |   ['Rutile', 'c (Å)']    |
| 0.305 |  ['Computational', 'GGA [25]']  |     ['Rutile', 'u']      |
|  3.80 |  ['Computational', 'GGA [25]']  | ['Anatase', 'a = b (Å)'] |
|  9.67 |  ['Computational', 'GGA [25]']  |   ['Anatase', 'c (Å)']   |
| 0.207 |  ['Computational', 'GGA [25]']  |     ['Anatase', 'u']     |
|  4.63 |  ['Computational', 'GGA [26]']  | ['Rutile', 'a = b (Å)']  |
|  2.98 |  ['Computational', 'GGA [26]']  |   ['Rutile', 'c (Å)']    |
| 0.305 |  ['Computational', 'GGA [26]']  |     ['Rutile', 'u']      |
|   -   |  ['Computational', 'GGA [26]']  | ['Anatase', 'a = b (Å)'] |
|   -   |  ['Computational', 'GGA [26]']  |   ['Anatase', 'c (Å)']   |
|   -   |  ['Computational', 'GGA [26]']  |     ['Anatase', 'u']     |
|   -   |   ['Computational', 'HF [27]']  | ['Rutile', 'a = b (Å)']  |
|   -   |   ['Computational', 'HF [27]']  |   ['Rutile', 'c (Å)']    |
|   -   |   ['Computational', 'HF [27]']  |     ['Rutile', 'u']      |
|  3.76 |   ['Computational', 'HF [27]']  | ['Anatase', 'a = b (Å)'] |
|  9.85 |   ['Computational', 'HF [27]']  |   ['Anatase', 'c (Å)']   |
| 0.202 |   ['Computational', 'HF [27]']  |     ['Anatase', 'u']     |
| 4.594 |  ['Experimental', 'Expt. [23]'] | ['Rutile', 'a = b (Å)']  |
| 2.958 |  ['Experimental', 'Expt. [23]'] |   ['Rutile', 'c (Å)']    |
| 0.305 |  ['Experimental', 'Expt. [23]'] |     ['Rutile', 'u']      |
| 3.785 |  ['Experimental', 'Expt. [23]'] | ['Anatase', 'a = b (Å)'] |
| 9.514 |  ['Experimental', 'Expt. [23]'] |   ['Anatase', 'c (Å)']   |
| 0.207 |  ['Experimental', 'Expt. [23]'] |     ['Anatase', 'u']     |
+-------+---------------------------------+--------------------------+

If we want to further process the category table, we can access it as a list of lists:

[4]:
print(table.category_table)
[['4.64', ['Computational', 'This study'], ['Rutile', 'a = b (Å)']], ['2.99', ['Computational', 'This study'], ['Rutile', 'c (Å)']], ['0.305', ['Computational', 'This study'], ['Rutile', 'u']], ['3.83', ['Computational', 'This study'], ['Anatase', 'a = b (Å)']], ['9.62', ['Computational', 'This study'], ['Anatase', 'c (Å)']], ['0.208', ['Computational', 'This study'], ['Anatase', 'u']], ['4.67', ['Computational', 'GGA [25]'], ['Rutile', 'a = b (Å)']], ['2.97', ['Computational', 'GGA [25]'], ['Rutile', 'c (Å)']], ['0.305', ['Computational', 'GGA [25]'], ['Rutile', 'u']], ['3.80', ['Computational', 'GGA [25]'], ['Anatase', 'a = b (Å)']], ['9.67', ['Computational', 'GGA [25]'], ['Anatase', 'c (Å)']], ['0.207', ['Computational', 'GGA [25]'], ['Anatase', 'u']], ['4.63', ['Computational', 'GGA [26]'], ['Rutile', 'a = b (Å)']], ['2.98', ['Computational', 'GGA [26]'], ['Rutile', 'c (Å)']], ['0.305', ['Computational', 'GGA [26]'], ['Rutile', 'u']], ['-', ['Computational', 'GGA [26]'], ['Anatase', 'a = b (Å)']], ['-', ['Computational', 'GGA [26]'], ['Anatase', 'c (Å)']], ['-', ['Computational', 'GGA [26]'], ['Anatase', 'u']], ['-', ['Computational', 'HF [27]'], ['Rutile', 'a = b (Å)']], ['-', ['Computational', 'HF [27]'], ['Rutile', 'c (Å)']], ['-', ['Computational', 'HF [27]'], ['Rutile', 'u']], ['3.76', ['Computational', 'HF [27]'], ['Anatase', 'a = b (Å)']], ['9.85', ['Computational', 'HF [27]'], ['Anatase', 'c (Å)']], ['0.202', ['Computational', 'HF [27]'], ['Anatase', 'u']], ['4.594', ['Experimental', 'Expt. [23]'], ['Rutile', 'a = b (Å)']], ['2.958', ['Experimental', 'Expt. [23]'], ['Rutile', 'c (Å)']], ['0.305', ['Experimental', 'Expt. [23]'], ['Rutile', 'u']], ['3.785', ['Experimental', 'Expt. [23]'], ['Anatase', 'a = b (Å)']], ['9.514', ['Experimental', 'Expt. [23]'], ['Anatase', 'c (Å)']], ['0.207', ['Experimental', 'Expt. [23]'], ['Anatase', 'u']]]

We may wish to access other elements of the table, such as the title row, the row or column headers, and the data:

[5]:
print ("Title row:     \n", table.title_row)
print ("Row header:    \n", table.row_header)
print ("Column header: \n", table.col_header)
print ("Data:          \n", table.data)
Title row:
 0
Row header:
 [['Computational' 'This study']
 ['Computational' 'GGA [25]']
 ['Computational' 'GGA [26]']
 ['Computational' 'HF [27]']
 ['Experimental' 'Expt. [23]']]
Column header:
 [['Rutile' 'Rutile' 'Rutile' 'Anatase' 'Anatase' 'Anatase']
 ['a = b (Å)' 'c (Å)' 'u' 'a = b (Å)' 'c (Å)' 'u']]
Data:
 [['4.64' '2.99' '0.305' '3.83' '9.62' '0.208']
 ['4.67' '2.97' '0.305' '3.80' '9.67' '0.207']
 ['4.63' '2.98' '0.305' '-' '-' '-']
 ['-' '-' '-' '3.76' '9.85' '0.202']
 ['4.594' '2.958' '0.305' '3.785' '9.514' '0.207']]

If needed we can transpose the whole table, which will return the same category table, with row and column categories interchanged:

[6]:
table.transpose()
print(table)
+-------+--------------------------+---------------------------------+
|  Data |      Row Categories      |        Column Categories        |
+-------+--------------------------+---------------------------------+
|  4.64 | ['Rutile', 'a = b (Å)']  | ['Computational', 'This study'] |
|  4.67 | ['Rutile', 'a = b (Å)']  |  ['Computational', 'GGA [25]']  |
|  4.63 | ['Rutile', 'a = b (Å)']  |  ['Computational', 'GGA [26]']  |
|   -   | ['Rutile', 'a = b (Å)']  |   ['Computational', 'HF [27]']  |
| 4.594 | ['Rutile', 'a = b (Å)']  |  ['Experimental', 'Expt. [23]'] |
|  2.99 |   ['Rutile', 'c (Å)']    | ['Computational', 'This study'] |
|  2.97 |   ['Rutile', 'c (Å)']    |  ['Computational', 'GGA [25]']  |
|  2.98 |   ['Rutile', 'c (Å)']    |  ['Computational', 'GGA [26]']  |
|   -   |   ['Rutile', 'c (Å)']    |   ['Computational', 'HF [27]']  |
| 2.958 |   ['Rutile', 'c (Å)']    |  ['Experimental', 'Expt. [23]'] |
| 0.305 |     ['Rutile', 'u']      | ['Computational', 'This study'] |
| 0.305 |     ['Rutile', 'u']      |  ['Computational', 'GGA [25]']  |
| 0.305 |     ['Rutile', 'u']      |  ['Computational', 'GGA [26]']  |
|   -   |     ['Rutile', 'u']      |   ['Computational', 'HF [27]']  |
| 0.305 |     ['Rutile', 'u']      |  ['Experimental', 'Expt. [23]'] |
|  3.83 | ['Anatase', 'a = b (Å)'] | ['Computational', 'This study'] |
|  3.80 | ['Anatase', 'a = b (Å)'] |  ['Computational', 'GGA [25]']  |
|   -   | ['Anatase', 'a = b (Å)'] |  ['Computational', 'GGA [26]']  |
|  3.76 | ['Anatase', 'a = b (Å)'] |   ['Computational', 'HF [27]']  |
| 3.785 | ['Anatase', 'a = b (Å)'] |  ['Experimental', 'Expt. [23]'] |
|  9.62 |   ['Anatase', 'c (Å)']   | ['Computational', 'This study'] |
|  9.67 |   ['Anatase', 'c (Å)']   |  ['Computational', 'GGA [25]']  |
|   -   |   ['Anatase', 'c (Å)']   |  ['Computational', 'GGA [26]']  |
|  9.85 |   ['Anatase', 'c (Å)']   |   ['Computational', 'HF [27]']  |
| 9.514 |   ['Anatase', 'c (Å)']   |  ['Experimental', 'Expt. [23]'] |
| 0.208 |     ['Anatase', 'u']     | ['Computational', 'This study'] |
| 0.207 |     ['Anatase', 'u']     |  ['Computational', 'GGA [25]']  |
|   -   |     ['Anatase', 'u']     |  ['Computational', 'GGA [26]']  |
| 0.202 |     ['Anatase', 'u']     |   ['Computational', 'HF [27]']  |
| 0.207 |     ['Anatase', 'u']     |  ['Experimental', 'Expt. [23]'] |
+-------+--------------------------+---------------------------------+

Output & Pandas

  • as csv file
  • as Pandas DataFrame

To store the table as .csv:

[7]:
table.to_csv('./saved_table.csv')

The table can also be converted to a Pandas DataFrame object:

[8]:
import pandas
df = table.to_pandas()
df
[8]:
Computational Experimental
This study GGA [25] GGA [26] HF [27] Expt. [23]
Rutile a = b (Å) 4.64 4.67 4.63 - 4.594
c (Å) 2.99 2.97 2.98 - 2.958
u 0.305 0.305 0.305 - 0.305
Anatase a = b (Å) 3.83 3.80 - 3.76 3.785
c (Å) 9.62 9.67 - 9.85 9.514
u 0.208 0.207 - 0.202 0.207

We can now use all the powerful features of Pandas to interpret the content of the table. Lets say that we are interested in the experimental values for ‘Anatase’:

[9]:
df.loc['Anatase','Experimental']
[9]:
Expt. [23]
a = b (Å) 3.785
c (Å) 9.514
u 0.207

The most powerful feature of TableDataExtractor is that it will automatically create a MultiIndex for the Pandas DataFrame, which would traditionally be done by hand for every individual table.

[10]:
print(df.index)
print(df.columns)
MultiIndex(levels=[['Anatase', 'Rutile'], ['a = b (Å)', 'c (Å)', 'u']],
           labels=[[1, 1, 1, 0, 0, 0], [0, 1, 2, 0, 1, 2]])
MultiIndex(levels=[['Computational', 'Experimental'], ['Expt. [23]', 'GGA [25]', 'GGA [26]', 'HF [27]', 'This study']],
           labels=[[0, 0, 0, 0, 1], [4, 1, 2, 3, 0]])

Or, we might be interested in only the ‘c (Å)’ values from the table. Here, ilevel_1 specifies the index level of 1, which includes a=b(Å), c(Å) and u:

[11]:
df.query('ilevel_1 == "c (Å)"')
[11]:
Computational Experimental
This study GGA [25] GGA [26] HF [27] Expt. [23]
Rutile c (Å) 2.99 2.97 2.98 - 2.958
Anatase c (Å) 9.62 9.67 - 9.85 9.514