Basics¶
Input¶
- from file, as .csv or .html
- from url (if there are more tables at the provided url, use the table_numberargument)
- from python list object
[1]:
table_path = '../examples/tables/table_example.csv'
from tabledataextractor import Table
table = Table(table_path)
First we will check out the original table, which is now stored as table.raw_table. We can use the print_raw_table()function within TableDataExtractor:
[2]:
table.print_raw_table()
                           Rutile     Rutile  Rutile  Anatase    Anatase  Anatase
                           a = b (Å)  c (Å)   u       a = b (Å)  c (Å)    u
Computational  This study  4.64       2.99    0.305   3.83       9.62     0.208
Computational  GGA [25]    4.67       2.97    0.305   3.80       9.67     0.207
Computational  GGA [26]    4.63       2.98    0.305   -          -        -
Computational  HF [27]     -          -       -       3.76       9.85     0.202
Experimental   Expt. [23]  4.594      2.958   0.305   3.785      9.514    0.207
TableDataExtractor provides a category table, where each row corresponds to a single data point. This is the main result of TableDataExtractor. We can simply print the table to see it:
[3]:
print(table)
+-------+---------------------------------+--------------------------+
|  Data |          Row Categories         |    Column Categories     |
+-------+---------------------------------+--------------------------+
|  4.64 | ['Computational', 'This study'] | ['Rutile', 'a = b (Å)']  |
|  2.99 | ['Computational', 'This study'] |   ['Rutile', 'c (Å)']    |
| 0.305 | ['Computational', 'This study'] |     ['Rutile', 'u']      |
|  3.83 | ['Computational', 'This study'] | ['Anatase', 'a = b (Å)'] |
|  9.62 | ['Computational', 'This study'] |   ['Anatase', 'c (Å)']   |
| 0.208 | ['Computational', 'This study'] |     ['Anatase', 'u']     |
|  4.67 |  ['Computational', 'GGA [25]']  | ['Rutile', 'a = b (Å)']  |
|  2.97 |  ['Computational', 'GGA [25]']  |   ['Rutile', 'c (Å)']    |
| 0.305 |  ['Computational', 'GGA [25]']  |     ['Rutile', 'u']      |
|  3.80 |  ['Computational', 'GGA [25]']  | ['Anatase', 'a = b (Å)'] |
|  9.67 |  ['Computational', 'GGA [25]']  |   ['Anatase', 'c (Å)']   |
| 0.207 |  ['Computational', 'GGA [25]']  |     ['Anatase', 'u']     |
|  4.63 |  ['Computational', 'GGA [26]']  | ['Rutile', 'a = b (Å)']  |
|  2.98 |  ['Computational', 'GGA [26]']  |   ['Rutile', 'c (Å)']    |
| 0.305 |  ['Computational', 'GGA [26]']  |     ['Rutile', 'u']      |
|   -   |  ['Computational', 'GGA [26]']  | ['Anatase', 'a = b (Å)'] |
|   -   |  ['Computational', 'GGA [26]']  |   ['Anatase', 'c (Å)']   |
|   -   |  ['Computational', 'GGA [26]']  |     ['Anatase', 'u']     |
|   -   |   ['Computational', 'HF [27]']  | ['Rutile', 'a = b (Å)']  |
|   -   |   ['Computational', 'HF [27]']  |   ['Rutile', 'c (Å)']    |
|   -   |   ['Computational', 'HF [27]']  |     ['Rutile', 'u']      |
|  3.76 |   ['Computational', 'HF [27]']  | ['Anatase', 'a = b (Å)'] |
|  9.85 |   ['Computational', 'HF [27]']  |   ['Anatase', 'c (Å)']   |
| 0.202 |   ['Computational', 'HF [27]']  |     ['Anatase', 'u']     |
| 4.594 |  ['Experimental', 'Expt. [23]'] | ['Rutile', 'a = b (Å)']  |
| 2.958 |  ['Experimental', 'Expt. [23]'] |   ['Rutile', 'c (Å)']    |
| 0.305 |  ['Experimental', 'Expt. [23]'] |     ['Rutile', 'u']      |
| 3.785 |  ['Experimental', 'Expt. [23]'] | ['Anatase', 'a = b (Å)'] |
| 9.514 |  ['Experimental', 'Expt. [23]'] |   ['Anatase', 'c (Å)']   |
| 0.207 |  ['Experimental', 'Expt. [23]'] |     ['Anatase', 'u']     |
+-------+---------------------------------+--------------------------+
If we want to further process the category table, we can access it as a list of lists:
[4]:
print(table.category_table)
[['4.64', ['Computational', 'This study'], ['Rutile', 'a = b (Å)']], ['2.99', ['Computational', 'This study'], ['Rutile', 'c (Å)']], ['0.305', ['Computational', 'This study'], ['Rutile', 'u']], ['3.83', ['Computational', 'This study'], ['Anatase', 'a = b (Å)']], ['9.62', ['Computational', 'This study'], ['Anatase', 'c (Å)']], ['0.208', ['Computational', 'This study'], ['Anatase', 'u']], ['4.67', ['Computational', 'GGA [25]'], ['Rutile', 'a = b (Å)']], ['2.97', ['Computational', 'GGA [25]'], ['Rutile', 'c (Å)']], ['0.305', ['Computational', 'GGA [25]'], ['Rutile', 'u']], ['3.80', ['Computational', 'GGA [25]'], ['Anatase', 'a = b (Å)']], ['9.67', ['Computational', 'GGA [25]'], ['Anatase', 'c (Å)']], ['0.207', ['Computational', 'GGA [25]'], ['Anatase', 'u']], ['4.63', ['Computational', 'GGA [26]'], ['Rutile', 'a = b (Å)']], ['2.98', ['Computational', 'GGA [26]'], ['Rutile', 'c (Å)']], ['0.305', ['Computational', 'GGA [26]'], ['Rutile', 'u']], ['-', ['Computational', 'GGA [26]'], ['Anatase', 'a = b (Å)']], ['-', ['Computational', 'GGA [26]'], ['Anatase', 'c (Å)']], ['-', ['Computational', 'GGA [26]'], ['Anatase', 'u']], ['-', ['Computational', 'HF [27]'], ['Rutile', 'a = b (Å)']], ['-', ['Computational', 'HF [27]'], ['Rutile', 'c (Å)']], ['-', ['Computational', 'HF [27]'], ['Rutile', 'u']], ['3.76', ['Computational', 'HF [27]'], ['Anatase', 'a = b (Å)']], ['9.85', ['Computational', 'HF [27]'], ['Anatase', 'c (Å)']], ['0.202', ['Computational', 'HF [27]'], ['Anatase', 'u']], ['4.594', ['Experimental', 'Expt. [23]'], ['Rutile', 'a = b (Å)']], ['2.958', ['Experimental', 'Expt. [23]'], ['Rutile', 'c (Å)']], ['0.305', ['Experimental', 'Expt. [23]'], ['Rutile', 'u']], ['3.785', ['Experimental', 'Expt. [23]'], ['Anatase', 'a = b (Å)']], ['9.514', ['Experimental', 'Expt. [23]'], ['Anatase', 'c (Å)']], ['0.207', ['Experimental', 'Expt. [23]'], ['Anatase', 'u']]]
We may wish to access other elements of the table, such as the title row, the row or column headers, and the data:
[5]:
print ("Title row:     \n", table.title_row)
print ("Row header:    \n", table.row_header)
print ("Column header: \n", table.col_header)
print ("Data:          \n", table.data)
Title row:
 0
Row header:
 [['Computational' 'This study']
 ['Computational' 'GGA [25]']
 ['Computational' 'GGA [26]']
 ['Computational' 'HF [27]']
 ['Experimental' 'Expt. [23]']]
Column header:
 [['Rutile' 'Rutile' 'Rutile' 'Anatase' 'Anatase' 'Anatase']
 ['a = b (Å)' 'c (Å)' 'u' 'a = b (Å)' 'c (Å)' 'u']]
Data:
 [['4.64' '2.99' '0.305' '3.83' '9.62' '0.208']
 ['4.67' '2.97' '0.305' '3.80' '9.67' '0.207']
 ['4.63' '2.98' '0.305' '-' '-' '-']
 ['-' '-' '-' '3.76' '9.85' '0.202']
 ['4.594' '2.958' '0.305' '3.785' '9.514' '0.207']]
If needed we can transpose the whole table, which will return the same category table, with row and column categories interchanged:
[6]:
table.transpose()
print(table)
+-------+--------------------------+---------------------------------+
|  Data |      Row Categories      |        Column Categories        |
+-------+--------------------------+---------------------------------+
|  4.64 | ['Rutile', 'a = b (Å)']  | ['Computational', 'This study'] |
|  4.67 | ['Rutile', 'a = b (Å)']  |  ['Computational', 'GGA [25]']  |
|  4.63 | ['Rutile', 'a = b (Å)']  |  ['Computational', 'GGA [26]']  |
|   -   | ['Rutile', 'a = b (Å)']  |   ['Computational', 'HF [27]']  |
| 4.594 | ['Rutile', 'a = b (Å)']  |  ['Experimental', 'Expt. [23]'] |
|  2.99 |   ['Rutile', 'c (Å)']    | ['Computational', 'This study'] |
|  2.97 |   ['Rutile', 'c (Å)']    |  ['Computational', 'GGA [25]']  |
|  2.98 |   ['Rutile', 'c (Å)']    |  ['Computational', 'GGA [26]']  |
|   -   |   ['Rutile', 'c (Å)']    |   ['Computational', 'HF [27]']  |
| 2.958 |   ['Rutile', 'c (Å)']    |  ['Experimental', 'Expt. [23]'] |
| 0.305 |     ['Rutile', 'u']      | ['Computational', 'This study'] |
| 0.305 |     ['Rutile', 'u']      |  ['Computational', 'GGA [25]']  |
| 0.305 |     ['Rutile', 'u']      |  ['Computational', 'GGA [26]']  |
|   -   |     ['Rutile', 'u']      |   ['Computational', 'HF [27]']  |
| 0.305 |     ['Rutile', 'u']      |  ['Experimental', 'Expt. [23]'] |
|  3.83 | ['Anatase', 'a = b (Å)'] | ['Computational', 'This study'] |
|  3.80 | ['Anatase', 'a = b (Å)'] |  ['Computational', 'GGA [25]']  |
|   -   | ['Anatase', 'a = b (Å)'] |  ['Computational', 'GGA [26]']  |
|  3.76 | ['Anatase', 'a = b (Å)'] |   ['Computational', 'HF [27]']  |
| 3.785 | ['Anatase', 'a = b (Å)'] |  ['Experimental', 'Expt. [23]'] |
|  9.62 |   ['Anatase', 'c (Å)']   | ['Computational', 'This study'] |
|  9.67 |   ['Anatase', 'c (Å)']   |  ['Computational', 'GGA [25]']  |
|   -   |   ['Anatase', 'c (Å)']   |  ['Computational', 'GGA [26]']  |
|  9.85 |   ['Anatase', 'c (Å)']   |   ['Computational', 'HF [27]']  |
| 9.514 |   ['Anatase', 'c (Å)']   |  ['Experimental', 'Expt. [23]'] |
| 0.208 |     ['Anatase', 'u']     | ['Computational', 'This study'] |
| 0.207 |     ['Anatase', 'u']     |  ['Computational', 'GGA [25]']  |
|   -   |     ['Anatase', 'u']     |  ['Computational', 'GGA [26]']  |
| 0.202 |     ['Anatase', 'u']     |   ['Computational', 'HF [27]']  |
| 0.207 |     ['Anatase', 'u']     |  ['Experimental', 'Expt. [23]'] |
+-------+--------------------------+---------------------------------+
Output & Pandas¶
- as csv file
- as Pandas DataFrame
To store the table as .csv:
[7]:
table.to_csv('./saved_table.csv')
The table can also be converted to a Pandas DataFrame object:
[8]:
import pandas
df = table.to_pandas()
df
[8]:
| Computational | Experimental | |||||
|---|---|---|---|---|---|---|
| This study | GGA [25] | GGA [26] | HF [27] | Expt. [23] | ||
| Rutile | a = b (Å) | 4.64 | 4.67 | 4.63 | - | 4.594 | 
| c (Å) | 2.99 | 2.97 | 2.98 | - | 2.958 | |
| u | 0.305 | 0.305 | 0.305 | - | 0.305 | |
| Anatase | a = b (Å) | 3.83 | 3.80 | - | 3.76 | 3.785 | 
| c (Å) | 9.62 | 9.67 | - | 9.85 | 9.514 | |
| u | 0.208 | 0.207 | - | 0.202 | 0.207 | |
We can now use all the powerful features of Pandas to interpret the content of the table. Lets say that we are interested in the experimental values for ‘Anatase’:
[9]:
df.loc['Anatase','Experimental']
[9]:
| Expt. [23] | |
|---|---|
| a = b (Å) | 3.785 | 
| c (Å) | 9.514 | 
| u | 0.207 | 
The most powerful feature of TableDataExtractor is that it will automatically create a MultiIndex for the Pandas DataFrame, which would traditionally be done by hand for every individual table.
[10]:
print(df.index)
print(df.columns)
MultiIndex(levels=[['Anatase', 'Rutile'], ['a = b (Å)', 'c (Å)', 'u']],
           labels=[[1, 1, 1, 0, 0, 0], [0, 1, 2, 0, 1, 2]])
MultiIndex(levels=[['Computational', 'Experimental'], ['Expt. [23]', 'GGA [25]', 'GGA [26]', 'HF [27]', 'This study']],
           labels=[[0, 0, 0, 0, 1], [4, 1, 2, 3, 0]])
Or, we might be interested in only the ‘c (Å)’ values from the table. Here, ilevel_1 specifies the index level of 1, which includes a=b(Å), c(Å) and u:
[11]:
df.query('ilevel_1 == "c (Å)"')
[11]:
| Computational | Experimental | |||||
|---|---|---|---|---|---|---|
| This study | GGA [25] | GGA [26] | HF [27] | Expt. [23] | ||
| Rutile | c (Å) | 2.99 | 2.97 | 2.98 | - | 2.958 | 
| Anatase | c (Å) | 9.62 | 9.67 | - | 9.85 | 9.514 |