FinTabNet is a large dataset of document images from financial earnings of Fortune500 companies.
The dataset is open sourced by IBM Research and is available to download freely on the IBM Developer Data Asset Exchange.
This notebook can be found on Watson Studio.
In this section, we will download and extract the dataset. We will also install dependencies as needed.
For this notebook we will install 2 PyPi packages pdf2image
and pypdf2
as shown in the cell below.
Additionally for all cells to work, we will also need a system wide package called poppler
. This needs to be installed for the Watson Studio environment. Please follow the instructions here to install a custom conda
package. Specifically we need to install the conda
package poppler
.
!pip install pdf2image
!pip install pypdf2
Requirement already satisfied: pdf2image in /opt/conda/envs/Python-3.7-main/lib/python3.7/site-packages (1.14.0) Requirement already satisfied: pillow in /opt/conda/envs/Python-3.7-main/lib/python3.7/site-packages (from pdf2image) (7.2.0) Requirement already satisfied: pypdf2 in /opt/conda/envs/Python-3.7-main/lib/python3.7/site-packages (1.26.0)
# importing prerequisites
import sys
import requests
import tarfile
import json
import numpy as np
import pdf2image
from os import path
from PIL import Image
from PIL import ImageFont, ImageDraw
from glob import glob
from matplotlib import pyplot as plt
from pdf2image import convert_from_path
from PyPDF2 import PdfFileReader
from IPython.core.display import display, HTML
import pdb
import copy
%matplotlib inline
fname = 'examples.tar.gz'
url = 'https://dax-cdn.cdn.appdomain.cloud/dax-fintabnet/1.0.0/' + fname
r = requests.get(url)
open(fname , 'wb').write(r.content)
639829
#Extracting the dataset
tar = tarfile.open(fname)
tar.extractall()
tar.close()
# Verifying the file was extracted properly
data_path = "examples/"
path.exists(data_path)
True
In this section, we visualize the annotations file by overlaying it on the underlying image. We also display the HTML structure for each table to compare.
# Define color code
colors = [(255, 0, 0),(0, 255, 0)]
categories = ["table", "cell"]
# Function to viz the annotation
def markup(image, annotations, pdf_height):
''' Draws the segmentation, bounding box, and label of each annotation
'''
draw = ImageDraw.Draw(image, 'RGBA')
for annotation in annotations:
# Draw bbox
orig_annotation = copy.copy(annotation['bbox'])
annotation['bbox'][3] = pdf_height-orig_annotation[1]
annotation['bbox'][1] = pdf_height-orig_annotation[3]
draw.rectangle(
(annotation['bbox'][0],
annotation['bbox'][1],
annotation['bbox'][2],
annotation['bbox'][3]),
outline=colors[annotation['category_id'] - 1] + (255,),
width=2
)
# Draw label
w, h = draw.textsize(text=categories[annotation['category_id'] - 1])
if annotation['bbox'][3] < h:
draw.rectangle(
(annotation['bbox'][2],
annotation['bbox'][1],
annotation['bbox'][2] + w,
annotation['bbox'][1] + h),
fill=(64, 64, 64, 255)
)
draw.text(
(annotation['bbox'][2],
annotation['bbox'][1]),
text=categories[annotation['category_id'] - 1],
fill=(255, 255, 255, 255)
)
else:
draw.rectangle(
(annotation['bbox'][0]-w,
annotation['bbox'][1]-h,
annotation['bbox'][0],
annotation['bbox'][1]),
fill=(64, 64, 64, 255)
)
draw.text(
(annotation['bbox'][0]-w,
annotation['bbox'][1]-h),
text=categories[annotation['category_id'] - 1],
fill=(255, 255, 255, 255)
)
return np.array(image)
# Parse the JSON file and read all the images and labels
with open('examples/FinTabNet_1.0.0_table_example.jsonl', 'r') as fp:
images = {}
for line in fp:
sample = json.loads(line)
# Index images
if sample['filename'] in images:
annotations = images[sample['filename']]["annotations"]
html = images[sample['filename']]["html"]
else:
annotations = []
html = ""
for t, token in enumerate(sample["html"]["cells"]):
if "bbox" in token:
annotations.append({"category_id":2, "bbox": token["bbox"]})
#Build html table
cnt = 0
for t, token in enumerate(sample["html"]["structure"]["tokens"]):
html += token
if token=="<td>":
html += "".join(sample["html"]["cells"][cnt]["tokens"])
cnt += 1
annotations.append({"category_id": 1, "bbox": sample["bbox"]})
images[sample['filename']] = {'filepath': 'examples/pdf/' + sample["filename"], 'html': html, 'annotations': annotations}
# Visualize annotations and print HTML tables
import matplotlib
matplotlib.rcParams['figure.dpi'] = 250
for i, (filename, image) in enumerate(images.items()):
pdf_page = PdfFileReader(open(image["filepath"], 'rb')).getPage(0)
pdf_shape = pdf_page.mediaBox
pdf_height = pdf_shape[3]-pdf_shape[1]
pdf_width = pdf_shape[2]-pdf_shape[0]
converted_images = convert_from_path(image["filepath"], size=(pdf_width, pdf_height))
img = converted_images[0]
print("Table HTML for page #{}".format(i))
display(HTML(image['html']))
plt.figure()
plt.imshow(markup(img, image['annotations'], pdf_height))
plt.title("Page # {}".format(i))
plt.axis('off')
Table HTML for page #0
2017 | 2016 | 2015 | |
Minimum rentals | $2,814 | $2,394 | $2,249 |
Contingent rentals(1) | 178 | 214 | 194 |
$2,992 | $2,608 | $2,443 |
Operating Leases | Aircraftand RelatedEquipment | Facilitiesand Other | |
TotalOperatingLeases | 2018 | $398 | $2,047 |
$2,445 | 2019 | 343 | 1,887 |
2,230 | 2020 | 261 | 1,670 |
1,931 | 2021 | 203 | 1,506 |
1,709 | 2022 | 185 | 1,355 |
1,540 | Thereafter | 175 | 7,844 |
8,019 | Total | $1,565 | $16,309 |
Table HTML for page #1
Amount Reclassified from AOCI | ||||
Affected Line Item in theIncome Statement | 2017 | 2016 | 2015 | |
Amortization of retirement plans prior servicecredits, before tax | $120 | $121 | $115 | |
Salaries and employee benefits | Income tax benefit | (44) | (45) | (43) |
Provision for income taxes | AOCI reclassifications, net of tax | $76 | $76 | $72 |
2017 | 2016 | 2015 | |
Foreign currency translation gain (loss): | |||
Balance at beginning of period | $(514) | $(253) | $81 |
Translation adjustments | (171) | (261) | (334) |
Balance at end of period | (685) | (514) | (253) |
Retirement plans adjustments: | |||
Balance at beginning of period | 345 | 425 | 425 |
Prior service credit and other arising during period | 1 | (4) | 72 |
Reclassifications from AOCI | (76) | (76) | (72) |
Balance at end of period | 270 | 345 | 425 |
Accumulated other comprehensive (loss) income at end of period | $(415) | $(169) | $172 |
Table HTML for page #2
2018 | $81 |
2019 | 71 |
2020 | 55 |
2021 | 44 |
2022 | 41 |
2017 | 2016 | GrossCarryingAmount | Accumulated Amortization | Net BookValue | GrossCarryingAmount | |
Accumulated Amortization | Net BookValue | Customer relationships | $656 | $(203) | $453 | $912 |
$(156) | $756 | Technology | 54 | (26) | 28 | 123 |
(16) | 107 | Trademarks and other | 136 | (88) | 48 | 202 |
(57) | 145 | Total | $846 | $(317) | $529 | $1,237 |
2017 | 2016 | |
Accrued Salaries and Employee Benefits | ||
Salaries | $431 | $478 |
Employee benefits, including variable compensation | 781 | 804 |
Compensated absences | 702 | 690 |
$1,914 | $1,972 | |
Accrued Expenses | ||
Self-insurance accruals | $976 | $837 |
Taxes other than income taxes | 283 | 311 |
Other | 1,971 | 1,915 |
$3,230 | $3,063 |
Table HTML for page #3
(in millions, except per share amounts) | FirstQuarter | SecondQuarter | ThirdQuarter | Fourth Quarter |
2017(1) | ||||
Revenues | $14,663 | $14,931 | $14,997 | $15,728 |
Operating income | 1,264 | 1,167 | 1,025 | 1,581 |
Net income | 715 | 700 | 562 | 1,020 |
Basic earnings per common share(2) | 2.69 | 2.63 | 2.11 | 3.81 |
Diluted earnings per common share(2) | 2.65 | 2.59 | 2.07 | 3.75 |
2016(3) | ||||
Revenues | $12,279 | $12,453 | $12,654 | $12,979 |
Operating income (loss) | 1,144 | 1,137 | 864 | (68) |
Net income (loss) | 692 | 691 | 507 | (70) |
Basic earnings (loss) per common share(2) | 2.45 | 2.47 | 1.86 | (0.26) |
Diluted earnings (loss) per common share(2) | 2.42 | 2.44 | 1.84 | (0.26) |
Table HTML for page #4
2017 | 2016 | 2015 | |
Low | 3.25% | 2.75% | 4.50% |
High | 4.50 | 4.50 | 7.00 |
Weighted-average | 4.03 | 3.82 | 5.90 |
Table HTML for page #5
Aircraft andAircraft Related | Other(1) | Total | |
2018 | $1,777 | $1,440 | $3,217 |
2019 | 1,729 | 508 | 2,237 |
2020 | 1,933 | 400 | 2,333 |
2021 | 1,341 | 309 | 1,650 |
2022 | 1,276 | 198 | 1,474 |
Thereafter | 2,895 | 499 | 3,394 |
Total | $10,951 | $3,354 | $14,305 |
B767F | B777F | Total | |
2018 | 14 | 4 | 18 |
2019 | 15 | 2 | 17 |
2020 | 16 | 3 | 19 |
2021 | 10 | 3 | 13 |
2022 | 10 | 4 | 14 |
Thereafter | 6 | - | 6 |
Total | 71 | 16 | 87 |
Table HTML for page #6
2017 | Percent of Revenue 2017 | |
Revenues | $7,401 | 100.0% |
Operating expenses: | ||
Salaries and employee benefits | 2,077 | 28.1 |
Purchased transportation | 3,049 | 41.2 |
Rentals | 353 | 4.8 |
Depreciation and amortization | 239 | 3.2 |
Fuel | 225 | 3.1 |
Maintenance and repairs | 143 | 1.9 |
Intercompany charges | 17 | 0.2 |
Other | 1,214 | 16.4 |
Total operating expenses | 7,317 | 98.9% |
Operating income | $84 | |
Operating margin | 1.1% | |
Package: | ||
Average daily packages | 1,022 | |
Revenue per package (yield) | $24.77 | |
Freight: | ||
Average daily pounds | 3,608 | |
Revenue per pound (yield) | $0.56 |
Table HTML for page #7
2017 | 2016 | TotalNumber ofSharesPurchased | AveragePrice Paidper Share | TotalPurchasePrice | TotalNumber ofSharesPurchased | |
AveragePrice Paidper Share | TotalPurchasePrice | Common stock repurchases | 2,955,000 | $172.13 | $509 | 18,225,000 |
Table HTML for page #8
2017 | 2016 | |
Funded Status of Plans: | ||
Projected benefit obligation (PBO) | $29,913 | $29,602 |
Fair value of plan assets | 26,312 | 24,271 |
Funded status of the plans | $(3,601) | $(5,331) |
Cash Amounts: | ||
Cash contributions during the year | $2,115 | $726 |
Benefit payments during the year | $2,310 | $912 |
MeasurementDate | Discount Rate |
5/31/2017 | 4.08% |
5/31/2016 | 4.13 |
5/31/2015 | 4.42 |
5/31/2014 | 4.60 |
Table HTML for page #9
Net Book Value at May 31, | Range | 2017 | |
2016 | Wide-body aircraft and related equipment | 15 to 30 years | $9,103 |
$8,356 | Narrow-body and feeder aircraft and related equipment | 5 to 18 years | 3,099 |
3,180 | Package handling and ground support equipment | 3 to 30 years | 3,862 |
3,249 | Information technology | 2 to 10 years | 1,114 |
1,051 | Vehicles | 3 to 15 years | 3,400 |
3,084 | Facilities and other | 2 to 40 years | 5,403 |