# Why pdf processing is difficult

This article is a work in progress and is not yet complete

# extract financial transactions from grids

take one example - finding a table of financial transactions and extracting each cell row by row
pdf has no concept of a table - just text, bitmaps, vectors. You create a table by aligning text boxes into position
sometimes multi-line rows, text boxes span multiple columns, or multiple boxes in a single column, footnotes to ignore, sidenotes, span multiple pages ...
contrast this to html - which has a semantic structure for a table name <table> - you can easily find the begining and end of the table an each row inside <tr>
Statement quirks
- e.g. [capitec stamp removal](file:///spike/v9/priv/lib/core/pdf/data/capitec-stamp/algorithm.md)

for a lot more in-depth look at the weird things that you can find with pdfs see this excellent article:
- Why are pdfs difficult for a machine to process (opens new window)