# Why pdf processing is difficult
This article is a work in progress and is not yet complete
# extract financial transactions from grids
- take one example - finding a table of financial transactions and extracting each cell row by row
- pdf has no concept of a table - just text, bitmaps, vectors. You create a table by aligning text boxes into position
- sometimes multi-line rows, text boxes span multiple columns, or multiple boxes in a single column, footnotes to ignore, sidenotes, span multiple pages ...
- contrast this to html - which has a semantic structure for a table name
<table>
- you can easily find the begining and end of the table an each row inside<tr>
- Statement quirks
- e.g. [capitec stamp removal](file:///spike/v9/priv/lib/core/pdf/data/capitec-stamp/algorithm.md)
# general pdf processing complexities
- for a lot more in-depth look at the weird things that you can find with pdfs see this excellent article: