# Why pdf processing is difficult

This article is a work in progress and is not yet complete

# extract financial transactions from grids

  • take one example - finding a table of financial transactions and extracting each cell row by row
  • pdf has no concept of a table - just text, bitmaps, vectors. You create a table by aligning text boxes into position
  • sometimes multi-line rows, text boxes span multiple columns, or multiple boxes in a single column, footnotes to ignore, sidenotes, span multiple pages ...
  • contrast this to html - which has a semantic structure for a table name <table> - you can easily find the begining and end of the table an each row inside <tr>
  • Statement quirks
    • e.g. [capitec stamp removal](file:///spike/v9/priv/lib/core/pdf/data/capitec-stamp/algorithm.md)

# general pdf processing complexities

Updated: 7/21/2021, 9:29:43 AM