# Authenticity
With the proliferation of online pdf editing tools it is quite easy to fraudulently modify a pdf. Clearly it is very important for lenders to be able to detect this kind of modification! however it can be impossible to detect by manual inspection.
We have a feature called authenticity
which allows you to discern modified from original statements. This page describes this feature in detail.
# Fraudulent modification example
In the example clip below we see how easy it is to modify a pdf using a freely available online pdf editor. In the clip we add a 1 in front of the income amount to change it from R 42,000
to R 142,000
!
# How to establish whether a pdf is authentic
# Meta data
Every pdf contains some additional information in it that indicates which software was used to create and/or modify the pdf. This additional information is called meta data
. You can see some of the meta data fields (PDF Producer
& PDF Version
) in Acrobat Reader in the picture below:
# Unmodified statements aka originals
When we receive at a pdf we can use the meta data to check whether it is the original pdf that was produced by the bank, or whether it has been modified by some other pdf software. Note that this is complicated slightly by a number of factors:
- First, there are many ways to obtain statements:
- some banks have multiple ways of obtaining statements e.g. download from internet banking vs received in monthly email vs access via banking app
- there can be multiple formats for each of these methods e.g. the ABSA site allows you to (1) generate an e-statement on the fly from the transactions history screen, but also allows you to (2) download "archived statements"
- each of these methods may have different meta data
- Second, banks change their statement formats from time to time - e.g. to give the statement a more modern look, or to display new information like COVID alerts.
- Third, in some cases the underlying technology that is used by the bank to generate the statement is changed - i.e. the statement may look the same to you, but the meta data has changed.
- Finally, not all statements are procured in a manner which keeps meta data consistent: see the table.
So it's not as simple as checking the meta data on the first statement which we receive and then assuming that all originals will have the same meta data. In fact it takes quite a bit of effort for Spike to keep our list of original meta data up to date for each statement type.
# You can't simply reject all non-original statements
Beware the phenomenon of the innocently modified pdf! There are a number of ways in which the meta data can be modified from the original, even when the user hasn't explicitly tried to change the pdf. The following is a typical scenario:
- the user downloads their statement from their bank's website
- the pdf opens in their browser or pdf reader like Acrobat Reader
- the user clicks
File
>Save as
in the pdf reader in order to save the file somewhere else on their computer - this inadvertently changes the meta data!
- they then send this innocently modified pdf to you.
Roughly 80% of the pdfs which we process are original
, what about the remaining 20%?
We don't want to slow down your back-office process by demanding that you re-request a pdf from a user when it has been innocently modified however it is not so easy to distinguish the innocent mods from the malicious ones.
# Solution = the Spike authenticity feature
# The Spike pdf metadata database
When trying to work out whether you should accept a non-orignal pdf, or ask the user to resend a non-modified pdf, the key question to consider is: "what capabilities does the software have which was used to create the pdf". Crucially:
- was it a pdf editor? i.e. with
write
capabilities: these could have been used to modify the transactions - or was it a reader? i.e. with
read
capabilities only: these couldn't have been used to edit the pdf
In order to make this call Spike maintains a meta data database - listing all pdf software which we've encountered to date, and what category of software it falls into. The category determines whether the pdf could have been edited or not. As you can imagine keeping this database up to date requires a lot of work on our side - for every pdf that we receive with non-matching meta data we have to research what software was used to create the pdf, and then make a judgement call as to whether this software can edit pdfs or not.
Amongst all of the pdf software that we've seen to date we've identified a handful of pdf software categories. These categories are what you see in the Authenticity
field for any processed pdf. See the table for the full list of categories, and an guideline for the level of caution which you should apply when you observe this category of pdf.
Clearly if a pdf has been saved by a pdf writer then you want to scrutinize this more closely, and possibly reject the pdf and ask for the original. It's easy for a user to make fraudulent changes to their statements with a pdf writer, see the example.
We would give this a red caution level.
# What does Spike do when a modified pdf is detected? Does it stop processing the pdf?
No: we will always attempt to extract transactions.
What we do is the following:
- we will attempt to extract all transactions from a pdf regardless of whether it has been modified or not
- errors:
- basically, if it errors, you should be suspicious of the pdf
- some pdf modifications so fundamentally change the structure of the pdf that we fail to extract the transactions
- these will come back as processing errors
- note: see the guide on how to identify modified statements
- basically, if it errors, you should be suspicious of the pdf
- success: assuming that we successfully extract the transactions from the pdf we will display an
authenticity
field for the pdf- consult the table for guidelines on the level caution which you should apply when you observe this category of pdf.
# How Spike displays pdf authenticity
If your pricing plan includes the authenticity
feature then you'll see this field in the pdf converter (opens new window)
Authenticity is also surfaced in the json returned by the API (see line 39 below)
{
"parser": "ABSA_ACTIVESAVE_ALL_0",
"statement": {
"bank": "ABS",
"accountNumber": "9217334673",
"statementNumber": "00",
"dates": {
"issuedOn": "2017-05-13T00:00:00.000Z",
"from": "2017-04-01T00:00:00.000Z",
"to": "2017-04-30T00:00:00.000Z"
},
"nameAddress": ["MR I COPELYN", "20 SYDNEY STREET", "GREEN POINT", "8005"]
},
"transactions": [
{
"date": "2017-04-01T00:00:00.000Z",
"description": ["Balance brought forward"],
"balance": 61769.24,
"id": 1
},
{
"date": "2017-04-01T00:00:00.000Z",
"description": ["Direct debit", "Afrihost a11220426 e69d3vc"],
"amount": -549,
"balance": 61220.24,
"id": 2
},
{
"date": "2017-04-01T00:00:00.000Z",
"description": ["Direct debit", "M-choice m-choice45149223"],
"amount": -874,
"balance": 60346.24,
"id": 3
}
// ...
],
"valid": true,
"breaks": [],
"authenticity": "original"
}
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# Tables
# List of pdf software categories
category | caution level | description |
---|---|---|
original | green | this is the original unmodified statement as produced by the bank |
library | green | library which is not in use by an original pdf pipeline atm (i.e. probably an original pipeline but for a pdf which is not part of our statement library) |
reader | green | the pdf has been modified, but it's probably safe - i.e. "innocently modified" - the pdf software that modified the file does not offer pdf editing capabilities |
browser | green | the pdf has been modified, but it's probably safe - i.e. has been opened in a browser and then file-save-as'ed. Browsers do not offer pdf editing capabilities. |
printToPdf | orange | the user used a program which supports a Print option, and then used a printer driver which does print-to-pdf. It is possible that the file could have been editted in this case. This does occur routinely with some statement types tough - e.g. most nedbank statements are Microsoft:PrintToPDF |
scan | orange | the statement was likely printed and then scanned. This does not indicate whether the data has been modified, however in most cases the scan will not be successfully parsed by our software anyway. |
writer | red | the meta data indicates that this file was modified by a pdf writer. The transactions may have been edited. |
blank | red | means that all the meta.info fields are blank. This probably indicates tampering - all legitimate pdf software writes metadata. |
unmatched | - | the meta data didn't match any rules in our database - we will manually review this in due course and update our database |
unknown | - | returned when Spike can't access the meta data - e.g. when the pdf is password protected and you don't supply the correct password |
multiple | - | the meta data matched multiple rules in our database - we will manually review this in due course and update our database (this should happen infrequently) |
# Statements with no "original" form
Some statements are browser-generated i.e. created entirely by client-side code that is running on a Bank's internet banking website. In this case there is no original
form - statements will have a variety of meta data. These types of statement have the highest error rate because the internal structure of the pdfs vary greatly depending on the browser that was used. It's also not great for modification detection because the statements are much more easy to modify in the browser and then there is no automated way that we can detect that the statement has been modified.
A slight variant on this is where the website opens the pdf within the browser. In this case the user is presented with an interface with 2 "save" buttons - one is a download button, the other is actually a print button. In the former case you wind up with a pdf that has an "original" form, in the later you get a pdf with browser metadata (see example).
Thankfully we handle both of these cases for you. We know which statements are browser generated and print-to-pdf'd and have a big database of browser / print-to-pdf metadata. So we can give you the best guess as to whether the statement is in its original form versus modified.