March 24, 2021

Payment for duplicate invoices is a big problem for any business and it results in huge financial losses for the organizations. Detection of such duplicate entries from a large number of invoices requires manual efforts and is usually very tedious and. On an average, around 0.1% to 0.05% of invoices are paid as duplicate payments; which surmounts to a huge loss when aggregated for a year or more. Duplicate invoice entry happens for the following reasons:

  • Same invoices being sent more than once for processing due to delayed payment.

  • Same invoices arriving with minute changes due to inappropriate pattern recognition by OCR, for instance, an 'O' is replaced by a '0' or 'I' is replaced by a pipe (|).

  • Unintentional data-entry mistakes made while entering hundreds of invoices per day.

  • Payments from multiple source documents; for example, same invoice is submitted in both PO and non-PO formats for faster processing.

  • Fraudulent activities such as trying to claim reimbursements multiple times.

Most of the organizations still depend on manual efforts to identify duplicate invoices and stop financial leakage.

Can we automate the detection of such Invoices? The answer is yes. We can achieve it with the help of machine learning (ML).

There are several methodologies such as static rule-driven method or AI-driven Principle Component Analysis (PCA) approach to automate invoice detection. We will talk about PCA here.

PCA mechanism

Using PCA mechanism, we not only check for duplicates but also find the near-matched invoices. While the exactly matching invoices could be easily detected using database queries, the rules applied in this approach are static and only for a pre-decided set of fields. In the PCA approach, varied number of invoice attributes could be used, and the reduced dimensional space will be updated each time with change in the data. Thus, the PCA approach remains much more dynamic and robust.

Detecting duplicates

PCA is one of the most efficient ways for duplicate invoice detection as it takes a dataset with large number of features, converts it into an equivalent low dimensional data without loss of much information and variability. The compact output of PCA is then used to detect duplicate entities based on proximity matching.

For duplicate invoice detection, we need one historic dataset and another test set, for which we are detecting duplicates. Then we have to preprocess both the datasets, remove unwanted features and derive more features which will be helpful in detecting the duplicates.

Once the data is ready to fit in the ML Model, we apply PCA to data which transforms it into an n-dimensional space or a set of linearly unrelated variables or principle components. Here, ‘n’ is the total number of variables using which PCA can identify the maximum variation in the data. The number of dimensions, ‘n’ could be configured at customer level or selected automatically by setting the percentage of variation as a threshold (say 97%). Since all the invoices (both the historic and new) are now in a reduced dimensional vector space, we can find similar invoices and probable duplicate invoices by calculating the distance among them. The distance between two invoices in n-dimensional vector space can be calculated using the Euclidean Distance or Pearson’s Correlation Distance. If we find the distance is very close to zero, then we can consider the invoice as a duplicate of the historic data.

At times, it may happen that two invoices are very similar to one another but in reality, they are two different invoices. So, in the above-mentioned PCA technique, these invoices will be classified as duplicates. In order to avoid these kinds of conflicting/exception cases, we will further need to define rules to identify the false positive (wrongly identified as duplicates) cases. If we consider these rules, the results will be more accurate in identifying the correct duplicates as well as the very similar but not duplicate invoices.


Today businesses are expanding, so is data. Without data we can’t analyze our business, neither can we point out those fields where our financial leakage is happening on a daily basis. Human intervention is good and error-less up to a certain point. But when it comes to millions of records, we can’t afford to have that many resources in hand nor can we afford to make mistakes. Therefore, we have to turn to automation techniques such as ML-based PCA. Using PCA, we can ramp up the efficiency of the invoicing process and remove human intervention at certain stages.

Rajarshi Ghoshal is working with the Platform Solutions group at TCS. He possesses strong skills in Python, Data Science, Machine Learning and Web Services. He has been contributing to the development of embedded analytics and the implementation and hosting of web services for predictive analytics use cases in the home grown HRMS and Procurement platforms of TCS. He holds a bachelor’s degree in Electronics and Communication Engineering from West Bengal University of Technology.