In 2016, there were 39.2 million daily card transactions in the United Kingdom alone. It is expected that by 2026, the number of card transactions will grow to 60 million per day. This translates to 700 uniform transactions per second, per day. But as card transactions grow, so do the number of frauds. According to a 2019 report, the volume of frauds was to the tune of 2.5 billion UK Pounds in 2018. The good news is, financial institutions are constantly investing in building robust processes and early alert systems to sniff fraudulent credit card transactions. In 2018 alone, leading credit card companies were able to detect and prevent almost two-thirds of all fraudulent attempts. In fact, £2 in every £3 of attempted fraud was stopped.
At the same time, as financial institutions and credit card companies benefit from credit and debit card growth, they are incorporating newer technologies such as near field communications (NFC) and tap to pay to enhance user experience (Ux). However, such technologies make financial instruments vulnerable to fraud. Typically, fraudulent transactions are authorized when card holders are tricked to process payments to a fraudster. For such unauthorized fraudulent transactions, the authorization for payment is actually carried out by a fraudster, without the knowledge of the holder of the financial instrument. This puts financial institutions and merchants at risk as they are responsible to offset most of the money spent as products of fraud.
Financial institutions are therefore always looking for reliable solutions to detect fraud.
But identifying fraudulent transactions from several thousand genuine ones is a challenge. It might seem like a simple classification issue, that can be solved by machine learning technology, on the face of it. However, this is not the case as deep machine learning works on trained data and is prudent to the number of training samples for both genuine and fraudulent transactions. In the case of credit card transactions, this balance does not exist, resulting in failure of machine learning techniques.
Such class imbalance problems are not new. In fact, for call centers using audio data for machine learning purposes, it is a challenge to differentiate a delighted customer from an annoyed one. The reason: Audio data contains a greater ratio of annoyed (dissatisfied) customers and very few audio segments corresponding to delighted (happy) customers. Similarly, the class imbalance issue can be observed in medical diagnostics. For instance, building a machine learning system to identify the X-ray image of the lungs of a COVID-19 patient is a challenge as there are inadequate samples of COVID-19 positives as compared to healthy ones that are used as training data.
Importance of class balance in Machine Learning
Imbalance in training data biases machine learning mostly lean towards the majority such as genuine credit card transactions at the cost of the minority class such as fraudulent transactions. This is primarily because of the bias towards majority class learning samples. This is a known challenge and has been addressed by leveraging machine learning algorithms that include equal number of training samples belonging to the two different classes either by reducing the number of majority samples or by artificially increasing the number of minority samples using some heuristics. But is there a better way?
What about the possibility of training machines the way we teach the English alphabet? Children are often confused between similar looking letters such as “b” and “d” and take a while to register the differences. What if, instead of teaching one alphabet letter at a time, as is the convention in school learning, the teaching involved showing two letters of the alphabet at a time, say, “b” and “d” together? The child would learn both the similarity and the differences between the two alphabets letters “b” (Io) and “d” (oI) and this might help the child intuitively understand the difference in both concept and application.
A Novel Data Representation for Effective Learning in Class Imbalanced Scenarios
In contrast to most existing methods that address class imbalance which either rely on over-sampling the minority class or under-sampling the majority class to maintain balance in the training data, we propose a novel data representation for learning by mimicking the learning of letters in the English alphabet. In our approach, we take two samples that are simultaneously used to train a classifier. We call it Simultaneous Two-Sample Learning or s2s-learning, for short.
Keeping the classifier in the same in order to allow for fair comparison, it is possible to simultaneously compare the two samples and a single sample that can be used for training. Studies have shown that the ability to recognize fraudulent transactions using two samples simultaneously achieve an absolute improvement of 12.4% over techniques that use a single sample to train. This novel data representation outperforms conventional machine learning systems especially when the imbalance in the training data is significant. Though S2s learning requires a small change in the machine learning architecture to accommodate learning two simultaneous samples instead of one, there won’t be any significant changes from an overall performance perspective in computational time or memory requirements during training. Experimental results on several bench-marked highly imbalanced data-sets also demonstrate that simultaneously learning two samples at a time is more beneficial than the conventional method of learning a single sample at a time.
We believe this will make a difference to the outcome of an algorithm and thereby help in correcting data imbalances. In other words, a simple and creative approach to solve a complex problem.