There are many usages of anomaly detection ranging from scientific observation to financial transactions.
The definition of an anomaly is a single observation, or a set of observations, that does not conform to any of a set of properties exhibited by larger collections.
Although anomalies are often considered undesirable in certain fields, they are actually a highly specialised subset of phenomena within systems that provide insight.
A system's health can be threatened by outliers, particularly in the domain of computer networks, security systems, and intrusion detection systems. Monitored activity should always include spikes and dips, since spikes and dips could lead to malicious attacks, malware initiating network scans, hosts losing connectivity, and host crashes.
Outlier detection and novelty detection are two methods used by data analysts to detect anomalies. As a result, the outlier is the observation that differs from other data points in the train dataset. In the same way that outliers proliferate in datasets, novelty points also occur in the train dataset, but not in the test dataset.
Common Reasons for Outliers
These are the most common reasons for outliers
- There are errors in the data (measurement errors, rounding errors, and incorrect writing errors)
- Data points with noise;
- The dataset contains hidden patterns (fraud and attacks)In other words, outlier processing depends both on the data and on the domain.
In other words, outlier processing depends both on the data and on the domain. Data points that contain noise should be filtered (noise removed); errors in data should be corrected. Some applications focus on anomaly detection, while other applications are considered in more depth.
True positives are outcomes where the model correctly predicted the positive class (non-anomalous data as non-anomalous). The outcome of a true negative is one in which the model correctly predicts the negative class (anomalous data as anomalous). False positives occur when the model incorrectly predicts the positive class (non-anomalous data as anomalous) and false negatives occur when the model incorrectly predicts the negative class.
If there is no prior knowledge of anomalies and their patterns, unsupervised anomaly detection makes sense. For this case, one can use isolation forests, oneclassSVM, or k-means. The main goal here is to separate all observations into clusters and to analyze the structure and size of the clusters.Outlier detection methods testing can be conducted using various open datasets, for example, Outlier Detection DataSets
True and False Outcomes
Analyzing Isolation Forests for Unsupervised Anomaly Detection
Based on a random algorithm, the Isolation Forests method uses a Decision Tree ensemble and other results to determine which solution is best. Until all train datasets are exhausted, each Decision Tree is constructed. To build the new branch in the Decision Tree, a random feature and a random splitting are selected.
By measuring the depths of Decision Tree leaves, the algorithm distinguishes normal points from outliers. The scikit-learn library implements this method. Data cleaning from noisy data points and observations errors could be employed by anomaly detection methods in branch applications, for example.
Anomaly detection methods, on the other hand, could be especially useful in business applications such as intrusion detection and fraud detection. In his project, Machine Learning Model: Python Sklearn & Keras on Education Ecosystem, Andrey demonstrates that Isolation Forests is one of the simplest and most effective methods for unsupervised anomaly detection. Further implementation of the method is done by the state-of-the-art library Scikit-Learn.