iGroup Learning and iDetect for Dynamic Anomaly Detection

iGroup Learning and iDetect for Dynamic Anomaly Detection withApplications in Maritime Threat Detection

Chencheng Cai, Rong Chen, Alexander D. Liu, Fred S. Roberts, Minge XieDepartment of Statistics and CCICADA Center

Rutgers UniversityPiscataway, New Jersey 08854

Email: [email protected]

Abstract—The maritime transportation system is criticalto the US and world economy. This paper reports on twonovel statistical tools, iGroup learning for individualizedgrouping and baseline distribution formation, and iDetectfor subsequent individualized detection of anomalous de-viations from the baseline distribution. These statisticalmethods are being developed, tested, and implementedin the context of maritime threat detection, but can beeasily applied in other areas. In the maritime domain,the tools aim to provide early warnings of anomalies andassessments of resulting risk for vessels being monitored.The paper presents some preliminary results about thesetools and specifically reports on a case study aimed atfinding anomalous behavior for vessels approaching aport.

1. Introduction

Security is of paramount importance to humanexistence. Accurate and early detection of threat isbecoming increasingly crucial to security. Noveland sophisticated statistical methods/algorithmsare needed to enhance our ability to detect threats,and take advantage of the vast amount of readilyavailable data.

The maritime transportation system is criticalto the US and world economy [9]. After the9/11 terrorist attack, air-traffic has seen significantsecurity and threat prevention enhancement. Thedetection and prevention of threats from mar-itime traffic, however, lags behind, and it raises

many challenges, due to the vast ocean spaceand complex river systems, long and often un-monitored shorelines, extremely busy ports, andthe sheer volume of goods being transported (e.g.,account for about 25% of US GDP). Threatsthrough maritime traffic can be multi-faceted andserious, ranging from human and drug trafficking,smuggling, transport of nuclear material and dirtybombs, to garbage dumping and illegal fishing.Hence it is important to have an efficient earlydetection and risk assessment system for maritimetraffic over space and time. Identification of un-usual maritime traffic rapidly in terms of timeand accurately in terms of location is critical tobeing able to apprehend those who are violatingsome law or planning some illegal or dangerousact. With today’s data gathering capability andglobal regulation and agreements, leading to largeamounts of data obtained through what is calledthe Automatic Identification System for vessels(see Sec. 2), it is now possible to achieve sucha goal with sophisticated and advanced statisti-cal tools.

We have started to develop two novel statis-tical tools, iGroup learning for individualizedgrouping and baseline distribution formation andiDetect for subsequent individualized detection ofanomalous deviations from the baseline distribu-tion, and hence accurate risk assessment. Thesestatistical methods are being developed, tested and

978-1-5386-3443-1/18/$31.00 ©2018 IEEE

implemented in the context of maritime threat de-tection, although they are general statistical meth-ods that can be easily applied in others areassuch as cell phone monitoring, cyber security, anti-money laundering, and others. We are consolidat-ing publicly available maritime traffic data setsand geological/geographical/geophysical data setsthat provide the necessary information for analy-sis, including vessel characteristics and spatial andtemporal characteristics of vessel movement. Ourtools aim to provide early warnings of anomaliesand assessments of resulting risk for vessels beingmonitored, finding ways to trace vessel move-ments vessels, with an emphasis on threat detec-tion. In this paper, we present some preliminaryresults obtained from our iGroup learning andiDetect tools.

2. Automatic Identification System(AIS)

Since the year 2000, the International Mar-itime Organization (IMO) has required, via theInternational Convention for the Safety of Life atSea [8], that all ships of 300 gross tons or more,and all passenger ships install an identificationand location device on board that consistently andautomatically transmits dynamic data (location,course, speed, etc.) and voyage-related informa-tion (destination, estimated time of arrival (ETA),etc.) to Vessel Traffic Services (VTS) stationsas well as to other ships. The Automatic Iden-tification System (AIS), developed by technicalcommittees under the auspices of the IMO, usesGlobal Positioning Systems (GPS), gyrocompass,other shipboard sensors and digital VHF radiocommunication equipment. Vessel identifiers suchas vessel name and VHF call sign are programmedin the device and are also included in the trans-mittal.

The global AIS system receives data fromapproximately 1,000,000 ships with updates foreach ship as frequently as every two seconds whilein motion and every three minutes while at anchor.AIS tracks ships automatically by electronically

linking data with other ships, AIS base stations,and satellites. This system enables ships to sharepositional data with other ships, and while its pri-mary initial function was for traffic management,collision avoidance, and other safety applications,it has extensive other uses. Overall, AIS offersawareness about vessels operating within the mar-itime transportation system [1], [6].

Specifically, AIS data include the vessel’sMaritime Mobile Service Identity (MMSI), aunique identification number; navigation status (atanchor, under way using engine(s), or not undercommand); rate of turn; speed over ground; po-sition accuracy; Longitude and Latitude; courseover ground in degrees; true heading from gyrocompass; and time stamp. Less frequent broad-casts also include IMO ship identification num-ber (which remains unchanged upon transfer ofthe ship’s registration to another country); inter-national radio call sign assigned to the vesselby its country of registry; vessel name; type ofship/cargo; dimensions of ship; type of position-ing system (GPS, Differential Global Position-ing Systems (DGPS) or Long Range Navigation(LORAN)); draught of ship; destination; ETA atdestination. Figure 1 shows a partial route ofa typical vessel, using AIS data. The grey areashown, obtained from AIS data and methods wehave developed, depicts the ’normal’ routes takenby similar ships. A ship that goes out of the greyarea may indicate a high level of abnormality - ananomaly.

The data is massive. For example, US coastand waterway related information accounts for32 GB of AIS data each day. Discussions withthe US Coast Guard (USCG) Research and De-velopment Center and USCG Headquarters haveled to the idea that AIS data could be usedto flag early warning of changed shipping pat-terns and other anomalous behavior. Selected AISdata are available from the public domain athttps://marinecadastre.gov/ais/.

978-1-5386-3443-1/18/$31.00 ©2018 IEEE

Figure 1. Partial Route of a Vessel from AIS Data

3. iGroup Learning and iDetect

To deal with heterogeneity in the population,conventional methods often cluster/group individ-ual entities into subgroups and use the data inthe same subgroup for statistical analysis, as oftenseen in subgroup analysis, personalized medicineand fusion learning (see for example [3], [5]).The clustering and subgrouping are typically per-formed a priori. Such approaches have severaldisadvantages. First, the forming of subgroupsdepends on a pre-determined number of clusters,a parameter that is difficult to determine. Second,all analytical outcomes and inferences (e.g., esti-mated parameters, testing) are identical for all in-dividuals in the same subgroup. More importantly,in many situations, there may not be any clear-cut and well divided subgroup structure in thepopulation. In these situations, the conventional

subgroup analysis imposes an artificial groupingstructure to the population and the analysis oftenleads to large biases and thus invalid inference inmany cases.

We have begun to develop a novel statisticalmethod, individualized grouping (iGroup), to beused to identify a group of vessels that behavesimilarly to a target individual vessel, and henceenable us to effectively establish baseline (nor-mal) behavior, in terms of a joint distributionof features, of the target vessel. In contrast tothe traditional methods, iGroup focuses on eachindividual and forms one individualized group foreach individual, by locating individuals that sharesimilar characteristics of the target individual. Itsidesteps all the aforementioned difficulties byforming an iGroup specifically for the target indi-vidual while ignoring other entities that have littlein common with the target, even if there is noclear-cut and well divided subgroup structure. Inour anomaly application, such a group (formedbased on standard features) allows us to establish abaseline behavior (joint distribution of risk-relatedfeatures) for the individualized group with highaccuracy and from which to detect any anomalyfor the target individual.

While customized grouping for a specific in-dividual is not new, our method has a solid the-oretical foundation. A detailed discussion of thetheoretical development of the methods is beyondthe scope of this paper, but roughly speakingthe iGroup methodology is similar to nonpara-metric estimation in spirit. There are significantdifferences, and nonparametric estimation can beviewed as a special case of iGroup. iGroup isdifferent from clustering as it is localized whileclustering is a global partition technique.

Our preliminary analysis suggests that iGroupLearning is robust and effective for handling het-erogeneity arising from diverse sources in big dataand it is ideally suited for goal-directed applica-tions in individualized inference. Additionally, interms of computation, by ignoring a large numberof irrelevant entities and zoning directly to indi-viduals, iGroup Learning is parallel in nature and

978-1-5386-3443-1/18/$31.00 ©2018 IEEE

can scale up better for big data than any existingcompetitors.

We have also begun to develop an individualdetection (iDetect) method to score the anomaly(abnormality) of the target individual against thebaseline distribution of features obtained from itsown iGroup. This is essentially an outlier detec-tion problem. However, the challenge here is thatthe features under consideration may be complexand of mixed type, and the features’ importancemay not be equal. A new approach is needed toassess the deviation of an observation from itsbaseline distribution. Our iDetect is based on theidea of data depth [4], extended to complex space,noisy features and features of unequal importance.

4. Features for the Applications andtheir Use with iGroup and iDetect

We identified key feature groups to be usedin iGroup Learning and iDetect through consul-tation with maritime traffic experts. A partial listincludes the following:

• Vessel Profile: vessel type, dimensions,draught, etc.

• Risk Features: country of registration,ports visited in the near past, etc.

• At-anchor Features: AIS indicator, on-landindicator, port indicator, etc.

• In-motion Features: average speed or ac-celeration, maximum speed or accelera-tion, etc.

• Functional Voyage Features: trajectory ofvessel, time series of speed or acceleration,etc.

• Dyadic Product Formula Features: dyadicparameters of time series of variety ofvariables.

For maritime traffic monitoring, the iGroupmethod works this way. Suppose A is a givenvessel and d(., .) is an appropriate similarity ordistance measure between the target’s feature setDA and vessel k’s feature set Dk. Then we lookfor the iGroup (clique) consisting of all vessels

k so that d(DA, Dk) < τ where τ is a thresholddetermined by optimizing some criterion. Com-pared with clustering methods, iGroup has thefollowing advantages: it is not necessary to specifythe number of clusters, and grouping is based onan individual vessel’s characteristics.

Our detection method measures the (potential)anomaly of a target individual vessel against itscohort (iGroup) in terms of feature distribution.Because the detection rule is based on an indi-vidual vessel’s own baseline distribution, the pro-cedure is called individualize detection (iDetect).There are many possible outlier detection rules[2], for example:

• 1.5 interquartile range rule: detect outliersbased on empirical quantiles.

• two/three standard deviation rule: detectoutliers lying more than two/three standarddeviations away from the mean

• parametric distribution-based. quantilerules: detect outliers by estimated quantilefrom a parametric distribution model.

• multiple hypothesis testing procedures: de-tect outliers that can reject a correspondinghypothesis testing.

• data depth method: detect outliers that arefar away from the data center.

5. Case Study: Finding Anomalies forVessels Approaching a Port

To illustrate our ideas, we focused on 534vessels/voyages (tankers, cargo vessels) arrivingin the Port of Newark between July and Novem-ber 2014. We investigated behaviors starting fromcrossing the 12 nautical mile US territorial sea(TS) boundary to arrival at the port. The trajectory,a functional feature, was used as the standardfeature for iGroup. Outliers in duration (time spentfrom TS boundary to port) were detected based onthe two standard deviation rule. See Figure 2.

The distance between trajectories was set tothe dynamic time warping distance (DTW) [7]that finds an optimal match between two given

978-1-5386-3443-1/18/$31.00 ©2018 IEEE

Figure 2. Trajectories of 534 Vessels Heading to Port of Newark

Figure 3. Cliques found by iGroup for Four Vessels/Voyages. Topis Vessel/Voyage, Bottom is its Clique.

sequences (two time series). The eight parts ofFigure 3 demonstrate the cliques found by iGroupfor four selected vessels/voyages. The top is thevessel trajectory and the bottom is the set oftrajectories in its clique.

We then looked for outliers that have abnormaltime duration (from territorial sea boundary toPort) compared to vessels in their clique (vesselswith a similar trajectory). Outliers were detectedby the two standard deviation rule: Vessels in aclique with time duration at least two standarddeviations from the clique mean. 95 outliers weredetected. These are shown in Figure 4: (a) 50vessels had a prior dock before the Port of Newark(left); (b) 18 vessels were anchored somewhereoutside the port for an extremely long time (mid-dle); (c) the other 27 vessels were traveling too

Figure 4. Outliers among Vessels/Voyages in Trajectories of Ves-sels Heading to Port of Newark

fast/slow compared with their cliques (right). Atleast (b) could reflect a vessel that was a threatbecause it was potentially unloading illegal goodssuch as weapons or narcotics.

Our algorithm can be completed in O(K2T )time, where K = 534 is the number of trajectoriesand T ∼ 200 is the average number of timestamps in the DTW algorithm to calculate distancebetween trajectories. On a 6 core machine, the al-gorithm took about 26 minutes to do the groupingand detect the outliers. With a single thread, onecould guess the time might rise to something like2.5 hours.

6. Future Work and Research Chal-lenges

Next steps in our research include:

• Broaden the feature sets by including moregeological information, vessel history, etc.

• Extend the current results to a high-dimensional features case.

• Investigate theoretical properties of iGroupand iDetect.

• Develop an individualized feature selec-tion scheme.

• Develop an online system that enablesonline-updating of features and online out-lier detection.

The research challenges here include the fol-lowing:

• Data quality: part of AIS data is incom-plete, unreliable or deliberately misleading

978-1-5386-3443-1/18/$31.00 ©2018 IEEE

• Handling functional data: Treatment offunctional data is significantly differentfrom numerical values.

• Extracting useful features from functionaldata is generally difficult. (E.g., trajectoryis a function.)

• Dimension reduction: It is challenging toselect from such a large number of fea-tures an efficient small subset that couldbe collected or derived from our data sets

Acknowledgements. The authors thank the Na-tional Science FoundatIon for support under grantDMS-1737857 to Rutgers University. This pa-per has benefitted from discussions with Cap-tain David Nichols (US Coast Guard, retired)about vessel features and the idea of investigatinganomalies in vessel arrivals at a port.

References

[1] DiRenzo III, J., Goward, D., Roberts, F.S.: The Little-knownChallenge of Maritime Cyber Security. Proc. of the Int.Conf. on Information, Intelligence, Systems and Applications(IISA), IEEE, 1-5 (2015). DOI: 10.1109/IISA.2015.7388071

[2] Hodge, V., Austin, J.: A Survey of Outlier Detection Method-ologies. Artificial Intelligence Review 22.2, 85-126 (2004).

[3] Lipkovich, I., and Dmitrienko, A.: Tutorial in Biostatistics:Data-driven Subgroup Identification and Analysis in ClinicalTrials. Statistics in Medicine 36 (2017).

[4] Liu, R., Parelius, J.M, Singh, K.: Multivariate Analysis byData Depth: Descriptive Statistics, Graphics and Inference(with discussion and a rejoinder by Liu and Singh). TheAnnals of Statistics 27, 783-858 (1999).

[5] W. Y. Loh, W.Y., Fu, H., Man, M., Champion, V., Yu, M.:Identification of Subgroups with Differential Treatment Ef-fects for Longitudinal and Multiresponse variables. Statisticsin Medicine 35 (2016).

[6] Net Help Security: Digital Ship Pirates: Researchers CrackVessel Tracking System, October 16, 2013, http://www.net-security.org/secworld.php?id=15781, accessed Feb. 21, 2015.

[7] Sakoe, H., Seibi C:. Dynamic Programming Algorithm Op-timization for Spoken Word Recognition. IEEE Transactionson Acoustics, Speech, and Signal Processing 26.1, (43-49(1978).

[8] International Maritime Organization: AIS Transponders.http://www.imo.org/en/OurWork/safety/navigation/pages/ais.aspx,accessed Feb. 13, 2017.

[9] United States Coast Guard: Western Hemisphere Strategy,Sept. 2014.

978-1-5386-3443-1/18/$31.00 ©2018 IEEE

Documents

iGroup Learning and iDetect for Dynamic Anomaly Detection