Jump to content

Leakage (machine learning)

From Wikipedia, the free encyclopedia
(Redirected from Data leakage)

In statistics and machine learning, leakage (also known as data leakage or target leakage) is the use of information in the model training process which would not be expected to be available at prediction time, causing the predictive scores (metrics) to overestimate the model's utility when run in a production environment.[1]

Leakage is often subtle and indirect, making it hard to detect and eliminate. Leakage can cause a statistician or modeler to select a suboptimal model, which could be outperformed by a leakage-free model.[1]

Leakage modes

[edit]

Leakage can occur in many steps in the machine learning process. The leakage causes can be sub-classified into two possible sources of leakage for a model: features and training examples.[1]

Feature leakage

[edit]

Feature or column-wise leakage is caused by the inclusion of columns which are one of the following: a duplicate label, a proxy for the label, or the label itself. These features, known as anachronisms, will not be available when the model is used for predictions, and result in leakage if included when the model is trained.[2]

For example, including a "MonthlySalary" column when predicting "YearlySalary"; or "MinutesLate" when predicting "IsLate".

Training example leakage

[edit]

Row-wise leakage is caused by improper sharing of information between rows of data. Types of row-wise leakage include:

  • Premature featurization; leaking from premature featurization before Cross-validation/Train/Test split (must fit MinMax/ngrams/etc on only the train split, then transform the test set)
  • Duplicate rows between train/validation/test (e.g. oversampling a dataset to pad its size before splitting; e.g. different rotations/augmentations of a single image; bootstrap sampling before splitting; or duplicating rows to up sample the minority class)
  • Non-i.i.d. data
    • Time leakage (e.g. splitting a time-series dataset randomly instead of newer data in test set using a TrainTest split or rolling-origin cross validation)
    • Group leakage—not including a grouping split column (e.g. Andrew Ng's group had 100k x-rays of 30k patients, meaning ~3 images per patient. The paper used random splitting instead of ensuring that all images of a patient were in the same split. Hence the model partially memorized the patients instead of learning to recognize pneumonia in chest x-rays.[3][4])

A 2023 review found data leakage to be "a widespread failure mode in machine-learning (ML)-based science", having affected at least 294 academic publications across 17 disciplines, and causing a potential reproducibility crisis.[5]

Detection

[edit]

See also

[edit]

How to deal with Leakage

[edit]

Addressing leakage, the unintended displacement of emissions due to mitigation activities, is crucial for the effectiveness of REDD (Reducing Emissions from Deforestation and Forest Degradation) initiatives. Leakage can undermine carbon reduction efforts by shifting emissions from protected areas to unprotected ones. Several strategies can be employed to manage leakage effectively:

  1. Monitoring: Comprehensive monitoring systems are essential. This includes tracking historical deforestation data, utilizing control areas, and conducting socioeconomic surveys to measure impacts beyond project boundaries. Tools like the Voluntary Carbon Standard recommend extensive monitoring areas to capture leakage accurately.
  2. Increasing Scale: Expanding the scale of REDD projects from subnational to national levels can help mitigate leakage. Larger-scale initiatives reduce the risk of displacement within a country, and broader international participation can curb cross-border leakage by lowering global commodity supply pressures.
  3. Discounting: To account for potential leakage, REDD benefits should be discounted. This involves adjusting credit accounting to reflect the estimated extent of leakage, ensuring only net emission reductions are rewarded. Discounting can be complemented by mechanisms such as banking reserve credits and insurances.
  4. Redesigning Projects: Project design should consider leakage risks by balancing various mitigation activities. For instance, combining REDD conservation with sustainable forest management (SFM) and afforestation/reforestation (A/R) projects can absorb displaced labor and capital, thus reducing leakage.
  5. Neutralizing Leakage: Implementing alternative livelihood components can address primary leakage. While challenging, targeted training and incentive packages can shift activities to more sustainable practices, although care must be taken to avoid creating additional pressures through overly successful interventions.[6]

References

[edit]
  1. ^ a b c Shachar Kaufman; Saharon Rosset; Claudia Perlich (January 2011). "Leakage in data mining: Formulation, detection, and avoidance". Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. Vol. 6. pp. 556–563. doi:10.1145/2020408.2020496. ISBN 9781450308137. S2CID 9168804. Retrieved 13 January 2020.
  2. ^ Soumen Chakrabarti (2008). "9". Data Mining: Know it All. Morgan Kaufmann Publishers. p. 383. ISBN 978-0-12-374629-0. Anachronistic variables are a pernicious mining problem. However, they aren't any problem at all at deployment time—unless someone expects the model to work! Anachronistic variables are out of place in time. Specifically, at data modeling time, they carry information back from the future to the past.
  3. ^ Guts, Yuriy (30 October 2018). Yuriy Guts. TARGET LEAKAGE IN MACHINE LEARNING (Talk). AI Ukraine Conference. Ukraine – via YouTube.
  4. ^ Nick, Roberts (16 November 2017). "Replying to @AndrewYNg @pranavrajpurkar and 2 others". Brooklyn, NY, USA: Twitter. Archived from the original on 10 June 2018. Retrieved 13 January 2020. Replying to @AndrewYNg @pranavrajpurkar and 2 others ... Were you concerned that the network could memorize patient anatomy since patients cross train and validation? "ChestX-ray14 dataset contains 112,120 frontal-view X-ray images of 30,805 unique patients. We randomly split the entire dataset into 80% training, and 20% validation."
  5. ^ Kapoor, Sayash; Narayanan, Arvind (August 2023). "Leakage and the reproducibility crisis in machine-learning-based science". Patterns. 4 (9): 100804. doi:10.1016/j.patter.2023.100804. ISSN 2666-3899. PMC 10499856. PMID 37720327.
  6. ^ Wunder, S. (2008). How do we deal with leakage? In A. Angelsen (Ed.), Moving Ahead with REDD: Issues, Options and Implications (pp. 65-75). Center for International Forestry Research. https://www.jstor.org/stable/resrep02104.13