> DATA SCIENCE

Video credit: gonin/Creatas Video+ / Getty Images Plus via Getty Images

Emerging Opportunities for Machine Learning in Food Safety: Potential and Pitfalls

Determining the right tools and the right data to use them on

By Xiangyu Deng, Ph.D., Shuhao Cao, Ph.D., and Abigail L. Horn, Ph.D.

SCROLL DOWN

We see more and more that large volumes of data associated with and throughout the farm-to-table continuum can be used to inform food safety and public health.¹ From pathogen genomes to consumer reviews, innovative applications that use machine leaning to analyze these data are on the horizon. This article is meant to accompany a recent scholarly review on the topic² by providing a primer and synopsis for the general food safety audience.

Whether machine learning, or the more encompassing subject of artificial intelligence, may turn into a major boon or a disappointing bust for food safety is already a much-discussed topic, including here at Food Safety Magazine. In our effort to review emerging applications of machine learning in food safety, we tried to examine both potentials and pitfalls, hoping to contribute to constructive and cautious practices of these still-new approaches in food safety analytics.

Primer on Machine Learning

Although machine learning has been around for a couple of decades, it has recently become a very hot topic. Buzzwords such as “big data” and “artificial intelligence” (AI) have appeared in the media since the mid-’90s. Computer vision, natural language processing, and data science, fields predicated on machine-learning techniques, have been growing explosively. Some even argue that the proliferation of machine learning is the “third industrial revolution.” For years, perhaps without realizing the transformative power of machine learning, people have been enjoying its conveniences, ranging from smartphone apps to autonomous vehicles to QR Code recognition to translation of complex sentences into different languages in real time. In sum, data-driven, machine-learning-powered decision making has gradually become common practice.

The old adage “Practice makes perfect” caps an important aspect of machine learning: A machine-learning-based computer program is “trained” by data (practice) to better fulfill its purpose. For model-based machine learning, this training procedure often involves letting the model “see” a continuous stream of training data, so that, with iterations and iterations, the model can perform the same task better and better. For example, if a computer program were to be designed to classify cats from dogs in images, then a data set consisting of images previously labeled as dog or cat must be given to the model to train it. The math behind the curtain will try to minimize the difference between the model’s predictions of the labels and the true labels, often called the “ground truth” data. Meanwhile, certain validation measures are taken to ensure a trained model can generalize its predicting power to unseen data—so that it is not overfit to the training data.

Sleeve, Aircraft

In many ways, the success of a machine-learning model depends on the quality of data, following the computer science proverb “Garbage in, garbage out.” Choosing the “right” model to fit the data is important: A simple model may not capture certain important patterns, while a complicated model may pick up too much noise from the data and lose generalizability (i.e., overfit). Choosing the right model for the application is also important. For communication and translation of work with key stakeholders and decision makers, an interpretable machine-learning model that returns features and prediction results that are comprehensible to humans is often preferred over more complex “black box” models.  

“Whether machine learning, or the more encompassing subject of artificial intelligence, may turn into a major boon or a disappointing bust for food safety is already a much-discussed topic…”

Machine Learning Using Genomic Data

A major category of machine-learning applications in food safety center on whole-genome sequencing (WGS) data. First, established surveillance and monitoring infrastructure, such as PulseNet, GenomeTrakr, and the National Antimicrobial Resistance Monitoring System (NARMS), have been producing copious amounts of pathogen genomes, creating scalability challenges for data analytics and a constant need to cope with new data. Machine learning is presumably positioned to address these challenges and needs. Second, certain pathogen characteristics of food safety significance are mechanistically complicated and not well understood, for example, the association between particular pathogens and their sources. Machine learning may help uncover hidden patterns from WGS data that are less identifiable with traditional methods. Finally, machine learning holds promise to solve difficult problems in the biomedical sciences using genomic data. A recent prominent case is a stunning breakthrough in predicting protein folding through deep learning of amino acid sequences. 

Given the relatively small genomes of foodborne pathogens, one may assume that machine-learning analyses of these genomes would be mere iteration or extension of established methodology. However, domain-specific opportunities and challenges continue to arise.

In antimicrobial resistance (AMR) monitoring, AMR prediction using WGS data is considered an accurate and efficient alternative to or enhancement of phenotypic antimicrobial susceptibility testing (AST). Typically, a curated panel of AMR genes are searched in the genome of interest to yield susceptible or resistant classifications. This rule-based approach requires a priori knowledge of genetic determinants of AMR and thus cannot identify AMR caused by novel or uncatalogued genes. Machine-learning models that are agnostic of known AMR determinants and unbiased by any AMR gene curation have naturally been explored for genomic AST. In a cutting-edge study,³ Nguyen et al. broke down over 5,000 Salmonella genomes into small sequences (k-mers) and used them as features fed into a machine-learning model to predict the minimum inhibitory concentration (MIC) of 15 drugs. The MIC prediction accuracy of the model met the U.S. Food and Drug Administration (FDA) standards for automated systems with all 15 drugs measured by major errors (false-resistant results) and 7 of the 15 measured by very major errors (false-susceptible results).  

In zoonotic source attribution, that is, predicting livestock sources of food pathogens, machine learning leveraging WGS and associated labels (“metadata”) has shown great potential. Zhang et al. applied a machine-learning classifier to predict food-animal sources of Salmonella Typhimurium. Trained by more than 1,200 genomes of known sources in the U.S., the classifier correctly attributed isolates from seven of eight major outbreaks linked to food animals in the U.S. from 1994 to 2013. 

While promising, such applications are still nascent and premature for deployment. First, some machine-learning models are inherently difficult to interpret, and common performance metrics are inadequate for determining whether a model is ready to be deployed in practice. Domain expertise and commonsense knowledge are critical for evaluating machine-learning solutions. The “rule” is: One shall not “listen (only) to data.” Second, in clinical and public health settings, standardized methods are often desired. However, the design of training sets, the choice of models, and the strategy for validation all pose difficulties for standardization. Finally, as mentioned above, training sets greatly affect the outcome of the analysis, and “garbage in, garbage out” is a common pitfall. In the case of machine learning with food safety genomic data, inflated source attribution accuracy can be derived from oversampling of closely related Salmonella genomes of the same source in the training set.⁴,⁶ 

Machine Learning Using Novel Data Streams

Another major area of application of machine-learning techniques in food safety involves the use of novel data streams (NDS)—emerging sources of data that are created continuously and passively by individuals going about their daily lives, also called “data in the wild”—which include text (social media), trade, and transactional data. Because these data are generated on the consumer level, their value has been found mainly in surveillance of food safety events at the last mile of the food supply chain.

Text data, in the form of Twitter posts,⁸⁻¹² reviews on Yelp¹³⁻¹⁵ and Amazon,¹⁶ news, or blogs,¹⁷ have been monitored to capture near-real-time food safety violations or illness reports linked to restaurants or retailers that are not reported through the official channels (e.g., filing reports with the local health department) and would otherwise go unrecognized. 

The commonly applied machine learning task here is to train a classifier to identify a set of keywords, such as “illness,” “food poisoning,” and “throw up,” that are associated with food safety-related incidents at an associated restaurant or retailer. 

Over the past decade, a number of foodborne illness surveillance systems built on machine-learning classifiers have moved out of the lab into systems piloted by some of the country’s largest city health departments. While the original concept behind these systems was to identify outbreaks of foodborne illness, in practice, there have been no notable successes in identifying outbreaks, either active or historical, that traditional surveillance techniques have missed. The main value in application has instead been found in identifying food safety violations at restaurants, whether or not an outbreak has occurred. The most successful approach to date does not use social media data but combines Google search and location history to identify restaurants violating health codes.¹⁸ Pioneered by a team including Google researchers and the Chicago and Las Vegas health departments, this system uses machine learning to identify a set of Google search terms predictive of foodborne illness and then uses Google location history to link the users placing those search terms to restaurants they have frequented.

Machine-learning models based on these “wild” data sources have critical biases that cannot be overlooked, following the “garbage in, garbage out” principle. User attribution of foodborne illness to a specific food item or consumption location is notoriously difficult to confirm given the incubation periods of foodborne pathogens, multiplicity of consumed foods, and inaccuracies of recall.¹⁹ Users of social media including Twitter and Yelp represent a convenience sample of a younger, wealthier, predominantly urban population with specific race/ethnicity biases and are not a representative sample of society; additionally, platform penetration is known to vary by geographic region.²⁰⁻²² Research has shown that consumer stereotypes around ethnic foods drive the implication of such restaurants on social media, and likewise in food safety surveillance systems based on social media data.²⁰,²³ The reader is referred to a recent article in Food Safety Magazine for a more in-depth explanation of some of these biases. 

Transactional data—electronic records generated at point of sale—are another form of NDS that hold much promise for foodborne disease outbreak investigations, although still with very few applications in practice. Transactional data provide an objective history of consumption records that can be used to generate hypotheses at the early stages of an investigation about the causative food vehicle, as well as the causative location of contamination in retail or a restaurant. While there have been numerous successful examples using individual consumer checkout data collected from known case patients together with standard case-control statistical methods, machine learning finds application with aggregated data, which exist in the form of aggregated store-based or spatially aggregated retail sales, loyalty card data, and credit card transaction records. To protect privacy, sensitive information like usernames and addresses are anonymized before sharing with researchers. A notable example involves work by Kaufman et al.,²⁴,²⁵ utilizing aggregated sales data of hundreds of individual products sold in retail to identify the causative food in large-scale spatially distributed outbreaks. A pattern-matching approach was developed that identifies as likely culprits the food items with sales patterns more closely resembling the outbreak distribution. A machine-learned clustering analysis complements the method by identifying clusters of products that are indistinguishable. The approach has been applied in practice, and with some edits and improvements was demonstrated by the Norwegian Institute of Public Health to help identify the source of an enterohemorrhagic Escherichia coli outbreak.²⁶ 

Trade data, traditionally recorded or logged for company operations or statistical analysis, have also found innovative application in food safety surveillance and risk assessment alongside machine-learning techniques. Examples of such data sources include supply chain or transshipment logistics records, federal and international trade statistics, and production and consumption data. In one application, import-export data accessed from a public data source, ImportGenius, were combined with FDA import inspection records to train a system to predict food import firms likely to fail FDA site inspections.²⁷ The trade data add value because the trained model selects features relating not only to producers and suppliers (information contained in the import FDA inspection data alone) but also to supply chain network structural relationships that are most predictive of risk (information particular to the trade structure), which in combination improves identification of firms likely to fail FDA site inspections by greater than 40 percent from existing approaches. Trade data have also been used to develop spatially resolved models of the aggregate structure of national food supply chains in Germany²⁸ and the U.S.²⁹ with machine learning used to train models to ensure that the properties of the estimated networks follow known structural properties of observed empirical food-flow networks.

“Given the relatively small genomes of foodborne pathogens, one may assume that machine-learning analyses of these genomes would be mere iteration or extension of established methodology.”

Pitfalls and Limitations of Machine Learning in Novel Data Stream Data Analysis

It should be clear from the survey of examples provided in this article that although machine-learning models have been shown to supplement and/or complement existing data and analysis techniques to address food safety challenges, they are not a replacement for investigative work. Beyond the issue of choosing the right data to train a model, a researcher must also choose the most appropriate methodology to approach each food safety challenge, and machine learning is not a fit for every situation. Principled mathematical or predictive mechanistic modeling techniques, which introduce a structure or logic into a problem approach, are a better fit for certain tasks, especially when limited training data are available. Examples include agent-based modeling to predict likely hazards in a production facility, or network-theoretic modeling to identify the probabilistically most likely source of an outbreak and identify risks in the international food supply chain. Oftentimes, the most practical solution requires not sophisticated modeling but a sophisticated IT system for capturing and securely sharing data in real time. In this case, a blockchain solution or a system built around software accessible to food safety risk analyzers, such as the effective supply chain data input, mapping, and visualization tools developed and deployed to assist traceback during one of the largest foodborne outbreaks in Germany, the 2011 enterohemorrhagic E. coli outbreak linked to contaminated sprouts, may work.³⁰,³¹ 

But we should close on a note of possibility. Many promising opportunities exist for food safety by extending machine-learning applications from related fields or sources of data. Aggregated transactional data from loyalty cards,³² credit card transactions,³³ restaurant sales, and online grocery shopping³⁴ or delivery,³⁵ most of which have been used in nutrition applications, could find application in foodborne outbreak source attribution analysis to identify the culprit food item or market/restaurant where contaminated products were sold. Mobility records logged by smartphones have been applied extensively in tracing the spread of infectious diseases, including coronavirus disease 2019, but so far have found only one application in the foodborne disease space.¹⁸ Foodborne illness or hazard surveillance systems built on social media, search query, or company message board data could be extended to other areas of food safety, including product recalls, allergens, or food safety regulations. Further promising opportunities exist for combining data sources, such as the combination of genomic and supply chain data in source attribution.³⁶ So long as careful attention is consistently placed in identifying sources of bias and pitfalls, machine learning together with big data promise to usher food safety into what Frank Yannis, the FDA deputy commissioner for food policy and response, calls a “new era of smarter food safety.” 

References

  1. Marvin, H.J., et al. 2017. “Big Data in Food Safety: An Overview.” Crit Rev Food Sci Nutr 57(11): 2286–2295.
  2. https://www.annualreviews.org/doi/abs/10.1146/annurev-food-071720-024112.

  3. Nguyen, M., et al. 2019. “Using Machine Learning to Predict Antimicrobial MICs and Associated Genomic Features for Nontyphoidal Salmonella.” J Clin Microbiol 57(2): e01260–18.
  4. Zhang, S., et al. 2019. “Zoonotic Source Attribution of Salmonella enterica Serotype Typhimurium Using Genomic Surveillance Data, United States.” Emerg Infect Dis 25(1): 82–91.
  5. Nisbet, R., G. Miner, and J. Elder, Handbook of Statistical Analysis and Data Mining Applications (Cambridge, MA: Academic, 2009).
  6. Wheeler, N.E. 2019. “Tracing Outbreaks with Machine Learning.” Nat Rev Microbiol 17(5): 269.
  7. Althouse, B.M., et al. 2015. “Enhancing Disease Surveillance with Novel Data Streams: Challenges and Opportunities.” EPJ Data Sci 4: 17.
  8. Devinney, K., et al. 2018. “Evaluating Twitter for Foodborne Illness Outbreak Detection in New York City.” Online J Public Health Inform 10(1): e120.
  9. Harris, J.K., et al. 2017. “Using Twitter to Identify and Respond to Food Poisoning: The Food Safety STL Project.” J Public Health Manag Pract 23(6): 577–580.
  10. Harrison, C., et al. 2014. “Using Online Reviews by Restaurant Patrons to Identify Unreported Cases of Foodborne Illness—New York City, 2012–2013.” Morb Mortal Wkly Rep 63: 441–445.
  11. Kuehn, B.M. 2014. “Agencies Use Social Media to Track Foodborne Illness.” JAMA 312: 117–118.
  12. Sadilek, A., et al., “Deploying nEmesis: Preventing Foodborne Illness by Data Mining Social Media,” in Proceedings of the Twenty-Eighth AAAI Conference on Innovative Applications (Menlo Park, CA: Association for the Advancement of Artificial Intelligence, 2017), 3982–3989.
  13. Effland, T., et al. 2018. “Discovering Foodborne Illness in Online Restaurant Reviews.” J Am Med Inform Assoc 25: 1586–1592.
  14. Nsoesie, E.O., et al. 2014. “Online Reports of Foodborne Illness Capture Foods Implicated in Official Foodborne Outbreak Reports.” Prev Med 67: 264–269.
  15. Schomberg, J.P., et al. 2016. “Supplementing Public Health Inspection via Social Media.” PLOS ONE 11(3): e0152117.
  16. Maharana, A., et al. 2019. “Detecting Reports of Unsafe Foods in Consumer Product Reviews.” JAMIA Open 2: 330–338.
  17. Kate, K., et al., “FoodSIS: A Text Mining System to Improve the State of Food Safety in Singapore,” in Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (New York: Association of Computing Machinery, 2014), 1709–1718.
  18. Sadilek, A., et al. 2018. “Machine-Learned Epidemiology: Real-Time Detection of Foodborne Illness at Scale.” npj Digit Med 1: 36.
  19. Gertler, M., et al. 2017. “Assessment of Recall Error in Self-Reported Food Consumption Histories Among Adults—Particularly Delay of Interviews Decrease Completeness of Food Histories—Germany, 2013.” PLOS ONE 12: e0179121.
  20. Altenburger, K.M. and D.E. Ho. 2019. “When Algorithms Import Private Bias into Public Enforcement: The Promise and Limitations of Statistical Debiasing Solutions.” J Inst Theor Econ 175: 98–122.
  21. Oldroyd, R.A., et al. 2018. “Identifying Methods for Monitoring Foodborne Illness: Review of Existing Public Health Surveillance Techniques.” JMIR Public Health Surveill 4: e57.
  22. Tufekci, Z. “Big Questions for Social Media Big Data: Representativeness, Validity and Other Methodological Pitfalls,” in Proceedings of the Eighth International AAAI Conference on Weblogs and Social Media (Palo Alto, CA: AAAI Press, 2014).
  23. Zukin, S., et al. 2015. “The Omnivore’s Neighborhood? Online Restaurant Reviews, Race, and Gentrification.” J Consum Cult 17: 459–479.
  24. Kaufman, J., et al. 2014. “A Likelihood-Based Approach to Identifying Contaminated Food Products Using Sales Data: Performance and Challenges.” PLOS Comput Biol 10: e1003692.
  25. Hu, K., et al. 2016. “A Modeling Framework to Accelerate Food-Borne Outbreak Investigations.” Food Contr 59: 53–58.
  26. Norström, M., et al. 2015. “An Adjusted Likelihood Ratio Approach Analysing Distribution of Food Products to Assist the Investigation of Foodborne Outbreaks.” PLOS ONE 10: e0134344.
  27. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3374620.
  28. Balster, A. and H. Friedrich. 2019. “Dynamic Freight Flow Modelling for Risk Evaluation in Food Supply.” Transp Res E 121: 4–22.
  29. Lin, X., et al. 2019. “Food Flows between Counties in the United States.” Environ Res Lett 14: 084011.
  30. Weiser, A.A., et al. 2013. “Trace-Back and Trace-Forward Tools Developed Ad Hoc and Used During the STEC O104:H4 Outbreak 2011 in Germany and Generic Concepts for Future Outbreak Situations.” Foodborne Pathog Dis 10: 263–269.
  31. Weiser, A.A., et al. 2016. “FoodChain-Lab: A Trace-Back and Trace-Forward Tool Developed and Applied During Food-Borne Disease Outbreak Investigations in Germany and Europe.” PLOS ONE 11: e0151977.
  32. Aiello, L.M., et al. 2019. “Large-Scale and High-Resolution Analysis of Food Purchases and Health Outcomes.” EPJ Data Sci 8: 14.
  33. Singh, V.K., et al. 2015. “Money Walks: Implicit Mobility Behavior and Financial Well-Being.” PLOS ONE 10: e0136628.
  34. Huyghe, E., et al. 2017. “Clicks as a Healthy Alternative to Bricks: How Online Grocery Shopping Reduces Vice Purchases.” J Mark Res 54: 61–74.
  35. Schulz, E., et al. 2019. “Structured, Uncertainty-Driven Exploration in Real-World Consumer Choice.” Proc Natl Acad Sci USA 116: 13903–13908.
  36. Dallman, T., et al. 2016. “Phylogenetic Structure of European Salmonella Enteritidis Outbreak Correlates with National and International Egg Distribution Network. Microb Genom 2: e000070.

Xiangyu Deng, Ph.D., is an associate professor in the Center for Food Safety at the University of Georgia.

Shuhao Cao, Ph.D., is a lecturer in the Department of Mathematics and Statistics at Washington University.

Abigail L. Horn, Ph.D., is a postdoctoral fellow in the Center for Applied Network Analysis at the University of Southern California Keck School of Medicine.

APRIL/MAY 2021

Font, Line, Text