Frontiers in Big Data | |
Ki-Cook: clustering multimodal cooking representations through knowledge-infused learning | |
Big Data | |
Saini Rohan Rao1  Revathy Venkataramanan2  Amit Sheth2  Ronak Kaoshik3  Anirudh Sundara Rajan4  Swati Padhee5  | |
[1] Department of Computational Science and Engineering, Technical University of Munich, Munich, Germany;Department of Computer Science, Artificial Intelligence Research Institute, University of South Carolina, Columbia, SC, United States;Department of Computer Science, University of California, Los Angeles, Los Angeles, CA, United States;Department of Computer Science, University of Wisconsin Madison, Madison, WI, United States;Department of Computer Science, Wright State University, Dayton, OH, United States; | |
关键词: cooking process modeling; cross-modal retrieval; ingredient prediction; knowledge-infused learning; multimodal learning; representation learning; clustering; | |
DOI : 10.3389/fdata.2023.1200840 | |
received in 2023-04-05, accepted in 2023-06-26, 发布年份 2023 | |
来源: Frontiers | |
【 摘 要 】
Cross-modal recipe retrieval has gained prominence due to its ability to retrieve a text representation given an image representation and vice versa. Clustering these recipe representations based on similarity is essential to retrieve relevant information about unknown food images. Existing studies cluster similar recipe representations in the latent space based on class names. Due to inter-class similarity and intraclass variation, associating a recipe with a class name does not provide sufficient knowledge about recipes to determine similarity. However, recipe title, ingredients, and cooking actions provide detailed knowledge about recipes and are a better determinant of similar recipes. In this study, we utilized this additional knowledge of recipes, such as ingredients and recipe title, to identify similar recipes, emphasizing attention especially on rare ingredients. To incorporate this knowledge, we propose a knowledge-infused multimodal cooking representation learning network, Ki-Cook, built on the procedural attribute of the cooking process. To the best of our knowledge, this is the first study to adopt a comprehensive recipe similarity determinant to identify and cluster similar recipe representations. The proposed network also incorporates ingredient images to learn multimodal cooking representation. Since the motivation for clustering similar recipes is to retrieve relevant information for an unknown food image, we evaluated the ingredient retrieval task. We performed an empirical analysis to establish that our proposed model improves the Coverage of Ground Truth by 12% and the Intersection Over Union by 10% compared to the baseline models. On average, the representations learned by our model contain an additional 15.33% of rare ingredients compared to the baseline models. Owing to this difference, our qualitative evaluation shows a 39% improvement in clustering similar recipes in the latent space compared to the baseline models, with an inter-annotator agreement of the Fleiss kappa score of 0.35.
【 授权许可】
Unknown
Copyright © 2023 Venkataramanan, Padhee, Rao, Kaoshik, Sundara Rajan and Sheth.
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
RO202310107595592ZK.pdf | 944KB | download |