Tutorials

Data Security and Privacy in Machine Unlearning: Recent Advances, Challenges, and Future Perspectives

  • Aobo Chen (Iowa State University), Wei Qian (Iowa State University), Zheyuan Liu (University of Notre Dame), Shagufta Mehnaz (Penn State University), Tianhao Wang (University of Virginia), Mengdi Huai (Iowa State University)
  • Tutorial page: https://awzstm.github.io/Data_Security_and_Privacy_in_Machine_Unlearning_Tutorial/
  • Machine unlearning has gained significant attention in recent years. However, the development of machine unlearning is associated with inherent vulnerabilities and threats, posing significant security and privacy challenges for researchers and practitioners. This tutorial will focus on the following aspects: (1) providing a comprehensive review of security and privacy challenges in machine unlearning from the data mining perspective; (2) introducing cutting-edge techniques to solve the security and privacy risks in machine unlearning from both data and model perspectives; and (3) identifying open challenges and proposing convincing future research directions in robust unlearning.

AI-Driven Multimodal Frameworks for Healthcare Decision-Making

  • Jiaming Cui (Virginia Tech), Xuan Wang (Virginia Tech), Zhe Zeng (University of Virginia), Hongru Du (University of Virginia)
  • Tutorial page: https://people.cs.vt.edu/jiamingcui/icdm25/index.html
  • Modern healthcare systems generate vast amounts of multimodal data, including electronic health records, clinical notes, medical images, physiological signals, and genomic sequences. While machine learning and data mining have shown remarkable promise, effectively unifying multimodal data into trustworthy, interpretable, and actionable frameworks remains a central challenge. This tutorial will provide a comprehensive overview of recent advances in AI-driven multimodal frameworks for healthcare, with four core themes: (1) foundation models for multimodal healthcare data, (2) neurosymbolic approaches for trustworthy high-stakes decisions, (3) integration of mechanistic modeling into machine learning for enhanced explainability, and (4) generative AI for population health applications. Through case studies and open discussions, we will highlight practical methodologies, key challenges, and future research directions. The tutorial is designed to bridge cutting-edge data mining research with pressing healthcare needs, fostering collaboration between AI researchers and healthcare practitioners.

Geospatial Foundation Models: Algorithms and Applications

  • Ranga Raju Vatsavai (North Carolina State University)
  • Tutorial page: https://rvatsavai.github.io/tutorials/
  • Foundation models are deep learning models trained on massive datasets and high-end computing resources. Recent advances have enabled them to perform a broad range of general tasks, including language processing, summarization, question answering, code generation, problem-solving, and reasoning. Geospatial foundation models are specifically trained on large-scale geospatial and temporal data. While general-purpose foundation models have demonstrated their capabilities in numerous popular applications, such as natural language generation, question answering, and text summarization, applications of geospatial foundation models are just beginning to emerge. This tutorial will first summarize recent advancements in geospatial foundation models and then describe their various applications. Key topics will include: Data: Preparing geospatial data (e.g., remote sensing) for AI. Models: Architectures such as NASA/IBM’s Prithvi, the Allen Institute’s SatlasPretrain, Stanford’s SatMAE, and SpectralGPT. Applications: Techniques like prompting, fine-tuning, and zero/few-shot learning for domains including climate and weather, spatial hazards, agriculture and forestry, and soil moisture prediction.

Federated Stochastic Compositional and Bilevel Optimization

  • Hongchang Gao (Temple University), Xinwen Zhang (Temple University)
  • Tutorial page: https://hcgao.github.io/tutorial_icdm2025.html
  • In recent years, Federated Learning has emerged as a rapidly growing area of research, sparking the development of a wide range of algorithmic approaches. Yet, the majority of these efforts have been confined to tackling conventional optimization problems, often overlooking the broader machine learning paradigms. This tutorial shifts the focus to two increasingly important problem formulations: stochastic compositional optimization (SCO) and stochastic bilevel optimization (SBO). These frameworks encompass a variety of advanced learning scenarios that go well beyond standard objective minimization, including model-agnostic meta-learning, classification with imbalanced data, contrastive self-supervised learning, graph-based neural models, and neural architecture search. The inherently nested structure of SCO and SBO pose unique challenges, particularly in the Federated Learning setting, where both computational and communication constraints must be carefully managed. In response, a new line of research has emerged, aiming to adapt and extend optimization techniques to the federated setting for these complex problems. Despite this progress, the resulting methodologies remain relatively underexplored in the broader machine learning and data mining communities. This tutorial seeks to bridge that gap. We will provide a comprehensive overview of the theoretical foundations, algorithmic innovations, and practical applications of federated SCO and SBO. Participants will leave with a clear understanding of the challenges unique to these problems, the latest techniques developed to address them, and actionable insights on applying these methods to real-world federated learning applications.

Uncertainty Quantification and Mitigation in Large Language Models

  • Longchao Da (Arizona State University), Xiaoou Liu (Arizona State University), Hua Wei (Arizona State University)
  • Tutorial page: https://darl-genai.github.io/ICDM-UQ-LLM-Tutorial/
  • This tutorial introduces practical methods to quantify and mitigate uncertainty in Large Language Models (LLMs), covering input, reasoning, parameter, and prediction sources. Attendees will learn cutting-edge techniques and real-world strategies to enforce decision-making by more trustworthy, interpretable, and reliable LLM generations across high-stakes applications.

Behavior-Aware Data Valuation for LLMs at Scale

  • Zhaozhuo Xu (Stevens Institute of Technology), Huawei Lin (Rochester Institute of Technology), Weijie Zhao (Rochester Institute of Technology), Denghui Zhang (Stevens Institute of Technology)
  • Tutorial page: https://github.com/huawei-lin/RapidIn
  • Data valuation provides principled methods to quantify how training data shapes model performance and behavior, improving traceability, interpretability, and efficiency. Yet classical approaches, like influence functions and Shapley values, do not scale to trillion-token corpora or hundred-billion-parameter models. This tutorial introduces recent advances that address these challenges, including both second-order- and first-order-based approaches, particularly the recently proposed linearized influence kernel, an efficient metric that scales to LLMs with more than 600 billion parameters. This progress is largely attributed to RapidIn, a technique that enables near real-time estimation of training data influence. We will also demonstrate the slowly change phenomenon, which enables forward-looking valuation of future training data before full model training. By integrating principled algorithms, system-level optimizations, case studies, and a hands-on demonstration, the tutorial bridges theory and practice, providing participants with both the knowledge and practical skills needed to apply scalable data valuation to real LLM scenarios.

Time Series Analysis Unraveled: Motifs, Forecasting, and Explainability

  • Jessica Lin (George Mason University), Panagiotis Papapetrou (Stockholm University), Li Zhang (The University of Texas Rio Grande Valley)
  • Time series data are ubiquitous in domains such as healthcare, finance, industry, and environmental science. Extracting insights from such data remains challenging due to complexity, scale, and the need for interpretability. This tutorial provides a comprehensive introduction to modern approaches for time series analysis, spanning classification, forecasting, motif discovery, and explainability. We cover both classical techniques and recent advances, including deep learning, transformers, and foundation models, with a special focus on interpretable methods. Participants will gain a practical understanding of how motifs and semantic structures can complement model-based approaches, and how explainability techniques can bridge the gap between statistical accuracy and human understanding. By combining methodological depth with application-driven examples, the tutorial equips researchers and practitioners with the conceptual tools to tackle real-world time series problems. This is the first time the tutorial is presented in this format, with a holistic integration of motifs and explainability.

AI for Precision Medicine: Integrative Analysis of Histopathology Images and Spatial Omics

  • Ninghui Hao (Harvard Medical School), Boshen Yan (Carnegie Mellon University), Dong Li (Baylor University), Chen Zhao (Baylor University), Guihong Wan (Harvard Medical School)
  • Tutorial page: https://sites.google.com/view/icdm25tutorial-ai4pm/main
  • Hematoxylin and eosin (H&E) imaging is the gold standard in clinical pathology, providing detailed visualization of tissue and cellular morphology. Spatial omics complements H&E images by offering spatially resolved molecular profiles. Together, these modalities are transforming our ability to interrogate tissues by integrating molecular resolution with structural context. In this tutorial, we present recent advances in spatial omics technologies combined with H&E histopathology. Methodologically, we trace the shift from statistical deconvolution to deep learning, highlighting graph neural networks (GNNs), transformers, and encoder-based architectures for cross-modal alignment, microenvironment modeling, etc. Downstream tasks such as cell deconvolution, spatial domain identification, spatial reconstruction, and gene imputation are systematically reviewed with showcase applications. We conclude with open challenges in interpretability and point toward multi-omics spatial assays and scalable multimodal frameworks. This tutorial equips the audience with a unified view of integrating spatial biology and pathology at the cellular level through advanced machine learning methods.

Fairness in Language Models: A Tutorial

  • Zichong Wang (Florida International University), Avash Palikhe (Florida International University), Zhipeng Yin (Florida International University), Wenbin Zhang (Florida International University)
  • Tutorial page: https://github.com/vanbanTruong/Fairness-in-Large-Language-Models
  • Language Models (LMs) achieve outstanding performance across diverse applications but often produce biased outcomes, raising concerns about their trustworthy deployment. These concerns call for fairness research specific to LMs; however, most existing work in machine learning assumes access to model internals or training data, conditions that rarely hold in practice. As LMs exert growing societal influence, it becomes increasingly important to understand and address fairness challenges unique to these models. To this end, our tutorial begins by showcasing real-world examples of bias to highlight practical implications and uncover underlying sources. We then define fairness concepts tailored to LMs, review methods for bias evaluation and mitigation, and present a multi-dimensional taxonomy of benchmark datasets for fairness assessment. We conclude by outlining research challenges, aiming to provide the community with conceptual clarity and practical tools for fostering fairness in LMs.

Responsible GenFMs: From Foundational Principles to Real-World Impact

  • Yue Huang (University of Notre Dame), Canyu Chen (Northwestern University), Lu Cheng (University of Illinois Chicago), Bhavya Kailkhura (Lawrence Livermore National Laboratory), Manling Li (Northwestern University), Xiangliang Zhang (University of Notre Dame)
  • Generative foundation models (GenFMs)—encompassing large language and multimodal models—are reshaping the landscape of information retrieval and knowledge management. Yet, their rapid integration also raises pressing concerns around social responsibility, trust, and governance. This tutorial provides a practical, end-to-end introduction to responsible GenFMs, covering core concepts, multi-dimensional risk frameworks (spanning safety, privacy, robustness, truthfulness, fairness, and machine ethics), cutting-edge evaluation benchmarks, and proven mitigation approaches. We incorporate real-world case studies and hands-on exercises with open-source tools, while highlighting perspectives from both policy and industry, including emerging regulatory trends and enterprise adoption practices. The session concludes by addressing open challenges and offering actionable insights tailored for the ICDM community.

Explaining the “Unexplainable” Large Language Models

  • Zhen Tan (Arizona State University), Song Wang (University of Central Florida), Jing Ma (Case Western Reserve University), Jundong Li (University of Virginia), Huan Liu (Arizona State University)
  • Tutorial Page: https://icdm-explain-tutorial.github.io/
  • The integration of Large Language Models (LLMs) into critical societal functions has intensified the urgent demand for transparency and trust. While post-hoc attribution and Chain-of-Thought reasoning serve as primary explainability approaches, they often prove unreliable, yielding brittle or illusory insights into model behavior. This tutorial establishes the theoretical intractability of complete mechanistic explanations, clarifies intrinsic barriers to full transparency, and introduces principled user-centric alternatives such as concept-based interpretability and controlled data attribution. We review their foundations and modern extensions for comprehensive explanation, inference-time intervention, and editability, demonstrating how these methods enable effective human-AI collaboration in high-stakes scientific applications.

Multiple Clustering: From Classical Foundations to Interactive, User-Guided Methods

  • Jiawei Yao (University of Washington), Juhua Hu (University of Washington), Jian Pei (Duke University)
  • Tutorial page: TBA
  • Modern datasets rarely admit a single partition; alternative clustering arises from different semantics, feature subspaces, views, or downstream goals. This tutorial surveys how to generate, compare, and interpret such alternatives, spanning from classical methods to recent deep approaches. A key strand is interactive discovery, where user intent can be expressed via side information or natural language to automatically group data respectively. In this new direction, we can treat LLM‑ and vision–language–assisted guidance as one option among several (alongside pairwise constraints and subspace priors), and discuss prompt formulation, alignment objectives, and typical failure modes (prompt sensitivity, confounding, spurious diversity). Evaluation using examples such as image, text, and biomedical data, covers validity–diversity trade‑offs, stability under resampling, subgroup fairness/calibration, and user‑alignment diagnostics.

Computational Pathology Foundation Models: Datasets, Adaptation Strategies, and Evaluations

  • Dong Li (Baylor University), Ninghui Hao (Harvard Medical School), Xintao Wu (University of Arkansas), Guihong Wan (Harvard Medical School), Chen Zhao (Baylor University)
  • This tutorial provides a comprehensive overview of multi-modal foundation models for computational pathology, an emerging field at the intersection of AI and digital pathology. We adopt a model-centric taxonomy that organizes methods into three paradigms: (1) vision-language models, including both non-LLM-based and instruction-tuned multi-modal LLMs; (2) vision-knowledge models integrating structured resources such as ontologies and knowledge graphs; and (3) vision-gene expression models linking histopathology with molecular profiles. The tutorial highlights recent advances in architectures, pretraining objectives, and adaptation strategies, as well as diverse downstream tasks such as classification, report generation, and visual question answering. We also address key challenges in dataset construction, evaluation protocols, and cross-modal alignment. Finally, we discuss broader impacts, including opportunities to improve diagnostic accuracy and healthcare access, alongside risks related to bias, interpretability, and ethical deployment. This tutorial targets researchers and practitioners seeking trustworthy, domain-adapted models for medical AI.