Short Courses (May 20-21, 2024)

An Introduction to the Statistical Foundations of Transfer Learning

Instructor: Dr. Yang Feng and Ye Tian

Yang Feng Yang Feng is a Professor of Biostatistics at New York University. He obtained his Ph.D. in Operations Research at Princeton University in 2010. Feng’s research interests encompass the theoretical and methodological aspects of machine learning, high-dimensional statistics, network models, and nonparametric statistics, leading to a wealth of practical applications. He has published more than 70 papers in statistical and machine learning journals. His research has been funded by multiple grants from the National Institutes of Health (NIH) and the National Science Foundation (NSF), notably the NSF CAREER Award. He is currently an Associate Editor for the Journal of the American Statistical Association (JASA), the Journal of Business & Economic Statistics (JBES), and the Annals of Applied Statistics (AoAS). His professional recognition includes being named a fellow of the American Statistical Association (ASA) and the Institute of Mathematical Statistics (IMS), as well as an elected member of the International Statistical Institute (ISI).

Ye Tian Ye Tian is a fifth-year PhD student in Statistics at Columbia University. He is actively working on data aggregation and its intersections with other topics such as high-dimensional statistics, robust statistics, latent variable models, and differential privacy. He is a recipient of the 2022 ICSDS Student Travel Award, the 2023 IMS Hannan Graduate Student Travel Award, and the 2023 NESS Student Research Award.

Abstract: This course offers a comprehensive introduction to the statistical foundations underpinning the prevalent machine learning technique: transfer learning. We delve into how transfer learning effectively transfers knowledge from one task to another in an adaptive and robust fashion, thereby enhancing model performance across both supervised and unsupervised learning frameworks. Various transfer learning frameworks will be discussed, including covariate shift and posterior drift, under different assumptions such as model sparsity and low-rank structure. Additionally, we will cover transfer learning in a reliable setting, addressing privacy concerns and potential dataset contaminations, which will be connected to other topics such as federated learning, differential privacy, and robust statistics. Tailored for data science professionals, the course aims to provide an in-depth understanding of problem formulations, algorithms, and their theoretical underpinnings within the context of transfer learning. Participants will gain the essential knowledge required to skillfully implement these techniques in diverse supervised and unsupervised learning scenarios.

Large Scale Spatial Data Science

Instructor: Dr. Marc G. Genton, Dr. Sameh Abdulah and Dr. Mary Lai Salvaña

Marc G. Genton Marc G. Genton is a Distinguished Professor of Statistics at the Spatio-Temporal Statistics and Data Science (STSDS) research group, King Abdullah University of Science and Technology (KAUST), Saudi Arabia. He received his Ph.D. in Statistics in 1996 from the EPFL, Lausanne, Switzerland. He also holds a M.S. degree in Applied Mathematics teaching from EPFL. Prior to joining KAUST, he held faculty positions at MIT, North Carolina State University, the University of Geneva, and Texas A&M University. His research centers around spatial and spatio- temporal statistics. His interests include the statistical analysis, visualization, modeling, prediction, and uncertainty quantification of spatio-temporal data, with applications in environmental and climate science, renewable energies, geophysics, and marine science. He was awarded the 2020 Georges Matheron Lectureship from the International Association for Mathematical Geosciences and the 2023 Barnett Award from the Royal Statistical Society.

Sameh Abdulah Sameh Abdulah obtained his MS and Ph.D. degrees from Ohio State University, Columbus, USA, in 2014 and 2016, respectively. Presently, he serves as a research scientist at the Extreme Computing Research Center (ECRC), King Abdullah University of Science and Technology, Saudi Arabia. His research focuses on various areas, including high-performance computing applications, big data, bitmap indexing, handling large spatial datasets, parallel spatial statistics applications, algorithm-based fault tolerance, and machine learning and data mining algorithms. Sameh was a part of the KAUST team nominated for the ACM Gordon Bell Prize in 2022 for their work on large-scale climate/weather modeling and prediction.

Mary Lai Salvaña Mary Lai Salvaña is an Assistant Professor in Statistics at the University of Connecticut (UConn). Prior to joining UConn, she was a Postdoctoral Fellow at the Department of Mathematics at University of Houston. She received her Ph.D. in Statistics at the King Abdullah University of Science and Technology (KAUST), Saudi Arabia. She obtained her BS and MS degrees in Applied Mathematics from Ateneo de Manila University, Philippines, in 2015 and 2016, respectively. Her research interests include extreme and catastrophic events, risks, disasters, spatial and spatio-temporal statistics, environmental statistics, computational statistics, large-scale data science, and high-performance computing.

Abstract: Spatial data science involves the analysis of spatial data distributions, patterns, and correlations within a predefined geographic area. This science field assumes that nearby spatial data points exhibit some association. Historically, most spatial datasets were manageable, allowing for exact inference using sequential processing software. However, recent advancements in data collection techniques have led to a surge in data volume, posing significant challenges for large-scale spatial data analysis. High-Performance Computing (HPC) has emerged as a valuable tool for addressing these challenges in various spatial applications, allowing researchers to tackle the vast datasets that have become commonplace. The advent of parallel processing hardware systems, including shared and distributed memory multiprocessors and GPU accelerators, has made it feasible to process big data in spatial statistics. Parallel computing can relieve the computational and memory limitations of large-scale Gaussian random process inference. This course aims to provide an overview of spatial statistics, explore existing approximation methods for Gaussian random processes, delve into state-of-the-art HPC techniques, and demonstrate how these techniques can solve large-scale spatial problems. We aim to encompass the parallel implementation of existing tools and modern approximation methods, such as low-rank approximation at a granular level and multi- and mixed-precision approximation to mitigate the computational load.

The course content will cover the basic concepts of large-scale spatial statistics on parallel systems through synthetic and real data examples using both exact and approximation methods. The course will also provide a comprehensive comparison between existing Geostatistics packages (fields and GeoR) with the cutting-edge HPC packages (ExaGeoStatR and MPCR) to show the main contribution and benefits of using HPC techniques on leading-edge parallel hardware architectures such as GPUs and supercomputers. In general, the course will cover both the theoretical part of spatial statistics and HPC systems and the practical part through coding exercises and performance measurements.

Tentative Course Outline

  1. Spatial Statistics Overview -- 30 minutes -- Marc Genton
  2. Advanced Approximation Techniques for Large-Scale Spatial Data: Low-Rank and Mixed-Precision Approaches -- 30 minutes -- Marc Genton
  3. An Overview of High-Performance Computing (HPC) -- 30 minutes -- Sameh Abdulah
  4. Recent R Packages for Gaussian Likelihood Inference -- 30 minutes -- Sameh Abdulah
  5. Exploring Geostatistics R Packages: An Overview with Code Examples -- 40 minutes -- Sameh Abdulah
  6. Large-Scale Geostatistics with R Packages: Demonstrated with Code Examples -- 40 minutes -- Mary Salvana
  7. Practical Applications Illustrated with Code Examples Advanced -- 30 minutes -- Mary Salvana
  8. Discussion and Conclusions -- 10 minutes -- Marc Genton

Learning Outcomes

As a result of the course, the attendees are expected to:

  1. Gain a fundamental understanding of spatial statistics and its applications across various academic disciplines.
  2. Acquire knowledge of established approximation techniques for managing large spatial datasets, including Tile Low-Rank (TLR) and Mixed Precision (MP) methods.
  3. Explore contemporary High-Performance Computing (HPC) systems and learn how to leverage them for accelerating and managing big data applications.
  4. Discover practical tools for spatial data analysis in R, such as fields and GeoR, illustrated through code examples.
  5. Gain insight into the implementation specifics of the recent HPC packages, ExaGeoStatR and MPCR, including its core components and various parallel computation options.
  6. Develop the skills to analyze real-world, large-scale spatial data using HPC tools.

Statistical Network Analysis in R

Instructor: Dr. Eric Kolaczyk

Eric Kolaczyk Eric Kolaczyk is a professor in the Department of Mathematics and Statistics, and inaugural director of the McGill Computational and Data Systems Initiative (CDSI). He is also an associate academic member of Mila, the Quebec AI Institute. His research is focused at the point of convergence where statistical and machine learning theory and methods support human endeavors enabled by computing and engineered systems, frequently from a network-based perspective of systems science. He collaborates regularly on problems in computational biology, computational neuroscience and, most recently, AI-assisted chemistry and materials science. He has published over 100 articles, including several books on the topic of network analysis. As an associate editor, he has served on the boards of JASA and JRSS-B in statistics, IEEE IP and TNSE in engineering, and SIMODS in mathematics. He formerly served as co-chair of the US National Academies of Sciences, Medicine, and Engineering Roundtable on Data Science Education. He is an elected fellow of the AAAS, ASA, and IMS, an elected senior member of the IEEE, and an elected member of the ISI.

Abstract: A gentle introduction to the statistical analysis of network data, largely through the lens of the R package igraph. Topics to be covered include basic definitions and concepts in networks, manipulation and visualization of network data, and tools for describing network characteristics, as well as a brief look at select inferential topics such as node clustering (aka ‘community detection’) and network modeling. Material will be drawn largely from Chs 1-4 of Kolaczyk and Csardi (2020), Statistical Analysis of Network Data in R, 2nd Edition, and select latter chapters.

Informative Prior Elicitation Using Historical Data, Expert Opinion, and Other Sources

Instructor: Dr. Joseph G. Ibrahim

Joseph G. Ibrahim Dr. Joseph G. Ibrahim, Alumni Distinguished Professor of Biostatistics at the University of North Carolina. Dr. Ibrahim's areas of research focus are Bayesian inference, missing data problems, cancer, and clinical trials. With over 31 years of experience working in Bayesian methods, Dr. Ibrahim directs the UNC Laboratory for Innovative Clinical Trials. He is also the Director of Graduate Studies in UNC's Department of Biostatistics. He is an Elected Fellow of ASA, IMS, ISBA, ISI, and RSS.

Abstract: This full-day short course is designed to give biostatisticians and data scientists a comprehensive overview of informative prior elicitation from historical data, expert opinion, and other data sources, such as real-world data, prior predictions, estimates, and summary statistics. We focus both on Bayesian design and analysis and examples will be presented for several types of applications such as clinical trials, observational studies, environmental studies as well as other areas in biomedical research. The methods we present will be demonstrated Stan, SAS, and the newly developed R packages hdbayes and BayesPPD.The first part of the course gives a brief but broad overview of Bayesian inference, examining concepts of Bayesian design and analysis such as i) Bayesian type 1 error and power, ii) calculation of posterior and predictive distributions, iii) MCMC sampling methods, iv) fundamental concepts in informative and non-informative prior elicitation, v) Bayesian point and interval estimation, and vi) Bayesian hypothesis testing. These topics will be presented in a general context as well in several contexts in regression settings including linear and generalized linear models, models for longitudinal data, and survival models. The first part of the course contains two sections. The second part of the course will focus broadly on advanced methods for informative prior elicitation, including i) informative prior elicitation from historical data using the power prior (PP) and its variations including the normalized power prior, the partial borrowing power prior, the asymptotic power prior, and the scale transformed power prior (STRAPP). In addition, ii) the Bayesian hierarchical mode (BHM) commensurate prior, and the robust Meta-analytic Mixture Prior (MAP) will also be examined and the properties and performance of the four priors (BHM, PP, commensurate, robust MAP) will be analytically compared and studied via simulations and real data analyses of case studies. In addition, we will also examine iii) informative prior elicitation from predictions, including the hierarchical prediction prior (HPP), and the Information Matrix (IM) prior. We also examine iv) strategies for informative prior elicitation from expert opinion. Finally, we discuss(v) synthesis of randomized controlled trial and real-world data using Bayesian nonparametric methods. For (i) – (iv), we will present examples both in the context of Bayesian design and analysis and demonstrate the performance of these prior through several simulation studies and case studies involving real data in the context of linear and generalized linear models, longitudinal data, and survival data. We will also demonstrate the implementation of these priors through the hdbayes and BayesPPD R packages, SAS, Nimble, and Stan.

Introduction to Causal Inference: From Theory to Practice

Instructor: Dr. Linbo Wang

Linbo Wang Dr. Linbo Wang is an assistant professor in the Department of Statistical Sciences and the Department of Computer and Mathematical Sciences at the University of Toronto. He is also a faculty affiliate at the Vector Institute for Artificial Intelligence and holds affiliate positions in the Department of Computer Science at the University of Toronto and the Department of Statistics at the University of Washington. Before assuming these roles, he was a postdoctoral researcher at the Harvard T.H. Chan School of Public Health. He obtained his Ph.D. from the University of Washington. He has published many articles and research papers in prestigious journals and conferences and received the ICSA Outstanding Young Researcher Award in 2022. His research focuses on causality and its interplay with statistics and machine learning. He has taught courses on causal inference to a diverse audience, ranging from students at various levels to professionals in academia and industry.

Abstract: When attempting to make sense of data, decision makers often encounter causal inquiries such as, “Would COVID-19 case numbers have been lower with an earlier lockdown?” It is widely acknowledged that correlation does not imply causation, and a perfectly conducted randomized experiment is considered the gold standard for drawing causal conclusions. However, there is a significant gap between these two extremes: perfect randomized experiments are not always available, and decision-makers frequently need to infer causality, not just correlation.

This course offers a quick overview of causal inference concepts and methods, designed for those new to the field but familiar with basic statistical tools such as regression models and and R programming proficiency. It covers the mathematical underpinnings and cutting-edge methods of causal analysis, aiming to squeeze as much evidence as possible from imperfect studies about the causal effects of interest. Key topics we’ll explore include:
  • The different languages of causal inference, including potential outcomes, graphical models (e.g., DAGs and SWIGs), and structural equation models, along with how to translate between them.
  • How to identify causal effects from observational and imperfectly randomized studies (i.e., dealing with unmeasured confounding).
  • When regression models provide consistent estimates of causal effects.
  • How to estimate causal effects otherwise?
  • Implementation of causal estimation methods using R

Tutorial on Deep Learning and Generative AI

Instructor: Dr. Haoda Fu

Haoda Fu Dr. Haoda Fu is an Associate Vice President and an Enterprise Lead for Machine Learn ing, Artificial Intelligence, and Digital Connected Care from Eli Lilly and Company. Dr. Haoda Fu is a Fellow of ASA (American Statistical Association), and IMS Fellow (Institute of Mathematical Statistics). He is also an adjunct professor of biostatistics department, Univ. of North Carolina Chapel Hill and Indiana university School of Medicine. Dr. Fu re ceived his Ph.D. in statistics from University of Wisconsin - Madison in 2007 and joined Lilly after that. Since he joined Lilly, he is very active in statistics and data science methodology research. He has more than 100 publications in the areas, such as Bayesian adaptive de sign, survival analysis, recurrent event modeling, personalized medicine, indirect and mixed treatment comparison, joint modeling, Bayesian decision making, and rare events analysis. In recent years, his research area focuses on machine learning and artificial intelligence. His research has been published in various top journals including JASA, JRSS, Biometrika, Bio metrics, ACM, IEEE, JAMA, Annals of Internal Medicine etc.. He has been teaching topics of machine learning and AI in large industry conferences including teaching this topic in FDA workshop. He was board of directors for statistics organizations and program chairs, com mittee chairs such as ICSA, ENAR, and ASA Biopharm session. He is a COPSS Snedecor Awards committee member from 2022-2026, and will also serve as an associate editor for JASA theory and method from 2023.

Abstract: Designed specifically for individuals possessing a strong foundation in statistics and biostatistics, this course seeks to bridge the gap into the realm of deep learning and generative AI. Beginning with fundamental knowledge of deep learning, participants will be guided through hands-on implementations using the PyTorch framework. As we delve deeper, the course will unpack popular architectures that have reshaped the landscape of artificial intelligence, including CNN, GNN, ResNet, U-net, attention mechanisms, and transformers. Given the increasing importance of AI in healthcare, special emphasis will be laid on techniques tailor-made for medical imagery and drug discovery, such as SE(3) equivariant machine learning. As a culmination, participants will be introduced to the various facets of generative AI, encompassing GANs, VAEs, DDPM, and score-based generative models. Whether you're seeking to apply these technologies in healthcare, research, or any other domain, this tutorial promises a comprehensive insight into the world of generative AI and deep learning. For this short course, we are going to use Python and necessary packages such as PyTorch, NumPy are needed. All the software and packages used in this short course are free.