1. A Short Course on Optimization for Data Science and Machine Learning Problems (Full day; May 31, 2025)
2. Statistical Methods for Composite Time-to-Event Outcomes: Win Ratio and Beyond (Full day; May 31, 2025)
3. Statistical tests for bioequivalence and biosimilarity (Half day - AM Session; May 31, 2025)
4. Statistical Methods for Time-to-Event Data from Multiple Sources: A Causal Inference Perspective (Full day; June 1, 2025)
5. From Estimands to Robust Inference of Treatment Effects in Platform Trials (Half day - AM Session; June 1, 2025)
6. Statistics Meets Tensors: Methods, Theory, and Applications (Full day; June 1, 2025)
Abstract: Optimization lies at the heart of modern data science, offering scalable solutions for high-dimensional problems in statistics and machine/deep learning. The first part of the course will cover: (i) the fundamentals of gradient-based optimization and (ii) advanced optimization methods. These algorithms will be illustrated through applications in high-dimensional statistics and machine learning, including sparse regression, matrix completion, graphical models and feed-forward neural networks. The second part will explore key recent developments in optimization driven by challenges in machine and deep learning. It will briefly cover: (i) Federated and distributed learning, where decentralized optimization techniques enable efficient model training across multiple devices while preserving data privacy. (ii) Minimax optimization, a powerful framework for adversarial learning, robust statistics, and generative modeling. (iii) Bilevel optimization, which has gained prominence in the last 2-3 years for applications such as hyperparameter tuning, meta-learning, and reinforcement learning. The course will balance core concepts with sufficient technical depth, providing an accessible yet insightful perspective on the latest advances in optimization.
Course Outline: Part I: (a) Fundamentals of Gradient-Based Optimization Lu Mao is an Associate Professor in the
Department of Biostatistics and Medical Informatics at the
University of Wisconsin–Madison. He joined the department as
an Assistant Professor after earning his PhD in Biostatistics
from UNC Chapel Hill in 2016. His research interests include
survival analysis, particularly composite outcomes, as well as
causal inference, semiparametric theory, and clinical trials.
He is currently the principal investigator of an NIH R01 grant
on statistical methodology for composite time-to-event
outcomes in cardiovascular trials and an NSF grant on causal
inference in randomized trials with noncompliance. Beyond
methodological research, he collaborates with medical
researchers in cardiology, radiology, oncology, and health
behavioral interventions, where time-to-event and longitudinal
data are routinely analyzed. He has also taught several short
courses on statistical methods for composite outcomes to broad
audiences, including a recent one at the 2024 Joint
Statistical Meetings (JSM) in Portland, OR.
Course Outline
Elena Rantou, PhD, is a Master Scientist
in the Office of Biostatistics/OTS/CDER. She joined FDA in
2013. Since 2019 she is a lead mathematical statistician
working with generic and biosimilar products. Her research
work is mainly focused on assessing bioequivalence of
topical/dermatological generic products, characterizing
outliers in replicate PK studies, the detection of data
anomalies and the use of AI/ML in drug development. She has
contributed to various working groups and worked towards
guidance development. She is part of the leadership of the FDA
Modeling and Simulation working group and co-chairs the AI/ML
and the digital health technologies (DHT) Regulatory Review
Committee in the Office of Biostatistics. Elena holds a PhD
from American University, Washington, DC and prior to joining
the FDA, she worked in the academia and as a statistical
consultant for over 15 years.
Abstract: Statistical testing for bioequivalence plays a crucial role in the regulatory approval of generic drugs, ensuring that they have the same rate and extent of absorption as a reference drug. It is also used to confirm that a follow-on therapeutic biologic product, like a biosimilar monoclonal antibody, is highly similar to its reference biologic, with no clinically meaningful differences. This half-day course will cover various types of bioequivalence studies, including in-vivo pharmacokinetic (PK) studies, comparative clinical endpoint studies, and in-vitro studies. We will explore the different statistical tests applicable to each type of study, addressing both continuous and discrete endpoints. These concepts will be explained in theory and illustrated with examples from approved marketing applications. Furthermore, challenges encountered during the review of these studies have led to the development of advanced regulatory statistical methodologies. These challenges include issues like outliers, statistical power, sample size considerations, study design, and variability in drug performance. The course will highlight these challenges and demonstrate how these are addressed so that bioequivalence assessments are both accurate and reliable.
Xiaofei Wang is a Professor of
Biostatistics and Bioinformatics at Duke University School of
Medicine, and the Director of Statistics for Alliance
Statistics and Data Management Center. Dr. Wang has been
involved in clinical trials, observational studies, and
translational studies for Alliance/CALGB and Duke Cancer
Institute. His methodology research has been funded by NIH
with a focus on biased sampling, causal inference, survival
analysis, methods for predictive and diagnostic medicine, and
clinical trial design. He is an Associate Editor for
Statistics in Biopharmaceutical Statistics and an elected
fellow for the American Statistical Association (ASA).
Shu Yang is an Associate Professor of
Statistics, Goodnight Early Career Innovator, and University
Faculty Scholar at North Carolina State University. Her
primary research interest is causal inference and data
integration, particularly with applications to comparative
effectiveness research in health studies. She also works
extensively on methods for missing data and spatial
statistics. Dr. Yang has been a Principal Investigator for the
U.S. NSF, NIH, and FDA research projects. She is one of the
recipients of the COPPS Emerging Leader award.
Abstract: The short course will review important statistical methods for survival data arising from multiple data sources, including randomized clinical trials and observational studies. It consists of four parts, all of which will be discussed in a unified causal inference framework. In each part, we will review the theoretical background. Supplemented with data examples, we will emphasize the application of these methods in practice and their implementation in freely available statistical software. Each part takes approximately two hours to cover.
Part 1: (Instructor: Xiaofei Wang)
In Part 1, we
will review key issues and methods in designing randomized
clinical trials (RCTs). Statistical tests, such as logrank
test and its weighted variants, inference for hazard ratio
with Cox proportional hazards (PH) model, and the causal
estimand based on survival functions (e.g. restricted mean
survival difference), will be discussed. Examples and data
from cancer clinical trials will be used to illustrate these
methods. In addition, standard survival analysis methods, such
as Kaplan-Meier estimator, logrank test, Cox PH models, have
been commonly used to analyze survival data arising from
observational studies, in which treatment groups are not
randomly assigned as in RCTs. We will start by introducing the
statistical framework and causal inference, then shift the
focus to the causal inference methods for survival data. We
will review various methods that allow valid visualization and
testing for confounder-adjusted survival curves and RMST
differences, including G-Formula, Inverse Probability of
Treatment Weighting, Propensity Score Matching, calibration
weighting, Augmented Inverse Probability of Treatment
Weighting. Examples and data from cancer observational studies
will be used to illustrate these methods.
Part 2: (Instructor: Shu Yang)
In Part 2, we
will cover the objectives and methods that allow integrative
data analyses from RCTs and observational studies. These
methods exploit the complementing features of RCTs and
observational studies to estimate the average treatment effect
(ATE), heterogeneity of treatment effect (HTE), and
individualized treatment rules (ITRs) over a target
population. Firstly, we will review existing statistical
methods for generalizing RCT findings to a target population,
leveraging the representativeness of the observational
studies. Due to population heterogeneity, the ATE and ITRs
estimated from the RCTs lack external
validity/generalizability to a target population. We will
review the statistical methods for conducting generalizable
RCT analysis for the targeted ATE and ITRs, including inverse
probability sampling weighting, calibration weighting, outcome
regression, and doubly robust estimators. R software and
applications will also be covered. Secondly, we will review
existing statistical methods for integrating RCTs and
observational studies for robust and efficient estimation of
the HTE. RCTs have been regarded as the gold standard for
treatment effect evaluation due to randomization of treatment,
which may be underpowered to detect HTEs due to practical
limitations. On the other hand, large observational studies
contain rich information on how patients respond to treatment,
which may be confounded. We will review statistical methods
for robust and efficient estimation of the HTE leveraging the
treatment randomization in RCTs and rich information in
observational studies, including test-based integrative
analysis, selective borrowing, and confounding function
modeling. R software and applications will also be
covered.
Learning Strategy
The course material will blend
concepts, methods, and real-data applications. It will also
describe how to implement the methods using R packages.
Pre-requisites
Attendees are expected to be
familiar with survival analysis and some concepts of causal
inference, but a deep understanding of the general principles
of causal inference is not required.
Ting Ye is an Assistant Professor in
Biostatistics at the University of Washington. Her research
aims to accelerate human health advances through data-driven
discovery, development, and delivery of clinical, medical, and
scientific breakthroughs, spanning the design and analysis of
complex innovative clinical trials, causal inference in
biomedical big data, and quantitative medical research. Ting
is a recipient of the School of Public Health's Genentech
Endowed Professorship and the NIH Maximizing Investigators'
Research Award (MIRA). Ting is a leader in covariate
adjustment for randomized clinical trials. She has published
over ten papers, including four in top-tier journals such as
JASA, JRSSB, and Biometrika, two of which have been cited in
the FDA's official guidance. The RobinCar R package, developed
by her research group, has become a standard software tool in
the field. She is also the co-founder and co-chair of an ASA
Biopharmaceutical Section Scientific Working Group on
Covariate Adjustment.
Abstract: A platform trial is an innovative clinical trial design that uses a master protocol (i.e., one overarching protocol) to evaluate multiple treatments in an ongoing manner and can accelerate the evaluation of new treatments. However, its flexibility introduces inferential challenges, with two fundamental ones being the precise definition of treatment effects and robust, efficient inference on these effects. Central to these challenges is defining an appropriate target population for the estimand, as the populations represented by some commonly used analysis approaches can arbitrarily depend on the randomization ratio or trial type. In this short course, we will first establish a clear framework for constructing clinically meaningful estimands with precise specification of the population of interest. In particular, we introduce the concept of the Entire Concurrently Eligible (ECE) population, which preserves the integrity of randomized comparisons while remaining invariant to both the randomization ratio and trial type. This framework provides a solid foundation for future design, analysis, and research in platform trials. Next, we will present weighting and post-stratification methods for estimation of treatment effects with minimal assumptions. To fully leverage the efficiency potential of platform trials, we will also present model-assisted approaches for baseline covariate adjustment to gain efficiency while maintaining robustness against model misspecification. Additionally, we will discuss and compare the asymptotic distributions of the proposed estimators and introduce robust variance estimators. Throughout the course, we will illustrate these concepts and methods through case studies and demonstrate their implementation using the R package RobinCID.
Course Outline
Anru Zhang is a primary faculty member jointly
appointed by the Department of Biostatistics &
Bioinformatics and the Departments of Computer Science and the
Eugene Anson Stead, Jr. M.D. Associate Professor at Duke
University. He obtained his bachelor's degree from Peking
University in 2010 and his Ph.D. from the University of
Pennsylvania in 2015. His work focuses on high-dimensional
statistical inference, tensor learning, generative models, and
applications in electronic health records and microbiome data
analysis. He won the IMS Tweedie Award, the COPSS Emerging
Leader Award, and the ASA Gottfried E. Noether Junior Award.
His research is currently supported by two NIH R01 Grants (as
PI and MPI) and an NSF CAREER Award.
Abstract:
High-dimensional
high-order/tensor data refers to data organized in the form of
large-scale arrays spanning three or more dimensions, which
becomes increasingly prevalent across various fields,
including biology, medicine, psychology, education, and
machine learning. Specifically, tensor data is prevalent in
biological and medical research, playing a crucial role in
various studies. For instance, in longitudinal microbiome
research, microbiome samples are collected from multiple
subjects (units) at multiple time points to analyze the
abundance of bacteria (variables) over time. Depending on the
taxonomic level under investigation, there can be hundreds or
thousands of bacterial taxa in the feature mode, with many
taxa exhibiting strong correlations in their abundance
patterns. In the field of neurological science, techniques
like Magnetic Resonance Imaging (MRI), functional MRI, and
electroencephalogram (EEG) have been developed to measure
neurological activities in three-dimensional brain regions.
These imaging data are often stored in the form of
tensors.
Compared to low-dimensional or low-order data, the distinct
characteristics of high-dimensional high-order data poses
unprecedented challenges to the statistics community. For the
most part, classical methods and theory tailored to matrix
data may no longer apply to high-order data. While previous
studies have attempted to address this issue by transforming
high-order data into matrices or vectors through vectorization
or matricization, this paradigm often leads to loss of
intrinsic tensor structures, and as a result, suboptimal
outcomes in subsequent analyses. Another major challenge stems
from the computational side, as the high-dimensional
high-order structure introduces computational difficulties
unseen in the matrix counterpart. Many fundamental concepts
and methods developed for matrix data cannot be extended to
high-order data in a tractable manner; for instance, naive
extensions of concepts such as operator norm, singular values,
and eigenvalues all become NP-hard to compute.
From a methodology perspective, with the rapid expansion of tensor datasets, fundamental statistical analysis tools, such as dimension reduction, regression, classification, discrimination analysis, and clustering, face unique aspects and significant challenges compared to traditional statistics. These difficulties arise from both statistical and computational perspectives, giving rise to the ubiquitous phenomenon of computational-statistical tradeoffs. Given these challenges and the growing importance of tensor data analysis, we are offering the short course "Statistics meets Tensors: Methodology, Theory, and Applications" the New England Statistics Symposium (NESS) 2025.
Tentative Course Outline
Introduction: Background, applications of tensor methods on data in a tensor format, applications of tensor methods on data in other formats.
Tensor algebras: Concepts (order, fibers, slices, norms, etc), Rank-one tensors, tensor decomposition, tensor rank, uniqueness of tensor decomposition: Kruskal condition.
Probabilistic model of tensors: Tensor regression, tensor SVD, tensor completion,high-order clustering.
Methods: Power iteration, higher-order orthogonal iteration, importance sketching.
Applications: Computational Imaging, high-dimensional longitudinal data, multivariate relational data.
Theory: Information-theoretic limits, computational-statistical trade-offs.