- I combine massive datasets, distributed computing, and fundamental statistical principles to understand the world's true complexity.
- Principled statistical analysis of massive datasets
- Multiphase inference and preprocessing
- Parallel and distributed statistical algorithms
- Application Areas
- Systems biology
- Information networks
- Xiao-Li Meng, Harvard University Department of Statistics
- Edo Airoldi, Harvard University Department of Statistics
- Selected publications
- Estimating latent processes on a network from indirect
Airoldi, E.A. & Blocker, A.W.
- To appear in Journal of the American Statistical Association.
- arXiv:1212.0178 [stat.ME]
- Semi-parametric Robust Event Detection for Massive
Time-Domain Databases (2012).
Blocker, A.W. & Protopapas, P.
- Published in Statistical Challenges in Modern Astronomy V, Springer-Verlag, 177-189.
- Full version available at arXiv:1301.3027 [stat.AP]
- Deconvolution of mixing time series on a graph (2011).
Blocker, A. W. & Airoldi, E.A.
- Published in Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence (UAI 2011)
- arXiv:1105.2526 [stat.ME]
- Winner of IBM Thomas J. Watson Research Center Student Research Award, April 2011
- A Bayesian approach to the analysis of time symmetry in
light curves: Reconsidering Scorpius X-1 occultations
Alexander W Blocker, Pavlos Protopapas, & Charles Alcock.
- Selected talks
- Preprocessing, Multiphase Inference, and Massive Data in Theory and Practice
- Analyses of massive datasets are often built upon
preprocessed data (e.g., microarray experiments) or constitute a
form of preprocessing themselves (e.g., dimensionality reduction
or feature extraction). Such preprocessing is often vital, but
it is rife with subtleties and pitfalls. When such steps are
taken, the data analysis effectively becomes a collaborative
endeavor by all parties involved in data collection,
preprocessing and curation, and downstream inference. Each party
does not and often cannot have a perfect understanding of the
entire phenomenon at hand; the final results will inevitably
contain some combination of their judgments, and some
preprocessing can irreversibly destroy information from the raw
data. Further- more, even if each party has done their absolute
best given the information and resources available to them, the
final result may still fall short of the best possible when it
is evaluated in the traditional single-phase framework due to
the problem of uncongeniality (Meng, 1994, Statistical Science).
We therefore need a core of statistical theory for such
multiphase inference problems. This talk presents some building
blocks for a multiphase theoretical framework, illustrated by
some applied examples. Our work highlights the importance of
providing information beyond optimal estimators for down- stream
analyses; however, such information need not correspond to
sufficient statistics, even in theory.
Presented July 13, 2012 at MMDS 2012
- The Potential and Perils of Preprocessing: A Multiphase Investigation
- Preprocessing forms an oft-neglected foundation for a wide
range of statistical analyses. However, it is rife with
subtleties and pitfalls. Decisions made in preprocessing
constrain all later analyses and are typically irreversible.
Hence, data analysis becomes a collaborative endeavor by all
parties involved in data collection, preprocessing and
curation, and downstream inference. Even if each party has done
its best given the information and resources available to them,
the final result may still fall short of the best possible when
evaluated in the traditional single-phase inference framework.
This is particularly relevant as we enter the era of "big
data". The technologies driving this data explosion are subject
to complex new forms of measurement error. Simultaneously, we
are accumulating increasingly massive databases of scientific
analyses. As a result, preprocessing has become more vital (and
potentially more dangerous) than ever before.
In this talk, we propose a theoretical framework for the
analysis of preprocessing under the banner of multiphase
inference. We provide some initial theoretical foundations for
this area, building upon previous work in multiple imputation.
We motivate this foundation with two problems from biology and
astrophysics, illustrating multiphase pitfalls and potential
solutions. These examples also serve to emphasize the practical
motivations behind multiphase analyses --- both technical and
statistical. This work suggests several rich directions for
further research into the statistical principles underlying
Presented May 11, 2012 for upon receipt of the inaugural Arthur P. Dempster Award. Covered in the Harvard Gazette. Further information is available from the Harvard Statistics Department.
- Semi-parametric Robust Event Detection for Massive Time-Series Datasets
The detection and analysis of events within massive collections
of time-series has become an extremely important task for
time-domain astronomy. In particular, many scientific
investigations (e.g. the analysis of microlensing and other
transients) begin with the detection of isolated events in
irregularly-sampled series with both non-linear trends and
non-Gaussian noise. I will discuss a semi-parametric, robust,
parallel method for identifying variability and isolated events
at multiple scales in the presence of the above complications.
This approach harnesses the power of Bayesian modeling while
maintaining much of the speed and scalability of more ad-hoc
machine learning approaches. I will also contrast this work
with event detection methods from other fields, highlighting
the unique challenges posed by astronomical surveys. Finally, I
will present initial results from the application of this
method to 87.2 million EROS sources, where we have obtained a
greater than 100-fold reduction in candidates for certain types
Presented August 25, 2010 at the Workshop on Computational Astrostatistics ( http://hea-www.harvard.edu/AstroStat/CAS2010/)
- Discussion: The Promise & Peril of Synthetic & Integrated Data
A discussion of the potential gains and problems with
integrated, synthetic data of the type produced by Longitudinal
Employer-Household Dynamics at the US Census Bureau.
Presented June 21, 2010 at the ICSA 2010 Applied Statistics Symposium as part of the "Informative But Not Invasive" panel, organized by Xiao-Li Meng.
Followed presentations by Jeremy Wu (US Census Bureau), John Abowd (Cornell University), and Jerry Reiter (Duke University).
- Doing Right By Massive Data: Using Probability Modeling To Advance The Analysis Of Huge Astronomical Datasets
- The analysis of extremely large, complex datasets is
becoming an increasingly important task in the analysis of
scientific data. This trend is especially prevalent in
astronomy, as large-scale surveys such as SDSS, Pan-STARRS, and
the LSST deliver (or promise to deliver) terabytes of data per
night. While both the statistics and machine-learning
communities have offered approaches to these problems, neither
has produced a completely satisfactory approach. Working in the
context of event detection for the MACHO LMC data, I will
present an approach that combines much of the power of Bayesian
probability modeling with the efficiency and scalability
typically associated with more ad-hoc machine learning
approaches. This provides both rigorous assessments of
uncertainty and improved statistical efficiency on a dataset
containing approximately 20 million sources and 40 million
individual time series. I will also discuss how this framework
could be extended to related problems.
Presented at NESS 2010
- Two Problems in X-ray Astronomy
- Discussion of my work on two projects in x-ray astronomy: the development of a hierarchical Bayesian replacement for "stacking" and the analysis of events in x-ray light curves. For each problem, I outlined the development of an improved model for the data and the computational methods employed. I also discussd the unique challenges that each case has presented from a cultural perspective.
- Selected posters
- Deconvolution of Mixing Time Series on a Graph (UAI 2011)
- In many applications we are interested in making inference on latent time series from indirect measurements, which are often low-dimensional projections resulting from mixing or aggregation. Positron emission tomography, super-resolution, and network traffic monitoring are some examples. Inference in such settings requires solving a sequence of ill-posed inverse problems, y_t = A x_t, where the projection mechanism provides information on A. We consider problems in which A specifies mixing on a graph of times series that are bursty and sparse. We develop a multilevel state-space model for mixing times series and an efficient approach to inference. A simple model is used to calibrate regularization parameters that lead to efficient inference in the multilevel state-space model. We apply this method to the problem of estimating point-to-point traffic flows on a network from aggregate measurements. Our solution outperforms existing methods for this problem, and our two-stage approach suggests an efficient inference strategy for multilevel models of multivariate time series.
- fastGHQuad : Fast, numerically-stable evaluation of Gauss-Hermite quadrature rules in R/C++
- Available via CRAN
- networkTomography : A wide range of network tomography algorithms with several commonly-used datasets, including all methods from Blocker and Airoldi (2011) and Airoldi and Blocker (2012).
- Available via CRAN
- bayesstack: Bayesian x-ray stacking analysis
- Part of the ChaMP software packages for the analysis of multiwavelength surveys
- Kalman tools for Matlab
- Kalman filter & smoother
- Allow for control inputs in state equation & affine term in measurement equation
- Maximum likelihood estimation of linear state-space systems
- Implementation of the expectation maximization algorithm
- Can estimate input matrix and/or affine term in measurement equation
- Optional diagonal restrictions on state & observation noise covariance matrices
- 12/06/2007: Updated with moderate efficiency improvements for M-step routines & major change in EM convergence criterion (relative instead of absolute change)
- 12/13/2007: Significant efficiency improvements and further tweaking of EM convergence criterion
- Licensed under LGPL v3.0
- A technical note on the EM algorithm for affine state-space systems & its usage
- Some useful scripts for R
- bagginglm.R: The beginning of a set of functions for bagging LMs and GLMs. Very preliminary. Licensed under GPL v2.0
- AICc.R: A function to calculate corrected AIC (AIC with an adjustment term for small-sample bias). This is written in the same way as the base AIC function, and will work for any model with a logLik method.
- split.data.R: A simple function to break apart a data frame or multivariate time series; it is particularly useful for dealing with the latter. Includes an option to omit missing values while splitting.
- exif2kmz: a Python script to convert geotagged images to a KMZ file
- Requires pyexiv2 and Python Imaging libraries.
- Creates a KMZ file with a placemark for each image and the images themselves.
- Licensed under the MIT license
- Current Affiliations
- PhD Student with Harvard University Department of Statistics
- Building theory of multiphase inference and preprocessing with Xiao-Li Meng
- Collaborating with the O'Shea lab at the FAS Center for Systems Biology on methods for estimating chromatin structure from high-throughput sequencing data
- Working with the Time Series Center (part of the IIC) on computationally-intensive time series analysis (focus on event detection).
- Currently working with astrostatistics group (CHASC) on X-ray stacking for ChaMP.
- Teaching assistant with Harvard University Department of Statistics
- TF for Statistics 211 (Statistical Inference) with Joe Blitzstein and Carl Morris, Spring 2012
- TF for Statistics 244 (Linear and Generalized Linear Models) with Alan Agresti, Fall 2011
- Head TF for Statistics 111 (Introduction to Theoretical Statistics) with Edo Airoldi, Spring 2011
- TF for Statistics 221 (Statistical Computing and Learning) with Edo Airoldi, Fall 2010
- Head TF for Statistics 104 with Kenneth Stanley, Spring 2009
- TF for Statistics 104 with Kenneth Stanley, Fall 2008
- Awards and Honors
- Arthur P. Dempster Award (inaugural) for "Preprocessing and Multiphase Inference in Theory and Practice"; awarded May 2012
- NESS Best Paper Award for "Template-based methods for analyzing chromatin structure dynamics genome-wide" with Edo Airoldi; awarded at NESS 2012 in April 2012
- IBM Thomas J. Watson Research Center Student Research Award for “Deconvolution of mixing time series on a graph” with Edo Airoldi; awarded at NESS 2011 in April 2011
- NESS Student Paper Award (Sponsored by Microsoft and Google) for “A Bayesian Approach to the Analysis of Time Symmetry in Light Curves: Reconsidering Scorpius X-1 Occultations” with Pavlos Protopapas; awarded April 2010
- Harvard University Certiﬁcate of Distinction in Teaching for Statistics 221, Fall 2010; awarded April 2011
- Pierce Fellowship Recipient, Harvard University Graduate School of Arts & Sciences, September 2009
- Phi Beta Kappa initiate, Epsilon Chapter of MA, 2008
- Boston University Alumnus, Class of 2008
- Bachelors in Mathematics & Economics
- Masters in Economics
- PhD-level coursework in statistics & econometrics
- Intern with Weiss Asset Management (June 2008 - June 2009)
- Research Assistant with Boston University Department of Economics
- Intern with UBS Fixed Income Research
- US Rates & Govt. Bonds Group
- Senior Research & IT Advisor, Matté &
- Research Assistant, Boston University School of Management