 I combine massive datasets, distributed computing, and fundamental statistical principles to understand the world's true complexity.
 Methodological
 Principled statistical analysis of massive datasets
 Multiphase inference and preprocessing
 Parallel and distributed statistical algorithms
 Application Areas
 Systems biology
 Astronomy
 Information networks
Current
 Director, Data Science with Foresite Capital Management
 Staff data scientist at GRAIL
 Senior (data) scientist and tech lead for computational biology at Verily (formerly Google Life Sciences / Google [x])
 Statistician / data scientist at Google on the Ads Quality team
 PhD Student with Harvard University Department of Statistics
 Built theory of multiphase inference and preprocessing with XiaoLi Meng
 Collaborated with the O'Shea lab at the FAS Center for Systems Biology on methods for estimating chromatin structure from highthroughput sequencing data
 Worked with the Time Series Center (part of the IIC) on computationallyintensive time series analysis (focus on event detection).
 Worked with astrostatistics group (CHASC) on Xray stacking for ChaMP.
 Teaching assistant with Harvard University Department of Statistics
 TF for Statistics 211 (Statistical Inference) with Joe Blitzstein and Carl Morris, Spring 2012
 TF for Statistics 244 (Linear and Generalized Linear Models) with Alan Agresti, Fall 2011
 Head TF for Statistics 111 (Introduction to Theoretical Statistics) with Edo Airoldi, Spring 2011
 TF for Statistics 221 (Statistical Computing and Learning) with Edo Airoldi, Fall 2010
 Head TF for Statistics 104 with Kenneth Stanley, Spring 2009
 TF for Statistics 104 with Kenneth Stanley, Fall 2008
 Selected publications
 Estimating latent processes on a network from indirect
measurements (2012).
Airoldi, E.A. & Blocker, A.W. To appear in Journal of the American Statistical Association.
 arXiv:1212.0178 [stat.ME]
 Semiparametric Robust Event Detection for Massive
TimeDomain Databases (2012).
Blocker, A.W. & Protopapas, P. Published in Statistical Challenges in Modern Astronomy V, SpringerVerlag, 177189.
 Full version available at arXiv:1301.3027 [stat.AP]
 Deconvolution of mixing time series on a graph (2011).
Blocker, A. W. & Airoldi, E.A.  Published in Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence (UAI 2011)
 arXiv:1105.2526 [stat.ME]
 Winner of IBM Thomas J. Watson Research Center Student Research Award, April 2011
 A Bayesian approach to the analysis of time symmetry in
light curves: Reconsidering Scorpius X1 occultations
(2009).
Alexander W Blocker, Pavlos Protopapas, & Charles Alcock.  Published in The Astrophysical Journal (doi: 10.1088/0004637X/701/2/1742)
 arXiv:0904.0645v1[astroph.IM]
 Winner of NESS Student Paper Award (Sponsored by Microsoft and Google), April 2010
 Selected talks
 Preprocessing, Multiphase Inference, and Massive Data in Theory and Practice
 Analyses of massive datasets are often built upon
preprocessed data (e.g., microarray experiments) or constitute a
form of preprocessing themselves (e.g., dimensionality reduction
or feature extraction). Such preprocessing is often vital, but
it is rife with subtleties and pitfalls. When such steps are
taken, the data analysis effectively becomes a collaborative
endeavor by all parties involved in data collection,
preprocessing and curation, and downstream inference. Each party
does not and often cannot have a perfect understanding of the
entire phenomenon at hand; the final results will inevitably
contain some combination of their judgments, and some
preprocessing can irreversibly destroy information from the raw
data. Further more, even if each party has done their absolute
best given the information and resources available to them, the
final result may still fall short of the best possible when it
is evaluated in the traditional singlephase framework due to
the problem of uncongeniality (Meng, 1994, Statistical Science).
We therefore need a core of statistical theory for such
multiphase inference problems. This talk presents some building
blocks for a multiphase theoretical framework, illustrated by
some applied examples. Our work highlights the importance of
providing information beyond optimal estimators for down stream
analyses; however, such information need not correspond to
sufficient statistics, even in theory.
Presented July 13, 2012 at MMDS 2012  The Potential and Perils of Preprocessing: A Multiphase Investigation
 Preprocessing forms an oftneglected foundation for a wide
range of statistical analyses. However, it is rife with
subtleties and pitfalls. Decisions made in preprocessing
constrain all later analyses and are typically irreversible.
Hence, data analysis becomes a collaborative endeavor by all
parties involved in data collection, preprocessing and
curation, and downstream inference. Even if each party has done
its best given the information and resources available to them,
the final result may still fall short of the best possible when
evaluated in the traditional singlephase inference framework.
This is particularly relevant as we enter the era of "big
data". The technologies driving this data explosion are subject
to complex new forms of measurement error. Simultaneously, we
are accumulating increasingly massive databases of scientific
analyses. As a result, preprocessing has become more vital (and
potentially more dangerous) than ever before.
In this talk, we propose a theoretical framework for the
analysis of preprocessing under the banner of multiphase
inference. We provide some initial theoretical foundations for
this area, building upon previous work in multiple imputation.
We motivate this foundation with two problems from biology and
astrophysics, illustrating multiphase pitfalls and potential
solutions. These examples also serve to emphasize the practical
motivations behind multiphase analyses  both technical and
statistical. This work suggests several rich directions for
further research into the statistical principles underlying
preprocessing.
Presented May 11, 2012 for upon receipt of the inaugural Arthur P. Dempster Award. Covered in the Harvard Gazette. Further information is available from the Harvard Statistics Department.  Semiparametric Robust Event Detection for Massive TimeSeries Datasets

The detection and analysis of events within massive collections
of timeseries has become an extremely important task for
timedomain astronomy. In particular, many scientific
investigations (e.g. the analysis of microlensing and other
transients) begin with the detection of isolated events in
irregularlysampled series with both nonlinear trends and
nonGaussian noise. I will discuss a semiparametric, robust,
parallel method for identifying variability and isolated events
at multiple scales in the presence of the above complications.
This approach harnesses the power of Bayesian modeling while
maintaining much of the speed and scalability of more adhoc
machine learning approaches. I will also contrast this work
with event detection methods from other fields, highlighting
the unique challenges posed by astronomical surveys. Finally, I
will present initial results from the application of this
method to 87.2 million EROS sources, where we have obtained a
greater than 100fold reduction in candidates for certain types
of phenomena.
Presented August 25, 2010 at the Workshop on Computational Astrostatistics ( http://heawww.harvard.edu/AstroStat/CAS2010/)  Discussion: The Promise & Peril of Synthetic & Integrated Data

A discussion of the potential gains and problems with
integrated, synthetic data of the type produced by Longitudinal
EmployerHousehold Dynamics at the US Census Bureau.
Presented June 21, 2010 at the ICSA 2010 Applied Statistics Symposium as part of the "Informative But Not Invasive" panel, organized by XiaoLi Meng.
Followed presentations by Jeremy Wu (US Census Bureau), John Abowd (Cornell University), and Jerry Reiter (Duke University).  Doing Right By Massive Data: Using Probability Modeling To Advance The Analysis Of Huge Astronomical Datasets
 The analysis of extremely large, complex datasets is
becoming an increasingly important task in the analysis of
scientific data. This trend is especially prevalent in
astronomy, as largescale surveys such as SDSS, PanSTARRS, and
the LSST deliver (or promise to deliver) terabytes of data per
night. While both the statistics and machinelearning
communities have offered approaches to these problems, neither
has produced a completely satisfactory approach. Working in the
context of event detection for the MACHO LMC data, I will
present an approach that combines much of the power of Bayesian
probability modeling with the efficiency and scalability
typically associated with more adhoc machine learning
approaches. This provides both rigorous assessments of
uncertainty and improved statistical efficiency on a dataset
containing approximately 20 million sources and 40 million
individual time series. I will also discuss how this framework
could be extended to related problems.
Presented at NESS 2010  Two Problems in Xray Astronomy
 Discussion of my work on two projects in xray astronomy: the development of a hierarchical Bayesian replacement for "stacking" and the analysis of events in xray light curves. For each problem, I outlined the development of an improved model for the data and the computational methods employed. I also discussd the unique challenges that each case has presented from a cultural perspective.
 Selected posters
 Deconvolution of Mixing Time Series on a Graph (UAI 2011)
 In many applications we are interested in making inference on latent time series from indirect measurements, which are often lowdimensional projections resulting from mixing or aggregation. Positron emission tomography, superresolution, and network traffic monitoring are some examples. Inference in such settings requires solving a sequence of illposed inverse problems, y_t = A x_t, where the projection mechanism provides information on A. We consider problems in which A specifies mixing on a graph of times series that are bursty and sparse. We develop a multilevel statespace model for mixing times series and an efficient approach to inference. A simple model is used to calibrate regularization parameters that lead to efficient inference in the multilevel statespace model. We apply this method to the problem of estimating pointtopoint traffic flows on a network from aggregate measurements. Our solution outperforms existing methods for this problem, and our twostage approach suggests an efficient inference strategy for multilevel models of multivariate time series.
 Software
 golm: Linear models in Go using a cgo to BLAS/LAPACK interface.
 fastGHQuad : Fast, numericallystable evaluation of GaussHermite quadrature rules in R/C++
 Available via CRAN
 networkTomography : A wide range of network tomography algorithms with several commonlyused datasets, including all methods from Blocker and Airoldi (2011) and Airoldi and Blocker (2012).
 Available via CRAN
 bayesstack: Bayesian xray stacking analysis
 Part of the ChaMP software packages for the analysis of multiwavelength surveys
 Kalman tools for Matlab
 Kalman filter & smoother
 Allow for control inputs in state equation & affine term in measurement equation
 Maximum likelihood estimation of linear statespace systems
 Implementation of the expectation maximization algorithm
 Can estimate input matrix and/or affine term in measurement equation
 Optional diagonal restrictions on state & observation noise covariance matrices
 12/06/2007: Updated with moderate efficiency improvements for Mstep routines & major change in EM convergence criterion (relative instead of absolute change)
 12/13/2007: Significant efficiency improvements and further tweaking of EM convergence criterion
 Licensed under LGPL v3.0
 A technical note on the EM algorithm for affine statespace systems & its usage
 Some useful scripts for R
 bagginglm.R: The beginning of a set of functions for bagging LMs and GLMs. Very preliminary. Licensed under GPL v2.0
 AICc.R: A function to calculate corrected AIC (AIC with an adjustment term for smallsample bias). This is written in the same way as the base AIC function, and will work for any model with a logLik method.
 split.data.R: A simple function to break apart a data frame or multivariate time series; it is particularly useful for dealing with the latter. Includes an option to omit missing values while splitting.
 exif2kmz: a Python script to convert geotagged images to a KMZ file
 Requires pyexiv2 and Python Imaging libraries.
 Creates a KMZ file with a placemark for each image and the images themselves.
 Licensed under the MIT license
 Awards and Honors
 Arthur P. Dempster Award (inaugural) for "Preprocessing and Multiphase Inference in Theory and Practice"; awarded May 2012
 NESS Best Paper Award for "Templatebased methods for analyzing chromatin structure dynamics genomewide" with Edo Airoldi; awarded at NESS 2012 in April 2012
 IBM Thomas J. Watson Research Center Student Research Award for “Deconvolution of mixing time series on a graph” with Edo Airoldi; awarded at NESS 2011 in April 2011
 NESS Student Paper Award (Sponsored by Microsoft and Google) for “A Bayesian Approach to the Analysis of Time Symmetry in Light Curves: Reconsidering Scorpius X1 Occultations” with Pavlos Protopapas; awarded April 2010
 Harvard University Certiﬁcate of Distinction in Teaching for Statistics 221, Fall 2010; awarded April 2011
 Pierce Fellowship Recipient, Harvard University Graduate School of Arts & Sciences, September 2009
 Phi Beta Kappa initiate, Epsilon Chapter of MA, 2008
 Background
 PhD in Statistics from Harvard University, May 2013
 Advised by XiaoLi Meng and Edo Airoldi
 Dissertation title: "Distributed and Multiphase Inference in Theory and Practice: Principles, Modeling, and Computation for HighThroughput Science"
 Boston University Alumnus, Class of 2008
 Bachelors in Mathematics & Economics
 Masters in Economics
 PhDlevel coursework in statistics & econometrics
 Formerly:
 Intern with Weiss Asset Management (June 2008  June 2009)
 Research Assistant with Boston University Department of Economics
 Intern with UBS Fixed Income Research
 US Rates & Govt. Bonds Group
 Senior Research & IT Advisor, Matté &
Company
 Research Assistant, Boston University School of Management