Alexander W Blocker


Blog (Randomized Blocker)
Research statement
I combine massive datasets, distributed computing, and fundamental statistical principles to understand the world's true complexity.
Research interests
  • Methodological
    • Principled statistical analysis of massive datasets
    • Multiphase inference and preprocessing
    • Parallel and distributed statistical algorithms
  • Application Areas
    • Systems biology
    • Astronomy
    • Information networks
Affiliations
Current
  • Statistician at Google on the Ads Quality team
Former
  • PhD Student with Harvard University Department of Statistics
    • Built theory of multiphase inference and preprocessing with Xiao-Li Meng
    • Collaborated with the O'Shea lab at the FAS Center for Systems Biology on methods for estimating chromatin structure from high-throughput sequencing data
    • Worked with the Time Series Center (part of the IIC) on computationally-intensive time series analysis (focus on event detection).
    • Worked with astrostatistics group (CHASC) on X-ray stacking for ChaMP.
  • Teaching assistant with Harvard University Department of Statistics
    • TF for Statistics 211 (Statistical Inference) with Joe Blitzstein and Carl Morris, Spring 2012
    • TF for Statistics 244 (Linear and Generalized Linear Models) with Alan Agresti, Fall 2011
    • Head TF for Statistics 111 (Introduction to Theoretical Statistics) with Edo Airoldi, Spring 2011
    • TF for Statistics 221 (Statistical Computing and Learning) with Edo Airoldi, Fall 2010
    • Head TF for Statistics 104 with Kenneth Stanley, Spring 2009
    • TF for Statistics 104 with Kenneth Stanley, Fall 2008
Selected publications
Selected talks
  • Preprocessing, Multiphase Inference, and Massive Data in Theory and Practice
  • Analyses of massive datasets are often built upon preprocessed data (e.g., microarray experiments) or constitute a form of preprocessing themselves (e.g., dimensionality reduction or feature extraction). Such preprocessing is often vital, but it is rife with subtleties and pitfalls. When such steps are taken, the data analysis effectively becomes a collaborative endeavor by all parties involved in data collection, preprocessing and curation, and downstream inference. Each party does not and often cannot have a perfect understanding of the entire phenomenon at hand; the final results will inevitably contain some combination of their judgments, and some preprocessing can irreversibly destroy information from the raw data. Further- more, even if each party has done their absolute best given the information and resources available to them, the final result may still fall short of the best possible when it is evaluated in the traditional single-phase framework due to the problem of uncongeniality (Meng, 1994, Statistical Science). We therefore need a core of statistical theory for such multiphase inference problems. This talk presents some building blocks for a multiphase theoretical framework, illustrated by some applied examples. Our work highlights the importance of providing information beyond optimal estimators for down- stream analyses; however, such information need not correspond to sufficient statistics, even in theory.

    Presented July 13, 2012 at MMDS 2012
  • The Potential and Perils of Preprocessing: A Multiphase Investigation
  • Preprocessing forms an oft-neglected foundation for a wide range of statistical analyses. However, it is rife with subtleties and pitfalls. Decisions made in preprocessing constrain all later analyses and are typically irreversible. Hence, data analysis becomes a collaborative endeavor by all parties involved in data collection, preprocessing and curation, and downstream inference. Even if each party has done its best given the information and resources available to them, the final result may still fall short of the best possible when evaluated in the traditional single-phase inference framework. This is particularly relevant as we enter the era of "big data". The technologies driving this data explosion are subject to complex new forms of measurement error. Simultaneously, we are accumulating increasingly massive databases of scientific analyses. As a result, preprocessing has become more vital (and potentially more dangerous) than ever before. In this talk, we propose a theoretical framework for the analysis of preprocessing under the banner of multiphase inference. We provide some initial theoretical foundations for this area, building upon previous work in multiple imputation. We motivate this foundation with two problems from biology and astrophysics, illustrating multiphase pitfalls and potential solutions. These examples also serve to emphasize the practical motivations behind multiphase analyses --- both technical and statistical. This work suggests several rich directions for further research into the statistical principles underlying preprocessing.

    Presented May 11, 2012 for upon receipt of the inaugural Arthur P. Dempster Award. Covered in the Harvard Gazette. Further information is available from the Harvard Statistics Department.
  • Semi-parametric Robust Event Detection for Massive Time-Series Datasets
  • The detection and analysis of events within massive collections of time-series has become an extremely important task for time-domain astronomy. In particular, many scientific investigations (e.g. the analysis of microlensing and other transients) begin with the detection of isolated events in irregularly-sampled series with both non-linear trends and non-Gaussian noise. I will discuss a semi-parametric, robust, parallel method for identifying variability and isolated events at multiple scales in the presence of the above complications. This approach harnesses the power of Bayesian modeling while maintaining much of the speed and scalability of more ad-hoc machine learning approaches. I will also contrast this work with event detection methods from other fields, highlighting the unique challenges posed by astronomical surveys. Finally, I will present initial results from the application of this method to 87.2 million EROS sources, where we have obtained a greater than 100-fold reduction in candidates for certain types of phenomena.

    Presented August 25, 2010 at the Workshop on Computational Astrostatistics ( http://hea-www.harvard.edu/AstroStat/CAS2010/)
  • Discussion: The Promise & Peril of Synthetic & Integrated Data
  • A discussion of the potential gains and problems with integrated, synthetic data of the type produced by Longitudinal Employer-Household Dynamics at the US Census Bureau.

    Presented June 21, 2010 at the ICSA 2010 Applied Statistics Symposium as part of the "Informative But Not Invasive" panel, organized by Xiao-Li Meng.
    Followed presentations by Jeremy Wu (US Census Bureau), John Abowd (Cornell University), and Jerry Reiter (Duke University).
  • Doing Right By Massive Data: Using Probability Modeling To Advance The Analysis Of Huge Astronomical Datasets
  • The analysis of extremely large, complex datasets is becoming an increasingly important task in the analysis of scientific data. This trend is especially prevalent in astronomy, as large-scale surveys such as SDSS, Pan-STARRS, and the LSST deliver (or promise to deliver) terabytes of data per night. While both the statistics and machine-learning communities have offered approaches to these problems, neither has produced a completely satisfactory approach. Working in the context of event detection for the MACHO LMC data, I will present an approach that combines much of the power of Bayesian probability modeling with the efficiency and scalability typically associated with more ad-hoc machine learning approaches. This provides both rigorous assessments of uncertainty and improved statistical efficiency on a dataset containing approximately 20 million sources and 40 million individual time series. I will also discuss how this framework could be extended to related problems.

    Presented at NESS 2010
  • Two Problems in X-ray Astronomy
  • Discussion of my work on two projects in x-ray astronomy: the development of a hierarchical Bayesian replacement for "stacking" and the analysis of events in x-ray light curves. For each problem, I outlined the development of an improved model for the data and the computational methods employed. I also discussd the unique challenges that each case has presented from a cultural perspective.
Selected posters
  • Deconvolution of Mixing Time Series on a Graph (UAI 2011)
  • In many applications we are interested in making inference on latent time series from indirect measurements, which are often low-dimensional projections resulting from mixing or aggregation. Positron emission tomography, super-resolution, and network traffic monitoring are some examples. Inference in such settings requires solving a sequence of ill-posed inverse problems, y_t = A x_t, where the projection mechanism provides information on A. We consider problems in which A specifies mixing on a graph of times series that are bursty and sparse. We develop a multilevel state-space model for mixing times series and an efficient approach to inference. A simple model is used to calibrate regularization parameters that lead to efficient inference in the multilevel state-space model. We apply this method to the problem of estimating point-to-point traffic flows on a network from aggregate measurements. Our solution outperforms existing methods for this problem, and our two-stage approach suggests an efficient inference strategy for multilevel models of multivariate time series.
Software
  • go-lm: Linear models in Go using a cgo to BLAS/LAPACK interface.
  • fastGHQuad : Fast, numerically-stable evaluation of Gauss-Hermite quadrature rules in R/C++
  • networkTomography : A wide range of network tomography algorithms with several commonly-used datasets, including all methods from Blocker and Airoldi (2011) and Airoldi and Blocker (2012).
  • bayesstack: Bayesian x-ray stacking analysis
  • Kalman tools for Matlab
    • Kalman filter & smoother
      • Allow for control inputs in state equation & affine term in measurement equation
    • Maximum likelihood estimation of linear state-space systems
      • Implementation of the expectation maximization algorithm
      • Can estimate input matrix and/or affine term in measurement equation
      • Optional diagonal restrictions on state & observation noise covariance matrices
    • 12/06/2007: Updated with moderate efficiency improvements for M-step routines & major change in EM convergence criterion (relative instead of absolute change)
    • 12/13/2007: Significant efficiency improvements and further tweaking of EM convergence criterion
    • Licensed under LGPL v3.0
  • A technical note on the EM algorithm for affine state-space systems & its usage
  • Some useful scripts for R
    • bagginglm.R: The beginning of a set of functions for bagging LMs and GLMs. Very preliminary. Licensed under GPL v2.0
    • AICc.R: A function to calculate corrected AIC (AIC with an adjustment term for small-sample bias). This is written in the same way as the base AIC function, and will work for any model with a logLik method.
    • split.data.R: A simple function to break apart a data frame or multivariate time series; it is particularly useful for dealing with the latter. Includes an option to omit missing values while splitting.
  • exif2kmz: a Python script to convert geotagged images to a KMZ file
    • Requires pyexiv2 and Python Imaging libraries.
    • Creates a KMZ file with a placemark for each image and the images themselves.
    • Licensed under the MIT license
Awards and Honors
  • Arthur P. Dempster Award (inaugural) for "Preprocessing and Multiphase Inference in Theory and Practice"; awarded May 2012
  • NESS Best Paper Award for "Template-based methods for analyzing chromatin structure dynamics genome-wide" with Edo Airoldi; awarded at NESS 2012 in April 2012
  • IBM Thomas J. Watson Research Center Student Research Award for “Deconvolution of mixing time series on a graph” with Edo Airoldi; awarded at NESS 2011 in April 2011
  • NESS Student Paper Award (Sponsored by Microsoft and Google) for “A Bayesian Approach to the Analysis of Time Symmetry in Light Curves: Reconsidering Scorpius X-1 Occultations” with Pavlos Protopapas; awarded April 2010
  • Harvard University Certificate of Distinction in Teaching for Statistics 221, Fall 2010; awarded April 2011
  • Pierce Fellowship Recipient, Harvard University Graduate School of Arts & Sciences, September 2009
  • Phi Beta Kappa initiate, Epsilon Chapter of MA, 2008
Background
  • PhD in Statistics from Harvard University, May 2013
    • Advised by Xiao-Li Meng and Edo Airoldi
    • Dissertation title: "Distributed and Multiphase Inference in Theory and Practice: Principles, Modeling, and Computation for High-Throughput Science"
  • Boston University Alumnus, Class of 2008
    • Bachelors in Mathematics & Economics
    • Masters in Economics
    • PhD-level coursework in statistics & econometrics
  • Formerly:
    • Intern with Weiss Asset Management (June 2008 - June 2009)
    • Research Assistant with Boston University Department of Economics
    • Intern with UBS Fixed Income Research
      • US Rates & Govt. Bonds Group
    • Senior Research & IT Advisor, Matté & Company
    • Research Assistant, Boston University School of Management

Github profile

arXiv author page

CV

LinkedIn Profile