- Research interests
- Statistical analysis of massive datasets
- Dependent data (time series, networks, spatial, etc.)
- Analysis of data under non-sampling variability
- Privacy-protected analysis and data release
- Semi-parametric methods
- Efficient computational and Monte Carlo methods
- Advisors
- Xiao-Li Meng, Harvard University Department of Statistics (Primary)
- Edo Airoldi, Harvard University Department of Statistics
- Selected publications
- A Bayesian approach to the analysis of time symmetry
in light curves: Reconsidering Scorpius X-1 occultations.
Alexander Blocker, Pavlos Protopapas, & Charles Alcock. - arXiv:0904.0645v1[astro-ph.IM]
- Published in ApJ (doi: 10.1088/0004-637X/701/2/1742)
- Winner of NESS Student Paper Award, April 2010
- Selected talks
- Discussion: The Promise & Peril of Synthetic & Integrated Data
-
A discussion of the potential gains and problems with integrated, synthetic
data of the type produced by Longitudinal Employer-Household Dynamics at the
US Census Bureau.
Presented June 21, 2010 at the ICSA 2010 Applied Statistics Symposium as part of the "Informative But Not Invasive" panel, organized by Xiao-Li Meng.
Followed presentations by Jeremy Wu (US Census Bureau), John Abowd (Cornell University), and Jerry Reiter (Duke University). - Doing Right By Massive Data: Using Probability Modeling To Advance The Analysis Of Huge Astronomical Datasets
- The analysis of extremely large, complex datasets is becoming an
increasingly important task in the analysis of scientific data. This trend is
especially prevalent in astronomy, as large-scale surveys such as SDSS,
Pan-STARRS, and the LSST deliver (or promise to deliver) terabytes of data
per night. While both the statistics and machine-learning communities
have offered approaches to these problems, neither has produced a
completely satisfactory approach. Working in the context of event
detection for the MACHO LMC data, I will present an approach that
combines much of the power of Bayesian probability modeling with the
efficiency and scalability typically associated with more ad-hoc machine
learning approaches. This provides both rigorous assessments of
uncertainty and improved statistical efficiency on a dataset containing
approximately 20 million sources and 40 million individual time series. I
will also discuss how this framework could be extended to related
problems.
Presented at NESS 2010 - Two Problems in X-ray Astronomy
- Discussion of my work on two projects in x-ray astronomy: the development of a hierarchical Bayesian replacement for "stacking" and the analysis of events in x-ray light curves. For each problem, I outlined the development of an improved model for the data and the computational methods employed. I also discussd the unique challenges that each case has presented from a cultural perspective.
- Software
- bayesstack: Bayesian x-ray stacking analysis
- Part of the ChaMP software packages for the analysis of multiwavelength surveys
- Kalman tools for Matlab
- Kalman filter & smoother
- Allow for control inputs in state equation & affine term in measurement equation
- Maximum likelihood estimation of linear state-space systems
- Implementation of the expectation maximization algorithm
- Can estimate input matrix and/or affine term in measurement equation
- Optional diagonal restrictions on state & observation noise covariance matrices
- 12/06/2007: Updated with moderate efficiency improvements for M-step routines & major change in EM convergence criterion (relative instead of absolute change)
- 12/13/2007: Significant efficiency improvements and further tweaking of EM convergence criterion
- Licensed under LGPL v3.0
- A technical note on the EM algorithm for affine state-space systems & its usage
- Some useful scripts for R
- bagginglm.R: The beginning of a set of functions for bagging LMs and GLMs. Very preliminary. Licensed under GPL v2.0
- AICc.R: A function to calculate corrected AIC (AIC with an adjustment term for small-sample bias). This is written in the same way as the base AIC function, and will work for any model with a logLik method.
- split.data.R: A simple function to break apart a data frame or multivariate time series; it is particularly useful for dealing with the latter. Includes an option to omit missing values while splitting.
- exif2kmz: a Python script to convert geotagged images to a KMZ file
- Requires pyexiv2 and Python Imaging libraries.
- Creates a KMZ file with a placemark for each image and the images themselves.
- Licensed under GPL v2.0
- Current Affiliations
- PhD Student with Harvard University Department of Statistics
- Statistical researcher with LEHD group at United States Census Bureau, developing methods for inference and disclosure limitation with imputed and sythetic data.
- Working with the Time Series Center (part of the IIC) on computationally-intensive time series analysis (focus on event detection).
- Currently working with astrostatistics group (CHASC) on X-ray stacking for ChaMP.
- Teaching assistant with Harvard University Department of Statistics
- Head TF for Statistics 104 with Professor Stanley for Spring 2009
- Background
- Boston University Alumnus, Class of 2008
- Bachelors in Mathematics & Economics
- Masters in Economics
- PhD-level coursework in statistics & econometrics
- Formerly:
- Teaching assistant for two sections of Stat 104 (Harvard, Fall 2008)
- Intern with Weiss Asset Management (June 2008 - June 2009)
- Research Assistant with Boston University Department of Economics
- Intern with UBS Fixed Income Research
- US Rates & Govt. Bonds Group
- Senior Research & IT Advisor, Matté & Company
- Research Assistant, Boston University School of Management