<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Randomized Blocker</title>
	<atom:link href="http://www.awblocker.com/blog/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.awblocker.com/blog</link>
	<description>Statistics and computing with massive data</description>
	<lastBuildDate>Mon, 16 Jan 2012 02:36:44 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>Low-level vs. high-level numerical programming</title>
		<link>http://www.awblocker.com/blog/2012/01/low-level-vs-high-level-numerical-programming/</link>
		<comments>http://www.awblocker.com/blog/2012/01/low-level-vs-high-level-numerical-programming/#comments</comments>
		<pubDate>Mon, 16 Jan 2012 02:31:39 +0000</pubDate>
		<dc:creator>awblocker</dc:creator>
				<category><![CDATA[Computing]]></category>
		<category><![CDATA[Numerical development]]></category>

		<guid isPermaLink="false">http://www.awblocker.com/blog/?p=50</guid>
		<description><![CDATA[First, apologies for the delay in posting; I know that your life has been dimmed by my absence, but I did finish a paper for AoAS. Now, enough hyperbole &#8212; more computing. I came across an older (2006) blog post via this HN post arguing that C/C++ are poor tools for numerical and scientific programming. It&#8217;s a &#8230;<p><a href="http://www.awblocker.com/blog/2012/01/low-level-vs-high-level-numerical-programming/" class="more-link">Read More</a></p>]]></description>
			<content:encoded><![CDATA[<p>First, apologies for the delay in posting; I know that your life has been dimmed by my absence, but I did finish a paper for AoAS. Now, enough hyperbole &#8212; more computing.</p>
<p>I came across<a href="http://scienceblogs.com/goodmath/2006/11/the_c_is_efficient_language_fa.php"> an older (2006) blog post</a> via <a href="http://news.ycombinator.com/item?id=3442172">this HN post</a> arguing that C/C++ are poor tools for numerical and scientific programming. It&#8217;s a bit dated, but I agree with many of Mark&#8217;s points. In particular, it is much more difficult to write efficient low-level numerical code in C than in some other languages. This is particularly true compared to, for example, Fortran or functional languages like OCaml.</p>
<p>However, I&#8217;m not ready to ditch C/C++ and start building my next project in one of those languages. For these decisions, I think it&#8217;s important to distinguish between low- and high-level numerical programming. This point is being hashed-out somewhat in the HN thread, but it seems that relatively few of them are building analytics software for scientific users. From my perspective, I want three (conflicting) traits in my programming language for numerical development:</p>
<p><span id="more-50"></span></p>
<ol>
<li>Extremely high efficiency</li>
<li>Easy integration with data management code and libraries</li>
<li>Fast development so I can test a wide range of computational approaches</li>
</ol>
<p>Mark argues that C is not ideal for 1, which I agree with. However, I find C excellent for 2 and not terrible for 3.</p>
<p>C is not ideal for the lowest-level forms of numerical development. The best low-level code (e.g. BLAS and LAPACK) is typically written in Fortran or extraordinarily tweaked, algorithmically-generated C (e.g. <a href="http://math-atlas.sourceforge.net/">ATLAS</a>). However, I&#8217;m not rewriting BLAS or LAPACK. When I need very fast code, I almost always use C/C++ with hooks to those libraries. In C++, I sometimes also use <a href="http://eigen.tuxfamily.org/">Eigen</a>, which performs very well with relatively low verbosity. I know that the people building these libraries have done a vastly better job optimizing the lowest-level details than I could in a decade or two, so I build upon their work.</p>
<p>In terms of integration with other languages (I&#8217;m particularly focused on R and Python), C/C++ is excellent. There are built-in APIs and well-written interfaces to make life easier (e.g. <a href="http://dirk.eddelbuettel.com/code/rcpp.html">Rcpp</a> and <a href="http://www.boost.org/doc/libs/1_48_0/libs/python/doc/">Boost.Python</a>), making development vastly easier. Fortran is alright on this point, although it usually entails more difficulties. OCaml, Lisp, etc. see far less support in the community and are generally less attractive for building reusable code.</p>
<p>For speed of development, C/C++ is not great. However, for much of the development process, I typically work in a higher-level language (R or Python), dropping to C/C++ once the algorithm&#8217;s overall form is fixed. Also, when working with BLAS/LAPACK/Eigen, it is quite easy to switch between different algorithms for tasks like matrix factorizations or eigendecompositions. OCaml and other modern languages can excel for speed of development, while Fortran is essentially a draw with C/C++; after all, BLAS and LAPACK are, at their core, Fortran libraries.</p>
<p>So, for now, I&#8217;m staying with C/C++ for high-performance numerical development. They&#8217;re not the prettiest languages, and they&#8217;re not the easiest to work with, but they are very fast and they play nicely with everything else I use. The only group of numerical developers that I&#8217;m really jealous of at this point are (ironically) those working in .Net &#8212; most of these people work in finance. They have access F# along with the full range of .Net languages and development tools, so they can drop into functional code where it&#8217;s useful and still have all of their components play together (reasonably) nicely. Ideally, we&#8217;ll see this type of integration on the open-source side in the near future.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.awblocker.com/blog/2012/01/low-level-vs-high-level-numerical-programming/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Thoughts on MIC (Reshef et al. 2011)</title>
		<link>http://www.awblocker.com/blog/2011/12/thoughts-on-mic-reshef-et-al-2011/</link>
		<comments>http://www.awblocker.com/blog/2011/12/thoughts-on-mic-reshef-et-al-2011/#comments</comments>
		<pubDate>Sat, 31 Dec 2011 21:25:51 +0000</pubDate>
		<dc:creator>awblocker</dc:creator>
				<category><![CDATA[EDA]]></category>
		<category><![CDATA[Nonparametics]]></category>

		<guid isPermaLink="false">http://www.awblocker.com/blog/?p=21</guid>
		<description><![CDATA[I recently went over this paper on the maximal information coefficient (MIC), which has garnered a lot of attention among the statistics blogosphere and some of my applied collaborators. My initial reaction is that it looks like a nice addition to the range of exploratory techniques, but it has some major limitations. Also, if you are reading this &#8230;<p><a href="http://www.awblocker.com/blog/2011/12/thoughts-on-mic-reshef-et-al-2011/" class="more-link">Read More</a></p>]]></description>
			<content:encoded><![CDATA[<p>I recently went over <a href="http://www.sciencemag.org/content/334/6062/1518.abstract">this paper on the maximal information coefficient (MIC)</a>, which has garnered a lot of attention among the statistics blogosphere and some of my applied collaborators. My initial reaction is that it looks like a nice addition to the range of exploratory techniques, but it has some major limitations. Also, if you are reading this paper, I strongly recommend going through <a href="http://www.sciencemag.org/content/suppl/2011/12/14/334.6062.1518.DC1/Reshef.SOM.pdf">the supplement</a> (PDF) as well; it has most of the important technical details about their methods.</p>
<p><span id="more-21"></span></p>
<p>On the positive side, the measures the authors demonstrate appear easy to use and reasonably effective for the detection of non-linear relationships between variables with moderate amounts of noise. It is also more interpretable than many other such statistics (e.g. unnormalized mutual information), has a useful relationship to R^2 in linear settings, and has very nice invariance properties under monotone univariate transformations. I am also intrigued by their definition of &#8220;equitability&#8221; for measures of association; it seems useful, but tricky to pin down when, for example, looking at different noise structures.</p>
<p>Now, my reservations. First (and foremost, on a technical level), their technique for estimation depends heavily on the maximal grid size B(n). This plays the role of an inverse bandwidth parameter &#8212; large values allow for finer grids. Setting B(n) too large leads to overfitting and inflated estimates of MIC, as the authors show in the supplement. To address this, the authors suggest setting B(n) to n^0.6 as a reasonable choice based on simulation studies. This is somewhat analogous to scaling results obtained for other nonparametric methods, but the authors of this paper do not attempt to find the optimal coefficient on this rate for any estimation criterion. I found this somewhat surprising given the attention devoted to optimal bandwidth selection for many existing nonparametric methods.</p>
<p>On a broader level, I have some cautions on the proposed methods:</p>
<ul>
<li>The basic approach of searching a vast amount of raw data for complex relationships can be problematic. What surfaces are often artifacts of the measurement process. Conversely, using preprocessed data can simply reveal effects of the processing rather than science. The examples used in this paper appear solid, but I worry about such issues if this approach is broadly adopted.</li>
<li>There is no free lunch. This is a neat technique, but it (necessarily) has less power to detect linear relationships than, e.g., F tests in the presence of Gaussian noise. MIC is also (currently) limited to bivariate relationships; scaling it to more than two variables appears difficult. I like MIC as an addition to my toolbox, but it&#8217;s not going to replace everything else.</li>
<li>Multiple testing is still tricky. The authors handle it fairly well (FDR control via Benjamini-Hochberg), but this is point of concern as the number pairwise of relationship scales as p^2. This will pose a greater challenge if this approach is pushed to relationships between more than two variables.</li>
<li>Their techniques really look for a lack of independence, not any particular form of non-linearity. This is great for exploration, but it is also very limited compared to model-based approaches. MIC and related methods cannot be used directly for prediction or easily integrated with other information in a principled way like probabilistic models can.</li>
</ul>
<p>Overall, MIC and its related methods look like a solid contribution to the literature on nonparametric EDA, but they are not a panacea. I am curious to see how they are received and developed by the rest of the statistical community.</p>
<p><strong>Edit: Moved commentary below the fold</strong></p>
]]></content:encoded>
			<wfw:commentRss>http://www.awblocker.com/blog/2011/12/thoughts-on-mic-reshef-et-al-2011/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Starting the blog</title>
		<link>http://www.awblocker.com/blog/2011/12/starting-the-blog/</link>
		<comments>http://www.awblocker.com/blog/2011/12/starting-the-blog/#comments</comments>
		<pubDate>Fri, 23 Dec 2011 05:07:19 +0000</pubDate>
		<dc:creator>awblocker</dc:creator>
				<category><![CDATA[Announcements]]></category>

		<guid isPermaLink="false">http://www.awblocker.com/blog/?p=10</guid>
		<description><![CDATA[Welcome to the blog. I intend to update this at least weekly with comments, questions, and musings on the union of statistics, computing, and massive data (intersection is so restrictive). It is my hope that you find my ramblings informative, thought-provoking, and/or rage-inducing &#8212; really, I&#8217;ll take any or all of the three. Happy holidays, &#8230;<p><a href="http://www.awblocker.com/blog/2011/12/starting-the-blog/" class="more-link">Read More</a></p>]]></description>
			<content:encoded><![CDATA[<p>Welcome to the blog. I intend to update this at least weekly with comments, questions, and musings on the union of statistics, computing, and massive data (intersection is so restrictive). It is my hope that you find my ramblings informative, thought-provoking, and/or rage-inducing &#8212; really, I&#8217;ll take any or all of the three. Happy holidays, and welcome again!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.awblocker.com/blog/2011/12/starting-the-blog/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

<!-- Dynamic page generated in 2.274 seconds. -->
<!-- Cached page generated by WP-Super-Cache on 2012-05-19 21:26:10 -->

