Notes

Keep thinking about this quote from Jeroen Oom’s recent piece on the arxiv:

The role and shape of data is the main characteristic that distinguishes scientific computing. In most general purpose programming languages, data structures are instances of classes with well-defined fields and methods. […] Strictly defined structures make it possible to write code implementing all required operations in advance without knowing the actual content of the data. It also creates a clear separation between developers and users [emphasis added]. Most applications do not give users direct access to raw data. Developers focus in implementing code and designing data structures, whereas users merely get to execute a limited set of operations.

This paradigm does not work for scientific computing. Developers of statistical software have relatively little control over the structure, content, and quality of the data. Data analysis starts with the user supplying a dataset, which is rarely pretty. Real world data come in all shapes and formats. They are messy, have inconsistent structures, and invisible numeric properties. Therefore statistical programming languages define data structures relatively loosely and instead implement a rich lexicon for interactively manipulating and testing the data. Unlike software operating on well-defined data structures, it is nearly impossible to write code that accounts for any scenario and will work for every possible dataset. Many functions are not applicable to every instance of a particular class, or might behave differently based on dynamic properties such as size or dimensionality. For these reasons there is also less clear of a separation between developers and users in scientific computing. The data analysis process involves simultaneously debugging of code and data where the user iterates back and forth between manipulating and analyzing the data. Implementations of statistical methods tend to be very flexible with many parameters and settings to specify behavior for the broad range of possible data. And still the user might have to go through many steps of cleaning and reshaping to give data the appropriate structure and properties to perform a particular analysis.

Good inspiration for this week’s assignment:

Notes for SWC training: short motivational pitch on R

What’s the difference between a novice programmer and a professional programmer?
The novice pauses a moment before doing something stupid.

In this course, we’ll be learning about the R programming environment. You may already have heard of R or have used it before.

It’s the number one language for statistical programming. It’s used by most major companies, putting these skills in high demand.

Recently DICE Magazine showed that programmers with expertise in R topped the tech survey salary charts (at $115,531, above Hadoop, MapReduce, C, or Cloud, mobile or UI/UX design).

You may not (always) like the R programming environment. The syntax is often challenging, it can do counterintuitive things, and it’s terrible at mind reading.

Try not to take this out on your hardware. Hardware is expensive.

R’s strength lies in data.

No other major language has key statistical concepts like ‘missing data’ baked in at the lowest level.

We will not be approaching R as a series of recipes to perform predifined tasks.

In scientific research we can seldom predict just what our data will look like in advance or what our analysis will require. This makes it impossible to always rely on pre-made software and graphical interfaces you may be familar with, and blurs the lines between users developers. Unlike other languages such as C, Java or Python that are often used to build ‘end-user’ software in which the underlying language is completely invisible, R does not make this distnction. R gives a lto of power to the end user. And with great power comes great responsibility.