The economics of data aggregation and analysis are being disrupted by falling costs for storage and CPU power, the continuing shift of business processes online, and the deluge of data that is being generated as a consequence.
Innovative technologies have emerged to cope with the storage and retrieval of Big Data, yet analysis tools have been less emphasized. Many emerging data sets do not fit within existing software paradigms: either their size overwhelms traditional desktop tools such as Excel, or their range of data types (geocodes, for example) prevent them from being pipelined into more powerful, but narrowly designed tools. Most importantly, closed-source tools cannot keep pace with the leading edge of innovation in statistical and machine-learning algorithms.
Enter the open source programming language R. R has been dubbed the lingua franca for statistical computing and graphical analysis, with a pedigree tracing back several decades at Bell Labs. Though its million-plus users are concentrated within academia, R is gaining currency within several high-profile quantitative analysis groups, including Google’s Customer Insights team and Barclays Global Investors. In addition, R’s extensibility via user-contributed packages has spawned an active developer community.
In this session, I will focus on applying R’s powerful visualization and analysis capabilities to the kinds of large, multidimensional data sets that increasingly confront developers. Along the way, I will highlight R’s functional programming features, its compact syntax for statistical modeling, and its ease of connectivity with persistent data stores.
In particular, I will present the following two case studies applying R to large, freely available data sets:
- an analysis of NASA’s Landsat imagery of Brazil’s center-west agricultural regions to detect correlates for soybean harvest yields, and a derived predictor of the Brazilian soybean market based in part on these correlates.
- a validation of Bill James’ sabermetrics approach to batting performance using 30 years of Major League Baseball statistics, and a derived predictor for batters’ salaries.
For all of its strengths, R has an admittedly steep learning curve. While source code for these examples will be provided, this talk will emphasize techniques and approach over detail. This session seeks to give developers the courage to learn R, the confidence to include it in their OSS arsenal, and the wisdom to recognize opportunities for its use.
Michael E. Driscoll is a Principal at Dataspora, Inc. a business analytics consultancy in San Francisco. He has eight years of experience developing large-scale databases and inference algorithms across academia and industry with applications ranging from metal-breathing microbes to municipal real estate. He also founded and until 2008 served on the board of CustomInk.com, an Inc. 500 e-commerce firm.
He is the co-chair of the Bay Area R Users Group, and has used R extensively for the visualization and analysis of genome data, GIS data, and macroeconomic data sets.
Michael has a Ph.D. in Bioinformatics and Systems Biology from Boston University, where he was a DOE Computational Science Graduate Fellow, and an A.B. from Harvard College.
Comments on this page are now closed.
For information on exhibition and sponsorship opportunities at the conference, contact Sharon Cordesse at firstname.lastname@example.org
Download the OSCON Sponsor/Exhibitor Prospectus
For media-related inquiries, contact Maureen Jennings at email@example.com
To stay abreast of conference news and to receive email notification when registration opens, please sign up for the OSCON newsletter (login required)
View a complete list of OSCON contacts