Open Source Analytics: Visualization and Predictive Modeling of Big Data with the R Programming Language

Michael Driscoll (Metamarkets)
Emerging Topics, Programming
Location: Exhibit Hall 3
Average rating: ***..
(3.61, 18 ratings)

The economics of data aggregation and analysis are being disrupted by
falling costs for storage and CPU power, the continuing shift of
business processes online, and the deluge of data that is being
generated as a consequence.

Innovative technologies have emerged to cope with the storage and retrieval of Big Data, yet analysis tools have been less emphasized. Many emerging data sets do not fit within existing software paradigms: either their size overwhelms traditional desktop tools such as Excel, or their range of data types (geocodes, for example) prevent them from being pipelined into more powerful, but narrowly designed tools. Most importantly, closed-source tools cannot keep pace with the leading edge of innovation in statistical and machine-learning algorithms.

Enter the open source programming language R. R has been dubbed the
lingua franca for statistical computing and graphical analysis, with a
pedigree tracing back several decades at Bell Labs. Though its
million-plus users are concentrated within academia, R is gaining
currency within several high-profile quantitative analysis groups,
including Google’s Customer Insights team and Barclays Global
Investors. In addition, R’s extensibility via user-contributed
packages has spawned an active developer community.

In this session, I will focus on applying R’s powerful visualization
and analysis capabilities to the kinds of large, multidimensional data
sets that increasingly confront developers. Along the way, I will
highlight R’s functional programming features, its compact syntax for
statistical modeling, and its ease of connectivity with persistent
data stores.

In particular, I will present the following two case studies applying R to large, freely available data sets:

- an analysis of NASA’s Landsat imagery of Brazil’s center-west
agricultural regions to detect correlates for soybean harvest yields,
and a derived predictor of the Brazilian soybean market based in part
on these correlates.

- a validation of Bill James’ sabermetrics approach to batting
performance using 30 years of Major League Baseball statistics, and a
derived predictor for batters’ salaries.

For all of its strengths, R has an admittedly steep learning curve.
While source code for these examples will be provided, this talk will
emphasize techniques and approach over detail. This session seeks to
give developers the courage to learn R, the confidence to include it
in their OSS arsenal, and the wisdom to recognize opportunities for
its use.

Photo of Michael Driscoll

Michael Driscoll

Metamarkets

Michael E. Driscoll is a Principal at Dataspora, Inc. a business analytics consultancy in San Francisco. He has eight years of experience developing large-scale databases and inference algorithms across academia and industry with applications ranging from metal-breathing microbes to municipal real estate. He also founded and until 2008 served on the board of CustomInk.com, an Inc. 500 e-commerce firm.

He is the co-chair of the Bay Area R Users Group, and has used R extensively for the visualization and analysis of genome data, GIS data, and macroeconomic data sets.

Michael has a Ph.D. in Bioinformatics and Systems Biology from Boston University, where he was a DOE Computational Science Graduate Fellow, and an A.B. from Harvard College.

Comments on this page are now closed.

Comments

Jason Buberel
07/23/2009 11:17pm PDT

There was a bit too much PR fluff plus Tufte advocacy and too little focus on what R is and what it can do. There was also little or no practical information on the tools that have been built around the language to make it do the cool things that Michael talked about.

There was more time/effort devoted to “How to do good visualization” which could have been summarized to a few book recommendations, and too little “Here is how to get started munging, modeling and visualizing data using the R toolkit”.

  • Intel
  • Microsoft
  • Google
  • SourceForge.net
  • Sun Microsystems
  • Facebook
  • Gear6
  • Kaltura
  • Liferay
  • MindTouch
  • MySpace.com
  • Novell, Inc.
  • Open Invention Network
  • Rackspace Cloud
  • Schooner Information Technology
  • Silicon Mechanics
  • Symbian Foundation
  • Twilio
  • WSO2
  • Yabarana Corporation

Sponsorship Opportunities

For information on exhibition and sponsorship opportunities at the conference, contact Sharon Cordesse at scordesse@oreilly.com

Download the OSCON Sponsor/Exhibitor Prospectus

Media Partner Opportunities

Download the Media & Promotional Partner Brochure (PDF) for information on trade opportunities with O'Reilly conferences or contact mediapartners@ oreilly.com

Press and Media

For media-related inquiries, contact Maureen Jennings at maureen@oreilly.com

OSCON Newsletter

To stay abreast of conference news and to receive email notification when registration opens, please sign up for the OSCON newsletter (login required)

Contact Us

View a complete list of OSCON contacts