An overview of the state of the art for bringing together the analytical power of the R language with the big data capabilities of Hadoop.
Data Analysis is often wrapped in a bit of mystery, with specialized tools, fancy terminology, and difficult techniques. This tutorial takes a different stance: we will review a set of basic methods and techniques, which are nevertheless essential if you want to think about and understand data. Particular emphasis is placed on ways to gain insight through graphical methods.
Learn how to cobble together a PostgreSQL database, install a few handy R packages, a pinch of language extensions, and a handful of publicly available data to generate a forest monitoring platform to help landscape managers make better decisions using basic design-engineering paradigms to perform quick trade-off analyses.
This hands-on tutorial aims at learning the basics of the important machine learning algorithms in Mahout. It aims to help you get it up and running on a Hadoop cluster. Mahout is open source implementation of a collection of algorithms designed from ground up to sift through terabytes of data and help bring out important patterns which are otherwise not in the reach of standard tools.
Imagine for a moment doing a JOIN on two HBase tables, crazy talk right? Well now you can thanks to Hive. True, it is only meant to be used in a batch context, but we have being doing it for a few months now at StumbleUpon and our analysts and engineers love it. This presentation will cover how the Hive-HBase integration works and how we use it at our company.
Time Series sensors are being ubiquitously integrated in places like cell phones, environmental sensors, and the smart grid. As we scale out this type of data RDBMS systems strain to scale with the high insertion rates and real time query requirements. In this talk we introduce “Lumberyard” which is a scalable indexing and low latency fuzzy pattern searching time series data.
We produce gorgeous LaTeX reports while harnessing the power of R on the backend. The data is pulled from our PostgreSQL database, the analysis and visualizations are fast and distributed thanks to Redis. We'll talk about weaving together open source tools to build powerful analytics reporting engines that rival the commercial alternatives.
Synthetic biology is a new field where basic biological components can be engineered to create something new. It often involves DNA synthesizers, ligation, promoters, and polymerase chain reaction -- which may or may not be safe for your in silico environment. However, as the size and complexity of the systems increase, tools become more and more important, thus CAD for biology has emerged.