• 10gen
  • DataStax, Inc.
  • Dell
  • Google
  • Lexis Nexis
  • Oracle
  • VMware
  • Percona

Sponsorship Opportunities

For information on exhibition and sponsorship opportunities at the convention, contact Sharon Cordesse at scordesse@oreilly.com

Download the OSCON Data Sponsor/Exhibitor Prospectus

Media Partner Opportunities

For information on trade opportunities with O'Reilly conferences or contact mediapartners@ oreilly.com

Press and Media

For media-related inquiries, contact Maureen Jennings at maureen@oreilly.com

OSCON Bulletin

To stay abreast of convention news and announcements, please sign up for the OSCON email bulletin (login required)

Contact Us

View a complete list of OSCON contacts

Real-time Streaming Analysis for Hadoop and Flume

Aaron Kimball (Magnify Consulting)
Average rating: ***..
(3.62, 8 ratings)

This talk introduces an open-source SQL-based system for continuous or ad-hoc analysis of streaming data built on top of the Flume data collection platform for Hadoop.

Big data analytics based on Hadoop often require aggregating data in a large data store like HDFS or HBase, and then running periodic MapReduce processes over this data set. Getting “near real time” results requires running MapReduce jobs more frequently over smaller data sets, which has a practical frequency limit based on the size of the data and complexity of the analytics; the lower bound on analysis latency is on the order of minutes. This has spawned a trend of
building custom analytics directly into the data ingestion pipeline, enabling some streaming operations such as early alerting, index generation, or real-time tuning of ad systems before performing less time-sensitive (but more comprehensive) analysis in MapReduce.

We present an open-source tool which extends the Flume data collection platform with a SQL-like language for analysis over streaming event-based data sets. We will discuss the motivation for the system, its architecture and interaction with Flume, potential applications, and examples of its usage.

Photo of Aaron Kimball

Aaron Kimball

Magnify Consulting

Aaron is the principal consultant at Magnify Consulting. Magnify helps organizations develop and execute their big data strategy.

He is a committer on the Apache Hadoop project and has been working with Hadoop since 2007. Prior to Magnify, Aaron founded WibiData, Inc. in 2010, a software company that engineers solutions for the large-scale user-centric data challenges that face today’s enterprises. Aaron previously worked at Cloudera, a company which provides an enterprise platform, support and services built around Hadoop. Aaron founded the open source Apache Sqoop data import tool and Apache MRUnit Hadoop testing library projects. Aaron holds a B.S. in Computer Science from Cornell University and a M.S. in Computer Science from the University of Washington.

Comments on this page are now closed.


Picture of Sheeri K. Cabral
Sheeri K. Cabral
09/05/2011 5:04am PDT

Video for this talk can be found at www.youtube.com/watch?v=POJ...