A look at the state of data storage, management & analysis, from SQL
to NOSQL, “NewSQL” and beyond. I will explain why the core premises of
data management have changed; tell some of the tales of success and failure I have collected on the topic; share some
counterintuitive rules-of-thumb about the sometimes mind-blowing,
sometimes nerve-wrecking reality of life with an alternative
Equipped with little more than a burning desire to succeed and a river of open source software, learn how you can build a test bed for developing and testing machine learning algorithms on a scale-out infrastructure on a shoestring budget.
Adding security to an existing product is never easy, but our team at Yahoo added strong authentication to Apache Hadoop by integrating it with Kerberos. This project was delivered on time and is currently deployed on all of Yahoo's 40,000 Hadoop computers. Come learn how we added security to and why it matters.
An overview of the state of the art for bringing together the analytical power of the R language with the big data capabilities of Hadoop.
You've heard about NoSQL. You've heard about the Cloud. What if you could spin up something like HBase in a couple minutes and try out both at the same time. By the end of this session, you'll learn how to do just that, in a way portable across several NoSQL projects and dozens of compute clouds.
The data & analytics teams at Etsy build up and tear down more than a thousand independent Hadoop clusters on EC2 each month. This talk discusses the benefits of this approach, where Elastic Map Reduce serves as a "meta-cluster" in which on-demand Hadoop clusters can be created, used, and shut down quickly and easily.
In November, Facebook launched a new version of Messages that combines chat, SMS, email, and Messages into a real-time conversation. Facebook relies on Apache HBase, a NoSQL-style database, for storing this real-time message data. This talk will elaborate on our decision process, system configuration, scaling issues, and advantages gained by choosing Open Source.
This hands-on tutorial aims at learning the basics of the important machine learning algorithms in Mahout. It aims to help you get it up and running on a Hadoop cluster. Mahout is open source implementation of a collection of algorithms designed from ground up to sift through terabytes of data and help bring out important patterns which are otherwise not in the reach of standard tools.
Imagine for a moment doing a JOIN on two HBase tables, crazy talk right? Well now you can thanks to Hive. True, it is only meant to be used in a batch context, but we have being doing it for a few months now at StumbleUpon and our analysts and engineers love it. This presentation will cover how the Hive-HBase integration works and how we use it at our company.
Hadoop gives you the ability to process massive amounts of data at scale. This presentation will show you how hadoop makes use of commodity hardware to allow you to build a system that scales, that deals gracefully with failure of individual nodes, and gives you the power of Map/Reduce to process Petabytes.
Time Series sensors are being ubiquitously integrated in places like cell phones, environmental sensors, and the smart grid. As we scale out this type of data RDBMS systems strain to scale with the high insertion rates and real time query requirements. In this talk we introduce “Lumberyard” which is a scalable indexing and low latency fuzzy pattern searching time series data.
If you've ever had to move from data center to data center or to the cloud, or from old hardware to new hardware, you know that it's even more painful than moving house. In this presentation, survivors will tell you how to stay sane (and how to get it right) with a case study from Mozilla: moving 30TB of crash reports with no downtime in data collection.
This talk introduces an open-source SQL-based system for continuous or ad-hoc analysis of streaming data built on top of Flume-based data collection for Hadoop.
Attendees will understand how to use a new tool to extend their Hadoop data collection pipeline with real-time streaming analytics.
The last few years have brought a wealth of new data technologies organized around horizontal scalability. This talk will cover the essential infrastructure areas: real-time stream processing, offline data crunching, large-scale data deployments and live serving. The focus will be on how these ingredients come together to enable innovative data-driven products at LinkedIn.
Algorithms are getting raunchier, tools more potent and competitions more intimate! Let us mix analytics tools (like R & Mahout) and a dash of algorithmics to work on BigData Analytics competitions and see if the answer is always 42. In the process we will explore and apply a few good algorithms, to the Heritage Health competition …
YARN is the next generation of Hadoop Map-Reduce designed to scale out much further while allowing for running applications other than pure Map-Reduce in a highly fault-tolerant manner.