As new data sets become available through municipal Open Data initiatives, how can these be leveraged to reveal insights and build services for communities? This talk shows an open source project based on the City of Palo Alto “Open Data Platform”, demonstrating how to work with public GIS data available there for parks, roads, trees, etc.
Starting from the raw data, we review simple techniques for discovery and modeling, then use Cascalog, Hadoop, and R to structure the GIS export into data products and accompanying visualizations. The end result creates a data service for a mobile app: “Find a shady spot on a summer day in which to walk near downtown Palo Alto. While on a long conference call. Sipping a latte or enjoying some fro-yo.” Extensions to the app incorporate other data sources to provide insights for the community: for example, monitoring invasive vs. endangered species, or the proximity of toxin-producing species near day-care centers.
Cascalog is an open source project from Twitter authored by Nathan Marz, Sam Ritchie, et al., which integrates the Cascading open source API into the Clojure language. Contributions, sample apps, and case studies have been published by a number of organizations including Climate Corp, REDD Metrics, YieldBot, Nokia Maps, Factual, Harvard School of Public Health, etc. This talk includes code and data, but also explores the process of approaching Open Data from the perspective of developing a data product—from start to finish. Cascalog functions are typically only a few lines long, so the code involved is brief and simple to grasp.
The talk also explores simple practices for test-driven development with large-scale data workflows based on unique features in Cascalog. We also touch on CS theory, going all the way back to the original “relational model” paper by Edgar Codd to discuss some of the unique properties of Cascalog. These aspects are useful for a wide range of data-driven apps.
This example app project began as a seminar at CMU West, showing students examples of how to work with the Palo Alto Open Data initiative, plus how to leverage open source tools for Big Data. The intended audience needs some exposure to programming, but the focus is mostly on process: understanding how to approach large-scale data. This project is also used as a case study in “Enterprise Data Workflows with Cascading”, an upcoming O’Reilly book.
GitHub repo for the open source project (code + wiki): https://github.com/Cascading/CoPA
City of Palo Alto Open Data portal based on Junar (open data): http://www.cityofpaloalto.org/gov/depts/it/open_data/
Slides from Open Data Bay Area meetup (also 40 minute format): http://www.slideshare.net/pacoid/using-cascalog-to-build-an-app-based-on-city-of-palo-alto-open-data
Cascalog project: https://github.com/nathanmarz/cascalog/wiki
Director of Data Science at Concurrent in SF, and a committer on the Cascading open source project. 10+ years leading innovative Data teams, 25+ yrs in tech industry overall. Background in math/stats and distributed computing. Expertise in Hadoop, R, AWS, predictive analytics, machine learning, and NLP. O’Reilly author: “Enterprise Data Workflows with Cascading”.
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
For information on exhibition and sponsorship opportunities at the conference, contact Sharon Cordesse at (707) 827-7065 or email@example.com.
View a complete list of OSCON contacts