For information on exhibition and sponsorship opportunities at the conference, contact Sharon Cordesse at email@example.com.
For media-related inquiries, contact Maureen Jennings at firstname.lastname@example.org.
To stay abreast of conference news and to receive email notification when registration opens, please sign up for the OSCON newsletter (login required).
View a complete list of OSCON 2008 Contacts
Wikipedia contains a wealth of collective knowledge but due to its semi-structured design and idiosyncratic markup mining this resource is a formidable challenge. This session will examine techniques for mining semantically weak data sources for explicit facts.
The session will utilize WEX and preprocessed normalization of Wikipedia designed to make this corpus easily accessible to developers interested in machine learning, natural language processing, or knowledge extraction. The process through which WEX is prepared, as a guide to creating mineable structures from semi-structured data, will be discussed followed by approaches to machine extraction on structures of mixed data quality.
The session is targeted at intermediate developers with an interest in machine learning or knowledge extraction (though no experience is assumed with either).
The demonstrations leverage the power of Postgres 8.3’s XPath capability to simplify the programming model and present examples in Python, but the data and principles are compatible with any modern data infrastructure.
While developing an Internet laboratory for studying economic equilibria, Jamie started one of the first ISPs in San Francisco so he could get a better connection at home. He finally got a real job as CTO at DETERMINE Software (now a part of Selectica) helping create order in the unstructured world of Enterprise contract management. He is now helping to organize the world’s structured information at Metaweb where he oversees data operations.
Colin fights information entropy on a daily basis using a wide arsenal of machine learning and semantic analytic techniques. The results of his efforts appear as millions of assertions in Freebase. Prior to joining Metaweb, Colin helped users organize their world through his work on the IRIS semantic desktop project at SRI.
Toby Segaran is the author of the O’Reilly title, “Programming Collective Intelligence”, Amazon’s top-selling AI book, and the Director of Software Development at Genstruct, a biotechnology company. He loves applying data-mining algorithms to everything ranging from pharmaceutical trials to the Technorati Top 100.