Skip to main content

Elasticsearch: The Missing Tutorial

Erik Rose (Mozilla), Laura Thomson (Mozilla Corporation)
Cloud | Databases & Datastores
Portland 255
Tutorial Please note: to attend, your registration must include Tutorials.
Average rating: ****.
(4.36, 33 ratings)
Slides:   external link

THIS TUTORIAL HAS REQUIREMENTS AND INSTRUCTIONS LISTED BELOW

In this tutorial, we focus on what’s missing from the documentation and will not assume you’re already a Lucene expert. Diving behind the curtain, we explore the data structures used for indexing, the algorithms that make faceting so fast, and the tradeoffs involved in replication and sharding. From these fundamentals, you will be able to draw reasonable conclusions about how to make your own use cases efficient. FInally, we show how to avoid the mistakes we made, both in design and deployment, so you can build a stable cluster in days rather than months.

Intro

Elasticsearch started as a one-man show.

  • Scope of project
  • What we’ve used it for – Mozilla (AMO, SUMO, crash logs), Votizen

Docs look good at first, but you quickly realize there’s a lot missing.

Characteristics of ES as a datastore

  • Where in CAP?
  • What is it good for?
  • What is it bad at?
  • Why is it typically a secondary datastore?

Definition of terms

  • Shards, replicas, mapping, indexing, etc.

Basic data structure

Document IDs

Type-guessing

Mappings

Arrays

  • How they’re searched

Nesting and inter-document relationships

Querying

Filters vs. Queries

Filters are cached, so filter when you can.

Queries are more powerful: fuzzy stuff, scoring, etc.

Term vs. match and why this will save you days of pain

Text phrase queries

Building glorious towers of boolean logic and why EC2 will make you sorry

Faceting

Scoring

  • Custom-scoring queries, e.g. weighting more complete user profiles
  • Boosts

Exercises (scattered throughout this section):

  • writing different types of queries
  • using explain to see how a query is processed

Mappings and Analysis

What a mapping is and what it is not

Exercise:

  • Load some documents (provided), look at the default mapping generated
  • Contains some things that ES handles poorly by default (dates, nesting, etc)

4-fold path to analysis

  • Char filter
  • Tokenizer
  • Token filter
  • Analyzer

Parallels with DB indexing

What kinds of analyses are there?

  • All the standard stuff
    • Stopwords
    • Stemming
    • Ngrams
  • Various field types
  • Multi fields

Choosing appropriate analysis: what kinds speed which queries?

Common cases

  • Email addresses
  • Usernames
  • Street addresses

Exercises:

  • Testing analyzers with the _analyze API
  • Write a mapping, including appropriate analyzers, to improve upon the default one above

Multi-language support

Query analyzers (vs. index analyzers)

Shrinking your index

  • What’s the point?
    • Is every part of your index equally hot?
    • Is your index bigger than RAM?
    • How’s your I/O speed?
  • Compression
  • `_source`: to store or not to store?

An example ES integration

How to index

  • Bulk indexing
    • How to monitor progress (yield)
  • Exercise in bulk indexing
  • post-save hooks for updating

Libraries: what to use in some popular languages (Python, PHP, Ruby)

  • some have query builders, some are more bare metal

What to do with ES query results

  • To hit the primary DB, or not to hit it?

Fancy/advanced features (not covered in depth, and may be omitted for time, but have slides)

Synonyms

Suggesters

Autocompletion – via prefixing, via autocomplete suggester (beta)

Percolation

Deployment and Administration

Don’t trust new versions too readily. It moves fast but furiously.

Give it big RAM up front.

All those lovely Java tuning switches: not necessary

Use an up-to-date JVM and a modern OS. Difference between life and death.

Clustering

  • How do replicas and shards relate?
  • Having more shards speeds a single query. Having more replicas speeds multiple-query throughput.
  • Pitfalls: ES makes friends too easily; protect with firewalls, turn off multicast
  • Have enough nodes to have all shards mirrored
  • Adding nodes with no downtime
    • Telling NewNode about OldNode is enough. OldNode will make friends. Then edit OldNode’s config later and restart it.
  • ES really likes to have a cluster. You’re not feeling lucky.
  • What split-brain is and how to avoid it

Monitoring

  • What to monitor
  • Tools like paramedic, BigDesk, Marvel
  • How to monitor (status API)
    • Exercise: looking at status API

Sharding tradeoffs

  • Watch for large Java heaps
  • Too few means too many GC pauses.
  • I/O on EC2 sucks.
  • It’s not single-thread-per-shard or anything.

Deploying new mappings and synonyms without moving files around

Planning for the future

Changing mappings

Mergeable and unmergeable changes

Reindexing

ES isn’t a good primary store, in most cases, because of the brittleness of mappings.

However, the update API exists, and versioning dodges race conditions.

TUTORIAL REQUIREMENTS AND INSTRUCTIONS FOR ATTENDEES

* Attendees need to install ES 1.2.x, and have a way to send HTTP POST requests, e.g. curl.

QUESTIONS for the speaker?: Use the “Leave a Comment or Question” section at the bottom to address them.

Photo of Erik Rose

Erik Rose

Mozilla

Erik Rose leads Mozilla’s DXR project, which does regex searches and static analysis on large codebases like Firefox. He is an Elasticsearch veteran, maintaining the pyelasticsearch library, transitioning Mozilla’s support knowledgebase to ES, and building a burly cluster to do realtime fuzzy matching against the entire corpus of U.S. voters. Erik is a frequent speaker at conferences around the world, the author of “Plone 3 for Education”, and the nexus of the kind of animal magnetism that comes only from writing your own bio.

Photo of Laura Thomson

Laura Thomson

Mozilla Corporation

Laura Thomson is a Senior Engineering Manager at Mozilla Corporation. She works with the Web Engineering team, which is responsible for the Firefox crash reporting system and other developer tools, and the Release Engineering team, which is responsible for shipping Firefox.

Laura is the co-author of “PHP and MySQL Web Development” and “MySQL Tutorial”. She is a veteran speaker at Open Source conferences worldwide.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)

Comments

Bernt Rostad
06/22/2014 10:55pm PDT

Thank you, I’m looking forward to the tutorial :)

Picture of Sophia DeMartini
Sophia DeMartini
06/20/2014 9:26am PDT

Hi Erik – it should be updated above now; please let me know if you see anywhere else it needs to be changed.

Thanks,
Sophia

Picture of Erik Rose
06/20/2014 9:08am PDT

Thanks, Josh! I still see 0.90 at http://www.oscon.com/oscon2014/public/schedule/detail/34677#requirements, but I expect it’s due to caching and will take care of itself.

Picture of Josh Simmons
06/20/2014 9:02am PDT

Just saw these comments come through – went ahead and updated the requirements to say 1.2.x rather than 0.90.×. LMK if there’s anything else I can do to help!

Cheers,
Josh

Picture of Erik Rose
06/20/2014 8:51am PDT

Not only is it okay, but version 1.2 is in fact going to be required (though most exercises should probably work with older versions). I asked the OSCON staff to update this page awhile ago; hopefully they’ll get to it soon. Thanks for asking!

Bernt Rostad
06/19/2014 10:37pm PDT

Hi,

The tutorial requirements I received on email this morning said “Attendees need to install ES 0.90.x”. I’ve got ES 1.2.0 running on my laptop, will I have to downgrade to 0.90.x or is it OK to use this newer version of Elasticsearch?

Cheers,
bernt