MongoDB's New Matcher

May 28, 2013, 11:33 am

≪ Previous: New Geo Features in MongoDB 2.4

MongoDB 2.5.0 (an unstable dev build) has a new implementation of the “Matcher”. The old Matcher is the bit of code in Mongo that takes a query and decides if a document matches a query expression. It also has to understand indexes so that it can do things like create a subsets of queries suitable for index covering. However, the structure of the Matcher code hasn’t changed significantly in more than four years and until this release, it lacked the ability to be easily extended. It was also structured in such a way that its knowledge could not be reused for query optimization. It was clearly ready for a rewrite.

The “New Matcher” in 2.5.0 is a total rewrite. It contains three separate pieces: an abstract syntax tree (hereafter ‘AST’) for expression match expressions, a parser from BSON into said AST, and a Matcher API layer that simulates the old Matcher interface while using all new internals. This new version is much easier to extend, easier to reason about, and will allow us to use the same structure for matching as for query analysis and rewriting.

This matcher rewrite is part of a larger project to restructure query execution, to optimize them, and to lay the groundwork for more advanced queries in the future. One planned optimization is index intersection. For example, if you have an index on each of ‘a’ and ‘b’ attributes, we want a query of the form { a : 5 , b : 6 } to do an index intersection of the two indexes rather than just use one index and discard the documents from that index that don’t match. Index intersection would also be suitable for merging geo-spatial, text and regular indexes together in fun and interesting ways (i.e. a query to return all the users in a 3.5 mile radius of a location with a greater than #x# reputation who are RSVP’ed ‘yes’ for an event).

A good example of an extension we’d like to enable is self referential queries, such as finding all documents where a = b + c. (This would be written { a : { $sum : [ “$b” , “$c” ] } }). With the new Matcher, such queries are easy to implement as a native part of the language.

Now that the Matcher re-write is ready for testing, we’d love people to help test it by trying out MongoDB 2.5.0. (Release Notes)

Code

By Eliot Horowitz, 10gen CTO, MongoDB core contributor. You can find the original post on his personal blog.

↧

Go Agent, Go

May 29, 2013, 7:32 am

≫ Next: Integrating MongoDB Text Search with a Python App

≪ Previous: MongoDB's New Matcher

Discuss on Hacker News

10gen introduced MongoDB Backup Service in early May. Creating a backup service for MongoDB was a new challenge, and we used the opportunity to explore new technologies for our stack. The final implementation of the MongoDB Backup Service agent is written in Go, an open-source, natively executable language initiated and maintained by Google.

Why did we Go with Go?

The Backup Service started as a Java project, but as the project matured, the team wanted to move to a language that compiled natively on the machine. After considering a few options, the team decided that Go was the best fit for its C-like syntax, strong standard library, the resolution of concurrency problems via goroutines, and painless multi-platform distribution.

mgo

As an open-source company, 10gen is fortunate to work with MongoDB developers around the world who build open-source tools for new and emerging languages to provide users with a breadth of options to access MongoDB. One of the MongoDB Masters, Gustavo Niemeyer, has spent over two years building mgo, the MongoDB driver for Go. In that time he’s developed a great framework for accessing MongoDB through Go and Gustavo has been a valuable resource as we’ve built out the Backup Service. In his own words:

“It’s great to see not only 10gen making good use of the Go language for first-class services, but contributing to that community of developers by providing its support for the development of the Go driver in multiple ways.”

Programming the backup agent in Go and the mgo driver has been extremely satisfying. Between the lightweight syntax, the first-class concurrency and the well documented, idiomatic libraries such as mgo, Go is a great choice for writing anything from small scripts to large distributed applications.

Starting a Java project often begins with a group debate: “Maven or Ant? JUnit or TestNG? Spring or Guice?” Go has a number of conventions through which Go team has created a sensible, uniform development experience with the holy trinity of tools: go build, test and fmt.

The organization of source code and libraries is standardized to allow using the go build tool. See details here
Name test files as XXX_test.go with functions named TestXXX can be run automatically with go test
Braces are required on if statements and the first brace goes along with the if condition. E.g.

if x {
     doSomething()
}

instead of:

if x 
{
    doSomething()
}

Methods that end with an f (e.g. Printf, Fatalf) means a string formatted method will be validated in go vet that the number of substitutions (e.g. %v) matches the number of inputs to the function.

mgo is a real pleasure to use with high-quality code, thorough documentation and an API that is a thoughtful, natural blend of idiomatic Go and MongoDB. Our team owes a lot of thanks to Gustavo for his hard work on this project.

There are other Go projects being explored at the moment and we hope to see more people using mgo in production going forward.

By the 10gen Backup Team

↧

Integrating MongoDB Text Search with a Python App

June 4, 2013, 7:14 am

≫ Next: The MEAN Stack: Mistakes You're Probably Making With MongooseJS, And How To Fix Them

≪ Previous: Go Agent, Go

By Mike O’Brien, 10gen Software engineer and maintainer of Mongo-Hadoop

With the release of MongoDB 2.4, it’s now pretty simple to take an existing application that already uses MongoDB and add new features that take advantage of text search. Prior to 2.4, adding text search to a MongoDB app would have required writing code to interface with another system like Solr, Lucene, ElasticSearch, or something else. Now that it’s integrated with the database we are already using, we can accomplish the same result with reduced complexity, and fewer moving parts in the deployment.

Here we’ll go through a practical example of adding text search to Planet MongoDB, our blog aggregator site.

Planet MongoDB is built in Python, uses the excellent Flask web framework, and stores feed content in a collection called posts. We’ll add some code that enables us to search over posts for any keyword terms we want. As you’ll see, the amount of code and configuration that needs to be added to accomplish this is quite small.

Initial Setup

Before you can actually use any text search features, you have to explicitly enable it. You can do this by just restarting mongod with the additional command line options --setParameter textSearchEnabled=true, or just from the mongo shell by running db.runCommand({setParameter:1, textSearchEnabled:true}). Since you’re hopefully developing and testing on a different database than you use for production, don’t forget to do this on both.

Creating Indexes

The next critical step is to create the text search index on the field you want to make searchable. In our case, we want our searches to be able to find hits in the article titles as well as the content. However, since the article titles are more prominent, we want to consider matches in the title to rank a bit higher overall in the search than matches in the content body. We can do this by setting weights on the fields.

To do this, we’ll add a line of python code to the application that is executed upon startup which creates the index we need, if it doesn’t already exist:

db.posts.ensure_index([
      ('body', 'text'),
      ('title', 'text'),
  ],
  name="search_index",
  weights={
      'title':100,
      'body':25
  }
)

Running searches

At this point, we now have a collection of data, and we’ve created a text index that can be used to do searches on arbitrary keywords. We just need to write some code that will actually run searches and render the results.

Unlike regular MongoDB queries, text search is implemented as a special command that returns a document containing a ‘results’ field, an array of the highest-scoring documents that matched. To use it, run the command with the additional field search which contains the keywords to match against. To use this in the app, we just grab the request parameter containing what the user typed into the search box and pass it as an argument to the text search command, and then render a page containing the search results.

@app.route('/search')
def search():
    query = request.form['q']
    text_results = db.command('text', 'posts', search=query, limit=SEARCH_LIMIT)
    doc_matches = (res['obj'] for res in text_results['results'])
    return render_template("search.html", results=results)

Filtering

In addition to finding docs that match text queries, you may want to filter the result set even further based on other criteria and fields in the documents. To do this, add a filter field to the text search command containing the additional filtering logic, in the exact same style as a regular find() query. In this case, we want to restrict the results to only the blog posts that are related to MongoDB, which is determined by a field in the posts called related. Modifying the call to db.command to include this, we get:

text_results = db.command('text', 'posts', search=query, filter={'related':True}, limit=SEARCH_LIMIT)

Pagination

In practice, most applications want to just show a few results on a page at a time, and then provide some kind of “previous/next” links to navigate through multiple pages of matches. We can tweak the existing code to accomplish this too, by adding a parameter page to indicate where we are in the results, and rendering 10 results at a time.

So now, we’ll parse out the page param and slice out the necessary items from the array returned in results, using an additional arg limit to return only as many documents as needed. On the results page, we can then just generate a link to the next page of results by constructing the same search link but incrementing page in the Jinja template.

PAGE_SIZE = 10
try:
    page = int(request.args.get("page", 0))
except:
    page = 0

start = page * PAGE_SIZE
end = (page + 1) * PAGE_SIZE
text_results = db.command('text', 'posts', search=query, filter={'related':True}, limit=end)
doc_matches = text_results[start:end]

Wrap-up

The rest of the work to be done to finish up is all on the user-interface side. We add a form with a single input element for the user to type in the query, and write the code to display the posts returned in the text search command, and it’s already up and running. Although it was very quick and easy to add a functional text-search feature to the app, this only scratches the surface of how it all works. To learn more, refer to the docs on text search.

↧

The MEAN Stack: Mistakes You're Probably Making With MongooseJS, And How To Fix Them

June 6, 2013, 6:57 am

≫ Next: Ruby, Rails, MongoDB and the Object-Relational Mismatch

≪ Previous: Integrating MongoDB Text Search with a Python App

This is a guest post from Valeri Karpov, a MongoDB Hacker and co-founder of the Ascot Project. For more MEAN Stack wisdom, check out his blog at TheCodeBarbarian.com. Valeri originally coined the term MEAN Stack while writing for the MongoDB blog, and you can find that post here.

If you’re familiar with Ruby on Rails and are using MongoDB to build a NodeJS app, you might miss some slick ActiveRecord features, such as declarative validation. Diving into most of the basic tutorials out there, you’ll find that many basic web development tasks are more work than you like. For example, if we borrow the style of http://howtonode.org/express-mongodb, a route that pulls a document by its ID will look something like this:

app.get('/document/:id', function(req, res) { 
  db.collection('documents', function(error, collection) {
    collection.findOne({ _id : collection.db.bson_serializer.ObjectID.createFromHexString(req.params.id) },
        function(error, document) {
          if (error || !document) {
            res.render('error', {});
          } else { 
            res.render('document', { document : document });
          }
        });
  });
});

In my last guest post MongoDB I touched on MongooseJS, a schema and usability wrapper for MongoDB in NodeJS. MongooseJS was developed by LearnBoost, an education startup based in San Francisco, and maintained by 10gen. MongooseJS lets us take advantage of MongoDB’s flexibility and performance benefits while using development paradigms similar to Ruby on Rails and ActiveRecord. In this post, I’ll go into more detail about how The Ascot Project uses Mongoose for our data, some best practices we’ve learned, and some pitfalls we’ve found that aren’t clearly documented.

Before we dive into the details of working with Mongoose, let’s take a second to define the primary objects that we will be using. Loosely speaking, Mongoose’s schema setup is defined by 3 types: Schema, Connection, and Model.

A Schema is an object that defines the structure of any documents that will be stored in your MongoDB collection; it enables you to define types and validators for all of your data items.
A Connection is a fairly standard wrapper around a database connection.
A Model is an object that gives you easy access to a named collection, allowing you to query the collection and use the Schema to validate any documents you save to that collection. It is created by combining a Schema, a Connection, and a collection name.
Finally, a Document is an instantiation of a Model that is tied to a specific document in your collection.

Okay, now we can jump into the dirty details of MongooseJS. Most MongooseJS apps will start something like this:

var Mongoose = require('mongoose'); var myConnection = Mongoose.createConnection('localhost', 'mydatabase');

var MySchema = new Mongoose.schema({ name : {
    type : String,
    default : 'Val',
    enum : ['Val', 'Valeri', 'Valeri Karpov']
  },
created : {
    type : Date,
    default : Date.now
  }
}); var MyModel = myConnection.model('mycollection', MySchema);
var myDocument = new MyModel({});

What makes this code so magical? There are 4 primary advantages that Mongoose has over the default MongoDB wrapper:

1. MongoDB uses named collections of arbitrary objects, and a Mongoose JS Model abstracts away this layer. Because of this, we don’t have to deal with tasks such as asynchronously telling MongoDB to switch to that collection, or work with the annoying createFromHexString function. For example, in the above code, loading and displaying a document would look more like:

app.get('/document/:id', function(req, res) { 
  Document.findOne({ _id : req.params.id }, function(error, document) {
    if (error || !document) {
      res.render('error', {});
    } else { 
      res.render('document', { document : document });
    }
  });
});

2. Mongoose Models handle the grunt work of setting default values and validating data. In the above example myDocument.name = ‘Val’, and if we try to save with a name that’s not in the provided enum, Mongoose will give us back a nice error. If you want to learn a bit more about the cool things you can do with Mongoose validation, you can check out my blog post on how to integrate Mongoose validation with [AngularJS] (http://thecodebarbarian.wordpress.com/2013/05/12/how-to-easily-validate-any-form-ever- using-angularjs/).

3. Mongoose lets us attach functions to our models:

MySchema.methods.greet = function() { return 'Hello, ' + this.name; };

4. Mongoose handles limited sub-document population using manual references (i.e. no MongoDB DBRefs), which gives us the ability to mimic a familiar SQL join. For example:

var UserGroupSchema = new Mongoose.schema({ 
  users : [{ type : Mongoose.Schema.ObjectId, ref : 'mycollection' }]
}); 

var UserGroup = myConnection.model('usergroups', UserGroupSchema);
var group = new UserGroup({ users : [myDocument._id] });
group.save(function() {
  UserGroup.find().populate('users').exec(function(error, groups) { 
    // Groups contains every document in usergroups with users field populated // Prints 'Val' 
    console.log(groups[0][0].name)
  });
});

In the last few months, my team and I have learned a great deal about working with Mongoose and using it to open up the true power of MongoDB. Like most powerful tools, it can be used well and it can be used poorly, and unfortunately a lot of the examples you can find online fall into the latter. Through trial and error over the course of Ascot’s development, my team has settled on some key principles for using Mongoose the right way:

1 Schema = 1 file

A schema should never be declared in app.js, and you should never have multiple schemas in a single file (even if you intend to nest one schema in another). While it is often expedient to inline everything into app.js, not keeping schemas in separate files makes things more difficult in the long run. Separate files lowers the barrier to entry for understanding your code base and makes tracking changes much easier.

Mongoose can’t handle multi-level population yet, and populated fields are not Documents. Nesting schemas is helpful but it’s an incomplete solution. Design your schemas accordingly.

Let’s say we have a few interconnected Models:

var ImageSchema = new Mongoose.Schema({
  url : { type : String}, 
  created : { type : Date, default : Date.now }
}); 
var Image = db.model('images', ImageSchema);

var UserSchema = new Mongoose.Schema({ 
  username : { type : String }, 
  image : { type : Mongoose.Schema.ObjectId, ref : 'images' }
}); 

UserSchema.methods.greet = function() {
  return 'Hello, ' + this.name;
};

var User = db.model('users', UserSchema);

var Group = new Mongoose.Schema({ 
  users : [{ type : Mongoose.Schema.ObjectId, ref : 'users' }]
});

Our Group Model contains a list of Users, which in turn each have a reference to an Image. Can MongooseJS resolve these references for us? The answer, it turns out, is yes and no.

Group.
  find({}).
  populate('user').
  populate('user.image').
  exec(function(error, groups) {
    groups[0].users[0].username; // OK 
    groups[0].users[0].greet(); // ERROR – greet is undefined
    
    groups[0].users[0].image; // Is still an object id, doesn't get populated
    groups[0].users[0].image.created; // Undefined
  });

In other words, you can call ‘populate’ to easily resolve an ObjectID into the associated object, but you can’t call ‘populate’ to resolve an ObjectID that’s contained in that object. Furthermore, since the populated object is not technically a Document, you can’t call any functions you attached to the schema. Although this is definitely a severe limitation, it can often be avoided by the use of nested schemas. For example, we can define our UserSchema like this:

var UserSchema = new Mongoose.Schema({
  username : { type : String }, 
  image : [ImageSchema]
});

In this case, we don’t have to call ‘populate’ to resolve the image. Instead, we can do this:

Group.
  find({}).
  populate('user').
  exec(function(error, groups) {
    groups[0].users[0].image.created; // Date associated with image
  });

However, nested schemas don’t solve all of our problems, because we still don’t have a good way to handle many-to-many relationships. Nested schemas are an excellent solution for cases where the nested schema can only exist when it belongs to exactly one of a parent schema. In the above example, we implicitly assume that a single image belongs to exactly one user – no other user can reference the exact same image object.

For instance, we shouldn’t have UserSchema as a nested schema of Group’s schema, because a User can be a part of multiple Groups, and thus we’d have to store separate copies of a single User object in multiple Groups. Furthermore, a User ought to be able to exist in our database without being part of any groups.

Declare your models exactly once and use dependency injection; never declare them in a routes file.

This is best expressed in an example:

// GOOD exports.listUsers = function(User) {
  return function(req, res) {
    User.find({}, function(error, users) {
      res.render('list_users', { users : users });
    });
  }
};

// BAD 
var db = Mongoose.createConnection('localhost', 'database');
var Schema = require('../models/User.js').UserSchema; 
var User = db.model('users', Schema); 

exports.listUsers = return function(req, res) {
  User.find({}, function(error, users) {
    res.render('list_users', { users : users });
  });
};

The biggest problem with the “bad” version of listUsers shown above is that if you declare your model at the top of this particular file, you have to define it in every file where you use the User model. This leads to a lot of error-prone find-and-replace work for you, the programmer, whenever you want to do something like rename the Schema or change the collection name that underlies the User model.

Early in Ascot’s development we made this mistake with a single file, and ended up with a particularly annoying bug when we changed our MongoDB password several months later. The proper way to do this is to declare your Models exactly once, include them in your app.js, and pass them to your routes as necessary.

In addition, note that the “bad” listUsers is impossible to unit test. The User schema in the “bad” example is inaccessible through calls to require, so we can’t mock it out for testing. In the “good” example, we can write a test easily using Nodeunit:

var UserRoutes = require('./routes/user.js');

exports.testListUsers = function(test) {
  mockUser.collection = [{ name : 'Val' }]; 
  var fnToTest = UserRoutes.listUsers(mockUser);
  fnToTest( {},
    { render : function(view, params) {
        test.equals(mockUser.collection, params.users); test.done();
      }
    });
};

And speaking of Nodeunit:

Unit tests catch mistakes, encourage you to write modular code, and allow you to easily make sure your logic works. They are your friend.

I’ll be the first to say that writing unit tests can be very annoying. Some tests can seem trivial, they don’t necessarily catch all bugs, and often you write way more test code than actual production code. However, a good suite of tests can save you a lot of worry; you can make changes and then quickly verify that you haven’t broken any of your modules. Ascot Project currently uses Nodeunit for our backend unit tests; Nodeunit is simple, flexible, and works well for us.

And there you have it! Mongoose is an excellent library, and if you’re using MongoDB and NodeJS, you should definitely consider using it. It will save you from writing a lot of extra code, it’ll handle some basic population, and it’ll handle all your validation and object creation grunt work. This adds up to more time spent building awesome stuff, and less time trying to figure out how to get your database interface to work.

Have any questions about the code featured in this post? Want to suggest a better approach? Feel like telling me why the MEAN Stack is the worst thing that ever happened in the history of the world and how horrible I am? Go ahead and leave a comment below, or shoot me an email at valkar207@gmail.com and I’ll do my best to answer any questions you might have. You can also find me on github at https://github.com/vkarpov15. My current venture is called The Ascot Project, and you can find that over at www.AscotProject.com.

↧

Ruby, Rails, MongoDB and the Object-Relational Mismatch

June 18, 2013, 5:00 am

≫ Next: 2dsphere, GeoJSON, and Doctrine MongoDB

≪ Previous: The MEAN Stack: Mistakes You're Probably Making With MongooseJS, And How To Fix Them

by Emily Stolfo, Ruby Engineer and Evangelist at 10gen

MongoDB is a popular choice among developers in part because it permits a one-to-one mapping between object-oriented (OO) software objects and database entities. Ruby developers are at a great advantage in using MongoDB because they are already used to working with and designing software that is purely object-oriented.

Most of the discussions I’ve had about MongoDB and Ruby assume Ruby knowledge and explain why MongoDB is a good fit for the Rubyist. This post will do the opposite; I’m going to assume you know a few things about MongoDB but not much about Ruby. In showing the Rubyist’s OO advantage, I’ll share a bit about the Ruby programming language and its popularity, explain specifically how the majority of Ruby developers are using MongoDB, and then talk about the future of the 10gen Ruby driver in the context of the Rails community.

Ruby who?

The Ruby programming language was created 20 years ago by Yukihiro Matsumoto, “Matz” to the Ruby community. The language, although not made immensely popular until the introduction of the Rails web framework many years later, is somewhat known by its founding philosophy. Matz has made numerous statements saying that he strived to create a language that follows principles of good user interface design. Namely, Ruby is intended to make the developer experience more pleasant and to facilitate programmer productivity. Matz has said that he wanted to combine the flexibility of Perl with the object-orientation of Smalltalk. The result is an elegant, flexible, and practical language that is indeed a pleasure to use.

Ruby is a purely object-oriented language. This means that while other languages would have primitive types for programming “atoms” such as integers, booleans, and null, Ruby has base types. Classes, once instantiated, are objects that have properties (instance variables) and performable actions (methods). Even classes themselves are instances of the class, Class. Let’s look at an example in the Ruby interpreter:

> 2.object_id
 => 5 
> false.object_id
 => 0 
> true.object_id
 => 20
> nil.object_id
 => 8

As you can see, even integers, booleans, and Ruby’s null object “nil” all have object ids. This implies that they are more than just primitives; they are objects complete with a class, properties and methods. Even further, we can see that the nil has methods!

> nil.methods
 => [:to_i, :to_f, :to_s, :to_a, :to_h, :inspect, :&, :|, :^, :nil?, :to_r, :rationalize, :to_c, :===, :=~, :!~, :eql?, :hash, :, :class, :singleton_class, :clone, :dup, :taint, :tainted?, :untaint, :untrust, :untrusted?, :trust, :freeze, :frozen?, :methods, :singleton_methods, :protected_methods, :private_methods, :public_methods, :instance_variables...etc]

Integers can have methods too:

> i = 0
> 4.times do
>     puts "#{i%2 == 0 ? 'heeey' : 'hoooo'}" 
>   i += 1
> end
heeey
hoooo
heeey
hoooo

In addition to the expected “primitive” types, Ruby provides other base types such as arrays and hashes. The hash will become particularly relevant later on in this post, but for now, I’ll just share the syntax:

> document = {}
 => {} 
> document["id"] = "emilys"
 => "emilys" 
> document
 => {"id"=>"emilys"}

Ruby software engineers strive to embrace the OO nature of the language, sometimes to the extreme. The language is strongly and dynamically typed. This allows the Rubyist to design software that is highly modular and that focuses on duck-typing— i.e. taking advantage of object behavior rather than statically-defined types. A good Rubyist aims to reduce dependencies and increase the flexibility of code. For example, the language doesn’t support multi-inheritance, but it does provide something called modules, which essentially are a grouping of common methods in a class that cannot be instantiated. This module can be “mixed in” to a class to give it extra functionality. All of these characteristics together— dynamic, object-oriented, flexible, modular, make Ruby code a pleasure to write and maintain.

Rails

I mentioned above that Ruby’s popularity really blossomed with the creation of Ruby on Rails in 2003. The web framework was created by a web programmer, David Heinemeier Hansson (“DHH”), who noticed that the web stack was a given and developers were repeating themselves over and over in building technology to wrap it. He decided to extract the common elements of web engineering into a modular, reusable framework. DHH unveiled his framework in a presentation that has become iconic in web programming. He uses a DSL to create a web application in one command and then starts up a server.

> rails new my_app
> cd my_app
> rails server

So why did Rails become so popular? Rails was built to make web programming faster, easier, and more manageable. By introducing a number of conventions and sticking to OO, web programmers could go from 0 to a full working app in a relatively short amount of time with little configuration. By the same token, they could take an existing app and quickly understand the codebase enough to maintain and develop it. We’ve already discussed the approachability of the Ruby programming language, and Rails follows many of the same principles. Rails a solution for making web programming simpler.

I teach a Ruby on Rails class at Columbia University, and I often tell my students that Ruby on Rails is the gateway drug to web programming. It makes web development more accessible to a newcomes.

MongoDB

MongoDB is a document database that focuses on developer needs. (Notice a common theme yet?) There’s no need for an army of database administrators to maintain a MongoDB cluster and the database’s flexibility allows for application developers to define and manipulate a schema themselves instead of relying on a separate team of dedicated engineers. Assuming that the many advantages of using MongoDB are familiar to you, it might seem natural for all Rails and Ruby developers to choose MongoDB as their first choice of datastore. Unfortunately, MongoDB is far from the default.

The Object-relational impedance mismatch and Active Record

The Active Record pattern describes the mapping of an object instance to a row in a relational database table, using accessor methods to retrieve columns/properties, and the ability to create, update, read, and delete entities from the database. It was first named by Martin Fowler in his book, Patterns of Enterprise Application Architecture.
The pattern has numerous limitations referred to as the Object-relational impedance mismatch. Some of these technical difficulties are structural. In OO programming, objects may be composed of other objects. The Active Record pattern maps these sub-objects to separate tables, thus introducing issues concerning the representation of relationships and encapsulated data. Rails’ biggest contribution to web programming was arguably not the framework itself, but rather its Object-relational-mapper, Active Record. Active Record uses macros to create relationships between objects and single table inheritance to represent inheritance.
The best solution to-date for the Object-relational impedance mismatch is Active Record, but this is assuming your datastore is relational. It’s also, fundamentally, a hack. What if we were to use a more OO datastore?

MongoDB and Rails take on the Object relational impedance mismatch

As we see massive growth in data, an increased diversity of content, and a demand for shorter development cycles, the infrastructure developers rely upon must rise to meet new challenges that traditional technologies were not designed to address. MongoDB has gained immense popularity because it fills many of the strongest modern technical demands, while still being developer-friendly and low-cost.

Given all that has been discussed regarding Rails and Ruby, wouldn’t it make sense to use MongoDB with Rails? The answer is yes: it makes a lot of sense. Nevertheless, Rails wasn’t originally built to use a document database so you must use a separate gem in place of Active Record.

MongoMapper and Mongoid are the two leading gems that make it possible use MongoDB as a datastore with Rails. MongoMapper, a project by Jon Nunemaker from Github, is a simple ORM for MongoDB. Mongoid, in particular, has become quite popular since its creation 4 years ago by Durran Jordan. Mongoid’s goal is to provide a familiar API to Active Record, while still leveraging MongoDB’s schema flexibility, document design, atomic modifiers, and rich query interface.

It’s largely due to these two gems that MongoDB can credit its traction in the Rails and Ruby community. In the past, Rails developers had to jump through a number of hoops in order to use one of these alternate ODMs with Rails, but the web framework then further modularized the database abstraction layer (the M component in MVC) to make it possible for a Rails developer to create an app without Active Record. Now all you have to do is:

> rails new my_app --skip-active-record
> cd my_app
> [edit Gemfile and add mongoid or mongo_mapper]
> bundle install
> rails server

To further illustrate a brief example using Mongoid, Rails model files are altered to not have the classes inherit from ActiveRecord::Base, must include a module Mongoid::Document, and define the schema in the actual file. Database migrations are not necessary!

class Course
  include Mongoid::Document
  field :name, type: String
  embeds_many :lectures
end

Additionally, you have a number of configuration options available, such as allow_dynamic_fields that allows you to define attributes on an object that aren’t in the model file’s schema. You can then add some logic in your model file if you need to do something different depending on the existence or absence of this field.

I’m not going to go into too much detail on using Rails with MongoDB, because that’s a whole blog post in itself and both MongoMapper and Mongoid’s docs are fantastic. Instead, it’d be worth devoting a paragraph or two talking about the future of Ruby and MongoDB.

Future

Rails is not required to use MongoDB with Ruby. You can use either of those two gems or the 10gen Ruby driver directly in another framework, such as Sinatra, or in the context of any other Ruby program. This is where the base Hash class in the Ruby language is relevant: one of the many roles of a MongoDB driver is to serialize/deserialize BSON documents into some native representation of a document in the given language. Luckily, Ruby’s Hash class is both idiomatically familiar to Rubyists and a very close representation of a document. See for yourself:

MongoDB document representing a tweet:

{
    "_id" : ObjectId("51073d4c4eeb4f4247b5c8f9"),
    "text" : "Just saw Star Trek.  It was the best Star Trek movie yet!",
    "created_at" : "Wed May 15 19:06:41 +0000 2010",
    "entities" : {
        "user_mentions" : [
            {
                "indices" : [
                    7,
                    20
                ],
                "screen_name" : "davess",
                "id" : 17916546
            }
        ],
        "urls" : [ ],
        "hashtags" : [ ]
    },
    "retweeted" : false,
    "user" : {
        "location" : "United States",
        "created_at" : "Wed Apr 01 10:14:11 +0000 2009",
        "description" : "MongoDB Ruby driver engineer, Adjunct faculty at Columbia",
        "time_zone" : "New York",
        "screen_name" : "EmilyS",
        "lang" : "en",
        "followers_count" : 152,
    },
    "favorited" : false,
}

The corresponding Ruby hash:

{
    "_id" => ObjectId("51073d4c4eeb4f4247b5c8f9"),
    "text" => "Just saw Star Trek.  It was the best Star Trek movie yet!",
    "created_at" => "Wed May 15 19:06:41 +0000 2010",
    "entities" => {
        "user_mentions" => [
            {
                "indices" => [
                    7,
                    20
                ],
                "screen_name" => "davess",
                "id" => 17916546
            }
        ],
        "urls" => [ ],
        "hashtags" => [ ]
    },
    "retweeted" => false,
    "user" => {
        "location" => "United States",
        "created_at" => "Wed Apr 01 10:14:11 +0000 2009",
        "description" => "MongoDB Ruby driver engineer",
        "time_zone" => "New York",
        "screen_name" => "EmilyS",
        "lang" => "en",
        "followers_count" => 152,
    },
    "favorited" => false,
}

It can’t get any closer than that. Whether they are simple hashes serialized using the driver directly or instantiated classes persisted through an ODM, Ruby objects map seamlessly to MongoDB documents.

Note: Mongoid has it’s own driver, called moped, as of version 3.x. Therefore, if you’re on Mongoid 3.x, you’re not using 10gen’s Ruby driver.

MongoMapper, on the other hand, does use the 10gen driver in all versions.

Mongoid, MongoMapper and Beyond

Given how passionate the Ruby team at 10gen feels about the language being one of the best fits for MongoDB, we want to strengthen our relationship with the Ruby community. We are always looking for opportunities to support even more Rubyists and thus have been working with Durran Jordan to build a new bson and mongo gem that Mongoid will use in the near future. We’ve also see great adoption of MongoMapper and hope that more Rubyists, specifically Rails developers, will benefit from the continued improvement and collaboration on these open source projects.

↧

2dsphere, GeoJSON, and Doctrine MongoDB

June 24, 2013, 4:44 am

≫ Next: Real-time Profiling a MongoDB Cluster

≪ Previous: Ruby, Rails, MongoDB and the Object-Relational Mismatch

By Jeremy Mikola, 10gen software engineer and maintainer of Doctrine MongoDB ODM.

It seems that GeoJSON is all the rage these days. Last month, Ian Bentley shared a bit about the new geospatial features in MongoDB 2.4. Derick Rethans, one of my PHP driver teammates and a renowned OpenStreetMap aficionado, recently blogged about importing OSM data into MongoDB as GeoJSON objects. A few days later, GitHub added support for rendering .geojson files in repositories, using a combination of Leaflet.js, MapBox, and OpenStreetMap data. Coincidentally, I visited a local CloudCamp meetup last week to present on geospatial data, and for the past two weeks I’ve been working on adding support for MongoDB 2.4’s geospatial query operators to Doctrine MongoDB.

Doctrine MongoDB is an abstraction for the PHP driver that provides a fluent query builder API among other useful features. It’s used internally by Doctrine MongoDB ODM, but is completely usable on its own. One of the challenges in developing the library has been supporting multiple versions of MongoDB and the PHP driver. The introduction of read preferences last year is one such example. We wanted to still allow users to set slaveOk bits for older server and driver versions, but allow read preferences to apply for newer versions, all without breaking our API and abiding by semantic versioning. Now, the setSlaveOkay() method in Doctrine MongoDB will invoke setReadPreference() if it exists in the driver, and fall back to the deprecated setSlaveOkay() driver method otherwise.

Query Builder API

Before diving into the geospatial changes for Doctrine MongoDB, let’s take a quick look at the query builder API. Suppose we had a collection, test.places, with some OpenStreetMap annotations (key=value strings) stored in a tags array and a loc field containing longitude/latitude coordinates in MongoDB’s legacy point format (a float tuple) for a 2d index. Doctrine’s API allows queries to be constructed like so:

$connection = new \Doctrine\MongoDB\Connection();
    $collection = $connection->selectCollection('test', 'places');

    $qb = $collection->createQueryBuilder()
        ->field('loc')
            ->near(-73.987415, 40.757113)
            ->maxDistance(0.00899928);
        ->field('tags')
            ->equals('amenity=restaurant');

    $cursor = $qb->getQuery()->execute();

This above example executes the following query:

   {
        "loc": {
            "$near": [-73.987415, 40.757113],
            "$maxDistance": 0.00899928
        },
        "tags": "amenity=restaurant"
    }

This simple query will return restaurants within half a kilometer of 10gen’s NYC office at 229 West 43rd Street. If only it was so easy to find good restaurants near Times Square!

Supporting New and Old Geospatial Queries

When the new 2dsphere index type was introduced in MongoDB 2.4, operators such $near and $geoWithin were changed to accept GeoJSON geometry objects in addition to their legacy point and shape arguments. $near was particularly problematic because of its optional $maxDistance argument. As shown above, $maxDistance previously sat alongside $near and was measured in radians. It now sits within $near and is measured in meters. Using a 2dsphere index and GeoJSON points, the same query takes on a whole new shape:

   {
        "loc": {
            "$near": {
                "$geometry": {
                    "type": "Point",
                    "coordinates" [-73.987415, 40.757113]
                },
                "$maxDistance": 500
            }
        },
        "tags": "amenity=restaurant"
    }

This posed a hurdle for Doctrine MongoDB’s query builder, because we wanted to support 2dsphere queries without drastically changing the API. Unfortunately, there was no obvious way for near() to discern whether a pair of floats denoted a legacy or GeoJSON point, or whether a number signified radians or meters in the case of maxDistance(). I also anticipated we might run into a similar quandry for the $geoWithin builder method, which accepts an array of point coordinates.

Method overloading seemed preferable to creating separate builder methods or introducing a new “mode" parameter to handle 2dsphere queries. Although PHP has no language-level support for overloading, it is commonly implemented by inspecting an argument’s type at runtime. In our case, this would necessitate having classes for GeoJSON geometries (e.g. Point, LineString, Polygon), which we could differentiate from the legacy geometry arrays.

Introducing a GeoJSON Library for PHP

A cursory search for GeoJSON PHP libraries turned up php-geojson, from the MapFish project, and geoPHP. I was pleased to see that geoPHP was available via Composer (PHP’s de facto package manager), but neither library implemented the GeoJSON spec in its entirety. This seemed like a ripe opportunity to create such a library, and so geojson was born a few days later.

At the time of this writing, 2dsphere support for Doctrine’s query builder is still being developed; however, I envision it will take the following form when complete:

  use GeoJson\Geometry\Point;

    // ...

    $qb = $collection->createQueryBuilder()
        ->field('loc')
            ->near(new Point([-73.987415, 40.757113]))
            ->maxDistance(0.00899928);
        ->field('tags')
            ->equals('amenity=restaurant');

All of the GeoJson classes implement JsonSerializable, one of the newer interfaces introduced in PHP 5.4, which will allow Doctrine to prepare them for MongoDB queries with a single method call. One clear benefit over the legacy geometry arrays is that the GeoJson library performs its own validation. When a Polygon is passed to geoWithin(), Doctrine won’t have to worry about whether all of its rings are closed LineStrings; the library would catch such an error in the constructor. This helps achieve a separation of concerns, which in turn increases the maintainability of both libraries.

I look forward to finishing up 2dsphere support for Doctrine MongoDB in the coming weeks. In the meantime, if you happen to fall in the fabled demographic of PHP developers in need of a full GeoJSON implementation, please give geojson a look and share some feedback.

↧

Real-time Profiling a MongoDB Cluster

June 25, 2013, 6:44 am

≫ Next: Leafblower: Winner of the MongoDB March Madness Hackathon

≪ Previous: 2dsphere, GeoJSON, and Doctrine MongoDB

by A. Jesse Jiryu Davis, Python Evangelist at 10gen

In a sharded cluster of replica sets, which server or servers handle each of your queries? What about each insert, update, or command? If you know how a MongoDB cluster routes operations among its servers, you can predict how your application will scale as you add shards and add members to shards.

Operations are routed according to the type of operation, your shard key, and your read preference. Let’s set up a cluster and use the system profiler to see where each operation is run. This is an interactive, experimental way to learn how your cluster really behaves and how your architecture will scale.

Setup

You’ll need a recent install of MongoDB (I’m using 2.4.4), Python, a recent version of PyMongo (at least 2.4—I’m using 2.5.2) and the code in my cluster-profile repository on GitHub. If you install the Colorama Python package you’ll get cute colored output. These scripts were tested on my Mac.

Sharded cluster of replica sets

Run the cluster_setup.py script in my repository. It sets up a standard sharded cluster for you running on your local machine. There’s a mongos, three config servers, and two shards, each of which is a three-member replica set. The first shard’s replica set is running on ports 4000 through 4002, the second shard is on ports 5000 through 5002, and the three config servers are on ports 6000 through 6002:

The setup

For the finale, cluster_setup.py makes a collection named sharded_collection, sharded on a key named shard_key.

In a normal deployment, we’d let MongoDB’s balancer automatically distribute chunks of data among our two shards. But for this demo we want documents to be on predictable shards, so my script disables the balancer. It makes a chunk for all documents with shard_key less than 500 and another chunk for documents with shard_key greater than or equal to 500. It moves the high chunk to replset_1:

client = MongoClient()  # Connect to mongos.
admin = client.admin  # admin database.

Pre-split.

admin.command(
    'split', 'test.sharded_collection',
    middle={'shard_key': 500})

admin.command(
    'moveChunk', 'test.sharded_collection',
    find={'shard_key': 500},
    to='replset_1')

If you connect to mongos with the MongoDB shell, sh.status() shows there’s one chunk on each of the two shards:

{ "shard_key" : { "$minKey" : 1 } } -->> { "shard_key" : 500 } on : replset_0 { "t" : 2, "i" : 1 }
{ "shard_key" : 500 } -->> { "shard_key" : { "$maxKey" : 1 } } on : replset_1 { "t" : 2, "i" : 0 }

The setup script also inserts a document with a shard_key of 0 and another with a shard_key of 500. Now we’re ready for some profiling.

Profiling

Run the tail_profile.py script from my repository. It connects to all the replica set members. On each, it sets the profiling level to 2 (“log everything”) on the test database, and creates a tailable cursor on the system.profile collection. The script filters out some noise in the profile collection—for example, the activities of the tailable cursor show up in the system.profile collection that it’s tailing. Any legitimate entries in the profile are spat out to the console in pretty colors.

Experiments

Targeted queries versus scatter-gather

Let’s run a query from Python in a separate terminal:

>>> from pymongo import MongoClient
>>> # Connect to mongos.
>>> collection = MongoClient().test.sharded_collection
>>> collection.find_one({'shard_key': 0})
{'_id': ObjectId('51bb6f1cca1ce958c89b348a'), 'shard_key': 0}

tail_profile.py prints:

replset_0 primary on 4000: query test.sharded_collection {"shard_key": 0}

The query includes the shard key, so mongos reads from the shard that can satisfy it. Adding shards can scale out your throughput on a query like this. What about a query that doesn’t contain the shard key?:

>>> collection.find_one({})

mongos sends the query to both shards:

replset_0 primary on 4000: query test.sharded_collection {“shard_key”: 0}
replset_1 primary on 5000: query test.sharded_collection {“shard_key”: 500}

For fan-out queries like this, adding more shards won’t scale out your query throughput as well as it would for targeted queries, because every shard has to process every query. But we can scale throughput on queries like these by reading from secondaries.

Queries with read preferences

We can use read preferences to read from secondaries:

>>> from pymongo.read_preferences import ReadPreference
>>> collection.find_one({}, read_preference=ReadPreference.SECONDARY)

tail_profile.py shows us that mongos chose a random secondary from each shard:

replset_0 secondary on 4001: query test.sharded_collection {“$readPreference”: {“mode”: “secondary”}, “$query”: {}}
replset_1 secondary on 5001: query test.sharded_collection {“$readPreference”: {“mode”: “secondary”}, “$query”: {}}

Note how PyMongo passes the read preference to mongos in the query, as the $readPreference field. mongos targets one secondary in each of the two replica sets.

Updates

With a sharded collection, updates must either include the shard key or be “multi-updates”. An update with the shard key goes to the proper shard, of course:

>>> collection.update({'shard_key': -100}, {'$set': {'field': 'value'}})

replset_0 primary on 4000: update test.sharded_collection {“shard_key”: -100}

mongos only sends the update to replset_0, because we put the chunk of documents with shard_key less than 500 there.

A multi-update hits all shards:

>>> collection.update({}, {'$set': {'field': 'value'}}, multi=True)

replset_0 primary on 4000: update test.sharded_collection {}
replset_1 primary on 5000: update test.sharded_collection {}

A multi-update on a range of the shard key need only involve the proper shard:

>>> collection.update({'shard_key': {'$gt': 1000}}, {'$set': {'field': 'value'}}, multi=True)

replset_1 primary on 5000: update test.sharded_collection {“shard_key”: {“$gt”: 1000}}

So targeted updates that include the shard key can be scaled out by adding shards. Even multi-updates can be scaled out if they include a range of the shard key, but multi-updates without the shard key won’t benefit from extra shards.

Commands

In version 2.4, mongos can use secondaries not only for queries, but also for some commands. You can run count on secondaries if you pass the right read preference:

>>> cursor = collection.find(read_preference=ReadPreference.SECONDARY)
>>> cursor.count()

replset_0 secondary on 4001: command count: sharded_collection
replset_1 secondary on 5001: command count: sharded_collection

Whereas findAndModify, since it modifies data, is run on the primaries no matter your read preference:

>>> db = MongoClient().test
>>> test.command(
...     'findAndModify',
...     'sharded_collection',
...     query={'shard_key': -1},
...     remove=True,
...     read_preference=ReadPreference.SECONDARY)

replset_0 primary on 4000: command findAndModify: sharded_collection

Go Forth And Scale

To scale a sharded cluster, you should understand how operations are distributed: are they scatter-gather, or targeted to one shard? Do they run on primaries or secondaries? If you set up a cluster and test your queries interactively like we did here, you can see how your cluster behaves in practice, and design your application for future growth.

Read Jesse’s blog, Emptysquare and follow him on Github

↧

Leafblower: Winner of the MongoDB March Madness Hackathon

July 1, 2013, 2:41 pm

≫ Next: Libbson

≪ Previous: Real-time Profiling a MongoDB Cluster

In March, 10gen hosted a Global Hackathon called MongoDB March Madness, and challenged the community to build tools to support the growing MongoDB community. Leafblower, a project built at the Sydney March Madness Hackathon, was the global winner of the Hackathon

Who are rdrkt?

Andy Sharman and Josh Stevenson work together on independent projects under their brand rdrkt (pronounced “redirect") and capitalize on their largely overlapping skillset, but unique passions to build cutting-edge web applications. Josh is passionate about flexible, scalable back-end systems and Andy is passionate about accessible, responsive user interfaces.

Josh, originally from the United States, now lives and works full-time in Sydney, Australia for Excite Digital Media. He has experience with the LAMP stack, CakePHP, ZendFramework, and MongoDB. He’s also worked on projects which have required his significant expertise with jQuery, HTML5 and CSS3. He’s very active in the Sydney area tech community, regularly attending PHP, MongoDB, WebDirections and WebBlast meetups.

Andy, from the UK, also now lives in Sydney and works for Visualjazz Isobar. He has frontend experience with HTML5, Vanilla JavaScript, jQuery, Mootools, and WebGL with Three.js. In addition to that, he has experience with backend languages such as PHP, C#.NET, Node.JS and CMS’ such as Sitecore, Magento, Joomla, and Wordpress.

What is Leafblower?

Leafblower is a real-time dashboard platform we conceived and programmed for the MongoDB March Madness Hackathon. Its primary focus is on bleeding-edge, up-to-the-second application analytics and monitoring to help campaign managers and system administrators to react quickly to ever-changing trending and traffic spiking. We live in an age where one link on Reddit or a viral post on Facebook can mean the difference between “getting your break" and catastrophic failure of the application’s server infrastructure. Leafblower helps users understand what is happening “right now" in their applications and most importantly their server availability and usage by cutting out as much aggregation as possible and focusing on capturing and reporting data samples as-they-happen.

Where did the idea for Leafblower come from?

At the start of the Sydney-based MongoDB hackathon, Andy wanted to build some sort of analytics or reporting toolset to help people learn the way Mongo databases function and grow, but it wasn’t until the starting bell rang that Josh had the idea of doing real-time monitoring rather than historical analysis. That’s how Leafblower was born.

We went on to win the Sydney hackathon, and decided to really buckle down and clean up what we worked on for the global competition, rather than trying to build out too many extra features.

Anyone can look at and download everything needed to install Leafblower on Github

How does Leafblower work?

After the initial idea for Leafblower was formed, the rest of the project flowed together organically. The platform uses MongoDB as the storage engine to run all the data “blocks" which display statistics intuitively for end-users. Blocks render themselves using information passed to them via a Node.js communication layer which itself connects to a backend API server. This allows many clients to be connected and viewing dashboards without requiring any additional load on the API server and thus the application itself.

Leafblower is currently running on a Linux/Apache/MongoDB/PHP/Node.JS system, however, Linux could be swapped out for any OS, and Apache could be swapped out for any web server assuming that the correct version of the required services are installed and configured to play nicely. There is a full list of required software available on our Github page. Leafblower uses well-supported cross-OS services it so you can run an instance with the operating system you’re used to.

Rather than trying to create a platform that is generic enough to be useful to all the different applications out there, our goal was to create enough useful Blocks and then put the power to customize or extend block functionality with the people actually using Leafblower to monitor their apps. NoSQL solutions like MongoDB are the perfect fit for projects looking to allow developers to extend our platform with minimal-to-zero prior knowledge of how data is stored and passed around, internally. For example, while some of the elements of the configuration documents in MongoDB are required to function properly, whole portions are completely arbitrary depending on the data needs of a given block. With MongoDB this is no sweat. For us, the advantage is not so much about “No Schema" as is it is about a “Flexible Schema".

When will it be ready for a production environment?

After completing the early stages of the project and successfully winning the MongoDB Global March Madness Hackathon Josh and Andy have a number of features and services, both short and long-term, to implement. Some of the items on their shortlist are:

Users and Groups to which profiles can be assigned
Easier setup and installation/configuration for users who want to deploy Leafblower on their own hardware
A full suite of Blocks of both a basic and advanced nature, such as API Blocks for Twitter/Facebook/Google which do live social/analytics monitoring
Automatic historical data comparison to make it easier to compare current live data with past trends
Admin panel enhancements to make users, groups, profiles, and Blocks easier to configure

User logins and access groups are our first priority and we expect to include them in the next release of Leafblower on Github. We’re also hoping the community will engage with Leafblower and help us build out our library of standard Blocks.

Why did rdrkt use MongoDB?

Leafblower is a real time monitoring platform, not a complete self-contained application. In order for it to be flexible enough to monitor all sorts of potential data feeds and apis, we knew a flexible schema was the way to go. For example, let’s look at a document from the blocks collection

{
    "_id" : "geoCheckIns",
    "type" : "geospacial",
    "title" : "Visualize checkins as they occur.",
    "description" : "Block for visualizing the checkins occurring in your app bound by a circle centered at a given lat/lon and radius",
    "ttl" : 5000,
    "options" : {
        "lat" : -33.873651,
        "lon" : -151.2068896,
        "radius" : 50,
        "host" : "localhost",
        "db" : "demo",
        "collection" : "checkins"
    }
}

In the case of a block, each one will always have the fields _id, type, title, description, ttl and options, however, options is a subdocument and may contain an arbitrary number of fields. In this example, we need to know which database host, database name, a collection name with checkin data to monitor, a center point to draw our great circle and the radius of the circle. From those data points, the block will be able to render a map centered over our point and drop pins onto the map as users check in to your application. This particular block would be useful for someone monitoring their social media campaign as it goes live in a specific city or region.

Let’s look at another block type for contrast:

{
    "_id" : "memory",
    "type" : "system",
    "title" : "Memory",
    "description" : "View statistics about the memory usage on your server.",
    "ttl" : 1000,
    "options" : {

    }
}

In this example, we left options completely empty because we are retrieving the monitoring data from the server the user is currently connected to. Our application still expects an object to exist when we access options, so we maintain the structure of our document despite not needing to read any further data for this type of block. We might expect sysadmins to use a modified version of this block for monitoring servers they administer.

This is a good illustration of what we mean when we say “flexible schema". It’s an important distinction to make from “no schema", since there is a clear idea of how the data is structured, we’ve just allowed for flexibility in certain places where advantageous. NoSQL solutions like MongoDB are very attractive to many developers, because they can interact with the database as if they are simply storing and retrieving objects from a datastore. This is a dangerous path to follow, however, because a “set it and forget it" attitude will quickly lead to scaling and performance issues as an application grows. Know the structure of your data and only make changes to that structure incrementally.

MongoDB is easy to scale and has a vibrant user community. Leafblower gives back to the community by creating useful tools to showcase MongoDB’s advantages while teaching them how it works. It’s a win for everyone: for ourselves, for MongoDB, and for the IT Community at large.

Want more information?

Leafblower:

rdrkt:

↧

Libbson

July 9, 2013, 7:13 am

≫ Next: Morphia Version 0.101 Released

≪ Previous: Leafblower: Winner of the MongoDB March Madness Hackathon

Libbson is a new shared library written in C for developers wanting to work with the BSON serialization format.

Its API will feel natural to C programmers but can also be used as the base of a C extension in higher-level MongoDB drivers.

The library contains everything you would expect from a BSON implementation. It has the ability to work with documents in their serialized form, iterating elements within a document, overwriting fields in place, Object Id generation, JSON conversion, data validation, and more. Some lessons were learned along the way that are beneficial for those choosing to implement BSON themselves.

Improving small document performance

A common use case of BSON is for relatively small documents. This has a profound impact on the memory allocator in userspace, causing what is commonly known as “memory fragmentation". Memory fragmentation can make it more difficult for your allocator to locate a contiguous region of memory.

In addition to increasing allocation latency, it increases the memory requirements of your application to overcome that fragmentation.

To help with this issue, the bson_t structure contains 120 bytes of inline space that allows BSON documents to be built directly on the stack as opposed to the heap.

When the document size grows past 120 bytes it will automatically migrate to a heap allocation.

Additionally, bson_t will grow it’s buffers in powers of two. This is standard when working with buffers and arrays as it amortizes the overhead of growing the buffer versus calling realloc() every time data is appended. 120 bytes was chosen to align bson_t to the size of two sequential cachelines on x86_64 (each 64 bytes).

This may change based on future research, but not before a stable ABI has been reached.

Single allocation for nested documents

One strength of BSON is it’s ability to nest objects and arrays. Often times when serializing these nested documents, each sub-document is serialized independently and then appended to the parents buffer.

As you might imagine, this takes quite the toll on the allocator. It can generate many small allocations which were only created to have been immediately discarded after appending to the parents buffer. Libbson allows for building sub-documents directly into the parent documents buffer.

Doing so helps avoid this costly fragmentation. The topmost document will grow its underlying buffers in powers of two each time the allocation would overflow.

Parsing BSON documents from network buffers

Another common area for allocator fragmentation is during BSON document parsing. Libbson allows parsing and iteration of BSON documents directly from your incoming network buffer.

This means the only allocations created are those needed for your higher level language such as a PyDict if writing a Python extension.

Developers writing C extensions for their driver may choose to implement a “generator" style parsing of documents to help keep memory fragmentation low.

A technique we’re yet to explore is implementing a hashtable-esque structure backed by BSON, only deserializing the entire buffer after a threshold of keys have been accessed.

Generating BSON documents into network buffers

Much like parsing BSON documents, generating documents and placing them into your network buffers can be hard on your memory allocator. To help keep this fragmentation down, Libbson provides support for serializing your document to BSON directly within a buffer of your choosing.

This is ideal for situations such as writing a sequence of BSON documents into a MongoDB message.

Generating Object Ids without Synchronization

Applications are often doing ObjectId generation, especially in high insert environments. The uniqueness of generated ObjectIds is critical to avoiding duplicate key errors across multiple nodes.

Highly threaded environments create a local contention point slowing the rate of generation. This is because the threads must synchronize on the increment counter of each sequential ObjectId. Failure to do so could cause collisions that would not be detected until after a network round-trip. Most drivers implement the synchronization with an atomic increment or a mutex if atomics are not available.

Libbson will use atomic increments and in some cases avoid synchronization altogether if possible. One such case is a non-threaded environment.

Another is when running on Linux as both threads and processes are in the same namespace.

This allows the use of the thread identifier as the pid within the ObjectId.

You can find Libbson at https://github.com/mongodb/libbson and discuss design choices with its author, Christian Hergert, who can be found on twitter as @hergertme.

↧

Morphia Version 0.101 Released

July 11, 2013, 4:00 am

≫ Next: NeDB: a lightweight Javascript database using MongoDB's API

≪ Previous: Libbson

The Java Team has released Morphia, version 0.101. Rumors of Morphia’s demise have been greatly exaggerated.

This release formalizes the 0.99-SNAPSHOT code that many have been using for years. Apart from some formatting and minor style changes, there is no functional change from what’s been on Google Code for the past few years.

We’re currently doing a follow-up release to incorporate the enhancements made in James Green’s fork. At that point, we can start working through the backlog of issues logged against Morphia and begin evolving the project.

As part of the mission to revive the project, we’ve migrated the Morphia code (and wiki, and issues) from Google Code to GitHub. This should make it much easier to contribute to Morphia via pull requests. Feel free to submit pull requests for the bugs or irritating documentation fixes that you want. The Google Code site will notify users that Github is now the place to be. You can find the Github repo here

We released 0.101 into Maven Central, so you’ll be able to get your updated version and future versions easily through Maven or Gradle. You can find the release notes here Github.

Further down the line, Morphia will run on the new Java driver that we’ve been working on all year.

We’re aware that the community has been somewhat split between the snapshot release and James’s fork, and we’ve spoken to James Green, who has maintained Morphia for awhile, about our plans. We want to make sure that we’re all moving in the same direction and that we involve the community in our efforts. We’d like to thank James for the work he’s done so far keeping Morphia alive, and look forward to working with him and the whole Morphia community more closely in the future.

The Java Team: Jeff Yemin, Justin Lee & Trisha Gee

↧

NeDB: a lightweight Javascript database using MongoDB's API

July 17, 2013, 7:46 am

≫ Next: Introducing the MongoDB Driver for the Rust Programming Language

≪ Previous: Morphia Version 0.101 Released

This is a guest post by Louis Chatriot

Sometimes you need database functionality but want to avoid the constraints that come with installing a full-blown solution. Maybe you are writing a Node service or web application that needs to be easily packageable, such as a continuous integration server. Maybe you’re writing a desktop application with Node Webkit, and don’t want to ask your users to install an external database. That’s when you need NeDB.

NeDB is a lightweight database written entirely in Javascript, and that implements the well-known and loved MongoDB API. It is packaged as a Node module that be used with a simple require and can be used as an in-memory only or persistent datastore. You can think of it as SQLite for MongoDB projects.

javascript
var Nedb = require('nedb')
  , planets = new Nedb({ filename: 'path/to/data.db', autoload: true });
// Let's insert some data
planets.insert({ name: 'Earth', satellites: 1 }, function (err) {
  planets.insert({ name: 'Mars', satellites: 2 }, function (err) {
    planets.insert({ name: 'Jupiter', satellites: 67 }, function (err) {
      
      // Now we can query it the usual way
      planets.find({ satellites: { $lt: 10 } }, function (err, docs) {
        // docs is an array containing Earth and Mars
      });
    });
  });
});

Features

NeDB implements the most widely used features of MongoDB:

CRUD operations including upserts
Ability to persist data
Expressive query language where you can use dot notation (to query on nested documents), regular expressions, comparison operators ($lt, $lte, $gt, $gte, $in, $nin, $exists) and logical operators ($and, $or, $not)
Documents modifiers $set, $inc, $push, $pop, $addToSet and $each
A browser version

Performance

Of course, NeDB is not a replacement for a “real" database such as MongoDB, so its goal is not to be as fast as possible, it is to be fast enough. And it is: using indexing, it achieves about 5,000 writes and 25,000 reads per second. If you need more than this, you’re probably not writing a small application!

Want to try it?

You can npm install it, the module name is nedb. You can also check the Github repository to read the documentation, give feedback, raise issues or send pull requests

↧

Introducing the MongoDB Driver for the Rust Programming Language

July 25, 2013, 7:36 am

≫ Next: The Most Popular Pub Names

≪ Previous: NeDB: a lightweight Javascript database using MongoDB's API

Discuss on Hacker News

This is a guest post by Jao-ke Chin-Lee and Jed Estep, who are currently interns at 10gen. This summer they were tasked with building a Rust driver for MongoDB.

Today we are open sourcing the alpha release of a MongoDB driver for the Rust programming language. This is the culmination of two months of work with help from the Rust community, and we’re excited to be sharing our initial version. We are looking forward to feedback from both the Rust and MongoDB communities.

About Rust

Rust is a multi-paradigm, systems-oriented language currently in development at Mozilla. It features elements of imperative, functional, and object-oriented languages, as well as including ambitious new features such as affine types that allow for smart automatic memory management and concurrency. Its powerful type- and memory-safety features run at compile time, and still maintain performance typical of a low-level language. As the language continues to grow, the availability of a MongoDB driver written in native Rust exposes both Rust and MongoDB to new audiences.

About the MongoDB Driver

The driver is Apache licensed, and is implemented purely in Rust without relying on any C extensions or libraries (with the exception of an embedded copy of md5, which can be removed when Rust offers a native implementation).

Using the driver will feel familiar to users of current MongoDB drivers. Basic BSON types, such as strings, numbers, and objects, have a built-in internal representation based around Rust’s algebraic data types. Users can also implement a trait (analogous to a Haskell typeclass, similar to a Java interface) which allows them to treat types native to their codebase as though they were native to BSON as well.

Once a user’s data is formatted, interacting with the database is done through familiar objects like Collection, DB, Cursor, and Client. These objects have similar APIs to their counterparts in other languages, and presently offer CRUD, indexing, as well as administrative functionality in a framework that is Rustically abstracted and balances the philosophy of Rust’s static guarantees with the flexibility of MongoDB. A small example is offered below.

Example Usage

First we need to import the mongo library and the classes we’ll be using.

extern mod mongo;

use mongo::client::*;
use mongo::util::*;
use mongo::coll::*;
use mongo::db::*;

In order to connect with a Mongo server, we first create a client.

let client = @Client::new();

To connect to an unreplicated, unsharded server running on localhost, port 27017 (MONGO_DEFAULT_PORT), we use the connect method:

match client.connect(~"127.0.0.1", MONGO_DEFAULT_PORT) {
    Ok(_) => (),
    Err(e) => fail!(e.to_str()),
}

We create a capped collection named “capped" in the database “rust_ex" by first creating a handle to the “rust_ex" database (which may or may not be empty) and calling create_collection with arguments specifying the size of the capped collection (which must not already exist).

let db = DB::new(~"rust_ex", client);
match db.create_collection(~"capped", None, Some(~[CAPPED(100000), MAX_DOCS(20)])) {
    Ok(_) => (),
    Err(e) => fail!(e.to_str()),
};

Now we create a tailable cursor to extract out documents where the value of the “a" field is divisible by 5, and project on the “msg" field.

let coll = Collection::new(~"rust_ex", ~"capped", client);
let mut cursor = match coll.find(   Some(SpecNotation(~"{ 'a':{'$mod':[5,0]} }")),
                                    Some(SpecNotation(~"{ 'msg':1 }")),
                                    None) {
    Ok(c) => c,     // JSON-formatted strings are automatically converted to BSON
    Err(e) => fail!(e.to_str()),
};
cursor.add_flags(~[CUR_TAILABLE, AWAIT_DATA]);

Then we spawn a thread to populate the capped collection. Note that for the first insert, we specify None as the writeconcern, which indicates the default of 1, whereas for the subsequent inserts, we specify the writeconcern as journaled.

coll.insert(~"{ 'a':0, 'msg':'first insert' }", None);
let n = 50;
do spawn {
    let tmp_client = @Client::new();
    tmp_client.connect(~"127.0.0.1", MONGO_DEFAULT_PORT);

    let coll = Collection::new(~"rust_ex", ~"capped", tmp_client);
    let mut i = 1;
    for n.times {
        match coll.insert(  fmt!("{ 'a':%?, 'msg':'insert no. %?' }", i, i),
                            Some(~[JOURNAL(true)])) {
            Ok(_) => (),
            Err(e) => println(fmt!("%s", e.to_str())),
        };
        i += 1;
    }
    tmp_client.disconnect();
}

Meanwhile, in the main thread, we iterate on the results returned from the tailable cursor.

for 25.times {
    let mut p = cursor.next();
    while p.is_none() && !cursor.is_dead() { p = cursor.next(); }
    if cursor.is_dead() { break; }
    println(fmt!("read %?", p.unwrap().to_str()));
}

Finally, we disconnect the client. This client can be reused to connect to other servers afterwards.

match client.disconnect() {
    Ok(_) => (),
    Err(e) => fail!(e.to_str()),
}

For similar examples including a worked one in which user-implemented structs are inserted into and read from database please refer to the examples.

Resources

Please find more examples, as well as the source code which we encourage you to check out and play around with, at the GitHub repository. We also have documentation available. Keep in mind that the driver currently will only build on Rust 0.7 release, and will not work with other versions of Rust. We welcome any feedback and contributions; create an issue or submit a pull request!

About Us

The MongoDB Rust driver was developed by Jao-ke Chin-Lee and Jed Estep, who are currently interns at 10gen. We were drawn to the project by the innovation of Rust as well as the idea of bridging the Rust and MongoDB communities.

Acknowledgements

Many thanks to the Rust community on IRC and rust-dev for their guidance in Rust, and for developing such an exciting language. Special thanks to 10gen for hosting us as interns, Stacy Ferranti and Ian Whalen for managing the internship program, and our mentors Tyler Brock and Andrew Morrow for their help and support throughout the project.

↧

The Most Popular Pub Names

July 30, 2013, 7:09 am

≫ Next: Securing MongoDB on Windows Azure

≪ Previous: Introducing the MongoDB Driver for the Rust Programming Language

By Ross Lawley, MongoEngine maintainer and Scala Engineer at 10gen

Earlier in the year I gave a talk at MongoDB London about the different aggregation options with MongoDB. The topic recently came up again in conversation at a user group, so I thought it deserved a blog post.

Gathering ideas for the talk

I wanted to give a more interesting aggregation talk than the standard “counting words in text", and as the aggregation framework gained shiny 2dsphere geo support in 2.4, I figured I’d use that. I just needed a topic…

What is top of mind for us Brits?

Two things immediately sprang to mind: weather and beer.

I opted to focus on something close to my heart: beer :) But what to aggregate about beer? Then I remembered an old pub quiz favourite…

What is the most popular pub name in the UK?

I know there is some great open data, including a wealth of information on pubs available from the awesome open street map project. I just need to get at it and happily the Overpass-api provides a simple “xapi" interface for OSM data. All I needed was anything tagged with amenity=pub within in the bounds of the UK and with their xapi interface this is as simple as a wget:

http://www.overpass-api.de/api/xapi?*[amenity=pub][bbox=-10.5,49.78,1.78,59]

Once I had an osm file I used the imposm python library to parse the xml and then convert it to following GeoJSON format:

{
  "_id" : 451152,
  "amenity" : "pub",
  "name" : "The Dignity",
  "addr:housenumber" : "363",
  "addr:street" : "Regents Park Road",
  "addr:city" : "London",
  "addr:postcode" : "N3 1DH",
  "toilets" : "yes",
  "toilets:access" : "customers",
  "location" : {
      "type" : "Point",
      "coordinates" : [-0.1945732, 51.6008172]
  }
}

Then it was a case of simply inserting it as a document into MongoDB. I quickly noticed that the data needed a little cleaning, as I was seeing duplicate pub names, for example: “The Red Lion" and “Red Lion". Because I wanted to make a wordle I normalised all the pub names.

If you want to know more about the importing process, the full loading code is available on github: osm2mongo.py

Top pub names

It turns out finding the most popular pub names is very simple with the aggregation framework. Just group by the name and then sum up all the occurrences. To get the top five most popular pub names we sort by the summed value and then limit to 5:

db.pubs.aggregate([
  {"$group":
     {"_id": "$name",
      "value": {"$sum": 1}
     }
  },
  {"$sort": {"value": -1}},
  {"$limit": 5}
]);

For the whole of the UK this returns:

The Red Lion
The Royal Oak
The Crown
The White Hart
The White Horse

Top pub names near you

At MongoDB London I thought that was too easy, so filtered to find the top pub names near the conference and showing off some of the geo functionality that became available in MongoDB 2.4. To limit the result set match and ensure the location is within a 2 mile radius by using $centreSphere. Just provide the coordinates [ <long>, <lat> ] and a radius of roughly 2 miles (3959 is approximately the radius of the earth, so divide it by 2):

db.pubs.aggregate([
    { "$match" : { "location":
                 { "$within":
                   { "$centerSphere": [[-0.12, 51.516], 2 / 3959] }}}
    },
    { "$group" :
       { "_id" : "$name",
         "value" : { "$sum" : 1 } }
    },
    { "$sort" : { "value" : -1 } },
    { "$limit" : 5 }
  ]);

What about where I live?

At the conference I looked the most popular pub name near the conference. Thats great if you happen to live in the centre of London but what about everyone else in the UK? So for this blog post I decided to update the demo code and make it dynamic based on where you live.

See: pubnames.rosslawley.co.uk

Apologies for those outside the UK - the demo app doesn’t have data for the whole world - its surely possible to do.

Cheers

All the code is available in my repo on github including the bson file of the pubs and the wordle code - so fork it and start playing with MongoDB’s great geo features!

↧

Securing MongoDB on Windows Azure

August 2, 2013, 6:48 am

≫ Next: Mongo-Hadoop 1.1

≪ Previous: The Most Popular Pub Names

By Sridhar Nanjesudwaran, Windows Azure lead at 10gen

I have used the MongoDB Installer for Windows Azure to deploy my MongoDB instance on a Windows Virtual Machine on Windows Azure. It is not my production environment but I would still like to secure it. What do I need to do to secure this standalone instance?

Let us take a look at the possible issues and how you would resolve each of them.

Password
Administrator username
Endpoints

Password

We are assuming you have created a strong password for the Administrator user. If not make sure to set a strong password for the Administrator user.

Administrator Username

The user name cannot be specified using the installer. It is always “Administrator”. The background here is that when Azure Virtual Machines were preview, “Administrator” was the only username allowed when creating Windows Virtual Machines. This was recently fixed but the installer has not been modified to allow it. To secure the instance it would be a good idea to change the username. You can change the username by logging onto the instance.

Once you remote desktop to the instance, you can change the username from PowerShell. To change:

$user = Get-WMIObject Win32_UserAccount -Filter "Name='Administrator'"
$username = “”
$user.Rename($username)

You can verify the username changed by logging out of the instance and retrying with Administrator – this should fail. Now retry with the username you just created which should succeed.

Endpoints

By default the installer creates 3 endpoints on the instance. The endpoints are for

RDP (starting at 3389)
MongoDB (starting at 27017)
PowerShell remoting (starting at 5985)

We are going to secure the endpoints by

Removing the ports when not required
Choosing non-standard ports
Securing them to your location

Removing endpoints

Remove the endpoints if they are not necessary. The PowerShell remoting endpoint is only required for the initial setup. It is not necessary unless you explicitly want to continue to use PowerShell remoting to manage the instance. Hence you should remove the endpoint. Also if you want to use PowerShell remoting to manage the instance, it is more secure to add it via an Azure interface such as (CLI, PowerShell or Management portal) when needed.

To remove the PowerShell remoting endpoint, from a Windows Azure PowerShell console:

# Remove PowerShell remoting endpoints
Get-AzureVM -ServiceName  | Remove-AzureEndpoint -Name endpname-5985-5985 | Update-AzureVM

The default remoting endpoint name is “endpname-5985-5985”. The service name is the same as the dns prefix you specified in the installer to create the instance. Similarly remove the RDP endpoint. Add it when needed as opposed to keeping it open all the time.

Choosing non-standard ports

Only add the RDP endpoint when necessary. When adding ensure you do not use the default port of 3389 for the external load balancer. To create the endpoint for RDP, from a Windows Azure PowerShell console:

# Add RDP endpoints to the single VM
Get-AzureVM -ServiceName “myservice” | Add-AzureEndpoint -Name rdp -LocalPort 3389 -Protocol tcp | Update-AzureVM

The above sets the load balancer port to an arbitrary one from the ephemeral range.

If an RDPendpoint already exists (like the default one created by the installer), you can change the load balancer port to a non standard port from a Windows Azure PowerShell console by:

# Update RDP endpoint external port
Get-AzureVM -ServiceName “myservice” | Set-AzureEndpoint -Name rdp -LocalPort 3389 -Protocol tcp | Update-AzureVM

To check the external port you can get it from the management portal or use Windows Azure PowerShell:

# Get RDP endpoint external port
Get-AzureVM -ServiceName “myservice” | Get-AzureEndpoint

Securing the endpoint to your location:

Prior to the recent updates to Windows Azure and Windows Azure PowerShell, the only method of securing endpoints are using firewall rules on the actual instance. While this does help secure the instance, it still allows for malicious DoS attacks. With the recent updates, in addition to firewall rules you can secure your endpoints by specifying a set of addresses that can access it (white list). You want to secure the MongoDB endpoints to only allow your MongoDB client/app machines (maybe in addition to administrator machines) to access the machines.

Also if you are enabling the RDPendpoint, secure it by only allowing access by the specified administrator machines. Using a Windows Azure PowerShell:

# Setup the ACL
$acl = New-AzureAclConfig
Set-AzureAclConfig -AddRule Permit -RemoteSubnet “mysubnet” -Order 1 –ACL $acl -Description “Lockdown MongoDB port”

# Update the endpoint with the ACL
Get-AzureVM -ServiceName “myservice” | Set-AzureEndpoint -Name endpname-27017-27017 -PublicPort 27017 -LocalPort 27017 -Protocol tcp –ACL $acl | Update-AzureVM

Mysubnet – is your subnet that you want to allow access specified in the CIDR format.

↧

Mongo-Hadoop 1.1

August 7, 2013, 6:57 am

≫ Next: Improving Driver Documentation: The MongoDB Meta Driver

≪ Previous: Securing MongoDB on Windows Azure

by Mike O’Brien, MongoDB Kernel Tools Lead and maintainer of Mongo-Hadoop, the Hadoop Adapter for MongoDB

Hadoop is a powerful, JVM-based platform for running Map/Reduce jobs on clusters of many machines, and it excels at doing analytics and processing tasks on very large data sets.

Since MongoDB excels at storing large operational data sets for applications, it makes sense to explore using these together - MongoDB for storage and querying, and Hadoop for batch processing.

The Mongo-Hadoop Adapter

We recently released the 1.1 release of the Mongo-Hadoop Adapter. The Mongo-Hadoop adapter makes it easy to use Mongo databases, or mongoDB backup files in .bson format, as the input source or output destination for Hadoop Map/Reduce jobs. By inspecting the data and computing input splits, Hadoop can process the data in parallel so that very large datasets can be processed quickly.

The Mongo-Hadoop adapter also includes support for Pig and Hive, which allow very sophisticated MapReduce workflows to be executed just by writing very simple scripts.

Pig is a high-level scripting language for data analysis and building map/reduce workflows
Hive is a SQL-like language for ad-hoc queries and analysis of data sets on Hadoop-compatible file systems.

Hadoop streaming is also supported, so map/reduce functions can be written in any language besides Java. Right now the Mongo-Hadoop adapter supports streaming in Ruby, Node.js and Python.

How it Works

How the Hadoop Adapter works

The adapter examines the MongoDB Collection and calculates a set of splits from the data
Each of the splits gets assigned to a node in Hadoop cluster
In parallel, Hadoop nodes pull data for their splits from MongoDB (or BSON) and process them locally
Hadoop merges results and streams output back to MongoDB or BSON

I’ll be giving an hour-long webinar on What’s New with the Mongo-Hadoop integration. The webinar will cover

Using Java MapReduce with Mongo-Hadoop
Using Hadoop Streaming for other non-JVM languages
Writing Pig Scripts with Mongo-Hadoop
Mongo-Hadoop usage with Elastic MapReduce to easily kick off your Hadoop jobs
Overview of MongoUpdateWriteable: Using the result output from Hadoop to modify an existing output collection

The webinar will be offered twice on August 8:

↧

Improving Driver Documentation: The MongoDB Meta Driver

August 8, 2013, 7:06 am

≫ Next: The MongoDB Java Driver 3.0

≪ Previous: Mongo-Hadoop 1.1

This is a guest post, written by Mario Alvarez, a MongoDB intern for Summer 2013

This summer, I worked on developing the Meta Driver project, an effort to re-work the drivers documentation, creating an implementable specification of driver functionality that could be applied to different drivers in different languages.

The Problem

Part of MongoDB’s appeal to developers is its ease of use. One important way in which MongoDB provides a good user experience for developers is its ecosystem of drivers - bindings for interacting with the database in a variety of languages, with one for each (major) language. Whereas many other databases force developers to construct queries and database commands using a specific query language, MongoDB’s drivers allow for smooth, idiomatic interaction with the database. Drivers must balance conflicting goals: on the one hand, they should allow developers to write code naturally in the language of their choice; on the other, drivers should strive to provide a relatively consistent experience across languages, to minimize the difficulty of switching between MongoDB-oriented development in different languages.

Because of the language-dependence of providing a natural developer experience, as well as the drivers’ varying authorship (many originated as, or still are, community-supported projects) MongoDB’s drivers embody a wide variety of design decisions regarding these and other tradeoffs. This, coupled with the lack of a fully clear specification of driver functionality, makes it difficult to create or maintain drivers - in cases where the current spec falls short, it is hard to know which other driver to look to as an example.

The Meta Driver is part of a solution to this issue. It creates an implementable specification of driver functionality, as well as creating a reference driver implementation, following the principles of Behavior-Driven Development, to demonstrate a driver meeting the specification. While this project alone is not a complete answer to the problem of inconsistencies within the driver ecosystem, it provides a useful framework around which to begin the process of standardization, as well as the basis of a single, unified driver documentation. In order to achieve these goals, an emphasis on improving the quality and consistency of driver documentation and on more hands-on management of drivers by 10gen itself will be necessary, among other things.

Behavior-Driven Development (BDD)

Behavior-Driven Development is a development methodology that emphasizes the creation and maintenance of useful documentation. Its principles and techniques help make the Meta Driver possible.

BDD has its roots in the Agile community; particularly, in Test-Driven Development (TDD). Under TDD, a developer first writes tests describing the functionality she wants her code to implement, then writes the minimum amount of useful code that will make the tests pass. As a discipline, TDD can be a useful defense against over-architecting code at the beginning, and also leads to the creation of a comprehensive testing framework that (at least ideally) completely describes the code’s functionality, making the code much easier to maintain.

BDD builds on this methodology, taking it a step further. Rather than beginning with writing tests, the Behavior-Driven developer begins by writing a human-readable specification of the behaviors her code should implement. These specifications do not have arbitrary structure; they are written in a format (generally, a simple specification language called Gherkin) that computers can parse as well. Thus, the specifications form a common ground between human and machine understanding of the specified software. In fact, Gherkin’s syntax is so natural that it can be read by non-technical stakeholders in a project, enabling the creation of a common language shared by everyone involved, minimizing the possibility for misunderstanding or ambiguity. The specification is oriented around user-stories, describing various scenarios of interaction with the software, and the behavior the software should exhibit in each scenario.

Next, BDD translates the written specifications into tests run against the code. This is done by writing a series of step definitions that map each step of each scenario to code implementing the step. Once these are written, the developer proceeds just as with TDD, hacking until the tests pass. The difference is that, rather than just a series of tests, the developer has a human-readable specification for her code, and a guarantee that the specification is met. This helps to solve a difficult problem in software engineering: keeping documentation current as the documented codebase changes. If documentation has behavior specifications at its core, it will always be current. BDD serves to keep code and documentation “honest”: if the two do not match, the tests will not pass, and the developer will know where the problem is.

BDD in action: an example

The specifications for a project in BDD are contained in .feature files, which describe scenarios, each with a series of steps, as described above. Here is a simple example.

These files are placed in the features directory (default location) at the root of the project being specified. The step definitions are placed in features/support. These are written in the language of the code the specifications are being run against (in this case, .rb files). Below are the step definitions implementing the steps making up the feature excerpt shown above.

Cucumber matches the provided regular expressions against lines in the .feature file, using matching groups (surrounded by parentheses) to extract arguments that are then passed in to the provided code blocks. In this way, steps can be made flexible, able to accept a variety of different arguments and thus able to model a wider range of use-cases.

To run these steps, one must first install Cucumber (this can be done via RubyGems: gem install cucumber). The cucumber executable can take a variety of options; cucumber —help will give a fairly detailed description. Without any options, cucumber looks in the default features directory and attempts to run all the tests described by the features and step definitions for all of them. For each step, Cucumber outputs whether the test succeeded or failed; and, if the step failed, Cucumber prints information about the specific assertion failure that caused the step to fail. For this example, the output should look like this:

BDD and Drivers: The Meta Driver

BDD can play a particularly useful role in helping to define and standardize driver behavior across languages. Since the Gherkin specifications of driver functionality are language-independent, they can serve as a common reference for all drivers, and be directly implemented as tests on each driver. In this way, drivers can be held to a common set of functionality.

Meta Driver Challenges

In practice, this approach can lead to some complications. The specifications must be written with an eye toward being independent of the special quirks and needs of each language. Otherwise, overly-complex step definitions would be required, in order to “hack around” the shorcomings of the specifications; this would obscure the direct relationship between the feature-files and the tests to which feature scenarios correspond, limiting the specification’s usefulness. As the specifications are linked to an ever-widening array of drivers and languages, they will likely have to be revised, to take into account any new language-dependencies we discover in the process. Slight differences between the specifications used in different languages may need to be introduced, if languages have differences that cannot be reconciled at the feature-spec level. We hope to keep these differences to a minimum, if they are needed at all; so far, they have not been.

Current Work

So far, I have specified a new BSON implementation for Ruby (the code for which - but not the Cucumber spec - can be found here). Building on this implementation, I have implemented and specified the MongoDB wire protocol, allowing the reference driver to communicate with a real MongoDB server. Finally, I implemented the beginnings of a CRUD API, enabling a user to issue insert, update, delete, and find commands to the database.

I have also begun work on writing step definitions in Python to attach the specification to the PyMongo driver, as a first test-case for using the specification across programming-language lines, and on a pre-existing, production driver. So far I have implemented steps on PyMongo for BSON and the wire protocol. Though adjusting for differences in the driver APIs and the tools available for Ruby and Python caused some minor hitches. As we’d hoped, only small changes were needed in order to make the specifications work with PyMongo, and the changes can be back-ported to the Ruby driver, in order to keep the two consistent.

Driver Specification

The full specification can be found on Github. Here is an example of some features and step implementations for BSON, in the same vein as the above example. This excerpt contains specifications and step implementations for deserializing objects containing individual BSON values. Here is part of a feature:

Here are the corresponding step definitions:

This example tests a few simple cases of the general problem of BSON deserialization: deserializing single objects, wrapped in a BSON document with one field (the key for the field is ‘k’). This excerpt is a good example of how feature files can be constructed, beginning with simple cases of functionality and building up toward larger, more advanced ones (later in the same file, deserialization of larger, more complex, heterogeneous BSON objects are tested). For more examples, here is the feature file the above code is taken from; and here are the step definitions.

Using the Specification with Production Drivers

The Meta Driver specification is ultimately designed to be integrated with a wide array of production drivers, in order to ensure that all conform to a common set of behaviors. In order to integrate the specifications with a new driver, step definitions in that driver’s language must be written that link the specification to calls against the driver code. While this is not an especially difficult task, it is nontrivial (differences between languages and between API designs for different drivers mean that the same steps must sometimes be implemented differently for different drivers). Ideally, the only difference between languages and drivers should be the step definitions; the feature files shared between them should be identical. In practice, some small differences may need to be tolerated, but they should be kept to an absolute minimum, since such differences reduce the usefulness of the Meta Driver by undermining its status as a single universal specification.

Binding to PyMongo

Though the vast majority of work in “porting” the Meta Driver specification to other, production drivers has yet to be done, I have begun work on binding the specification to PyMongo, the production MongoDB Python driver. I chose Python because it is largely similar to Ruby in terms of the paradigms it supports, but is different enough to create challenges and expose whatever subtle dependencies on Ruby or the Ruby implementation might exist in the Meta Driver features. Binding the specification to the production Ruby driver would be less likely to reveal such problems (though support for the production Ruby driver is an eventual goal).

So far, using the Behave implementation of Gherkin in Python, I have been able to get the BSON and wire-protocol features to run successfully on PyMongo, with only minimal changes to the specification (mostly, omitting some scenarios to compensate for functionality present in the reference driver’s API that are not supported by PyMongo; some of these can be worked around, and most of the ones that cannot are not very important).

There are frameworks for many other programming languages available. The Cucumber wiki has a partial listing, although there are many others, including (for some languages) multiple implementations, with distinct strengths and weaknesses. For example, I chose Behave over Lettuce and Freshen (two other Python Gherkin implementations) because it was better maintained and appeared to have a cleaner API for defining steps. Choosing between different Gherkin/Cucumber implementations in the target language is another important part of the process of making that language’s driver work with the Meta Driver specification.

Conclusion

Behavior-Driven Development is a useful way to create documentation for programs, encouraging the creation of docs that are both useful to humans and directly linked to the code they specify. Because of the polyglot nature of MongoDB drivers, and the importance of providing a consistent user experience for developers working with the drivers, BDD was an extremely good fit for the Meta Driver project. However, BDD has many other applications, across different products and different languages. Any project important enough to require accurate and up-to-date docs should consider incorporating BDD into its development methodology. While it is far from a “silver bullet” for documenting and testing code, it can streamline the process of doing both well, allowing developers to extend and maintain their code with confidence.

We believe that there are many other opportinities to deploy BDD, across the 10gen/MongoDB codebase, and hope that this is just the beginning of 10gen embracing this methodology.

Future Steps

A lot still remains to be done in achieving the goal of a general specification and testing across many drivers. The specification must be extended further, to more completely cover CRUD operations, as well as to support functionality for administrative commands, read preference, write concern, replica sets, and other more advanced but crucial features of MongoDB. At the same time, the specification will need to be applied to other drivers, in order to expand its reach and to learn more about the varying needs and characteristics of different drivers. The specifications will likely need to be revised during this process.

It will also be important to document the process of attaching the Meta Driver specification to a new driver. Since I have only done this once (and there is still more work to do on PyMongo), I won’t be able to document this process very completely by the time I leave; it will likely have to be put together over time. I will also try to compile some of the best-practices for writing Cucumber specifications I discovered while working on the Meta Driver itself, to help out non-driver projects that want to make use of BDD.

Resources

For information about the Cucumber platform, cukes.info, the Cucumber website, is an excellent resource, as is the Cucumber wiki. The wiki contains a listing of Cucumber/Gherkin implementations in various languages, though it is not a complete catalogue. The wiki also has a useful page on the Gherkin specification language.

This project is completely open-source, and can be found in MongoDB’s github, here. The readme file in that repository contains some useful practical information for getting started using the Meta Driver, not covered here.

Acknowledgements

I’d like to thank Mike Friedman, my mentor, as well as Gary Murakami, Tyler Brock, and the rest of the drivers team at 10gen for their help and support on this project. I’d also like to thank 10gen as a whole, for a wonderful summer internship!

↧

The MongoDB Java Driver 3.0

August 13, 2013, 1:30 am

≫ Next: The MongoDB Web Shell

≪ Previous: Improving Driver Documentation: The MongoDB Meta Driver

By Trisha Gee, MongoDB Java Engineer and Evangelist

You may have heard that the JVM team at 10gen is working on a 3.0 version of the Java driver. We’ve actually been working on it since the end of last year, and it’s probably as surprising to you as it is to me that we still haven’t finished it yet. But this is a bigger project than it might seem, and we’re working hard to get it right.

So why update the driver? What are we trying to achieve?

Well, the requirements are:

More maintainable
More extensible
Better support for ODMs, third party libraries and other JVM languages
More idiomatic for Java developers

That’s all very nice, but it’s a bit fluffy. You can basically summarise that as “better all round”. Which is probably the requirement of any major upgrade. Since it’s too fluffy to guide us in our development, we came up with the following design goals.

Design Goals

Consistency
Cleaner design
Intuitive API
Understandable Exceptions
Test Friendly
Backwards compatible

Consistency

Java developers using the driver will have encountered a number of inconsistencies: the way you do things in the shell, or in other drivers, is not always the same way you do things in the Java driver. Even using just the Java driver, methods are confusingly named (what’s the difference between createIndex and ensureIndex, for example?); the order of parameters is frequently different; often methods are overloaded but sometimes you chain methods; there are helpers such as QueryBuilder but sometimes you need to manually construct a DBObject, and so on.

If you’re working within the driver, the inconsistencies in the code will drive you mad if you’re even slightly OCD: use of whitespace, position of curly braces, position of fields, mixed field name conventions and so on. All of this may seem pedantic to some people, but it makes life unnecessarily difficult if you’re learning to use the driver, and it means that adding features or fixing bugs takes longer than it should.

Cleaner Design

It’s easy to assume that the driver has a single, very simple, function - to serialise Java to BSON and back again. After all, its whole purpose is to act as a facilitator between your application and MongoDB, so surely that’s all it does - turn your method call and Java objects into wire-protocol messages and vice versa. And while this is an important part of what the driver does, it’s not its only function. MongoDB is horizontally scalable, so that means your application might not be talking to just a single physical machine - you could be reading from one of many secondaries, you could be writing to and reading from a sharded environment, you could be working with a single server. The driver aims to make this as transparent as possible to your application, so it does things like server discovery, selects the appropriate server, and tries to reuse the right connection where appropriate. It also takes care of connection pooling. So as well as serialisation and deserialisation, there’s a whole connection management piece.

The driver also aims to provide the right level of abstraction between the protocol and your application - the driver has a domain of its own, and should be designed to represent that domain in a sane way - with Documents, Collections, Databases and so on exposed to your application in a way that you can intuitively use.

But it’s not just application developers that are using the driver. By implementing the right shaped design for the driver, we can make it easier for other libraries and drivers to reuse some of the low-level code (e.g. BSON protocol, connection management, etc) but put their own API on the front of it - think Spring Data, Morphia, and other JVM languages like Scala. Instead of thinking of the Java driver as the default way for Java developers to access MongoDB, we can think of this as the default JVM driver, on top of which you can build the right abstractions. So we need to make it easier for other libraries to reuse the internals without necessarily having to wrap the whole driver.

All this has led us to design the driver so that there is a Core, around which you can wrap an API - in our case, we’re providing a backward-compatible API that looks very much like the old driver’s API, and we’re working on a new fluent API (more on that in the next section). This Core layer (with its own public API) is what ODMs and other drivers can talk to in order to reuse the common functionality while providing their own API. Using the same core across multiple JVM drivers and libraries should give consistency to how the driver communicates with the database, while allowing application developers to use the library with the most intuitive API for their own needs.

Intuitive API

We want an API that:

Feels natural to Java developers
Is logical if you’ve learnt how to talk to MongoDB via the shell (since most of our documentation references the shell)
Is consistent with the other language drivers.

Given those requirements, it might not be a surprise that it’s taking us a while to come up with something that fits all of them, and this process is still in progress. However, from a Java point of view, we would like the following:

Static typing is an advantage of Java, and we don’t want to lose that. In particular, we’re keen for the IDE to help you out when you’re trying to decide which methods to use and what their parameters are. We want Cmd+space to give you the right answers.
Generics. They’ve been around for nearly 10 years, we should probably use them in the driver
We want to use names and terms that are familiar in the MongoDB world. So, no more DBObject, please welcome Document.
More helpers to create queries and objects in a way that makes sense and is self-describing

The API is still evolving, what’s in Github WILL change. You can take a look if you want to see where we are right now, but we make zero guarantees that what’s there now will make it into any release.

Understandable Exceptions

When you’re troubleshooting someone’s problems, it becomes obvious that some of the exceptions thrown by the driver are not that helpful. In particular, it’s quite hard to understand whether it’s the server that threw an error (e.g. you’re trying to write to a secondary, which is not allowed) or the driver (e.g. can’t connect to the server, or can’t serialise that sort of Java object). So we’ve introduced the concept of Client and Server Exceptions. We’ve also introduced a lot more exceptions, so that instead of getting a MongoException with some message that you might have to parse and figure out what to do, we’re throwing specific exceptions for specific cases (for example, MongoInvalidDocumentException).

This should be helpful for anyone using the driver - whether you’re using it directly from your application, whether a third party is wrapping the driver and needs to figure out what to do in an exceptional case, or whether you’re working on the driver itself - after all, the code is open source and anyone can submit a pull request.

Test Friendly

The first thing I tried to do when I wrote my first MongoDB & Java application was mock the driver - while you’ll want some integration tests, you may also want to mock or stub the driver so you can test your application in isolation from MongoDB. But you can’t. All the classes are final and there are no interfaces. While there’s nothing wrong with performing system/integration/functional tests on your database, there’s often a need to test areas in isolation to have simple, fast-running tests that verify something is working as expected.

The new driver makes use of interfaces at the API level so that you can mock the driver to test your application, and the cleaner, decoupled design makes it easier to create unit tests for the internals of the driver. And now, after a successful spike, we’ve started implementing Spock tests, both functional and unit, to improve the coverage and readability of the internal driver tests.

In addition, we’re trying to implement more acceptance tests (which are in Java, not Groovy/Spock). The goal here is to have living documentation for the driver - not only for how to do things (“this is what an insert statement looks like”) but also to document what happens when things don’t go to plan (“this is the error you see when you pass null values in”). These tests are still very much a work in progress, but we hope to see them grow and evolve over time.

Backwards Compatible

Last, but by no means least, all this massive overhaul of design, architecture, and API MUST be backwards compatible. We are committed to all our existing users, we don’t want them to have to do a big bang upgrade simply to get the new and improved driver. And we believe in providing users with an upgrade path which lets them migrate gradually from the old driver, and the old API, to the new driver and new API. This has made development a little bit more tricky, but we think it’s made it easier to validate the design of the new driver - not least because we can run existing test suites against the compatible new driver (the compatible-mode driver exposes the old API but uses the new architecture), to verify that the behaviour is the same as it used to be, other than deprecated functionality .

In Summary

It was time for the Java Driver for MongoDB to have a bit of a facelift. To ensure a quality product, the drivers team at 10gen decided on a set of design goals for the new driver and have been hard at work creating a driver that means these criteria.

In the next post, we’ll cover the new features in the 3.0 driver and show you where to find it.

↧

The MongoDB Web Shell

August 14, 2013, 1:30 am

≫ Next: Mongoose 3.7.0 (Unstable) Released

≪ Previous: The MongoDB Java Driver 3.0

About

The MongoDB Web Shell is a web application designed to emulate some of the features of the mongo terminal shell. This project has three main uses: try.mongodb.org, 10gen Education online classes, and the MongoDB API documentation.

In these three different contexts, users will be able to familiarize themselves with the MongoDB interface and basic commands available both independently and as part of a 10gen education homework assignment in the education program.

See a screenshot of the state of the browser shell prior to this summer below:

Architecture

The user interfaces with the new MongoDB Web Shell through any web browser. A majority of the codebase for the project is written in Javascript executed in the browser. It provides a responsive command line interface similar to the MongoDB desktop shell. Because the MongoDB console is also implemented in Javascript, it is convenient to evaluate user-written code locally.

The browser-based shell interacts with a backing mongod instance through a RESTful interface implemented on top of Flask, a Python web application microframework. The backend provides

1) sandboxing capability to manage sessions and access to resources on the backing database 2) a framework for preloading data into a given resource 3) ability to verify data state for use in the online classes.

One interesting problem we ran into is allowing the blocking paradigm of code used by the mongo shells without actually blocking in the web browser. Browser tabs run on a single thread, so if Javascript code blocks on that thread, you can’t scroll or select things or interact with the tab at all. However, running code such as length = db.foo.aggregate(pipeline).results.length traditionally requires that the aggregate call is a blocking request so that it can fetch the remote data and return it in a single function call. Normally, we should just rewrite out function calls to take callbacks, but since this is a web shell, we don’t have control over the code being written. Our solution was to write a Javascript evaluator that could pause execution at any time, similar to coroutines. This solution manifested itself in a library called suspend.js evaluates Javascript program strings and can pause execution, perform some arbitrary, possibly asynchronous actions, then resume execution with the result of those actions. Thus users can still write code in the blocking paradigm that the mongo shell uses without locking up the browser.

1) try.mongodb.org

One of the primary goals of the project is to replace the existing browser shell on try.mongodb.org. The newer version provides better support for javascript syntax and mongoDB features in a cleaner UI. Among the features that will be present in the shell are syntax highlighting (replacing the green on black styling), fully-featured Javascript, and a new tutorial system that integrates beyond the shell window in the containing page.

A screenshot of the new shell:

One of the challenges in developing a fully featured shell was the ability to execute user-written code in a sandboxed environment within the browser. Originally, the shell relied on parsing the code into an abstract syntax tree (AST) and mutating the source to namespace variables in a manner where var x would become mongo.shells[index].vars['x']. We quickly realized that this implementation would be difficult to maintain given potential feature growth and the number of edge case scenarios. Instead, the latest version of the shell delegates the scoping of variables to the browser by executing in an iframe element that intrinsically provides many sandboxing features.

2) 10gen Education

The MongoDB Web Shell will also make its debut on the 10gen Education platform where users learn to develop for and administer MongoDB quickly and efficiently. Currently, the assignments for the online classes must be downloaded and solved locally on an individual installation of MongoDB. With the web shell, students will be able to complete and submit assignments entirely within the education platform without needing to download and install their own copy of MongoDB or sample data. This online solution makes it more convenient for users to take courses wherever they may be and makes it easier to validate the correct completion of an assignment. Furthermore, users will no longer be required to load their own datasets using mongorestore since it is handled directly by the shell. Finally, since all the work is stored on their online account, students will be able to continue their work from where they left off from any computer.

3) MongoDB Documentation

Finally, the MongoDB Web Shell will be embedded into the API reference pages as part of the database’s documentation. This will allow a user to immediately test the commands detailed on the page and to evaluate its effects against provided sample data sets.

Resources

This project is Apache licensed open source software and is freely available through this Github repository. We have provided documentation using the Github wiki and welcome any feedback or contributions through issues and pull requests. The mentioned suspend.js library can be found at this Github repository.

About Us

The MongoDB Web Shell driver was developed by Ryan Chan and Danny Weinberg, who are currently interns at 10gen.

Acknowledgements

Special thanks to 10gen for hosting us as interns, Stacy Ferranti and Ian Whalen for managing the internship program, and our mentors Ian Bentley and Emily Stolfo for their help and support throughout the project.

↧

Mongoose 3.7.0 (Unstable) Released

August 15, 2013, 1:30 am

≫ Next: $push to sorted array

≪ Previous: The MongoDB Web Shell

By EJ Bensing, MongoDB intern for Summer 2013

I’ve spent the last 2 months interning at 10gen, the MongoDB company, working on Mongoose. It has been a lot of fun and I’ve learned a ton about Node.js, MongoDB, and building open source libraries. I’m going to save all of that for a different post though, and instead talk about the newest release of Mongoose.

Unstable

To start things off, this is an unstable release. This means that it contains potentially breaking changes or other major updates, and thus should probably not be used in production. You can tell this is an unstable release because of the middle digit. Starting from 3.7, odd middle digits mean unstable, even mean stable. This is identical to the Node.js and MongoDB versioning schemes.

Mquery

The largest change in this release is under the hood. We are now using the Mquery library to power querying in Mongoose. This allowed us to strip a lot of logic out of Mongoose and move it into a separate module. This should make things easier to maintain in the future and allows for the query engine to be reused elsewhere.

It does add some neat custom query functionality though:

var greatMovies = Movies.where('rating').gte(4.5).toConstructor();

// use it!
greatMovies().count(function (err, n) {
  console.log('There are %d great movies', n);
});

greatMovies().where({ name: /^Life/ }).select('name').find(function (err, docs) {
  console.log(docs);
});

Note: This feature is currently unreleased, but expected out in a 3.7.x update soon. Watch this pull request

Benchmarks

We now have some basic benchmarks for Mongoose. We will definitely expand these in the future and these will allow us to start improving Mongoose’s performance and bring it more in line with the native driver. For the most part, Mongoose has almost every feature you can think of; however, it is a little slow when compared to the driver. These are our first step is working towards fixing that issue.

There are 7 different benchmarks:

Each of these can be run on the command line using node, although it is recommended you have the MONGOOSE_DEV environment variable set to true when you run the benchmarks. We will also have graphs automatically generated using a new tool I built, gh-benchmarks. An example running on my development fork can be found here.

More Geospatial Support

We’ve added a lot in the geospatial area. First, we’ve added support for $geoWithin. This should largely be an unnoticed change, unless you’re using MongoDB before 2.4, then you’ll need to make sure and set Mongoose to continue using $within.

Next, we’ve added support for Haystack indices and the $geoNear command. This can be accessed directly off the model object.

var schema = new Schema({
  pos : [Number],
  type: String
});

schema.index({ "pos" : "2dsphere"});

mongoose.model('Geo', schema);

Geo.geoNear([9,9], { spherical : true, maxDistance : .1 }, function (err, results, stats) {
  console.log(results);
});

Finally, we’ve also added support for the $geoSearch command.

var schema = new Schema({
  pos : [Number],
  complex : {},
  type: String
});

schema.index({ "pos" : "geoHaystack", type : 1},{ bucketSize : 1});
mongoose.model('Geo', schema);

Geo.geoSearch({ type : "place" }, { near : [9,9], maxDistance : 5 }, function (err, results, stats) {
  console.log(results);
});

Aggregation Builder

We now have basic support for building aggregation queries the same way we build normal queries.

Example:

Model.aggregate({ $match: { age: { $gte: 21 }}}).unwind('tags').exec(cb);

See the change notes for more information.

Tons of Bug fixes and other minor enhancements

Check the change notes to see everything that has gone into this release. * Release Notes * Docs * About Unstable Releases

↧

$push to sorted array

August 20, 2013, 5:14 am

≫ Next: Aggregation Options on Big Data Sets Part 1: Basic Analysis using a Flights Data Set

≪ Previous: Mongoose 3.7.0 (Unstable) Released

By Sam Weaver, MongoDB Solutions Architect and Alberto Lerner, MongoDB Kernel Lead

MongoDB 2.4 introduced a feature that many have requested for some time - the ability to create a capped array.

Capped arrays are great for any application that needs a fixed size list. For example, If you’re designing an ecommerce application with MongoDB and want to include a listing of the last 5 products viewed, you previously had to issue a $push request for each new item viewed, and then a $pop to kick the oldest item out of the array. Whilst this method was effective, it wasn’t necessarily efficient. Let’s take an example of the old way to do this:

First we would need to create a document to represent a user which contains an array to hold the last products viewed:

db.products.insert({last_viewed:["bike","cd","game","bike","book"]})
db.products.findOne()
{
    "_id" : ObjectId("51ff97d233c4f2089347cab6"),
	"last_viewed" : [
		"bike",
		"cd",
		"game",
		"bike",
		"book"
	]
}

We can see the user has looked at a bike, cd, game, bike and book. Now if they look at a pair of ski’s we need to push ski’s into the array:

db.products.update({},{$push: {last_viewed: "skis"}})
db.products.findOne()
{
	"_id" : ObjectId("51ff97d233c4f2089347cab6"),
	"last_viewed" : [
		"bike",
		"cd",
		"game",
		"bike",
		"book",
		"skis"
	]
}

You can see at this point we have 6 values in the array. Now we would need a separate operation to pop “bike” out:

db.products.update({},{$pop: {last_viewed: -1}})
db.products.findOne()
{
	"_id" : ObjectId("51ff97d233c4f2089347cab6"),
	"last_viewed" : [
		"cd",
		"game",
		"bike",
		"book",
		"skis"
	]
}

In MongoDB 2.4, we combined these two operations to maintain a limit for arrays sorted by a specific field.

Using the same example document above, it is now possible to do a fixed sized array in a single update operation by using $slice:

db.products.update({},{$push:{last_viewed:{$each:["skis"],$slice:-5}}})

You push the value ski’s into the last_viewed array and then slice it to 5 elements. This gives us:

db.products.findOne()
{
	"_id" : ObjectId("51ff9a2d33c4f2089347cab7"),
	"last_viewed" : [
		"cd",
		"game",
		"bike",
		"book",
		"skis"
	]
}

Mongo maintains the array in natural order and trims the array to 5 elements. It is possible to specify to slice from the start of the array or the end of the array by using positive or negative integers with $slice. It is also possible to sort ascending or descending by passing $sort also. This helps avoid unbounded document growth, and allows for the event system to guarantee in-order delivery.

There are lots of other applications for this feature:

Keeping track of the newest messages in a messaging system
Managing recent clicks on a website Last accessed/viewed products
Top X users/comments/posts

The list goes on.

This feature is available in MongoDB 2.4, but there are many extensions requested, such as full $sort and $slice semantics in $push (SERVER-8069), and making the $slice operation optional (SERVER-8746). Both of these are planned for the 2.5 development series.

Special thanks to Yuri Finkelstein from eBay who was very enthusiastic about this feature and inspired this blog post.

↧