As I wrote previously about “DynamoDB and Me”, I’ve been using Amazon’s hosted NoSQL datastore for some new projects including Lensmob.com. I like it, but it inevitably led me to writing a library for better higher-level usage: Dynochemy.
The following is a story of the evolution of this library. How it started as a simple wrapper to enable easier integration into Tornado and then grew into it’s own datastore framework. This story isn’t over yet, but you might find this article interesting if you’re interested in using DynamoDB in the real world or if you just enjoy a good software engineering yarn.
Early April, 2012: The Beginning
The beginnings of Dynochemy came from two deficiencies in existing libraries:
- Async Support
- Lack of reasonable API for performing operations rather than manually creating requests.
I found asyncdynamo, a library written by some developers at bit.ly. It plugs into boto and provided an async interface. The most obvious hurdle at that time was how low-level it was. Rather than do something like:
db.put({'name': 'Rhett Garber})
You ended up with something more like:
data = json.dumps({'TableName': 'MyTable', 'Item': {'name': {'S': 'Rhett Garber'}})
client.make_request('PutItem', body=data, callback=all_done)
(In later versions of boto, they added some high-level support that made creating this requests easier, however, integrating with the new async client would still be a challenge.)
Adding a nice little API for each operation seemed straightforward enough. Plus, it gave me the opportunity to really dive deep into what DynamoDB supported. This first version of Dynochemy supported basic operations of put, get, scan and query and was fairly high-level. I wasn’t sure exactly how I would end up wanting to integrate Dynochemy into different types of applications, so I wanted to support a few different APIs: Callback, Defer or Synchronous. So I ended up with 4 ways to do the exact same thing:
db['123'] = {'name': 'Rhett Garber'}
db.put({'name': 'Rhett Garber', 'id': '123'})
df = db.put_defer({'name': 'Rhett Garber', 'id': '123'})
df(ioloop=self.ioloop)
db.put_async({'name': 'Rhett Garber', 'id': '123'}, callback=after_put)
This is the first time I mentioned defer. I had a vague recollection of the defer concept from when I worked in Twisted. The generally idea is to just have an object that represents the completion of some asynchronous task. Twisted’s version seemed too complicated, so I of course tried to implement my own. Through the course of this project, I’ve come to understand why twisted is so complicated. I also accidentally ended up with a defer system that almost identical (or at least compatible) to the new futures built-in to python (http://www.python.org/dev/peps/pep-3148/). Don’t knock reinventing the wheel, it’s a great learning process.
Late April, 2012: Some Tooling
I soon realized this was becoming a larger project than a simple wrapper around asyncdynamo. Testing and development needed to be a little more streamlined. Connecting to a live DynamoDB installation was a pain, especially if I wanted any automated test cases.
I decided it shouldn’t be too hard to create a backing store that was SQLite based. It wouldn’t be async capable, but I could at least reasonably test my data operations.
With a little refactor, I had a pluggable client I could set as my database abstraction’s interface.
June, 2012: Functional Completeness?
By this point, I had most of the operations supported and I was starting to use Dynochemy in a real product.
Batch operations were particularly challenging to come up with a good API for. I realized that batch operations can span across multiple tables. My API didn’t really allow for that because the db instance, was really a specific table.
July, 2012: Tables and Errors
A major change to the API allowed Dynochemy to support multiple tables. Rather than:
db['123'] = {'name': 'Rhett Garber'}
We now do:
db.MyTable['123'] = {'name': 'Rhett Garber'}
I also started to understand just how complicated it was to handle errors in async code. Keeping my error handling straight would continue to haunt this project.
Another cool feature to come out this time period was code to run through all the pages of a query. So you could do something like:
q = db.MyTable.query(hash_key).range(1234324, 1234340).async()
results, err = run_all(q)
This would run the query till all the results were found. In production, I soon learned, this was almost completely useless because you’ll quickly exhaust your provisioning and then the query just fails.
Late August, 2012: Operations and Solvents
I finally got to one of the major reasons I wanted this library in the first place: Dealing with provisioning and rate limiting.
Ever since I introduced the batch operations, I thought something just wasn’t quite right with my design. I knew that would come to a head when I wanted my library to deal with provisioning and retries.
The system I really wanted was to be able to say: “Do this set of operations, tell me when they are all done”. The library should transparently handle however many underlying requests to DynamoDB are required. Perhaps they can all be batched together, perhaps some of them need to be retried.
Also, I knew at some point I wanted to integrate Dynochemy with a caching layer like memcache. Being able to “replay” database operations against other plugins could result in some very useful applications.
So I created a new set of abstractions that interacted with the existing “raw” interfaces to give me these properties.
For lack of a better term, this is solved with a Solvent. A Solvent is a set of operations against one or more DynamoDB tables. When the solvent is executed, some number of HTTP requests are made to handle these operations. Eventually, a result comes back. The client can then examine the results for each operation.
It looks something like:
s = Solvent()
put1 = s.MyTable.put({'id': '123', 'name': 'Alice'})
put2 = s.MyTable.put({'id': '124', 'name': 'Bob'})
q = s.OtherTable.query(hash_key).limit(10)
res, err = s.run(db)
# Print all the query results
for r in res[q]:
print r
Very functional and powerful. But also pretty verbose. Especially around error handling.
September, 2012: Views
As I got deeper into real-world use of this new datastore, some common patterns kept coming up. For Lensmob, a common access pattern is to want to query all the albums for a user. But, you’ll also probably want to query all the users for an album. DynamoDB has very limited query options, so what this means is that we have to have a table with the following structure:
{
'album': 'album1', # Hash Key
'user': 'user1', # Range Key
}
And a separate table organized just the opposite:
{
'album': 'album1', # Range Key
'user': 'user1', # Hash Key
}
Then we can do a queries like:
db.AlbumUsers.query(album_id).limit(20)
db.UserAlbums.query(user_id).limit(20)
There are other types of secondary meta data that might need to be maintained. Keeping a count of how many photos an album has for instance. It could be very expensive to fetch all the photos for an album each time you want to display the count. Maintaining a counter like that can be very tricky though, as each modification to a photo may need to maintain the counter.
Maintaining these associations and counters are pretty similiar, but tedious to do manually. So, now that we had a smarter, higher-level interface to DynamoDB, we had the tools to automate this. I called this feature ‘Views'.
To create a view, basically you create a class that identifies how a secondary meta entity is to be maintained. It uses the visitor design pattern, meaning all operations in a solvent are delivered to each registered view, allowing it to create additional operation.
For example:
class UserAlbumsView(View):
table = AlbumTable
view_table = UserAlbumTable
@classmethod
def add(cls, entity):
return [PutOperation(cls.view_table, {'album': entity['album'], 'user': entity['user']})]
@classmethod
def remove(cls, entity):
return [DeleteOperation(cls.view_table, {'album': entity['album'], 'user': entity['user']})]
With this view, any album that’s created, automatically has another table maintained that is organized by user. It is important to understand what is happening behind the scenes. A minimum of two sequential DynamoDB calls are going to be required to maintain a view like this. The first, will simply add the album. Secondonly we’ll do any followup operations such as adding to an index. We can’t really do them together in the same Batch operation or else our views could be inconsisent from the actual tables (imagine if there wasn’t enough capacity on the album table, but there was on the user-album table).
Of course, maintaining the views is just half the story. We also want to query them. The View class also acts as something you can query against in a solvent. When you query against a View class, each page of the query results will be fed into a BatchGetItem with the appropriate keys. This gives the query the ability to automatically return to you the final objects, not the intermediate relationship object.
January, 2013: Streamlining
After heavy use of Dynochemy in Lensmob and other projects, one thing kept annoying me: Error handling. The standard pattern for running a solvent in Tornado looked like this:
s = Solvent()
get_op = s.get(AlbumTable, album_id)
res, err = yield tornado.gen.Task(s.run_async, self.db)
# Check if the overall solvent failed
if err:
raise err
# Check if the GetItem failed
album, err = res[get_op]
if err == ItemNotFoundError:
return None
elif err:
raise err
return album
Error handling this way was getting pretty annoying and repetitive. All that work just to get one thing from the database?
About this time I discovered python ‘futures’ PEP and library. It comes built-in to Python 3, but as a library in Python 2.7. Tornado also has some built-in handling for using futures, but it’s a little rough. Rather than try to convert my entire library to futures I took a more conservative approach and just made some changes to my own ‘defer’ class with respect to error handling.
Now, rather than the convention of:
result, error = df()
I change it so that a defer will raise any generated exeception when the result is asked for.
So rather than recording an error for a defer as:
df.callback(result, error=AnError)
Now, a successful result is recorded as:
df.callback(result)
Where an error is recorded as:
df.exception(AnError)
This makes the common pattern of getting results from a Solvent much more straightforward:
s = Solvent()
get_op = s.get(AlbumTable, album_id)
res = yield tornado.gen.Task(s.run_async, self.db)
# Check if the GetItem failed
try:
album = res[get_op]
except ItemNotFoundError:
return None
return album
The Future
I have not discussed much how the schema design for Lensmob evolved during this period. There is a lot of really great code that makes for a pretty useful datastore that is not in Dynochemy, but is in my application code base. This part of the application makes a lot of use of Views, and uses just two DynamoDB tables: Entity and EntityIndex. This is inspired by the friendfeed MySQL-NoSQL schema design.
Dynochemy is still pretty hard to use and I would be really surprised if many people looked at it and knew that it solved their problems.
In the next steps in this libraries evolution, I would like to do several things to clean up it’s use:
- Use python futures and hopefully make use of other tooling around them already present in many libraries (tornado)
- Clean up the the interfaces and naming so that Solvent is a first class citizen.
- Implement a caching plugin (and formalize the plugin-interface)
- Integrate my custom schema in a way that makes it the default choice for desigining an app. This includes tooling for re-building views and schema migrations.
If I actual implement the above, Dynochemy moves from being a library for accessing DynamoDB to a higher-level datastore that simply uses DynamoDB as a backing.
This begs the question.. is DynamoDB still a necessary requirement? My SQLite backing has actually been very useful and is pretty close to be production-ready as well. It uses SQLAlchemy so it should be fairly straightforward to run this against MySQL (or Postgres or whatever).
One downside about re-orienting Dynochemy to run against SQL data stores is the lack of async support. One direction I would like to investigate would be to handle the transactions against the database as a separate thread pool using futures.ThreadPoolExecutor or some such built-in tooling for executing futures. Anybody who knows me should be coughing up their coffee right now since the idea I would ever suggest using a thread is crazy. However, I think the futures interface and the fact that Dynochemy threading can be totally isolated from the application, much like a ZeroMQ application, makes it possible it won’t become a multi-threaded disaster. This is a direction for future investigation anyway, no promises.
Conclusion
I hope you enjoyed this long history of the development of a little-used python library. I don’t think developers write these things often enough.
For any potential users of Dynochemy, I think a good amount of caution is warranted. It’s proven pretty stable for our application so far, but a single user does not a battle tested library make. Anybody who identified similiar shotcomings in existing libraries and were excited to use DynamoDB should be interested in contributing and understanding how it works if they hope to make use of it.