Amazon SimpleDB - A different perspective

If you're new here, you may want to subscribe to my RSS feed. Thanks for visiting!

Amazon AWS

Amazon has launched SimpleDB and the BlogoSphere has already gone gaga over it (e.g. here, here, here and here) and a few of them should know better or in fact looked at it more closely before foaming at the mouth.

I was looking at the syntax for queries and one thing stuck out for me. This is that you cannot compare an ‘identifier’ with another ‘identifier’. That is all queries are based upon you filtering down against constant values.

Here is the attribute comparison syntax allowed.

attribute_comparision ::= identifier comp_op constant |
	      "not" identifier comp_op constant |
	      identifier comp_op constant "and" attribute_comparision |
	      "not" identifier comp_op constant "and" attribute_comparision |
	      identifier comp_op constant "or" attribute_comparision |
	      "not" identifier comp_op constant "or" attribute_comparision

What I was looking for was,

	 identifier comp_op identifier

What does this mean? Well for one you cannot build pivot tables. So for a simple example lets say you have the concept of bookmarks for a user.

If I was building this with any other database I would very likely have a separate table for all the unique URL’s that everyone was storing. And then have a pivot table that joined between the user and the URL table.

So we have..
user table

Index User
1 Nick
2 Fred
3 Bob

URL table

Index URL
1 http://www.bbc.co.uk
2 http://www.gamespot.com
3 http://www.techcrunch.com

Pivot

User Index Url Index
1 1
1 2
2 3
3 1
3 2
3 3

The pivot shows that nick has two URL’s, fred has 1 and bob has 3.

So how do we represent this in SimpleDB? Well it does have something very clever which lets you assign up to 256 pairs of key + value per item.

So you could do a number of things..

1) Assign each url to to a differently named attribute, e.g.

Item (name -> nick)
(url1 -> http://www.bbc.co.uk)
(url2 -> http://www.gamespot.com)

The problem with this is that you would quickly run out of attributes and it is very
inefficient. Plus if you wanted to perform a search to find who was interesting in a
particular URL you would have to check against every possible key name e.g. (url1,url2,url3) .. anyway makes me feel sick just writing it.

2) Store the URL multiple times under items which include a common identifier

item (name -> nick)
item (name -> nick) (url -> http://www.bbc.co.uk)
item (name -> nick) (url -> http://www.gamespot.com)

We can then perform a query.

['name' = 'nick'']

Would give us all the items with ‘nick’ in them but we would have to go through them and find those with the right attributes. We could also find all the users who were interested in a particular url.

['url' = 'http://www.bbc.co.uk']

Back to identifiers

But going back to my original point because we cannot do ‘identifier’ = ‘identifier’ you cannot setup a pivot that would join the URL + NAME together. Before I get shouted at (am sure it will happen) I do (hopefully) understand that the approach is completely different. What I do not understand is how you get around data duplication.

Pivot tables are all about not having duplicate data. In my SimpleDB example I was having to insert ‘http://www.bbc.co.uk’ for every user, rather than just a pivot index that would join them together. The result of this? unless SimpleDB is very clever and removes duplicates (possibly of course) then the size of your database will be a factor larger than if it was stored efficiently than in your usual SQL style database.

Limitations

Other major limitations include not being able to order the results, no way to count results, no way to sum, max, min and lots more.

One other current limitation - 250 million attributes per domain.

Now 250 million may sound like a lot of things, but in database terms let me be clear its tiny. An attribute is any pairing of data under each item. So lets say we have a typical table with 10 fields, stored under SimpleDB you would be left with 25 million rows. Double the number of fields, and you only have 12 million. If you start factoring in that you have to treat a single domain as containing multiple tables you start seeing that 250m as a serious limitation.

Conclusion

Don’t get me wrong this is definitely a generation shift but I really thought it was worth taking a different perspective on this. I have already signed up and will be tinkering around with the PHP library to see what I can make of it.

There Are 8 Responses So Far. »

  1. After all the hype i’v read about this so far, i’v got to say that this is the most thought out and realistic look at amazon’s latest offering. I had a quick glance at it myself and kinda fobbed it off as I love my sql and subselects far too much to give them up but now after reading this I can see there really are major limitations to it that could come to bite you if you aren’t prepared for them!

  2. Nick,

    Could you provide a reference (URL) for the 250 million attribute limit? I looked but have not find it yet.

    I was hoping to read the wording to make sure I understood the limit. Every mention I’ve seen [so far] of attribute is analogous to a column or field *name*, not a per row instance of that field. It seems like those row/field instances are referred to as values in what I’ve seen so far.

    I don’t think a limit of 250 million unique columns in most application would be problematic. I’m just wondering if there’s some potentially confusing communication due to the non-relational implementation.

    Unfortunately I’ve only found a few doc files so far and the service is newly public so I’m sure I’m missing something critical.

    Best,

    Dave

  3. Nick, I think you might be able to do this with SDS.

    From the Amazon website:-

    However unlike a spreadsheet, Amazon SimpleDB allows for multiple values to be associated with each “cell” (e.g., for item “123,” the attribute “color” can have both value “blue” and value “red”).

    Also in the glossary there is a mention of “multi-valued attributes”:-

    http://docs.amazonwebservices.com/AmazonSimpleDB/2007-11-07/GettingStartedGuide/SDB_Glossary.html

    It sounds like attributes can store arrays of values.

  4. You are right as far as I can see, that there is no join syntax in SimpleDB. Yet I don’t see why you cannot implement the schem m:n schema you described. The only problem is, you need to do more then one query.

    E.g. for all useres having a specific URL:
    one query to get the id of the url.
    one using that id to get all relevant user ids from the ‘pivot table’
    one query to get the users by id.

    Where is the problem?

    Of course doing this isn’t as fast (and easy) as doing a join on three tables in oracle (or whatever DB you prefer) but I don’t think SimpleDB is thought to be a competition for Oracle

    Jens

  5. Which actually might be nicer than using pivot tables.

    I’ve seen people use comma-separated lists on quite a few occasions (because they either don’t understand normalisation/joins or because they are lazy and don’t want to maintain the overhead).

  6. You’re loving your RDBMS too much. When it comes to scalability, it might be a good idea to drop 2NF support to make the thing fly. Instead of giving you an all-in-one data storage for every purpose, SimpleDB is about breaking the CAP theorem. With it, Amazon offers you a very special building block with it’s own specific strengths and weaknesses. In this case, you’ll get exceptional scalability and durability, paired with the lack of relations (and thus native 2NF support).

    You should learn to exploit specialized systems like SimpleDB (or Memcached, S3, BerkleyDB, lighttpd, etc.). Your applications will scale better and more importantly, it’s just more fun to work with those tools! :-)

    Marten

  7. Maarten,

    Already use S3, memcache, lighttpd and I use them all within their own specialised areas of use as and when I require them within an overall infrastructure.

    As I stated I see SimpleDB for what it is, and not what the ‘hype’ is trying to make it, its a simple tool thats useful for storing non-relational data in a very scalable way. But the irony here is that the simplistic datasets that it can scale so well are things you could cover on a dedicated box running a lightweight SQL setup.

    Bottom line is that is NOT a replacement for anything complex out there right now.

  8. This is your old Relational DBMS vs Column DBMS argument.
    Relational data has its uses, normally for analysis and data mining, but sparse data such as data that is normally used to drive a site therefore column oriented dbs make more sense (if indeed that is your data model).

    The biggest mistake, and one that I think will also be the most common on simpleDB, is going to be people that force a relational model onto the service. all of them will be slow, combersome and complex because they’ll have to duplicate (to varying degrees of success) the relational model in something that doesn’t support it.

    Don’t force a model on something, either learn the new model, or don’t use the service.

    That said, personally I’d be much more in favor of running my own column db (say hbase or cStore) off ec2 than simpleDb because of the limits stated, as I also agree that 250m is nowhere enough when you start scaling this out. So I agree with the model, just not amazons limited version of it.

Post a Response