Using Google App Engine is really easy for developers, you can use the language you prefer (Java, Python and a bunch of dynamic languages running on top of the JVM) with the usual API.
An old Java addicted can also use the usual javax.servlet, javax.mail, javax.cache, javax.persistence and so on and everything works great, but.....
....but after a while you feel that something is "wrong". Yes, you are using JPA but your "relations" doesn't work as expected and even if you read something about the fact that the underline store is not SQL you don't realize immediately that SQL is only the top of the iceberg. Because the DataStore is even not a Relational Database. After being shocked for a while I decided to study how this piece of software works (not much infos in my opinion on official GAE site), because to work well with it you must think the way it was designed for.
Let's start with a formal definition.
"Google App Engine Datastore is a schema-less persistence system, whose fundamental persistence unit is called Entity, composed by an immutable Key and a collection of mutable properties.
From this (quite) formal definition we can understand that is a modern DBMS, but under the hood is very different from Oracle or MySQL. Reading on the net I understood that the building blocks  are a bunch of Google Technologies that I tried to schematically represent this way:
Looking at this picture we must consider that:
But the guest star is one of the main technology that run at Google:
"A Bigtable is a sparse, distributed, persistent multi-dimensional sorted map"
coming back on the Earth we can say that BigTable is not only an HashTable like storage, but that thank to the way is designed and implemented we have the concept of "logical table" and "logical row" in it. Before introducing these two concepts let's take a look at the physical organization of the associative array:
From here it's only a small step to the BigTable's ROW logical view (example from the official BigTable doc):
The logical row in BigTable has some important features:
Those rows are stored inside the Google cloud in tablets. Every single tablet contains a range of data lexicographically ordered by row and is the minimal unit of distribution and load-balancing (about 200MB). It's really important to store rows with keys that minimize the number of tablet accessed, to maximize efficiency, hence performance.
Another interesting abstraction in BigTable is the one we can do with "column families", which is a group of column which represent the same concept, for instance "anchor:com.yahoo.www" and "anchor:com.google.www" are two columns that represent the concept of family. A column is always represented with the syntax "family:qualifier" and they are not only used to represent similar concept, they are the minimal unit of Access Control. For every "column family" in a table we can decide who can read, write and/or update.
The last concept that we must understand to work with BigTable is the "timestamp", a 64bit integer that remember the long field we use for optimistic locking in many ORMs. Here it's used to have many versions of the same row into the table, and for this reason we have to choose some policy for creation of a new version and garbage collection of old versions. We can decide to keep version newer than time T, or may the last n versions, this is a choice of the developer that it's using the table (unfortunately versioning is a BigTable's feature that we can't use in DataStore).
So we have an idea of what BigTable is, but on DataStore we have more. In fact on top on it DataStore adds:
Let's use a simple example for a better understanding. Suppose we have an Entity defined this way (a very simple JPA view):
and we want to execute this JPA-QL query:
Thinking the way BigTable stores data we can image such a query like a resource intensive task. But is not, the DataStore comes to our help and organize data so that we can query them easily thanks to indexes. A DataStore index is a table with "well organized" keys and no values, lexycographically ordered in a way that make access to table data very fast. An example tailored for the previous query is this:
where the row with key1 and key3 are selected.
We must also be aware that using indexes is not an holy grail, being fast is not a matter of having the index, but a right balance between them and the time we spend updating them at every insert. In fact, while the index is not size sensitive, the number of indexes impact very much the insert time of rows in a table, because every index must be updated.
SELECT pthis is done walking every property's index and intersecting the result, starting with the first index (selected row in yellow):
Here is clear that "key1" is selected because it follow the path from the first index to the last.
Here the ROOT Entity (Fiorentina) is the key for the transaction to work. All entities in this entity group refer to its version timestamp, which is like a flag about a well done transaction. If all is gone well then the commit updates this timestamp to flag that the entities in this group are all ok. If something went wrong nothing is done on the root entity and dirty entities (those without a valid ROOT entity) will be later garbaged.
Thanks to this abstraction we can write code like the following, that is very common in JPA programming and really easy to do on a relational database:
We can note that we only need to persist the ROOT entity and that we have utility methods (addXXX) that manage the inverse mapping necessary for a real ownership of an object. In case something goes wrong before the commit, all will be rolled back.
We are managed to think "relational" because the winning technology in the last 35 years is the Relational Database Management System. Often we think that everything can be well designed with the relational model, but this may be not true, just think the effort we need to do every time we map our Java objects, also with modern ORMs.
This is not only a technological challenge because this time we have to work on out "forma mentis" in order to fully use the new Cloud-Computing tools that Google and other vendors are going to provide us.