Today I want to re-visit the Identity Bus/Hub issue, which is also caught up with the choice between completely virtual directories and persistent storage metadirectories.
Today I want to re-visit the Identity Bus/Hub issue, which is also caught up with the choice between completely virtual directories and persistent storage metadirectories (see "Building an Identity Bus," Part 1 and Part 2).
Microsoft’s Kim Cameron still believes that applications need their own local data storage and proposes a test problem:
“Sometimes an application needs to do complex searches involving information 'mastered' in multiple locations. I’ll make up a very simple ‘two location’ example to demonstrate the issue:
‘What purchases of computers were made by employees who have been at the company for less than two years?’
Here we have to query ‘all the purchases of computers’ from the purchasing system, and ‘all employees hired within the last two years’ from the HR system, and find the intersection.”
Kim then states that in an “Identity Hub/virtualized directory” world that “…performing this query remotely and bringing down each result set is very expensive.” The implication is that all employee data and all purchase data must be downloaded to a temporary location (even fast local RAM) where the SQL join can be performed. But that really isn’t the case.
Surprisingly, Oracle’s Clayton Donley (the creator of the OctetString virtual directory) seems to agree with Cameron when he says of my proposal, “that functionality would likely be persistent cache, which if you look under the covers is exactly the same as a metadirectory in that it will copy data locally. In fact, the data may even be stored (again!) in a relational database.”
Both arguments (and those made by Macehiter ward-Dutton’s Neil Macehiter supporting Donley and Cameron) all contain a fatal flaw: they are premised on copying all potentially relevant data to local storage (either disk or RAM) where a sort and join can be done. That’s simply not necessary!
I will assume that the HR system assigns an employee number to each new hire and that these numbers are sequential using a known sequencing technique. I’ll further assume that the purchasing system includes an indicator (such as the employee number) of the person ordering the merchandise. To find all employees hired in the last two years, I simply query the employee database for the one earliest record that is later than two years ago yesterday. I can retrieve this record and note the employee number. That number and all subsequent numbers represent all the employees hired in the past two years. So now I simply need to query the purchasing database for all purchases of computers by persons whose employee number is equal to or greater than the one I’ve retrieved. So the only data that actually traverses the network is the single record from the employee database and those records from purchasing which satisfy my query – there are no “waste records” cluttering the network.
To my thinking, this is much less expensive in terms of network bandwidth and CPU time (not to mention RAM and disk space) than the method Kim proposes.
Still, I am open to further discussion of this, even with those who disagree!