Before the first relational
databases arrived programmers used different methods to store and access
information; simple files, sequential records, index sequential files, hierarchical
record structures, hash codes and so on. The commonality of these solutions was
that they were basically low level; most relations in data had to be put into
code mixing data structure and code structure. Relational databases
introduced the 100% principle which guarantees consistency even in case of
faulty programs.
I have the feeling that we are
in a similar situation with internet programming. Every website has some kind
of search (i.e. document indexing and crawler), taxonomy,
document management, registration, personalization and recommendation. In a perspective these
are all database services; store and connect information. However the
information is connected in code - in other words internet sites are now where
corporate databases had been in the seventies and eighteens. How nice would it
be to have a standard like SQL to provide all the common services for Web
applications!
Naturally I don't really know
how a system like this would look like; I only can hope that somebody will come
out with a nice theory like Edgar Frank Codd did.
Until then we are on our own imagination. What I would like to have is
something like SQL with some extensions.
First I need my unique domain:
“CREATE DOMAIN
" - all domains have a global
unique ID
Then I can start the crawler:
"CREATE WEBINDEX
SITES , .... LEAVE
,... DEEP 5, FRQUENCY
INSTANT"
- this will create an index crawling the websites at url1, url2... but
not urla and urlb, digging maximum 5 deep in the chain of linked
websites, and instantly updating the index if any document changes.
I also can add or remove
documents (= urls) manually with the INSERT, DELETE commands.
Metadata is similar. I also
can create, links and remove thesaurus, taxonomies and keywords to metadata with
the CREATE, ADD, REMOVE METADATA commands. Naturally url's can also be added to
the metadata elements. If I have a metadata I can add it to a WEBINDEX
structure. The trick is that the Webindex structure is self learning; it
can take the metadata element - url pairs from the metadata and classify
all elements in the index accordingly. In fact it makes an automatic
classification even without any metadata defined. For that purpose a standard taxonomy
is used.
SEARCH is much like as we know from the Google input box but with a lot of parameters defining metadata, ranking, and other document selection criteria.
Recommendation is identical
with taxonomy. We use simple keyword and url pairs where the keyword
is typically an unique user id and the url identifies a document or product.
Then we simple make a search for the keyword excluding the results with 100%
confidence (because these are the defined keyword - url pairs) and returning e.g.
the 10 best matches.
What remains is the content
management part. The domain in general and parts of it individually can set so
that versions are preserved (or not) and locks and statuses can be set on
domain elements (typically documents). Workflow can be defined in the manner
above.
As far I can decide the IDOL languge is fairly near to this vision and HP is working on an unified relational and web database.
No comments:
Post a Comment