IT General: Web Database

Before the first relational databases arrived programmers used different methods to store and access information; simple files, sequential records, index sequential files, hierarchical record structures, hash codes and so on. The commonality of these solutions was that they were basically low level; most relations in data had to be put into code mixing data structure and code structure. Relational databases introduced the 100% principle which guarantees consistency even in case of faulty programs.

I have the feeling that we are in a similar situation with internet programming. Every website has some kind of search (i.e. document indexing and crawler), taxonomy, document management, registration, personalization and recommendation. In a perspective these are all database services; store and connect information. However the information is connected in code - in other words internet sites are now where corporate databases had been in the seventies and eighteens. How nice would it be to have a standard like SQL to provide all the common services for Web applications!

Naturally I don't really know how a system like this would look like; I only can hope that somebody will come out with a nice theory like Edgar Frank Codd did. Until then we are on our own imagination. What I would like to have is something like SQL with some extensions.

First I need my unique domain:

“CREATE DOMAIN " - all domains have a global unique ID

Then I can start the crawler:

"CREATE WEBINDEX SITES , .... LEAVE ,... DEEP 5, FRQUENCY INSTANT" - this will create an index crawling the websites at url1, url2... but not urla and urlb, digging maximum 5 deep in the chain of linked websites, and instantly updating the index if any document changes.

I also can add or remove documents (= urls) manually with the INSERT, DELETE commands.

Metadata is similar. I also can create, links and remove thesaurus, taxonomies and keywords to metadata with the CREATE, ADD, REMOVE METADATA commands. Naturally url's can also be added to the metadata elements. If I have a metadata I can add it to a WEBINDEX structure. The trick is that the Webindex structure is self learning; it can take the metadata element - url pairs from the metadata and classify all elements in the index accordingly. In fact it makes an automatic classification even without any metadata defined. For that purpose a standard taxonomy is used.

SEARCH is much like as we know from the Google input box but with a lot of parameters defining metadata, ranking, and other document selection criteria.

Recommendation is identical with taxonomy. We use simple keyword and url pairs where the keyword is typically an unique user id and the url identifies a document or product. Then we simple make a search for the keyword excluding the results with 100% confidence (because these are the defined keyword - url pairs) and returning e.g. the 10 best matches.

What remains is the content management part. The domain in general and parts of it individually can set so that versions are preserved (or not) and locks and statuses can be set on domain elements (typically documents). Workflow can be defined in the manner above.

As far I can decide the IDOL languge is fairly near to this vision and HP is working on an unified relational and web database.

Web Database

No comments: