Showing posts with label Oracle. Show all posts
Showing posts with label Oracle. Show all posts

Tuesday, September 04, 2007

Database Build Script "Greatest Hits"

I know its been a quiet time on this blog for a while now, but I've noticed that I'm still getting visitors looking up old blog posts. It's especially true of the posts that relate to "The Patch Runner". Many of them come through a link from Wilfred van der Deijl, mainly his great post of "Version control of Database Objects". The patch runner is my grand idea for a version controlled database build script that you can use to give your developers sandbox databases to play with as well as ensuring that your live database upgrades work first time, every time. It's all still working perfectly here, and people still seem to be interested, so with that in mind I've decided to collate them a little bit. basically provide an index of all the posts I've made over the years that directly relate to database build scripts, sandboxes and version control. So, Rob's database build script 'Greatest Hits': All of the posts describe processes and patch runners that are very similar to those that I use in my work every day. I started playing with these theories over 3 years ago now and there is no way I'd go back to implement database upgrades the way I did before. However, I'd LOVE to hear ideas on how things can be improved. I'd be amazed if my three year old thinking was still up to date! Technorati Tags: , , , , ,

Thursday, July 12, 2007

Can a change in execution plan change the results?

We've been using Oracle Domain indexes for a while now in order to search documents to get back a ranked order of things that meet certain criteria. The documents are releated to people, and we augment the basic text search with other filters and score metrics based on the 'people' side of things to get an overall 'suitability' score for the results in a search. Without giving too much away about the business I work with I can't really tell you much more about the product than that, but it's probably enough of a background for this little gem. We've known for a while that the domain index 'score' returned from a 'contains' clause is based not only on the document to which that score relates, but also on the rest of the set that is searched. An individual document score does not live in isolation, rather in lives in the context of the whole result set. No problem. As I say, we've known this for a while and so have our customers. Quite a while ago they stopped asking what the numbers mean and learned to trust them. However, today we realised something. Since the results are affected by the result set that is searched, this means that the results can be affected by the order in which the optimizer decides to execute a query. I can't give you a full end to end example, but I can assure you that the following is most definately the case on one of our production domain indexes (names changed, obviously): We have a two column table 'document_index', which contains 'id' and 'document_contents'. Both columns have an index. The ID being the primary key and the other being a domain index. The following SQL gives the related execution path: SELECT id, SCORE( 1 ) FROM document_index WHERE CONTAINS( document_contents, :1, 1 ) > 0 AND id = :2 SELECT STATEMENT TABLE ACCESS BY INDEX ROWID SCOTT.DOCUMENT_INDEX DOMAIN INDEX SCOTT.DOCUMENT_INDEX_IDX01 However, the alternative SQL gives this execution path: SELECT id, SCORE( 1 ) FROM document_index WHERE CONTAINS( document_contents, 'Some text', 1 ) > 0 AND id = :2 SELECT STATEMENT TABLE ACCESS BY INDEX ROWID SCOTT.DOCUMENT_INDEX INDEX UNIQUE SCAN SCOTT.DOCUMENT_INDEX_PK Normally, this kind of change in execution path wouldn't be a problem. But as stated earlier, the result of a score operation against a domain index is not just dependant on the individual records, but the context of the whole result set. The first execution provides you a score for the single document in the context of the all the documents in the table, the second gives you a score within the context of just that document. The scores are different. Now obviously, this is an extreme example, but more subtle examples will almost certainly exist if you combine the domain index lookups with any other where clause criteria. This is especially true if you're using literal values instead of bind variables in which case you may find the execution path changing between calls to the 'same' piece of SQL. My advice? Well, we're going to split our domain index look ups from all the rest of the filtering criteria, that way we can prepare the set of documents we want the search to be within and know that the scoring algorithm will be applied consistently.

Sunday, December 31, 2006

P-dd 0.1 finally released

So, with just 6 1/2 hours to go before the year ends I've finally managed to get P-dd - The PHP Database Documentor up to version 0.1 standard.

The blog's up and running, a slab of source code sits on Google code and at last I feel like I can stand beside the library and say "yeah, well it's good enough for now".

You can find the code here: http://code.google.com/p/p-dd/

And the blog here: http://p-dd.blogspot.com/

So what does it do?

Put simply, it's a library of PHP classes that allow for the easy generation of documentation from a set of database sources.

The idea is that, over time, database sources will be added that will allow for the collection of meta-data from all the major database players (Oracle / MySql / Postgres / etc) and produce documentation in most of the popular forms (HTML / XML / RTF / PDF / etc) including ER diagrams.

The aim is to make the library simple to use to produce either applications that output documentation for static publication or applications that allow for navigation through the database structure. Note that it is not the aim of the project to produce either of these applications, merely to allow for their creation.

It is also recognised that in the future it is desirable to take the library into a more analysis role. For example - inferring foreign keys that are not explicitly stated, either by examining the table structures or the data within those tables.

The library is very much in its early stages though, and for now we've got the following:


  • A database model that consists of:

    • Tables

    • Columns

    • Primary Keys

    • Foreign Keys



  • The database model can be created from the following sources:

    • Oracle

    • XML File



  • The model can be rendered into the following formats:

    • HTML

    • XML

    • Graphviz Neato Diagram (producing an ER diagram)





There are also lots of other little goodies in there such as datasource in dependant filters, a datasource caching system that limits the round trips to the database and a plethora of examples showing how components can be used as well as a simple Oracle database viewer application to show off what can be possible with just a small amount of work.

I hope the code is of use, and I'm fully committed to getting more and more functionality into the code as soon as possible in the new year.

Note: The eagle eyed of you may notice that I've added a new sidebar to this blog which will list the blog posts from the P-dd blog...


Technorati Tags: , , , , , , , , , , , , ,

Wednesday, December 06, 2006

Repeating a point

The other day I mentioned the principle "Don't repeat yourself". I think it may have inspired Andy Clarke to write this up, and he's quite right. It comes from the Pragmatic Programmers.

APC's spot on in his description as it relates to writing code, but he doesn't go far enough.

DRY relates to every part of software development, not just the bit where you're knocking out code.

If, in any part of the process, you find you have a duplication of knowledge then you have a potential problem.

Anyone ever read that comment at the top of a procedure and found its description doesn't match the code that follows?

Watched that demonstration video and found that it's showing you an utterly different version of the system to that which you've just installed?

Looked at that definitive ER diagram and found it's missing half the tables?

Well, don't put a comment at the top of the procedure, instead document the behaviour by writing an easy to read unit test for it. Whilst the knowledge might be duplicated (the test duplicates the knowledge inside the procedure), at least those pieces of knowledge are validated against each other (if you have to repeat, put in some validation)

Don't have a team writing automated functional tests and another producing videos, write your video scripts as automated tests and have them generated with every build.

Instead of manually creating a set of ER diagrams and documentation on what the system will be like, write some documentation generation software and have it generated from the current state of the database instead.

You might notice that there's a running theme here... generation. Well yup. One of the best ways of reducing the chances of discrepancies between sources of knowledge is by ensuring there is only one representation of that knowledge and generating the others.

It's one of the reasons why I've been working on the new Open Source library 'P-dd' (Php Database Documentor). It's intended be a simple library for the production of database documentation from a number of different sources - the ultimate aim is to be able to read from any of the major RDBMS systems, Wikis, XML files and suchlike and be able to simply output many different forms, HTML, GIF, PDF, XML, Open Office Doc. Over the next week I intend on letting people know where they can find it, in its early form...