First and foremost, John and Hillary thank you both for your responses... I've not established what I set out to discover as yet, but there were several useful ideas, and you both asked me to clarify my question.
When I initially asked about performance - I suppose I should have made it clearer that I am primarily concerned by scalability. I can put-up with the wall-clock speed of execution at present but I need to establish that as the number of records grows over time that my solution won't grind to a halt.
The advice to limit the number of results was good - I'd omitted that for brevity - the actual form of query which interests me includes such a limit - I'd have better stated the query:
Select top 100 * from
A left outer join B on A.bid=B.id
left outer join C on A.cid=C.id
Where
contains(A.*, ?word?)
or contains (B.*, ?word?)
or contains (C.*, ?word?)
Assume that the ?word? I choose to search upon is sufficiently unusual to limit my result-set after application of the where clause to between, say, 0 and 10 rows and that all these rows must be returned to my client ? hence for the circumstances I was considering the TOP 100 would likely have no significant effect on performance. As you might have guessed, the query I posted is a significant simplification of a series of real world problems (hence the generic names) the detailed explanation of which is way beyond the scope of a post here. In my example the relationship from A to B and from A to C is 1..n. For some of the tables I actually use B.id is unique in B; in some C.id unique in C; for some both are unique; for some neither. Many queries and updates within the system only act upon these tables individually (hence fixing the table structure.) I also require support for data mining searches for which the result set matches that returned for the query above. While I can see a potential for improving performance by using a simplified table structure (and avoiding the complexity of evaluation strategy introduced by left-outer-join) this is not practical as a consequence of other binding project constraints. I need to keep the distinct tables I introduced as A, B and C ? their relationship can be considered ?set-in-stone.? I realise that a single table would eliminate concerns about performance but that idea is not an option for this project.
I?m still a little unclear about your terminology (specifically ?scans of the FTS catalogs? and ?table scans?) I suspect that this relate closely to what I consider the primary performance concern.
I assume by "table scans" you suggest that SQLServer might evaluate both outer joins for each record in A then establish if the candidate result rows match any of the three contains clauses? If this is the case then the performance would be worse than linear in the number of records in A ? and hence exhibit relatively poor scalability.
I assume ?scans of the FTS catalogs? you mean that SQLServer might use the text index to efficiently establish rows from A matching the contains predicate ranging over A; outer join those rows with B and C to get the initial result rows; establish in B using the text index rows matching the contains predicate ranging over B then inner join those results with A, then outer join that result with C; finally use the text index for C to efficiently establish the remaining rows in C before inner joining those with A and outer joining with B ? all the while carefully arranging not to return the exact same row twice? This would be a workable strategy but it feels ?unlikely? as it both fails to explain observed performances during my (extremely limited and potentially very error prone) experiments and because when I specified two OUTER JOIN expressions in my query I wouldn?t have expected the optimiser to implement this using two inner-join and four outer-join relationships unless I re-write my query to explicitly hint that this is the way I need the query to be evaluated. While this approach to establish the result set is relatively complex, it offers far better scalability ? especially if table A happens to have a large number of records.
I'm not in a position to give concrete sp_help output - primarily as this question is a massive simplification of the actual queries I use... Tables typically have 20-50 fields of various types (CLOB; BLOB; integer; date; etc. etc.); All tables have an integer primary key and in example above assume indices are defined on B.id and C.id - each of which is can be assumed ordinal (say integer). The textual information is free-text enabled in every table.
I didn't understand John's stored procedure - but the idea of a stored procedure to make explicit an evaluation strategy for the desired result-set is certainly an interesting and technically plausible possibility... all be it one which would prove quite a challenge in the context of my project!
I'm not encountered the "Best Practices Analyzer Tool" - though the advice on http://www.sqldude.4t.com/rules.html looks familiar enough. Is this tool supplied with SQLServer or is it 3rd party product? Would this tool tell me how my query is being optimised?
At this stage I'm trying to establish if my hunch is accurate about the scalability of queries of the type illustrated above. I am aware that the three contains clauses would be detrimental to performance - and I am aware that if I had a single flat table that any performance issues would likely disappear. I'm also aware that I could use an alternative sequence of queries to efficiently establish the same result rows - but it would be a big task to re-write my query generator to make this optimisation and I am reluctant to do this if there is an alternative. Queries of the form above work OK for my current test data on a quick server - (hundreds of records) but I want a solution which will scale to support hundreds of thousands to millions of records. I'm still not clear how the query above will be optimised ? and, as such, I'm still not in a position to estimate scalability. I can?t use a single table and avoid outer-joins ? but I might be able to alter the way in which the tables are indexed if that would influence the optimiser in a positive way.
Mal, I am interested in this your problem for purely academic reasons. I
think we can come up with a scalable solution that will be optimal for you.
Would you mind is we take this offline, for obvious reasons. I am happy to
sign an NDA if you feel you need protection.
I will need exact table structures, but not necessarily names, as table
widths and data types are a factor.
BTW - this service will be done at no charge and then with your permission,
I will post the solution back to the group.

Signature
Hilary Cotter
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html
> First and foremost, John and Hillary thank you both for your responses... I've not established what I set out to discover as yet, but there were
several useful ideas, and you both asked me to clarify my question.
> When I initially asked about performance - I suppose I should have made it clearer that I am primarily concerned by scalability. I can put-up with the
wall-clock speed of execution at present but I need to establish that as the
number of records grows over time that my solution won't grind to a halt.
> The advice to limit the number of results was good - I'd omitted that for brevity - the actual form of query which interests me includes such a
limit - I'd have better stated the query:
> Select top 100 * from
> A left outer join B on A.bid=B.id
[quoted text clipped - 5 lines]
>
> Assume that the 'word' I choose to search upon is sufficiently unusual to
limit my result-set after application of the where clause to between, say, 0
and 10 rows and that all these rows must be returned to my client - hence
for the circumstances I was considering the TOP 100 would likely have no
significant effect on performance. As you might have guessed, the query I
posted is a significant simplification of a series of real world problems
(hence the generic names) the detailed explanation of which is way beyond
the scope of a post here. In my example the relationship from A to B and
from A to C is 1..n. For some of the tables I actually use B.id is unique in
B; in some C.id unique in C; for some both are unique; for some neither.
Many queries and updates within the system only act upon these tables
individually (hence fixing the table structure.) I also require support for
data mining searches for which the result set matches that returned for the
query above. While I can see a potential for improving performance by using
a simplified table structure (and avoiding the complexity of evaluation
strategy introduced by left-outer-join) this is not practical as a
consequence of other binding project constraints. I need to keep the
distinct tables I introduced as A, B and C - their relationship can be
considered "set-in-stone." I realise that a single table would eliminate
concerns about performance but that idea is not an option for this project.
> I'm still a little unclear about your terminology (specifically "scans of
the FTS catalogs" and "table scans") I suspect that this relate closely to
what I consider the primary performance concern.
> I assume by "table scans" you suggest that SQLServer might evaluate both outer joins for each record in A then establish if the candidate result rows
match any of the three contains clauses? If this is the case then the
performance would be worse than linear in the number of records in A - and
hence exhibit relatively poor scalability.
> I assume "scans of the FTS catalogs" you mean that SQLServer might use the
text index to efficiently establish rows from A matching the contains
predicate ranging over A; outer join those rows with B and C to get the
initial result rows; establish in B using the text index rows matching the
contains predicate ranging over B then inner join those results with A, then
outer join that result with C; finally use the text index for C to
efficiently establish the remaining rows in C before inner joining those
with A and outer joining with B - all the while carefully arranging not to
return the exact same row twice? This would be a workable strategy but it
feels 'unlikely' as it both fails to explain observed performances during my
(extremely limited and potentially very error prone) experiments and because
when I specified two OUTER JOIN expressions in my query I wouldn't have
expected the optimiser to implement this using two inner-join and four
outer-join relationships unless I re-write my query to explicitly hint that
this is the way I need the query to be evaluated. While this approach to
establish the result set is relatively complex, it offers far better
scalability - especially if table A happens to have a large number of
records.
> I'm not in a position to give concrete sp_help output - primarily as this question is a massive simplification of the actual queries I use... Tables
typically have 20-50 fields of various types (CLOB; BLOB; integer; date;
etc. etc.); All tables have an integer primary key and in example above
assume indices are defined on B.id and C.id - each of which is can be
assumed ordinal (say integer). The textual information is free-text enabled
in every table.
> I didn't understand John's stored procedure - but the idea of a stored procedure to make explicit an evaluation strategy for the desired result-set
is certainly an interesting and technically plausible possibility... all be
it one which would prove quite a challenge in the context of my project!
> I'm not encountered the "Best Practices Analyzer Tool" - though the advice on http://www.sqldude.4t.com/rules.html looks familiar enough. Is this tool
supplied with SQLServer or is it 3rd party product? Would this tool tell me
how my query is being optimised?
> At this stage I'm trying to establish if my hunch is accurate about the scalability of queries of the type illustrated above. I am aware that the
three contains clauses would be detrimental to performance - and I am aware
that if I had a single flat table that any performance issues would likely
disappear. I'm also aware that I could use an alternative sequence of
queries to efficiently establish the same result rows - but it would be a
big task to re-write my query generator to make this optimisation and I am
reluctant to do this if there is an alternative. Queries of the form above
work OK for my current test data on a quick server - (hundreds of records)
but I want a solution which will scale to support hundreds of thousands to
millions of records. I'm still not clear how the query above will be
optimised - and, as such, I'm still not in a position to estimate
scalability. I can't use a single table and avoid outer-joins - but I might
be able to alter the way in which the tables are indexed if that would
influence the optimiser in a positive way.
> **********************************************************************
> Sent via Fuzzy Software @ http://www.fuzzysoftware.com/
> Comprehensive, categorised, searchable collection of links to ASP & ASP.NET resources...