Saturday, February 25, 2012

2005 T-SQL Logic and writing LOOPs?

My apologies for the long explanation & I don't have any experience writing LOOPS.

Using the information from TABLE#1 I'm looking to create an output file that looks like TABLE#2. In my example below I flagged GROUPS A, B, C & G with a value of "1" in TABLE#2 and GROUPS D, E, F with a value of "2". Even though GROUPS C & G never appear in the same row together both get flagged with a "1" because there is a row with GROUP C & G together, and C is linked to A in row 2. Groups D, E, F are flagged with a "2" because GROUP D & F are together and F is tied with E, so I flag all 3 with "2". TABLE#2 should only have distinct values in the DISTINCT_GROUPS COLUMN, however the FLAG number would repeat itself depending on how many GROUPS get the same flag ie. groups A, B, C & G have the same number "1" yet A, B, C & G only show up once in the DISTINCT_GROUP column - see table 2 output file below. TABLE#1 has millions of rows so optimization would be very important.

Even though their are millions of GROUPS in TABLE#1 they are distinct by row. example: A & B will only be in the table once however I could have B & A, where B is in the first column and A in the second. However each GROUP could appear hundreds of times if they SELL OR BUY from many other GROUPS. Thanks,

CREATE TABLE [dbo].[TABLE_1](

[BUYER_GROUP] [char](4) NOT NULL,

[SELLER_GROUP] [char](4) NOT NULL)

insert into TABLE_1 VALUES ('A','B')

insert into TABLE_1 VALUES ('C','A')

insert into TABLE_1 VALUES ('D','F')

insert into TABLE_1 VALUES ('B','C')

insert into TABLE_1 VALUES ('F','E')

insert into TABLE_1 VALUES ('C','G')

SELECT * FROM TABLE_1

BUYER_GROUP SELLER_GROUP

-- -

A B

C A

D F

B C

F E

C G

GO

CREATE TABLE [dbo].[TABLE_2](

[DISTINCT_GROUPS] [char](4) NOT NULL,

[FLAG] [int] NULL

) ON [PRIMARY]

insert into TABLE_2 VALUES ('A','1')

insert into TABLE_2 VALUES ('B','1')

insert into TABLE_2 VALUES ('C','1')

insert into TABLE_2 VALUES ('D','2')

insert into TABLE_2 VALUES ('E','2')

insert into TABLE_2 VALUES ('F','2')

insert into TABLE_2 VALUES ('G','1')

SELECT * FROM TABLE_2

DISTINCT_GROUP FLAG

-- --

A 1

B 1

C 1

D 2

E 2

F 2

G 1

James,

Although you said you have millions of rows in "table_1", how many rows are returned by the query below?

select distinct

buyer_group as col1,

seller_group as col2

from table_1

where (buyer_group < seller_group)

union

select distinct

seller_group as col1,

buyer_group as col2

from table_1

where (buyer_group > seller_group)

The reason I ask is that these constitute all the pairs that are the "initial set" for your problem.

Meanwhile, I will try to construct a loop over a result set such as above, a loop that terminates when a SQL query creates no rows that are not already in the "table of pairs."

Once we have a substantial table of pairs, we can probably use a CTE to make a "string" of each of the groups, and then "split" the string into its component parts to populate the table you seek.

That's what I'm thinking at this moment, anyway.

Dan

|||

Hi Dan, thanks for replying.

My current table has 3,851,668 rows and after running your code I have 1,925,834 rows - exactly 1/2 the records from TABLE#1 which makes sense because I forgot to mention that the relationships have to be reciprocal A & B, and B & A, so if A Buys from D however D never Buy's from A, this row would not be included in my dataset. So far I'm liking your approach because you've already reduced my Table size by 1/2 and the relationships can still be determined. Thanks Dan, let me know if you need anything else to assist you in helping me,

|||

James,

Please give this a try.

My current concern is with the "triple JOIN" with #pairs. If that is a problem, maybe the piece used for the LEFT OUTER JOIN must be a separate temporary table.

To use your own TABLE_1, you must change only the initial population of #pairs, as I'm sure you will readily recognize.

Dan

-- Look at some T-SQL code.

drop table #table_1

drop table #pairs

create table #table_1

(

buyer_groupchar(4),

seller_groupchar(4)

)

insert into #TABLE_1 VALUES ('A','B')

insert into #TABLE_1 VALUES ('C','A')

insert into #TABLE_1 VALUES ('D','F')

insert into #TABLE_1 VALUES ('B','C')

insert into #TABLE_1 VALUES ('F','E')

insert into #TABLE_1 VALUES ('C','G')

-- Populate a table with all initial "pairs".

select

col1,

col2

into #pairs

from

(

select distinct

buyer_group as col1,

seller_group as col2

from #table_1

where (buyer_group < seller_group)

union

select distinct

seller_group as col1,

buyer_group as col2

from #table_1

where (buyer_group > seller_group)

) tmp

;

create index pairs_ndx1

on #pairs (col1, col2)

create index pairs_ndx2

on #pairs (col2, col1)

-- Add new pairs to the table, using a WHILE loop.

declare @.new_pairs table

(

col1char(4),

col2char(4)

)

insert into @.new_pairs

(

col1,

col2

)

select distinct

p1.col1,

p2.col2

from #pairs p1

inner join #pairs p2

on p1.col2 = p2.col1 -- col2 with col1

left outer join #pairs p3

on p1.col1 = p3.col1 and

p2.col2 = p3.col2

where (p3.col1 is null) -- pair is not yet in data

and (p1.col1 < p2.col2) -- should not be necessary to test this

union

select distinct

p1.col1,

p2.col1

from #pairs p1

inner join #pairs p2

on p1.col2 = p2.col2 -- col2 with col2

left outer join #pairs p3

on p1.col1 = p3.col1 and

p2.col1 = p3.col2

where (p3.col1 is null) -- pair is not yet in data

and (p1.col1 < p2.col1) -- here it is necessary to test this

;

while ((select count(*) from @.new_pairs) > 0)

begin

insert into #pairs

(

col1,

col2

)

select

col1,

col2

from @.new_pairs

delete from @.new_pairs

insert into @.new_pairs

(

col1,

col2

)

select distinct

p1.col1,

p2.col2

from #pairs p1

inner join #pairs p2

on p1.col2 = p2.col1 -- col2 with col1

left outer join #pairs p3

on p1.col1 = p3.col1 and

p2.col2 = p3.col2

where (p3.col1 is null) -- pair is not yet in data

and (p1.col1 < p2.col2) -- should not be necessary to test this

union

select distinct

p1.col1,

p2.col1

from #pairs p1

inner join #pairs p2

on p1.col2 = p2.col2 -- col2 with col2

left outer join #pairs p3

on p1.col1 = p3.col1 and

p2.col1 = p3.col2

where (p3.col1 is null) -- pair is not yet in data

and (p1.col1 < p2.col1) -- here it is necessary to test this

;

end

-- Check the progress

-- select *

-- from #pairs

-- order by 1, 2

-- ;

-- Create the table of groups.

drop table #group_numbers

create table #group_numbers

(

initial_memberchar(4),

group_numberint identity (1,1)

)

insert into #group_numbers

(

initial_member

)

select distinct

p1.col1

from #pairs p1

left outer join

(

select distinct

col2

from #pairs

) tmp

on p1.col1 = tmp.col2

where (tmp.col2 is null) -- only want the first member of each group

order by 1

-- Create a table of group members.

drop table #group_members

create table #group_members

(

group_memberchar(4),

group_numberint

)

insert into #group_members

(

group_member,

group_number

)

select distinct

initial_member,

group_number

from #group_numbers

union

select distinct

col2,

group_number

from #group_numbers gn

inner join #pairs p

on gn.initial_member = p.col1

;

select *

from #group_members

order by 1, 2

;

|||

You're the best Dan, thanks so much for you help, your query worked amazing - I've been doing Q & A checks and everything looks great, no issues so far. This is my first time using a forumn and I will definately continue, much appreciated, have a great weekend.

Regards,

James

|||

James,

Thanks for the kind words.

I hope there are enough comments that you can follow the various steps.

I am interested to learn how long it takes to run with your large dataset. By using the #pairs temporary table, and non-clustered indexes, we are free from concerns about your actual table and its indexes. For such small indexes and only a few million rows, and with non-clustered indexes, the INSERT actions should go reasonably quickly.

If you have any questions about any step, please let me know.

I, too, have learned a great deal from this and other Microsoft forums. There are lots of great people sharing their knowledge with the rest of us. For that I am quite thankful.

Have a nice weekend.

Dan

|||

Hi Dan,

When I tested the query on Friday it ran very fast using 1 million rows of data, now with 4 million it's been running for 8 hours. It looks like it has been running a long time on insert statements, when I look at the Activity Monitor sometimes the status is suspended and the Wait Time can be 15 or 30 seconds. Since I started the query over 8 hours ago the CPU is 6,943,000 and the Physical IO is 2,204,800 - not sure if this good or not however thought I would mention it. Most of the flags would only include 5, 10 or 30 members however I think it's possible to get a few hundred thousand groups FLAGGED with the same number, so the insert is taking a long time. For now I'm letting it run and will check tonight from home - hopefully it finishes. I've tried udnerstanding the logic however need some more time to really sit down to see what's going on - with all the joins it's confusing for me to follow. thanks again, I will followup and let you know how long it took.

Thanks,

James

|||

James,

I consider 8 hours for this process to be completely unacceptable.

I will re-work the earlier code so that it does not need the LEFT OUTER JOIN -- though the underlying logic will remain the same.

I will include some timing statements so that you can use the "Results to Text" feature of SSMS to monitor progress. It will be nice to get some row-counts at each step in the loop, to get some idea of what is consuming all the time.

Dan

|||

Try this out, please. I put in a limit on the number of iterations of the loop, and some timing statements.

Please use "Results to Text" and post its output, to help identify a solution route (I hope).

Thanks.

Dan

declare @.MyMin varchar(3)

declare @.MySec varchar(3)

declare @.MyMS varchar(4)

set @.MyMin = right(('00' + cast(datepart(mi, getdate()) as varchar(3))), 2)

set @.MySec = right(('00' + cast(datepart(ss, getdate()) as varchar(3))), 2)

set @.MyMS = right(('000' + cast(datepart(ms, getdate()) as varchar(4))), 3)

print 'Beginning process ' + @.MyMin + ':' + @.MySec + '.' + @.MyMS

drop table #pairs

-- Populate a table with all initial "pairs".

select

col1,

col2

into #pairs

from

(

select distinct

buyer_group as col1,

seller_group as col2

from table_1

where (buyer_group < seller_group)

union

select distinct

seller_group as col1,

buyer_group as col2

from table_1

where (buyer_group > seller_group)

) tmp

;

create index pairs_ndx1

on #pairs (col1, col2)

create index pairs_ndx2

on #pairs (col2, col1)

set @.MyMin = right(('00' + cast(datepart(mi, getdate()) as varchar(3))), 2)

set @.MySec = right(('00' + cast(datepart(ss, getdate()) as varchar(3))), 2)

set @.MyMS = right(('000' + cast(datepart(ms, getdate()) as varchar(4))), 3)

print '#pairs and indexing complete: ' + @.MyMin + ':' + @.MySec + '.' + @.MyMS

-- Add new pairs to the table, using a WHILE loop.

declare @.new_pairs table

(

col1char(4),

col2char(4)

)

insert into @.new_pairs

(

col1,

col2

)

select distinct

p1.col1,

p2.col2

from #pairs p1

inner join #pairs p2

on p1.col2 = p2.col1 -- col2 with col1

left outer join #pairs p3

on p1.col1 = p3.col1 and

p2.col2 = p3.col2

where (p3.col1 is null) -- pair is not yet in data

and (p1.col1 < p2.col2) -- should not be necessary to test this

union

select distinct

p1.col1,

p2.col1

from #pairs p1

inner join #pairs p2

on p1.col2 = p2.col2 -- col2 with col2

left outer join #pairs p3

on p1.col1 = p3.col1 and

p2.col1 = p3.col2

where (p3.col1 is null) -- pair is not yet in data

and (p1.col1 < p2.col1) -- here it is necessary to test this

;

set @.MyMin = right(('00' + cast(datepart(mi, getdate()) as varchar(3))), 2)

set @.MySec = right(('00' + cast(datepart(ss, getdate()) as varchar(3))), 2)

set @.MyMS = right(('000' + cast(datepart(ms, getdate()) as varchar(4))), 3)

print '@.new_pairs initial population complete: ' + @.MyMin + ':' + @.MySec + '.' + @.MyMS

declare @.loop_counter int

set @.loop_counter = 0

while (((select count(*) from @.new_pairs) > 0) and (@.loop_counter < 10))

begin

insert into #pairs

(

col1,

col2

)

select

col1,

col2

from @.new_pairs

delete from @.new_pairs

insert into @.new_pairs

(

col1,

col2

)

select distinct

p1.col1,

p2.col2

from #pairs p1

inner join #pairs p2

on p1.col2 = p2.col1 -- col2 with col1

left outer join #pairs p3

on p1.col1 = p3.col1 and

p2.col2 = p3.col2

where (p3.col1 is null) -- pair is not yet in data

and (p1.col1 < p2.col2) -- should not be necessary to test this

union

select distinct

p1.col1,

p2.col1

from #pairs p1

inner join #pairs p2

on p1.col2 = p2.col2 -- col2 with col2

left outer join #pairs p3

on p1.col1 = p3.col1 and

p2.col1 = p3.col2

where (p3.col1 is null) -- pair is not yet in data

and (p1.col1 < p2.col1) -- here it is necessary to test this

;

set @.MyMin = right(('00' + cast(datepart(mi, getdate()) as varchar(3))), 2)

set @.MySec = right(('00' + cast(datepart(ss, getdate()) as varchar(3))), 2)

set @.MyMS = right(('000' + cast(datepart(ms, getdate()) as varchar(4))), 3)

set @.loop_counter = @.loop_counter + 1

print 'loop count: ' + cast(@.loop_counter as varchar(5)) + ';time: ' + @.MyMin + ':' + @.MySec + '.' + @.MyMS

end

|||

James,

And here is a version that uses a temp table, instead of a table variable. It also leaves room for expansion in the indexes, in case index maintenance is the main reason for the lengthy duration.

Dan

declare @.MyMin varchar(3)

declare @.MySec varchar(3)

declare @.MyMS varchar(4)

set @.MyMin = right(('00' + cast(datepart(mi, getdate()) as varchar(3))), 2)

set @.MySec = right(('00' + cast(datepart(ss, getdate()) as varchar(3))), 2)

set @.MyMS = right(('000' + cast(datepart(ms, getdate()) as varchar(4))), 3)

print 'Beginning process ' + @.MyMin + ':' + @.MySec + '.' + @.MyMS

drop table #pairs

-- Populate a table with all initial "pairs".

select

col1,

col2

into #pairs

from

(

select distinct

buyer_group as col1,

seller_group as col2

from table_1

where (buyer_group < seller_group)

union

select distinct

seller_group as col1,

buyer_group as col2

from table_1

where (buyer_group > seller_group)

) tmp

;

create index pairs_ndx1

on #pairs (col1, col2)

WITH PAD_INDEX, FILLFACTOR = 50

create index pairs_ndx2

on #pairs (col2, col1)

WITH PAD_INDEX, FILLFACTOR = 50

set @.MyMin = right(('00' + cast(datepart(mi, getdate()) as varchar(3))), 2)

set @.MySec = right(('00' + cast(datepart(ss, getdate()) as varchar(3))), 2)

set @.MyMS = right(('000' + cast(datepart(ms, getdate()) as varchar(4))), 3)

print '#pairs and indexing complete: ' + @.MyMin + ':' + @.MySec + '.' + @.MyMS

-- Add new pairs to the table, using a WHILE loop.

declare #new_pairs table

(

col1 char(4),

col2 char(4)

)

insert into #new_pairs

(

col1,

col2

)

select distinct

p1.col1,

p2.col2

from #pairs p1

inner join #pairs p2

on p1.col2 = p2.col1 -- col2 with col1

left outer join #pairs p3

on p1.col1 = p3.col1 and

p2.col2 = p3.col2

where (p3.col1 is null) -- pair is not yet in data

and (p1.col1 < p2.col2) -- should not be necessary to test this

union

select distinct

p1.col1,

p2.col1

from #pairs p1

inner join #pairs p2

on p1.col2 = p2.col2 -- col2 with col2

left outer join #pairs p3

on p1.col1 = p3.col1 and

p2.col1 = p3.col2

where (p3.col1 is null) -- pair is not yet in data

and (p1.col1 < p2.col1) -- here it is necessary to test this

;

create index new_pairs_ndx

on #new_pairs (col1, col2)

WITH PAD_INDEX, FILLFACTOR = 50

set @.MyMin = right(('00' + cast(datepart(mi, getdate()) as varchar(3))), 2)

set @.MySec = right(('00' + cast(datepart(ss, getdate()) as varchar(3))), 2)

set @.MyMS = right(('000' + cast(datepart(ms, getdate()) as varchar(4))), 3)

print '#new_pairs initial population complete: ' + @.MyMin + ':' + @.MySec + '.' + @.MyMS

declare @.loop_counter int

set @.loop_counter = 0

while (((select count(*) from #new_pairs) > 0) and (@.loop_counter < 10))

begin

insert into #pairs

(

col1,

col2

)

select

col1,

col2

from #new_pairs

delete from #new_pairs

insert into #new_pairs

(

col1,

col2

)

select distinct

p1.col1,

p2.col2

from #pairs p1

inner join #pairs p2

on p1.col2 = p2.col1 -- col2 with col1

where (p1.col1 < p2.col2) -- should not be necessary to test this

union

select distinct

p1.col1,

p2.col1

from #pairs p1

inner join #pairs p2

on p1.col2 = p2.col2 -- col2 with col2

where (p1.col1 < p2.col1) -- here it is necessary to test this

;

delete np -- entries already in the table;

from #new_pairs np

inner join #pairs p

on p.col1 = np.col1 and

p.col2 = np.col2

set @.MyMin = right(('00' + cast(datepart(mi, getdate()) as varchar(3))), 2)

set @.MySec = right(('00' + cast(datepart(ss, getdate()) as varchar(3))), 2)

set @.MyMS = right(('000' + cast(datepart(ms, getdate()) as varchar(4))), 3)

set @.loop_counter = @.loop_counter + 1

print 'loop count: ' + cast(@.loop_counter as varchar(5)) + '; time: ' + @.MyMin + ':' + @.MySec + '.' + @.MyMS

end

|||Hi Dan, sorry for the late reply, I took a couple days off but now I'm back. Thanks for posting the above information - I will give this a try over the weekend and let you know how it goes. Thanks again, Jim|||

Jim,

Ultimately I think we want a solution that works in less than 10 minutes.

The only thing that might prevent us from having such a solution is if there is some "large loop" in the data, where AAAA

does not sell directly to ZZZZ, but AAAA sells to AAAB, which sells to AAAC, which sells to AAAD, ... which sells to ZZZX, which sells to ZZZY, which sells to ZZZZ. (AAAA also does not sell directly to any of the "intermediate" entries in this list.)

In such a case the current loop would have to iterate approx. 26*26*26*26 times -- which is a lot of looping!

Dan

|||

Hi Dan, not sure if you're still around however I've finally had a chance to try the code again with even a larger dataset. Everything works great except for what you mentioned above when there are alot of iterations. Without that one very large interation I have which can be over 500k groups the query can finish in only 10 to 15 minutes. But once it hits that large interation it runs for over 10 hours and I've yet to finish running the code to give you any results; yesterday they rebooted the server because of too much CPU usage . Now with the larger dataset - 18 million there possibly could be an even a larger iteration that I haven't identified within the dataset. Not sure if there is away to have the query bypass large interations or some way I can identify and release the memory if I'm taking up too much? Anyway, thanks for your help to this point. Cheers

|||

Jim,

I am unaware of any way that you could determine, in advance, that you have some very large loop, such as with 500K "buyer-seller" pairs.

Is such a large loop relevant to what you are trying to do? Can you ignore it, and put in some iteration count limit to find only loops of less than 100 buyer-seller pairs?

I suppose 500K "buyer-seller" pairs could represent an incredible money-laundering scheme!

Dan

|||

Thanks Dan, yes that's what I've done. For now I just set aside this large loop so the query that you wrote I've been using and it runs in 10 minutes - sometimes even quicker. Thanks a lot for your help.

Jim

2005 T-SQL Logic and writing LOOPs?

My apologies for the long explanation & I don't have any experience writing LOOPS.

Using the information from TABLE#1 I'm looking to create an output file that looks like TABLE#2. In my example below I flagged GROUPS A, B, C & G with a value of "1" in TABLE#2 and GROUPS D, E, F with a value of "2". Even though GROUPS C & G never appear in the same row together both get flagged with a "1" because there is a row with GROUP C & G together, and C is linked to A in row 2. Groups D, E, F are flagged with a "2" because GROUP D & F are together and F is tied with E, so I flag all 3 with "2". TABLE#2 should only have distinct values in the DISTINCT_GROUPS COLUMN, however the FLAG number would repeat itself depending on how many GROUPS get the same flag ie. groups A, B, C & G have the same number "1" yet A, B, C & G only show up once in the DISTINCT_GROUP column - see table 2 output file below. TABLE#1 has millions of rows so optimization would be very important.

Even though their are millions of GROUPS in TABLE#1 they are distinct by row. example: A & B will only be in the table once however I could have B & A, where B is in the first column and A in the second. However each GROUP could appear hundreds of times if they SELL OR BUY from many other GROUPS. Thanks,

CREATE TABLE [dbo].[TABLE_1](

[BUYER_GROUP] [char](4) NOT NULL,

[SELLER_GROUP] [char](4) NOT NULL)

insert into TABLE_1 VALUES ('A','B')

insert into TABLE_1 VALUES ('C','A')

insert into TABLE_1 VALUES ('D','F')

insert into TABLE_1 VALUES ('B','C')

insert into TABLE_1 VALUES ('F','E')

insert into TABLE_1 VALUES ('C','G')

SELECT * FROM TABLE_1

BUYER_GROUP SELLER_GROUP

-- -

A B

C A

D F

B C

F E

C G

GO

CREATE TABLE [dbo].[TABLE_2](

[DISTINCT_GROUPS] [char](4) NOT NULL,

[FLAG] [int] NULL

) ON [PRIMARY]

insert into TABLE_2 VALUES ('A','1')

insert into TABLE_2 VALUES ('B','1')

insert into TABLE_2 VALUES ('C','1')

insert into TABLE_2 VALUES ('D','2')

insert into TABLE_2 VALUES ('E','2')

insert into TABLE_2 VALUES ('F','2')

insert into TABLE_2 VALUES ('G','1')

SELECT * FROM TABLE_2

DISTINCT_GROUP FLAG

-- --

A 1

B 1

C 1

D 2

E 2

F 2

G 1

James,

Although you said you have millions of rows in "table_1", how many rows are returned by the query below?

select distinct

buyer_group as col1,

seller_group as col2

from table_1

where (buyer_group < seller_group)

union

select distinct

seller_group as col1,

buyer_group as col2

from table_1

where (buyer_group > seller_group)

The reason I ask is that these constitute all the pairs that are the "initial set" for your problem.

Meanwhile, I will try to construct a loop over a result set such as above, a loop that terminates when a SQL query creates no rows that are not already in the "table of pairs."

Once we have a substantial table of pairs, we can probably use a CTE to make a "string" of each of the groups, and then "split" the string into its component parts to populate the table you seek.

That's what I'm thinking at this moment, anyway.

Dan

|||

Hi Dan, thanks for replying.

My current table has 3,851,668 rows and after running your code I have 1,925,834 rows - exactly 1/2 the records from TABLE#1 which makes sense because I forgot to mention that the relationships have to be reciprocal A & B, and B & A, so if A Buys from D however D never Buy's from A, this row would not be included in my dataset. So far I'm liking your approach because you've already reduced my Table size by 1/2 and the relationships can still be determined. Thanks Dan, let me know if you need anything else to assist you in helping me,

|||

James,

Please give this a try.

My current concern is with the "triple JOIN" with #pairs. If that is a problem, maybe the piece used for the LEFT OUTER JOIN must be a separate temporary table.

To use your own TABLE_1, you must change only the initial population of #pairs, as I'm sure you will readily recognize.

Dan

-- Look at some T-SQL code.

drop table #table_1

drop table #pairs

create table #table_1

(

buyer_groupchar(4),

seller_groupchar(4)

)

insert into #TABLE_1 VALUES ('A','B')

insert into #TABLE_1 VALUES ('C','A')

insert into #TABLE_1 VALUES ('D','F')

insert into #TABLE_1 VALUES ('B','C')

insert into #TABLE_1 VALUES ('F','E')

insert into #TABLE_1 VALUES ('C','G')

-- Populate a table with all initial "pairs".

select

col1,

col2

into #pairs

from

(

select distinct

buyer_group as col1,

seller_group as col2

from #table_1

where (buyer_group < seller_group)

union

select distinct

seller_group as col1,

buyer_group as col2

from #table_1

where (buyer_group > seller_group)

) tmp

;

create index pairs_ndx1

on #pairs (col1, col2)

create index pairs_ndx2

on #pairs (col2, col1)

-- Add new pairs to the table, using a WHILE loop.

declare @.new_pairs table

(

col1char(4),

col2char(4)

)

insert into @.new_pairs

(

col1,

col2

)

select distinct

p1.col1,

p2.col2

from #pairs p1

inner join #pairs p2

on p1.col2 = p2.col1 -- col2 with col1

left outer join #pairs p3

on p1.col1 = p3.col1 and

p2.col2 = p3.col2

where (p3.col1 is null) -- pair is not yet in data

and (p1.col1 < p2.col2) -- should not be necessary to test this

union

select distinct

p1.col1,

p2.col1

from #pairs p1

inner join #pairs p2

on p1.col2 = p2.col2 -- col2 with col2

left outer join #pairs p3

on p1.col1 = p3.col1 and

p2.col1 = p3.col2

where (p3.col1 is null) -- pair is not yet in data

and (p1.col1 < p2.col1) -- here it is necessary to test this

;

while ((select count(*) from @.new_pairs) > 0)

begin

insert into #pairs

(

col1,

col2

)

select

col1,

col2

from @.new_pairs

delete from @.new_pairs

insert into @.new_pairs

(

col1,

col2

)

select distinct

p1.col1,

p2.col2

from #pairs p1

inner join #pairs p2

on p1.col2 = p2.col1 -- col2 with col1

left outer join #pairs p3

on p1.col1 = p3.col1 and

p2.col2 = p3.col2

where (p3.col1 is null) -- pair is not yet in data

and (p1.col1 < p2.col2) -- should not be necessary to test this

union

select distinct

p1.col1,

p2.col1

from #pairs p1

inner join #pairs p2

on p1.col2 = p2.col2 -- col2 with col2

left outer join #pairs p3

on p1.col1 = p3.col1 and

p2.col1 = p3.col2

where (p3.col1 is null) -- pair is not yet in data

and (p1.col1 < p2.col1) -- here it is necessary to test this

;

end

-- Check the progress

-- select *

-- from #pairs

-- order by 1, 2

-- ;

-- Create the table of groups.

drop table #group_numbers

create table #group_numbers

(

initial_memberchar(4),

group_numberint identity (1,1)

)

insert into #group_numbers

(

initial_member

)

select distinct

p1.col1

from #pairs p1

left outer join

(

select distinct

col2

from #pairs

) tmp

on p1.col1 = tmp.col2

where (tmp.col2 is null) -- only want the first member of each group

order by 1

-- Create a table of group members.

drop table #group_members

create table #group_members

(

group_memberchar(4),

group_numberint

)

insert into #group_members

(

group_member,

group_number

)

select distinct

initial_member,

group_number

from #group_numbers

union

select distinct

col2,

group_number

from #group_numbers gn

inner join #pairs p

on gn.initial_member = p.col1

;

select *

from #group_members

order by 1, 2

;

|||

You're the best Dan, thanks so much for you help, your query worked amazing - I've been doing Q & A checks and everything looks great, no issues so far. This is my first time using a forumn and I will definately continue, much appreciated, have a great weekend.

Regards,

James

|||

James,

Thanks for the kind words.

I hope there are enough comments that you can follow the various steps.

I am interested to learn how long it takes to run with your large dataset. By using the #pairs temporary table, and non-clustered indexes, we are free from concerns about your actual table and its indexes. For such small indexes and only a few million rows, and with non-clustered indexes, the INSERT actions should go reasonably quickly.

If you have any questions about any step, please let me know.

I, too, have learned a great deal from this and other Microsoft forums. There are lots of great people sharing their knowledge with the rest of us. For that I am quite thankful.

Have a nice weekend.

Dan

|||

Hi Dan,

When I tested the query on Friday it ran very fast using 1 million rows of data, now with 4 million it's been running for 8 hours. It looks like it has been running a long time on insert statements, when I look at the Activity Monitor sometimes the status is suspended and the Wait Time can be 15 or 30 seconds. Since I started the query over 8 hours ago the CPU is 6,943,000 and the Physical IO is 2,204,800 - not sure if this good or not however thought I would mention it. Most of the flags would only include 5, 10 or 30 members however I think it's possible to get a few hundred thousand groups FLAGGED with the same number, so the insert is taking a long time. For now I'm letting it run and will check tonight from home - hopefully it finishes. I've tried udnerstanding the logic however need some more time to really sit down to see what's going on - with all the joins it's confusing for me to follow. thanks again, I will followup and let you know how long it took.

Thanks,

James

|||

James,

I consider 8 hours for this process to be completely unacceptable.

I will re-work the earlier code so that it does not need the LEFT OUTER JOIN -- though the underlying logic will remain the same.

I will include some timing statements so that you can use the "Results to Text" feature of SSMS to monitor progress. It will be nice to get some row-counts at each step in the loop, to get some idea of what is consuming all the time.

Dan

|||

Try this out, please. I put in a limit on the number of iterations of the loop, and some timing statements.

Please use "Results to Text" and post its output, to help identify a solution route (I hope).

Thanks.

Dan

declare @.MyMin varchar(3)

declare @.MySec varchar(3)

declare @.MyMS varchar(4)

set @.MyMin = right(('00' + cast(datepart(mi, getdate()) as varchar(3))), 2)

set @.MySec = right(('00' + cast(datepart(ss, getdate()) as varchar(3))), 2)

set @.MyMS = right(('000' + cast(datepart(ms, getdate()) as varchar(4))), 3)

print 'Beginning process ' + @.MyMin + ':' + @.MySec + '.' + @.MyMS

drop table #pairs

-- Populate a table with all initial "pairs".

select

col1,

col2

into #pairs

from

(

select distinct

buyer_group as col1,

seller_group as col2

from table_1

where (buyer_group < seller_group)

union

select distinct

seller_group as col1,

buyer_group as col2

from table_1

where (buyer_group > seller_group)

) tmp

;

create index pairs_ndx1

on #pairs (col1, col2)

create index pairs_ndx2

on #pairs (col2, col1)

set @.MyMin = right(('00' + cast(datepart(mi, getdate()) as varchar(3))), 2)

set @.MySec = right(('00' + cast(datepart(ss, getdate()) as varchar(3))), 2)

set @.MyMS = right(('000' + cast(datepart(ms, getdate()) as varchar(4))), 3)

print '#pairs and indexing complete: ' + @.MyMin + ':' + @.MySec + '.' + @.MyMS

-- Add new pairs to the table, using a WHILE loop.

declare @.new_pairs table

(

col1char(4),

col2char(4)

)

insert into @.new_pairs

(

col1,

col2

)

select distinct

p1.col1,

p2.col2

from #pairs p1

inner join #pairs p2

on p1.col2 = p2.col1 -- col2 with col1

left outer join #pairs p3

on p1.col1 = p3.col1 and

p2.col2 = p3.col2

where (p3.col1 is null) -- pair is not yet in data

and (p1.col1 < p2.col2) -- should not be necessary to test this

union

select distinct

p1.col1,

p2.col1

from #pairs p1

inner join #pairs p2

on p1.col2 = p2.col2 -- col2 with col2

left outer join #pairs p3

on p1.col1 = p3.col1 and

p2.col1 = p3.col2

where (p3.col1 is null) -- pair is not yet in data

and (p1.col1 < p2.col1) -- here it is necessary to test this

;

set @.MyMin = right(('00' + cast(datepart(mi, getdate()) as varchar(3))), 2)

set @.MySec = right(('00' + cast(datepart(ss, getdate()) as varchar(3))), 2)

set @.MyMS = right(('000' + cast(datepart(ms, getdate()) as varchar(4))), 3)

print '@.new_pairs initial population complete: ' + @.MyMin + ':' + @.MySec + '.' + @.MyMS

declare @.loop_counter int

set @.loop_counter = 0

while (((select count(*) from @.new_pairs) > 0) and (@.loop_counter < 10))

begin

insert into #pairs

(

col1,

col2

)

select

col1,

col2

from @.new_pairs

delete from @.new_pairs

insert into @.new_pairs

(

col1,

col2

)

select distinct

p1.col1,

p2.col2

from #pairs p1

inner join #pairs p2

on p1.col2 = p2.col1 -- col2 with col1

left outer join #pairs p3

on p1.col1 = p3.col1 and

p2.col2 = p3.col2

where (p3.col1 is null) -- pair is not yet in data

and (p1.col1 < p2.col2) -- should not be necessary to test this

union

select distinct

p1.col1,

p2.col1

from #pairs p1

inner join #pairs p2

on p1.col2 = p2.col2 -- col2 with col2

left outer join #pairs p3

on p1.col1 = p3.col1 and

p2.col1 = p3.col2

where (p3.col1 is null) -- pair is not yet in data

and (p1.col1 < p2.col1) -- here it is necessary to test this

;

set @.MyMin = right(('00' + cast(datepart(mi, getdate()) as varchar(3))), 2)

set @.MySec = right(('00' + cast(datepart(ss, getdate()) as varchar(3))), 2)

set @.MyMS = right(('000' + cast(datepart(ms, getdate()) as varchar(4))), 3)

set @.loop_counter = @.loop_counter + 1

print 'loop count: ' + cast(@.loop_counter as varchar(5)) + ';time: ' + @.MyMin + ':' + @.MySec + '.' + @.MyMS

end

|||

James,

And here is a version that uses a temp table, instead of a table variable. It also leaves room for expansion in the indexes, in case index maintenance is the main reason for the lengthy duration.

Dan

declare @.MyMin varchar(3)

declare @.MySec varchar(3)

declare @.MyMS varchar(4)

set @.MyMin = right(('00' + cast(datepart(mi, getdate()) as varchar(3))), 2)

set @.MySec = right(('00' + cast(datepart(ss, getdate()) as varchar(3))), 2)

set @.MyMS = right(('000' + cast(datepart(ms, getdate()) as varchar(4))), 3)

print 'Beginning process ' + @.MyMin + ':' + @.MySec + '.' + @.MyMS

drop table #pairs

-- Populate a table with all initial "pairs".

select

col1,

col2

into #pairs

from

(

select distinct

buyer_group as col1,

seller_group as col2

from table_1

where (buyer_group < seller_group)

union

select distinct

seller_group as col1,

buyer_group as col2

from table_1

where (buyer_group > seller_group)

) tmp

;

create index pairs_ndx1

on #pairs (col1, col2)

WITH PAD_INDEX, FILLFACTOR = 50

create index pairs_ndx2

on #pairs (col2, col1)

WITH PAD_INDEX, FILLFACTOR = 50

set @.MyMin = right(('00' + cast(datepart(mi, getdate()) as varchar(3))), 2)

set @.MySec = right(('00' + cast(datepart(ss, getdate()) as varchar(3))), 2)

set @.MyMS = right(('000' + cast(datepart(ms, getdate()) as varchar(4))), 3)

print '#pairs and indexing complete: ' + @.MyMin + ':' + @.MySec + '.' + @.MyMS

-- Add new pairs to the table, using a WHILE loop.

declare #new_pairs table

(

col1 char(4),

col2 char(4)

)

insert into #new_pairs

(

col1,

col2

)

select distinct

p1.col1,

p2.col2

from #pairs p1

inner join #pairs p2

on p1.col2 = p2.col1 -- col2 with col1

left outer join #pairs p3

on p1.col1 = p3.col1 and

p2.col2 = p3.col2

where (p3.col1 is null) -- pair is not yet in data

and (p1.col1 < p2.col2) -- should not be necessary to test this

union

select distinct

p1.col1,

p2.col1

from #pairs p1

inner join #pairs p2

on p1.col2 = p2.col2 -- col2 with col2

left outer join #pairs p3

on p1.col1 = p3.col1 and

p2.col1 = p3.col2

where (p3.col1 is null) -- pair is not yet in data

and (p1.col1 < p2.col1) -- here it is necessary to test this

;

create index new_pairs_ndx

on #new_pairs (col1, col2)

WITH PAD_INDEX, FILLFACTOR = 50

set @.MyMin = right(('00' + cast(datepart(mi, getdate()) as varchar(3))), 2)

set @.MySec = right(('00' + cast(datepart(ss, getdate()) as varchar(3))), 2)

set @.MyMS = right(('000' + cast(datepart(ms, getdate()) as varchar(4))), 3)

print '#new_pairs initial population complete: ' + @.MyMin + ':' + @.MySec + '.' + @.MyMS

declare @.loop_counter int

set @.loop_counter = 0

while (((select count(*) from #new_pairs) > 0) and (@.loop_counter < 10))

begin

insert into #pairs

(

col1,

col2

)

select

col1,

col2

from #new_pairs

delete from #new_pairs

insert into #new_pairs

(

col1,

col2

)

select distinct

p1.col1,

p2.col2

from #pairs p1

inner join #pairs p2

on p1.col2 = p2.col1 -- col2 with col1

where (p1.col1 < p2.col2) -- should not be necessary to test this

union

select distinct

p1.col1,

p2.col1

from #pairs p1

inner join #pairs p2

on p1.col2 = p2.col2 -- col2 with col2

where (p1.col1 < p2.col1) -- here it is necessary to test this

;

delete np -- entries already in the table;

from #new_pairs np

inner join #pairs p

on p.col1 = np.col1 and

p.col2 = np.col2

set @.MyMin = right(('00' + cast(datepart(mi, getdate()) as varchar(3))), 2)

set @.MySec = right(('00' + cast(datepart(ss, getdate()) as varchar(3))), 2)

set @.MyMS = right(('000' + cast(datepart(ms, getdate()) as varchar(4))), 3)

set @.loop_counter = @.loop_counter + 1

print 'loop count: ' + cast(@.loop_counter as varchar(5)) + '; time: ' + @.MyMin + ':' + @.MySec + '.' + @.MyMS

end

|||Hi Dan, sorry for the late reply, I took a couple days off but now I'm back. Thanks for posting the above information - I will give this a try over the weekend and let you know how it goes. Thanks again, Jim|||

Jim,

Ultimately I think we want a solution that works in less than 10 minutes.

The only thing that might prevent us from having such a solution is if there is some "large loop" in the data, where AAAA

does not sell directly to ZZZZ, but AAAA sells to AAAB, which sells to AAAC, which sells to AAAD, ... which sells to ZZZX, which sells to ZZZY, which sells to ZZZZ. (AAAA also does not sell directly to any of the "intermediate" entries in this list.)

In such a case the current loop would have to iterate approx. 26*26*26*26 times -- which is a lot of looping!

Dan

|||

Hi Dan, not sure if you're still around however I've finally had a chance to try the code again with even a larger dataset. Everything works great except for what you mentioned above when there are alot of iterations. Without that one very large interation I have which can be over 500k groups the query can finish in only 10 to 15 minutes. But once it hits that large interation it runs for over 10 hours and I've yet to finish running the code to give you any results; yesterday they rebooted the server because of too much CPU usage . Now with the larger dataset - 18 million there possibly could be an even a larger iteration that I haven't identified within the dataset. Not sure if there is away to have the query bypass large interations or some way I can identify and release the memory if I'm taking up too much? Anyway, thanks for your help to this point. Cheers

|||

Jim,

I am unaware of any way that you could determine, in advance, that you have some very large loop, such as with 500K "buyer-seller" pairs.

Is such a large loop relevant to what you are trying to do? Can you ignore it, and put in some iteration count limit to find only loops of less than 100 buyer-seller pairs?

I suppose 500K "buyer-seller" pairs could represent an incredible money-laundering scheme!

Dan

|||

Thanks Dan, yes that's what I've done. For now I just set aside this large loop so the query that you wrote I've been using and it runs in 10 minutes - sometimes even quicker. Thanks a lot for your help.

Jim

2005 T-SQL Logic and writing LOOPs?

My apologies for the long explanation & I don't have any experience writing LOOPS.

Using the information from TABLE#1 I'm looking to create an output file that looks like TABLE#2. In my example below I flagged GROUPS A, B, C & G with a value of "1" in TABLE#2 and GROUPS D, E, F with a value of "2". Even though GROUPS C & G never appear in the same row together both get flagged with a "1" because there is a row with GROUP C & G together, and C is linked to A in row 2. Groups D, E, F are flagged with a "2" because GROUP D & F are together and F is tied with E, so I flag all 3 with "2". TABLE#2 should only have distinct values in the DISTINCT_GROUPS COLUMN, however the FLAG number would repeat itself depending on how many GROUPS get the same flag ie. groups A, B, C & G have the same number "1" yet A, B, C & G only show up once in the DISTINCT_GROUP column - see table 2 output file below. TABLE#1 has millions of rows so optimization would be very important.

Even though their are millions of GROUPS in TABLE#1 they are distinct by row. example: A & B will only be in the table once however I could have B & A, where B is in the first column and A in the second. However each GROUP could appear hundreds of times if they SELL OR BUY from many other GROUPS. Thanks,

CREATETABLE [dbo].[TABLE_1](

[BUYER_GROUP] [char](4)NOTNULL,

[SELLER_GROUP] [char](4)NOTNULL)

insertinto TABLE_1 VALUES('A','B')

insertinto TABLE_1 VALUES('C','A')

insertinto TABLE_1 VALUES('D','F')

insertinto TABLE_1 VALUES('B','C')

insertinto TABLE_1 VALUES('F','E')

insertinto TABLE_1 VALUES('C','G')

SELECT * FROM TABLE_1

BUYER_GROUP SELLER_GROUP

-- -

A B

C A

D F

B C

F E

C G

GO

CREATETABLE [dbo].[TABLE_2](

[DISTINCT_GROUPS] [char](4)NOTNULL,

[FLAG] [int] NULL

)ON [PRIMARY]

insertinto TABLE_2 VALUES('A','1')

insertinto TABLE_2 VALUES('B','1')

insertinto TABLE_2 VALUES('C','1')

insertinto TABLE_2 VALUES('D','2')

insertinto TABLE_2 VALUES('E','2')

insertinto TABLE_2 VALUES('F','2')

insertinto TABLE_2 VALUES('G','1')

SELECT * FROM TABLE_2

DISTINCT_GROUP FLAG

-- --

A 1

B 1

C 1

D 2

E 2

F 2

G 1

James,

Although you said you have millions of rows in "table_1", how many rows are returned by the query below?

select distinct

buyer_group as col1,

seller_group as col2

from table_1

where (buyer_group < seller_group)

union

select distinct

seller_group as col1,

buyer_group as col2

from table_1

where (buyer_group > seller_group)

The reason I ask is that these constitute all the pairs that are the "initial set" for your problem.

Meanwhile, I will try to construct a loop over a result set such as above, a loop that terminates when a SQL query creates no rows that are not already in the "table of pairs."

Once we have a substantial table of pairs, we can probably use a CTE to make a "string" of each of the groups, and then "split" the string into its component parts to populate the table you seek.

That's what I'm thinking at this moment, anyway.

Dan

|||

Hi Dan, thanks for replying.

My current table has 3,851,668 rows and after running your code I have 1,925,834 rows - exactly 1/2 the records from TABLE#1 which makes sense because I forgot to mention that the relationships have to be reciprocal A & B, and B & A, so if A Buys from D however D never Buy's from A, this row would not be included in my dataset. So far I'm liking your approach because you've already reduced my Table size by 1/2 and the relationships can still be determined. Thanks Dan, let me know if you need anything else to assist you in helping me,

|||

James,

Please give this a try.

My current concern is with the "triple JOIN" with #pairs. If that is a problem, maybe the piece used for the LEFT OUTER JOIN must be a separate temporary table.

To use your own TABLE_1, you must change only the initial population of #pairs, as I'm sure you will readily recognize.

Dan

-- Look at some T-SQL code.

droptable #table_1

droptable #pairs

createtable #table_1

(

buyer_groupchar(4),

seller_groupchar(4)

)

insertinto #TABLE_1VALUES('A','B')

insertinto #TABLE_1VALUES('C','A')

insertinto #TABLE_1VALUES('D','F')

insertinto #TABLE_1VALUES('B','C')

insertinto #TABLE_1VALUES('F','E')

insertinto #TABLE_1VALUES('C','G')

-- Populate a table with all initial "pairs".

select

col1,

col2

into #pairs

from

(

selectdistinct

buyer_groupas col1,

seller_groupas col2

from #table_1

where(buyer_group< seller_group)

union

selectdistinct

seller_groupas col1,

buyer_groupas col2

from #table_1

where(buyer_group> seller_group)

) tmp

;

createindex pairs_ndx1

on #pairs(col1, col2)

createindex pairs_ndx2

on #pairs(col2, col1)

-- Add new pairs to the table, using a WHILE loop.

declare @.new_pairstable

(

col1char(4),

col2char(4)

)

insertinto @.new_pairs

(

col1,

col2

)

selectdistinct

p1.col1,

p2.col2

from #pairs p1

innerjoin #pairs p2

on p1.col2= p2.col1-- col2 with col1

leftouterjoin #pairs p3

on p1.col1= p3.col1and

p2.col2= p3.col2

where(p3.col1isnull)-- pair is not yet in data

and(p1.col1< p2.col2)-- should not be necessary to test this

union

selectdistinct

p1.col1,

p2.col1

from #pairs p1

innerjoin #pairs p2

on p1.col2= p2.col2-- col2 with col2

leftouterjoin #pairs p3

on p1.col1= p3.col1and

p2.col1= p3.col2

where(p3.col1isnull)-- pair is not yet in data

and(p1.col1< p2.col1)-- here it is necessary to test this

;

while((selectcount(*)from @.new_pairs)> 0)

begin

insertinto #pairs

(

col1,

col2

)

select

col1,

col2

from @.new_pairs

deletefrom @.new_pairs

insertinto @.new_pairs

(

col1,

col2

)

selectdistinct

p1.col1,

p2.col2

from #pairs p1

innerjoin #pairs p2

on p1.col2= p2.col1-- col2 with col1

leftouterjoin #pairs p3

on p1.col1= p3.col1and

p2.col2= p3.col2

where(p3.col1isnull)-- pair is not yet in data

and(p1.col1< p2.col2)-- should not be necessary to test this

union

selectdistinct

p1.col1,

p2.col1

from #pairs p1

innerjoin #pairs p2

on p1.col2= p2.col2-- col2 with col2

leftouterjoin #pairs p3

on p1.col1= p3.col1and

p2.col1= p3.col2

where(p3.col1isnull)-- pair is not yet in data

and(p1.col1< p2.col1)-- here it is necessary to test this

;

end

-- Check the progress

-- select *

-- from #pairs

-- order by 1, 2

-- ;

-- Create the table of groups.

droptable #group_numbers

createtable #group_numbers

(

initial_memberchar(4),

group_numberintidentity(1,1)

)

insertinto #group_numbers

(

initial_member

)

selectdistinct

p1.col1

from #pairs p1

leftouterjoin

(

selectdistinct

col2

from #pairs

) tmp

on p1.col1= tmp.col2

where(tmp.col2isnull)-- only want the first member of each group

orderby 1

-- Create a table of group members.

droptable #group_members

createtable #group_members

(

group_memberchar(4),

group_numberint

)

insertinto #group_members

(

group_member,

group_number

)

selectdistinct

initial_member,

group_number

from #group_numbers

union

selectdistinct

col2,

group_number

from #group_numbers gn

innerjoin #pairs p

on gn.initial_member= p.col1

;

select*

from #group_members

orderby 1, 2

;

|||

You're the best Dan, thanks so much for you help, your query worked amazing - I've been doing Q & A checks and everything looks great, no issues so far. This is my first time using a forumn and I will definately continue, much appreciated, have a great weekend.

Regards,

James

|||

James,

Thanks for the kind words.

I hope there are enough comments that you can follow the various steps.

I am interested to learn how long it takes to run with your large dataset. By using the #pairs temporary table, and non-clustered indexes, we are free from concerns about your actual table and its indexes. For such small indexes and only a few million rows, and with non-clustered indexes, the INSERT actions should go reasonably quickly.

If you have any questions about any step, please let me know.

I, too, have learned a great deal from this and other Microsoft forums. There are lots of great people sharing their knowledge with the rest of us. For that I am quite thankful.

Have a nice weekend.

Dan

|||

Hi Dan,

When I tested the query on Friday it ran very fast using 1 million rows of data, now with 4 million it's been running for 8 hours. It looks like it has been running a long time on insert statements, when I look at the Activity Monitor sometimes the status is suspended and the Wait Time can be 15 or 30 seconds. Since I started the query over 8 hours ago the CPU is 6,943,000 and the Physical IO is 2,204,800 - not sure if this good or not however thought I would mention it. Most of the flags would only include 5, 10 or 30 members however I think it's possible to get a few hundred thousand groups FLAGGED with the same number, so the insert is taking a long time. For now I'm letting it run and will check tonight from home - hopefully it finishes. I've tried udnerstanding the logic however need some more time to really sit down to see what's going on - with all the joins it's confusing for me to follow. thanks again, I will followup and let you know how long it took.

Thanks,

James

|||

James,

I consider 8 hours for this process to be completely unacceptable.

I will re-work the earlier code so that it does not need the LEFT OUTER JOIN -- though the underlying logic will remain the same.

I will include some timing statements so that you can use the "Results to Text" feature of SSMS to monitor progress. It will be nice to get some row-counts at each step in the loop, to get some idea of what is consuming all the time.

Dan

|||

Try this out, please. I put in a limit on the number of iterations of the loop, and some timing statements.

Please use "Results to Text" and post its output, to help identify a solution route (I hope).

Thanks.

Dan

declare @.MyMinvarchar(3)

declare @.MySecvarchar(3)

declare @.MyMSvarchar(4)

set @.MyMin=right(('00'+cast(datepart(mi,getdate())asvarchar(3))), 2)

set @.MySec=right(('00'+cast(datepart(ss,getdate())asvarchar(3))), 2)

set @.MyMS=right(('000'+cast(datepart(ms,getdate())asvarchar(4))), 3)

print'Beginning process '+ @.MyMin+':'+ @.MySec+'.'+ @.MyMS

droptable #pairs

-- Populate a table with all initial "pairs".

select

col1,

col2

into #pairs

from

(

selectdistinct

buyer_groupas col1,

seller_groupas col2

from table_1

where(buyer_group< seller_group)

union

selectdistinct

seller_groupas col1,

buyer_groupas col2

from table_1

where(buyer_group> seller_group)

) tmp

;

createindex pairs_ndx1

on #pairs(col1, col2)

createindex pairs_ndx2

on #pairs(col2, col1)

set @.MyMin=right(('00'+cast(datepart(mi,getdate())asvarchar(3))), 2)

set @.MySec=right(('00'+cast(datepart(ss,getdate())asvarchar(3))), 2)

set @.MyMS=right(('000'+cast(datepart(ms,getdate())asvarchar(4))), 3)

print'#pairs and indexing complete: '+ @.MyMin+':'+ @.MySec+'.'+ @.MyMS

-- Add new pairs to the table, using a WHILE loop.

declare @.new_pairstable

(

col1char(4),

col2char(4)

)

insertinto @.new_pairs

(

col1,

col2

)

selectdistinct

p1.col1,

p2.col2

from #pairs p1

innerjoin #pairs p2

on p1.col2= p2.col1-- col2 with col1

leftouterjoin #pairs p3

on p1.col1= p3.col1and

p2.col2= p3.col2

where(p3.col1isnull)-- pair is not yet in data

and(p1.col1< p2.col2)-- should not be necessary to test this

union

selectdistinct

p1.col1,

p2.col1

from #pairs p1

innerjoin #pairs p2

on p1.col2= p2.col2-- col2 with col2

leftouterjoin #pairs p3

on p1.col1= p3.col1and

p2.col1= p3.col2

where(p3.col1isnull)-- pair is not yet in data

and(p1.col1< p2.col1)-- here it is necessary to test this

;

set @.MyMin=right(('00'+cast(datepart(mi,getdate())asvarchar(3))), 2)

set @.MySec=right(('00'+cast(datepart(ss,getdate())asvarchar(3))), 2)

set @.MyMS=right(('000'+cast(datepart(ms,getdate())asvarchar(4))), 3)

print'@.new_pairs initial population complete: '+ @.MyMin+':'+ @.MySec+'.'+ @.MyMS

declare @.loop_counterint

set @.loop_counter= 0

while(((selectcount(*)from @.new_pairs)> 0)and(@.loop_counter< 10))

begin

insertinto #pairs

(

col1,

col2

)

select

col1,

col2

from @.new_pairs

deletefrom @.new_pairs

insertinto @.new_pairs

(

col1,

col2

)

selectdistinct

p1.col1,

p2.col2

from #pairs p1

innerjoin #pairs p2

on p1.col2= p2.col1-- col2 with col1

leftouterjoin #pairs p3

on p1.col1= p3.col1and

p2.col2= p3.col2

where(p3.col1isnull)-- pair is not yet in data

and(p1.col1< p2.col2)-- should not be necessary to test this

union

selectdistinct

p1.col1,

p2.col1

from #pairs p1

innerjoin #pairs p2

on p1.col2= p2.col2-- col2 with col2

leftouterjoin #pairs p3

on p1.col1= p3.col1and

p2.col1= p3.col2

where(p3.col1isnull)-- pair is not yet in data

and(p1.col1< p2.col1)-- here it is necessary to test this

;

set @.MyMin=right(('00'+cast(datepart(mi,getdate())asvarchar(3))), 2)

set @.MySec=right(('00'+cast(datepart(ss,getdate())asvarchar(3))), 2)

set @.MyMS=right(('000'+cast(datepart(ms,getdate())asvarchar(4))), 3)

set @.loop_counter= @.loop_counter+ 1

print'loop count: '+cast(@.loop_counterasvarchar(5))+';time: '+ @.MyMin+':'+ @.MySec+'.'+ @.MyMS

end

|||

James,

And here is a version that uses a temp table, instead of a table variable. It also leaves room for expansion in the indexes, in case index maintenance is the main reason for the lengthy duration.

Dan

declare @.MyMinvarchar(3)

declare @.MySecvarchar(3)

declare @.MyMSvarchar(4)

set @.MyMin=right(('00'+cast(datepart(mi,getdate())asvarchar(3))), 2)

set @.MySec=right(('00'+cast(datepart(ss,getdate())asvarchar(3))), 2)

set @.MyMS=right(('000'+cast(datepart(ms,getdate())asvarchar(4))), 3)

print'Beginning process '+ @.MyMin+':'+ @.MySec+'.'+ @.MyMS

droptable #pairs

-- Populate a table with all initial "pairs".

select

col1,

col2

into #pairs

from

(

selectdistinct

buyer_groupas col1,

seller_groupas col2

from table_1

where(buyer_group< seller_group)

union

selectdistinct

seller_groupas col1,

buyer_groupas col2

from table_1

where(buyer_group> seller_group)

) tmp

;

createindex pairs_ndx1

on #pairs(col1, col2)

WITHPAD_INDEX,FILLFACTOR= 50

createindex pairs_ndx2

on #pairs(col2, col1)

WITHPAD_INDEX,FILLFACTOR= 50

set @.MyMin=right(('00'+cast(datepart(mi,getdate())asvarchar(3))), 2)

set @.MySec=right(('00'+cast(datepart(ss,getdate())asvarchar(3))), 2)

set @.MyMS=right(('000'+cast(datepart(ms,getdate())asvarchar(4))), 3)

print'#pairs and indexing complete: '+ @.MyMin+':'+ @.MySec+'.'+ @.MyMS

-- Add new pairs to the table, using a WHILE loop.

declare #new_pairstable

(

col1 char(4),

col2 char(4)

)

insertinto #new_pairs

(

col1,

col2

)

selectdistinct

p1.col1,

p2.col2

from #pairs p1

innerjoin #pairs p2

on p1.col2= p2.col1-- col2 with col1

leftouterjoin #pairs p3

on p1.col1= p3.col1and

p2.col2= p3.col2

where(p3.col1isnull)-- pair is not yet in data

and(p1.col1< p2.col2)-- should not be necessary to test this

union

selectdistinct

p1.col1,

p2.col1

from #pairs p1

innerjoin #pairs p2

on p1.col2= p2.col2-- col2 with col2

leftouterjoin #pairs p3

on p1.col1= p3.col1and

p2.col1= p3.col2

where(p3.col1isnull)-- pair is not yet in data

and(p1.col1< p2.col1)-- here it is necessary to test this

;

createindex new_pairs_ndx

on #new_pairs(col1, col2)

WITHPAD_INDEX,FILLFACTOR= 50

set @.MyMin=right(('00'+cast(datepart(mi,getdate())asvarchar(3))), 2)

set @.MySec=right(('00'+cast(datepart(ss,getdate())asvarchar(3))), 2)

set @.MyMS=right(('000'+cast(datepart(ms,getdate())asvarchar(4))), 3)

print'#new_pairs initial population complete: '+ @.MyMin+':'+ @.MySec+'.'+ @.MyMS

declare @.loop_counterint

set @.loop_counter= 0

while(((selectcount(*)from #new_pairs)> 0)and(@.loop_counter< 10))

begin

insertinto #pairs

(

col1,

col2

)

select

col1,

col2

from #new_pairs

deletefrom #new_pairs

insertinto #new_pairs

(

col1,

col2

)

selectdistinct

p1.col1,

p2.col2

from #pairs p1

innerjoin #pairs p2

on p1.col2= p2.col1-- col2 with col1

where(p1.col1< p2.col2)-- should not be necessary to test this

union

selectdistinct

p1.col1,

p2.col1

from #pairs p1

innerjoin #pairs p2

on p1.col2= p2.col2-- col2 with col2

where(p1.col1< p2.col1)-- here it is necessary to test this

;

delete np-- entries already in the table;

from #new_pairs np

innerjoin #pairs p

on p.col1= np.col1and

p.col2= np.col2

set @.MyMin=right(('00'+cast(datepart(mi,getdate())asvarchar(3))), 2)

set @.MySec=right(('00'+cast(datepart(ss,getdate())asvarchar(3))), 2)

set @.MyMS=right(('000'+cast(datepart(ms,getdate())asvarchar(4))), 3)

set @.loop_counter= @.loop_counter+ 1

print'loop count: '+cast(@.loop_counterasvarchar(5))+'; time: '+ @.MyMin+':'+ @.MySec+'.'+ @.MyMS

end

|||Hi Dan, sorry for the late reply, I took a couple days off but now I'm back. Thanks for posting the above information - I will give this a try over the weekend and let you know how it goes. Thanks again, Jim|||

Jim,

Ultimately I think we want a solution that works in less than 10 minutes.

The only thing that might prevent us from having such a solution is if there is some "large loop" in the data, where AAAA

does not sell directly to ZZZZ, but AAAA sells to AAAB, which sells to AAAC, which sells to AAAD, ... which sells to ZZZX, which sells to ZZZY, which sells to ZZZZ. (AAAA also does not sell directly to any of the "intermediate" entries in this list.)

In such a case the current loop would have to iterate approx. 26*26*26*26 times -- which is a lot of looping!

Dan

|||

Hi Dan, not sure if you're still around however I've finally had a chance to try the code again with even a larger dataset. Everything works great except for what you mentioned above when there are alot of iterations. Without that one very large interation I have which can be over 500k groups the query can finish in only 10 to 15 minutes. But once it hits that large interation it runs for over 10 hours and I've yet to finish running the code to give you any results; yesterday they rebooted the server because of too much CPU usage . Now with the larger dataset - 18 million there possibly could be an even a larger iteration that I haven't identified within the dataset. Not sure if there is away to have the query bypass large interations or some way I can identify and release the memory if I'm taking up too much? Anyway, thanks for your help to this point. Cheers

|||

Jim,

I am unaware of any way that you could determine, in advance, that you have some very large loop, such as with 500K "buyer-seller" pairs.

Is such a large loop relevant to what you are trying to do? Can you ignore it, and put in some iteration count limit to find only loops of less than 100 buyer-seller pairs?

I suppose 500K "buyer-seller" pairs could represent an incredible money-laundering scheme!

Dan

|||

Thanks Dan, yes that's what I've done. For now I just set aside this large loop so the query that you wrote I've been using and it runs in 10 minutes - sometimes even quicker. Thanks a lot for your help.

Jim