My apologies for the long explanation & I don't have any experience writing LOOPS.
Using the information from TABLE#1 I'm looking to create an output file that looks like TABLE#2. In my example below I flagged GROUPS A, B, C & G with a value of "1" in TABLE#2 and GROUPS D, E, F with a value of "2". Even though GROUPS C & G never appear in the same row together both get flagged with a "1" because there is a row with GROUP C & G together, and C is linked to A in row 2. Groups D, E, F are flagged with a "2" because GROUP D & F are together and F is tied with E, so I flag all 3 with "2". TABLE#2 should only have distinct values in the DISTINCT_GROUPS COLUMN, however the FLAG number would repeat itself depending on how many GROUPS get the same flag ie. groups A, B, C & G have the same number "1" yet A, B, C & G only show up once in the DISTINCT_GROUP column - see table 2 output file below. TABLE#1 has millions of rows so optimization would be very important.
Even though their are millions of GROUPS in TABLE#1 they are distinct by row. example: A & B will only be in the table once however I could have B & A, where B is in the first column and A in the second. However each GROUP could appear hundreds of times if they SELL OR BUY from many other GROUPS. Thanks,
CREATETABLE [dbo].[TABLE_1](
[BUYER_GROUP] [char](4)NOTNULL,
[SELLER_GROUP] [char](4)NOTNULL)
insertinto TABLE_1 VALUES('A','B')
insertinto TABLE_1 VALUES('C','A')
insertinto TABLE_1 VALUES('D','F')
insertinto TABLE_1 VALUES('B','C')
insertinto TABLE_1 VALUES('F','E')
insertinto TABLE_1 VALUES('C','G')
SELECT * FROM TABLE_1
BUYER_GROUP SELLER_GROUP
-- -
A B
C A
D F
B C
F E
C G
GO
CREATETABLE [dbo].[TABLE_2](
[DISTINCT_GROUPS] [char](4)NOTNULL,
[FLAG] [int] NULL
)ON [PRIMARY]
insertinto TABLE_2 VALUES('A','1')
insertinto TABLE_2 VALUES('B','1')
insertinto TABLE_2 VALUES('C','1')
insertinto TABLE_2 VALUES('D','2')
insertinto TABLE_2 VALUES('E','2')
insertinto TABLE_2 VALUES('F','2')
insertinto TABLE_2 VALUES('G','1')
SELECT * FROM TABLE_2
DISTINCT_GROUP FLAG
-- --
A 1
B 1
C 1
D 2
E 2
F 2
G 1
James,
Although you said you have millions of rows in "table_1", how many rows are returned by the query below?
select distinct
buyer_group as col1,
seller_group as col2
from table_1
where (buyer_group < seller_group)
union
select distinct
seller_group as col1,
buyer_group as col2
from table_1
where (buyer_group > seller_group)
The reason I ask is that these constitute all the pairs that are the "initial set" for your problem.
Meanwhile, I will try to construct a loop over a result set such as above, a loop that terminates when a SQL query creates no rows that are not already in the "table of pairs."
Once we have a substantial table of pairs, we can probably use a CTE to make a "string" of each of the groups, and then "split" the string into its component parts to populate the table you seek.
That's what I'm thinking at this moment, anyway.
Dan
|||Hi Dan, thanks for replying.
My current table has 3,851,668 rows and after running your code I have 1,925,834 rows - exactly 1/2 the records from TABLE#1 which makes sense because I forgot to mention that the relationships have to be reciprocal A & B, and B & A, so if A Buys from D however D never Buy's from A, this row would not be included in my dataset. So far I'm liking your approach because you've already reduced my Table size by 1/2 and the relationships can still be determined. Thanks Dan, let me know if you need anything else to assist you in helping me,
|||James,
Please give this a try.
My current concern is with the "triple JOIN" with #pairs. If that is a problem, maybe the piece used for the LEFT OUTER JOIN must be a separate temporary table.
To use your own TABLE_1, you must change only the initial population of #pairs, as I'm sure you will readily recognize.
Dan
-- Look at some T-SQL code.
drop table #table_1
drop table #pairs
create table #table_1
(
buyer_groupchar(4),
seller_groupchar(4)
)
insert into #TABLE_1 VALUES ('A','B')
insert into #TABLE_1 VALUES ('C','A')
insert into #TABLE_1 VALUES ('D','F')
insert into #TABLE_1 VALUES ('B','C')
insert into #TABLE_1 VALUES ('F','E')
insert into #TABLE_1 VALUES ('C','G')
-- Populate a table with all initial "pairs".
select
col1,
col2
into #pairs
from
(
select distinct
buyer_group as col1,
seller_group as col2
from #table_1
where (buyer_group < seller_group)
union
select distinct
seller_group as col1,
buyer_group as col2
from #table_1
where (buyer_group > seller_group)
) tmp
;
create index pairs_ndx1
on #pairs (col1, col2)
create index pairs_ndx2
on #pairs (col2, col1)
-- Add new pairs to the table, using a WHILE loop.
declare @.new_pairs table
(
col1char(4),
col2char(4)
)
insert into @.new_pairs
(
col1,
col2
)
select distinct
p1.col1,
p2.col2
from #pairs p1
inner join #pairs p2
on p1.col2 = p2.col1 -- col2 with col1
left outer join #pairs p3
on p1.col1 = p3.col1 and
p2.col2 = p3.col2
where (p3.col1 is null) -- pair is not yet in data
and (p1.col1 < p2.col2) -- should not be necessary to test this
union
select distinct
p1.col1,
p2.col1
from #pairs p1
inner join #pairs p2
on p1.col2 = p2.col2 -- col2 with col2
left outer join #pairs p3
on p1.col1 = p3.col1 and
p2.col1 = p3.col2
where (p3.col1 is null) -- pair is not yet in data
and (p1.col1 < p2.col1) -- here it is necessary to test this
;
while ((select count(*) from @.new_pairs) > 0)
begin
insert into #pairs
(
col1,
col2
)
select
col1,
col2
from @.new_pairs
delete from @.new_pairs
insert into @.new_pairs
(
col1,
col2
)
select distinct
p1.col1,
p2.col2
from #pairs p1
inner join #pairs p2
on p1.col2 = p2.col1 -- col2 with col1
left outer join #pairs p3
on p1.col1 = p3.col1 and
p2.col2 = p3.col2
where (p3.col1 is null) -- pair is not yet in data
and (p1.col1 < p2.col2) -- should not be necessary to test this
union
select distinct
p1.col1,
p2.col1
from #pairs p1
inner join #pairs p2
on p1.col2 = p2.col2 -- col2 with col2
left outer join #pairs p3
on p1.col1 = p3.col1 and
p2.col1 = p3.col2
where (p3.col1 is null) -- pair is not yet in data
and (p1.col1 < p2.col1) -- here it is necessary to test this
;
end
-- Check the progress
-- select *
-- from #pairs
-- order by 1, 2
-- ;
-- Create the table of groups.
drop table #group_numbers
create table #group_numbers
(
initial_memberchar(4),
group_numberint identity (1,1)
)
insert into #group_numbers
(
initial_member
)
select distinct
p1.col1
from #pairs p1
left outer join
(
select distinct
col2
from #pairs
) tmp
on p1.col1 = tmp.col2
where (tmp.col2 is null) -- only want the first member of each group
order by 1
-- Create a table of group members.
drop table #group_members
create table #group_members
(
group_memberchar(4),
group_numberint
)
insert into #group_members
(
group_member,
group_number
)
select distinct
initial_member,
group_number
from #group_numbers
union
select distinct
col2,
group_number
from #group_numbers gn
inner join #pairs p
on gn.initial_member = p.col1
;
select *
from #group_members
order by 1, 2
;
|||You're the best Dan, thanks so much for you help, your query worked amazing - I've been doing Q & A checks and everything looks great, no issues so far. This is my first time using a forumn and I will definately continue, much appreciated, have a great weekend.
Regards,
James
|||James,
Thanks for the kind words.
I hope there are enough comments that you can follow the various steps.
I am interested to learn how long it takes to run with your large dataset. By using the #pairs temporary table, and non-clustered indexes, we are free from concerns about your actual table and its indexes. For such small indexes and only a few million rows, and with non-clustered indexes, the INSERT actions should go reasonably quickly.
If you have any questions about any step, please let me know.
I, too, have learned a great deal from this and other Microsoft forums. There are lots of great people sharing their knowledge with the rest of us. For that I am quite thankful.
Have a nice weekend.
Dan
|||Hi Dan,
When I tested the query on Friday it ran very fast using 1 million rows of data, now with 4 million it's been running for 8 hours. It looks like it has been running a long time on insert statements, when I look at the Activity Monitor sometimes the status is suspended and the Wait Time can be 15 or 30 seconds. Since I started the query over 8 hours ago the CPU is 6,943,000 and the Physical IO is 2,204,800 - not sure if this good or not however thought I would mention it. Most of the flags would only include 5, 10 or 30 members however I think it's possible to get a few hundred thousand groups FLAGGED with the same number, so the insert is taking a long time. For now I'm letting it run and will check tonight from home - hopefully it finishes. I've tried udnerstanding the logic however need some more time to really sit down to see what's going on - with all the joins it's confusing for me to follow. thanks again, I will followup and let you know how long it took.
Thanks,
James
|||James,
I consider 8 hours for this process to be completely unacceptable.
I will re-work the earlier code so that it does not need the LEFT OUTER JOIN -- though the underlying logic will remain the same.
I will include some timing statements so that you can use the "Results to Text" feature of SSMS to monitor progress. It will be nice to get some row-counts at each step in the loop, to get some idea of what is consuming all the time.
Dan
|||Try this out, please. I put in a limit on the number of iterations of the loop, and some timing statements.
Please use "Results to Text" and post its output, to help identify a solution route (I hope).
Thanks.
Dan
declare @.MyMin varchar(3)
declare @.MySec varchar(3)
declare @.MyMS varchar(4)
set @.MyMin = right(('00' + cast(datepart(mi, getdate()) as varchar(3))), 2)
set @.MySec = right(('00' + cast(datepart(ss, getdate()) as varchar(3))), 2)
set @.MyMS = right(('000' + cast(datepart(ms, getdate()) as varchar(4))), 3)
print 'Beginning process ' + @.MyMin + ':' + @.MySec + '.' + @.MyMS
drop table #pairs
-- Populate a table with all initial "pairs".
select
col1,
col2
into #pairs
from
(
select distinct
buyer_group as col1,
seller_group as col2
from table_1
where (buyer_group < seller_group)
union
select distinct
seller_group as col1,
buyer_group as col2
from table_1
where (buyer_group > seller_group)
) tmp
;
create index pairs_ndx1
on #pairs (col1, col2)
create index pairs_ndx2
on #pairs (col2, col1)
set @.MyMin = right(('00' + cast(datepart(mi, getdate()) as varchar(3))), 2)
set @.MySec = right(('00' + cast(datepart(ss, getdate()) as varchar(3))), 2)
set @.MyMS = right(('000' + cast(datepart(ms, getdate()) as varchar(4))), 3)
print '#pairs and indexing complete: ' + @.MyMin + ':' + @.MySec + '.' + @.MyMS
-- Add new pairs to the table, using a WHILE loop.
declare @.new_pairs table
(
col1char(4),
col2char(4)
)
insert into @.new_pairs
(
col1,
col2
)
select distinct
p1.col1,
p2.col2
from #pairs p1
inner join #pairs p2
on p1.col2 = p2.col1 -- col2 with col1
left outer join #pairs p3
on p1.col1 = p3.col1 and
p2.col2 = p3.col2
where (p3.col1 is null) -- pair is not yet in data
and (p1.col1 < p2.col2) -- should not be necessary to test this
union
select distinct
p1.col1,
p2.col1
from #pairs p1
inner join #pairs p2
on p1.col2 = p2.col2 -- col2 with col2
left outer join #pairs p3
on p1.col1 = p3.col1 and
p2.col1 = p3.col2
where (p3.col1 is null) -- pair is not yet in data
and (p1.col1 < p2.col1) -- here it is necessary to test this
;
set @.MyMin = right(('00' + cast(datepart(mi, getdate()) as varchar(3))), 2)
set @.MySec = right(('00' + cast(datepart(ss, getdate()) as varchar(3))), 2)
set @.MyMS = right(('000' + cast(datepart(ms, getdate()) as varchar(4))), 3)
print '@.new_pairs initial population complete: ' + @.MyMin + ':' + @.MySec + '.' + @.MyMS
declare @.loop_counter int
set @.loop_counter = 0
while (((select count(*) from @.new_pairs) > 0) and (@.loop_counter < 10))
begin
insert into #pairs
(
col1,
col2
)
select
col1,
col2
from @.new_pairs
delete from @.new_pairs
insert into @.new_pairs
(
col1,
col2
)
select distinct
p1.col1,
p2.col2
from #pairs p1
inner join #pairs p2
on p1.col2 = p2.col1 -- col2 with col1
left outer join #pairs p3
on p1.col1 = p3.col1 and
p2.col2 = p3.col2
where (p3.col1 is null) -- pair is not yet in data
and (p1.col1 < p2.col2) -- should not be necessary to test this
union
select distinct
p1.col1,
p2.col1
from #pairs p1
inner join #pairs p2
on p1.col2 = p2.col2 -- col2 with col2
left outer join #pairs p3
on p1.col1 = p3.col1 and
p2.col1 = p3.col2
where (p3.col1 is null) -- pair is not yet in data
and (p1.col1 < p2.col1) -- here it is necessary to test this
;
set @.MyMin = right(('00' + cast(datepart(mi, getdate()) as varchar(3))), 2)
set @.MySec = right(('00' + cast(datepart(ss, getdate()) as varchar(3))), 2)
set @.MyMS = right(('000' + cast(datepart(ms, getdate()) as varchar(4))), 3)
set @.loop_counter = @.loop_counter + 1
print 'loop count: ' + cast(@.loop_counter as varchar(5)) + ';time: ' + @.MyMin + ':' + @.MySec + '.' + @.MyMS
end
|||James,
And here is a version that uses a temp table, instead of a table variable. It also leaves room for expansion in the indexes, in case index maintenance is the main reason for the lengthy duration.
Dan
declare @.MyMin varchar(3)
declare @.MySec varchar(3)
declare @.MyMS varchar(4)
set @.MyMin = right(('00' + cast(datepart(mi, getdate()) as varchar(3))), 2)
set @.MySec = right(('00' + cast(datepart(ss, getdate()) as varchar(3))), 2)
set @.MyMS = right(('000' + cast(datepart(ms, getdate()) as varchar(4))), 3)
print 'Beginning process ' + @.MyMin + ':' + @.MySec + '.' + @.MyMS
drop table #pairs
-- Populate a table with all initial "pairs".
select
col1,
col2
into #pairs
from
(
select distinct
buyer_group as col1,
seller_group as col2
from table_1
where (buyer_group < seller_group)
union
select distinct
seller_group as col1,
buyer_group as col2
from table_1
where (buyer_group > seller_group)
) tmp
;
create index pairs_ndx1
on #pairs (col1, col2)
WITH PAD_INDEX, FILLFACTOR = 50
create index pairs_ndx2
on #pairs (col2, col1)
WITH PAD_INDEX, FILLFACTOR = 50
set @.MyMin = right(('00' + cast(datepart(mi, getdate()) as varchar(3))), 2)
set @.MySec = right(('00' + cast(datepart(ss, getdate()) as varchar(3))), 2)
set @.MyMS = right(('000' + cast(datepart(ms, getdate()) as varchar(4))), 3)
print '#pairs and indexing complete: ' + @.MyMin + ':' + @.MySec + '.' + @.MyMS
-- Add new pairs to the table, using a WHILE loop.
declare #new_pairs table
(
col1 char(4),
col2 char(4)
)
insert into #new_pairs
(
col1,
col2
)
select distinct
p1.col1,
p2.col2
from #pairs p1
inner join #pairs p2
on p1.col2 = p2.col1 -- col2 with col1
left outer join #pairs p3
on p1.col1 = p3.col1 and
p2.col2 = p3.col2
where (p3.col1 is null) -- pair is not yet in data
and (p1.col1 < p2.col2) -- should not be necessary to test this
union
select distinct
p1.col1,
p2.col1
from #pairs p1
inner join #pairs p2
on p1.col2 = p2.col2 -- col2 with col2
left outer join #pairs p3
on p1.col1 = p3.col1 and
p2.col1 = p3.col2
where (p3.col1 is null) -- pair is not yet in data
and (p1.col1 < p2.col1) -- here it is necessary to test this
;
create index new_pairs_ndx
on #new_pairs (col1, col2)
WITH PAD_INDEX, FILLFACTOR = 50
set @.MyMin = right(('00' + cast(datepart(mi, getdate()) as varchar(3))), 2)
set @.MySec = right(('00' + cast(datepart(ss, getdate()) as varchar(3))), 2)
set @.MyMS = right(('000' + cast(datepart(ms, getdate()) as varchar(4))), 3)
print '#new_pairs initial population complete: ' + @.MyMin + ':' + @.MySec + '.' + @.MyMS
declare @.loop_counter int
set @.loop_counter = 0
while (((select count(*) from #new_pairs) > 0) and (@.loop_counter < 10))
begin
insert into #pairs
(
col1,
col2
)
select
col1,
col2
from #new_pairs
delete from #new_pairs
insert into #new_pairs
(
col1,
col2
)
select distinct
p1.col1,
p2.col2
from #pairs p1
inner join #pairs p2
on p1.col2 = p2.col1 -- col2 with col1
where (p1.col1 < p2.col2) -- should not be necessary to test this
union
select distinct
p1.col1,
p2.col1
from #pairs p1
inner join #pairs p2
on p1.col2 = p2.col2 -- col2 with col2
where (p1.col1 < p2.col1) -- here it is necessary to test this
;
delete np -- entries already in the table;
from #new_pairs np
inner join #pairs p
on p.col1 = np.col1 and
p.col2 = np.col2
set @.MyMin = right(('00' + cast(datepart(mi, getdate()) as varchar(3))), 2)
set @.MySec = right(('00' + cast(datepart(ss, getdate()) as varchar(3))), 2)
set @.MyMS = right(('000' + cast(datepart(ms, getdate()) as varchar(4))), 3)
set @.loop_counter = @.loop_counter + 1
print 'loop count: ' + cast(@.loop_counter as varchar(5)) + '; time: ' + @.MyMin + ':' + @.MySec + '.' + @.MyMS
end
|||Hi Dan, sorry for the late reply, I took a couple days off but now I'm back. Thanks for posting the above information - I will give this a try over the weekend and let you know how it goes. Thanks again, Jim|||
Jim,
Ultimately I think we want a solution that works in less than 10 minutes.
The only thing that might prevent us from having such a solution is if there is some "large loop" in the data, where AAAA
does not sell directly to ZZZZ, but AAAA sells to AAAB, which sells to AAAC, which sells to AAAD, ... which sells to ZZZX, which sells to ZZZY, which sells to ZZZZ. (AAAA also does not sell directly to any of the "intermediate" entries in this list.)
In such a case the current loop would have to iterate approx. 26*26*26*26 times -- which is a lot of looping!
Dan
|||Hi Dan, not sure if you're still around however I've finally had a chance to try the code again with even a larger dataset. Everything works great except for what you mentioned above when there are alot of iterations. Without that one very large interation I have which can be over 500k groups the query can finish in only 10 to 15 minutes. But once it hits that large interation it runs for over 10 hours and I've yet to finish running the code to give you any results; yesterday they rebooted the server because of too much CPU usage . Now with the larger dataset - 18 million there possibly could be an even a larger iteration that I haven't identified within the dataset. Not sure if there is away to have the query bypass large interations or some way I can identify and release the memory if I'm taking up too much? Anyway, thanks for your help to this point. Cheers
|||Jim,
I am unaware of any way that you could determine, in advance, that you have some very large loop, such as with 500K "buyer-seller" pairs.
Is such a large loop relevant to what you are trying to do? Can you ignore it, and put in some iteration count limit to find only loops of less than 100 buyer-seller pairs?
I suppose 500K "buyer-seller" pairs could represent an incredible money-laundering scheme!
Dan
|||Thanks Dan, yes that's what I've done. For now I just set aside this large loop so the query that you wrote I've been using and it runs in 10 minutes - sometimes even quicker. Thanks a lot for your help.
Jim
No comments:
Post a Comment