Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
159 views
in Technique[技术] by (71.8m points)

c# - Compare 2 unordered recordset in memory

Below is my application database table which contains SQL queries stored in a table: QueryStorage

Id           Query           ConnectionString       Rdbms
1            select...        Data Source           Sql Server
2            select...        Data Source           Oracle

The SQL queries in the above table are updated through web service and we are not allowed to update above the queries, though we can add something on top of the query something like this:

Query stored in table: select id as LinkedColumn, Amount as CompareColumn from Source

Tweaked query from my c# app:select Q.LinkedColumn, Q.CompareColumn from (stored sql query) as Q

I am trying to compare 2 unordered list like below:

Query executed for Id = 1(Sql server) from QueryStorage table records are like below:

select Id as LinkedColumn,CompareColumn from Source

List 1:

LinkedColumn     CompareColumn
1                100
2                200
3                300
4                400
5                500
6                600
7                700
8                800
9                900
10               1000

Query executed for Id = 2(Oracle) from QueryStorage table records are like below:

select Id as LinkedColumn,CompareColumn from Target

List 2:

LinkedColumn       CompareColumn
10                 10
9                  20
8                  30
7                  40
6                  50
5                  60
4                  70
3                  80
2                  90
1                  5

I want to join on LinkedColumn from source to target and then do comparison on CompareColumn which should give me following output:

SrcLinkedColumn      SrcCompareColumn     TgtLinkedColumn    TgtCompareColumn
    1                       100           1                  5
    2                       200           2                  90               

Logic:

var data = (from s in List1.AsEnumerable()
             join t in List2.AsEnumerable() on s.Field<string>("LinkedColumn") equals t.Field<string>("LinkedColumn")
             where s.Field<decimal>("CompareColumn") != t.Field<decimal>("CompareColumn")
                        select new
                        {
                            srcLinkedcol = s.Field<string>("LinkedColumn"),
                            srcCompareCol = s.Field<decimal>("CompareColumn"),
                            tgtLinkedCol = t.Field<string>("LinkedColumn"),
                            tgtCompareCol = t.Field<decimal>("CompareColumn")
                        }).ToList();

There will be millions of records from source to target which I want to compare with so big issue is of out of memory exception which we are facing right now by loading all data in memory and then doing comparison with above linq query.

There are 2 solutions which I have thought of like below:

1) Open 2 ordered data reader.

Pros :

    - No memory exception
    - Fast as there will be 1 to 1 comparision of LinkedColumn for List1 and 
      List2 records.

Cons :

    - Order by is require on LinkedColumns and as i have no control over 
      query as because it is dumped by webservice in QueryStorage table so 
      user is explicitly require to submit query with order by on 
      LinkedColumn.
    - Wont work if order by on Linkedcolumn is not present.
    - Order by query have performance overhead so because of this user may not include order by on LinkedColumn in query.                               

2) Compare chunk by chunk records by tweaking query and adding OffSet and FetchRowNext. This is how I am thinking for algorithm:

enter image description here

But i still feel that with second approach i can get memory exception issue because at some steps where data from source and target are not matching i will be storing them inside buffer(datatable or list etc..) for next chunk comparision.

Can anybody please guide me what should be the good algorithm for this or any better way to address this?

Note : I dont want to use LinkedServer and SSIS.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Your second solution looks like an attempt to re-invent the External merge sort. It is a valid approach if your data doesn't fit into memory, but it is overkill when your datasets fit into memory.

24 million rows with two numbers each (8 bytes) is only ~200MB of memory. Two datasets are 400MB. This is nothing on the modern desktop hardware.

Load both datasets in memory. In simple arrays. Sort arrays in memory. Find the difference by scanning both sorted arrays once. You don't need fancy LINQ for that.

Here is pseudocode. We have Array1 and Array2. Maintain two variables that contain the current index of each array: Idx1 and Idx2. Move the indexes along the arrays looking for places where LinkedColumn match.

Idx1 = 0; Idx2 = 0;
while (true)
{
    // scan arrays until both indexes point to the same LinkedColumn
    // here we assume that both arrays are sorted by LinkedColumn
    // and that values in LinkedColumn are unique

    while (Idx1 < Array1.Length && Idx2 < Array2.Length &&
        Array1[Idx1].LinkedColumn < Array2[Idx2].LinkedColumn)
    {
        // move along Array1
        Idx1++;
    }

    while (Idx1 < Array1.Length && Idx2 < Array2.Length &&
        Array1[Idx1].LinkedColumn > Array2[Idx2].LinkedColumn)
    {
        // move along Array2
        Idx2++;
    }

    // at this point both indexes point to the same LinkedColumn
    // or one or both of the arrays are over

    if (Idx1 >= Array1.Length || Idx2 >= Array2.Length)
    {
        break;
    }

    // compare the values
    if (Array1[Idx1].CompareColumn != Array2[Idx2].CompareColumn)
    {
        // TODO: output/save/print the difference
    }

    Idx1++; Idx2++;
}

OR

you can dump both datasets in a database of your choice in two tables T1 and T2, create unique index on LinkedColumn in both tables and run this query:

SELECT
     T1.LinkedColumn  AS SrcLinkedColumn
    ,T1.CompareColumn AS SrcCompareColumn
    ,T2.LinkedColumn  AS DstLinkedColumn
    ,T2.CompareColumn AS DstCompareColumn
FROM
    T1 INNER JOIN T2 ON T1.LinkedColumn = T2.LinkedColumn
WHERE
    T1.CompareColumn <> T2.CompareColumn
ORDER BY
    T1.LinkedColumn
;

The pseudocode above performs the same merge join as the DBMS server would perform for this query.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...