Forum Discussion

SAGAR1985's avatar
SAGAR1985
Copper Contributor
Jul 24, 2024

How to create view joining two table of similar records without duplicating.

Hi All,

Hope all are doing great. I have a small query regarding the Join function to create view. I have two table in which table 1 is having 40k records and table 2 is having 40k records. But when i am joining the both to create view, it is showing 47k records. Please note that i am combining table using two common column in both, one is "Calendar_Date" and one is "Personal_ID". In some cases few Personal ID has duplicate records for each month.  So while creating the view the, these duplicate records gets multiplied . Hence the total records exceeds 47k. How to make it same 40k records which i am having in each table.

 

Any help would be highly appreciated. 

 

Regards,

Sagar

  • SAGAR1985 I think I might see where the issue lies.

    Your query for the [dbo].[ARC_Skill] table is selecting distinct Personal_IDs, meaning you are counting unique IDs. If there's duplicates, you're not seeing them.

     

     

    select distinct Personal_ID from [dbo].[ARC_Skill] where Calendar_day = '2024-05-31';
    --42678 rows

     

     


    If you do the following queries, I believe you'll find that you have the same amount of duplicates in both your tables:

     

     

    Select 
    	Personal_ID, /* The personal_ID */
    	Calendar_day, /* The calendar day */
    	Count(Personal_ID) as NumberOfDuplicates /* Number of rows with this given Personal_ID and Calendar_day */
    from [dbo].[ARC_SKILL]
    where Calendar_day = '2024-05-31'
    Group By Personal_ID, Calendar_day /* Grouping enables counting rows pr. Personal_ID */
    Order by NumberOfDuplicates Desc /* Largest number on top */
    /* You may have to use 'order by COUNT(Personal_ID) desc' instead */
    
    Select 
    	Personal_ID, /* The personal_ID */
    	Calendar_day, /* The calendar day */
    	Count(Personal_ID) as NumberOfDuplicates /* Number of rows with this given Personal_ID and Calendar_day */
    from [dbo].[Skill_Analysis]
    where Calendar_day = '2024-05-31'
    Group By Personal_ID, Calendar_day /* Grouping enables counting rows pr. Personal_ID */
    Order by NumberOfDuplicates Desc /* Largest number on top */
    /* You may have to use 'order by COUNT(Personal_ID) desc' instead */

     

     


    In your counting table at the end, we see the columns

    Row LabelView CountActual
    RL1x^2x
    RL2y^2y


    This means you've accidentally gotten a cartesian cross product where your tables contain the same amount of duplicates for the combination of [Personal_ID] and [Calendar_day].

    For RL '51512295', that means all 11 rows with a given [Personal_ID] and [Calendar_day] in [ARC_Skill] has been matched separately with all 11 rows with the same [Personal_ID] and [Calendar_day] in [Skill_Analysis]. 11*11 = 121.

    ----
    In conclusion, [Personal_ID] and [Calendar_day] are not sufficient to uniquely match between these tables.
    I hope there's another column you can add in the join predicate in order to get the expected result.

    • SAGAR1985's avatar
      SAGAR1985
      Copper Contributor

      Thanks SivertSolem,
      I can share the code and few sample data with it.

      The below is my table 1 total no of rows indicated after --.
      select distinct Personal_ID from [dbo].[ARC_Skill] where Calendar_day = '2024-05-31';--42678

      One more table has no of rows given after --
      select * from [dbo].[Skill_Analysis] where Calendar_day = '2024-05-31';--42678
      In both there 42678 no of records/rows. But when i have created View with Left Join , the no of rows is showing 47390.

      select * from [dbo].[VW_ARC_SKILL] where Calendar_day = '2024-05-31';--47390

      Now coming to my View code. Please find it below.
      ------
      ALTER VIEW [dbo].[VW_ARC_SKILL]
      AS
      SELECT
      B.Calendar_day AS Calendar_day,
      B.Employee_Name AS Employee_Name,
      B.Personal_ID AS Personal_ID,
      B.Sub_Band AS Sub_Band,
      B.Band_Grouping AS Band_Grouping,
      S.Job AS Job,
      B.Job_Group AS Job_Group,
      B.Never_Billed_Ageing AS Never_Billed_Ageing,
      B.Employee_Group AS Employee_Group,
      B.WFO_WT AS WFO_WT,
      B.WFH_WT AS WFH_WT,
      S.PhyACT_Days_Worked AS PhyACT_Days_Worked,
      S.PhyACT_Total_Working_Days AS PhyACT_Total_Working_Days,
      B.WFO_Category AS WFO_Category,
      B.HR_L2 AS HR_L2


      FROM
      [DB_Nippon_ARC].[dbo].[ARC_Skill] AS B
      LEFT JOIN
      [DB_Nippon_ARC].[dbo].[Skill_Analysis] AS S ON B.Calendar_day = S.Calendar_day AND B.Personal_ID = S.Personal_ID;
      GO
      ---------
      And for verification purpose i have taken the total view data and found few Personal ID have been duplicated multiple times.
      For example:
      -

      Row Labels In View count Actual count
      51512295 121 11
      52220994 100 10
      51789131 100 10
      40181872 100 10
      52221344 81 9
      52160816 81 9
      52160892 81 9
      52220993 64 8
      52212900 64 8
      52212169 49 7
      52234502 36 6
      52212233 36 6
      52212904 36 6
      52221374 36 6
      52141407 36 6
      52037147 36 6
      51985793 36 6
      51938909 36 6
      51627184 36 6
      40201464 36 6
      40101433 36 6
      51511318 36 6
      40130571 36 6
      52227723 25 5
      52227721 25 5
      52237307 25 5


      ---------------------------------
      In the above if you look at the top right is showing actual no of times the personal ID is repeated, but while joining into view, in view the repetition has been multiplied. for example 1st Personal ID
      51512295 is actually present for 11 times but while joining in view it is coming as 11*11= 121 times.

      How to corrcet this to view only for 42678 records !

      Regards,
      Sagar

      • SivertSolem's avatar
        SivertSolem
        Iron Contributor

        SAGAR1985 I think I might see where the issue lies.

        Your query for the [dbo].[ARC_Skill] table is selecting distinct Personal_IDs, meaning you are counting unique IDs. If there's duplicates, you're not seeing them.

         

         

        select distinct Personal_ID from [dbo].[ARC_Skill] where Calendar_day = '2024-05-31';
        --42678 rows

         

         


        If you do the following queries, I believe you'll find that you have the same amount of duplicates in both your tables:

         

         

        Select 
        	Personal_ID, /* The personal_ID */
        	Calendar_day, /* The calendar day */
        	Count(Personal_ID) as NumberOfDuplicates /* Number of rows with this given Personal_ID and Calendar_day */
        from [dbo].[ARC_SKILL]
        where Calendar_day = '2024-05-31'
        Group By Personal_ID, Calendar_day /* Grouping enables counting rows pr. Personal_ID */
        Order by NumberOfDuplicates Desc /* Largest number on top */
        /* You may have to use 'order by COUNT(Personal_ID) desc' instead */
        
        Select 
        	Personal_ID, /* The personal_ID */
        	Calendar_day, /* The calendar day */
        	Count(Personal_ID) as NumberOfDuplicates /* Number of rows with this given Personal_ID and Calendar_day */
        from [dbo].[Skill_Analysis]
        where Calendar_day = '2024-05-31'
        Group By Personal_ID, Calendar_day /* Grouping enables counting rows pr. Personal_ID */
        Order by NumberOfDuplicates Desc /* Largest number on top */
        /* You may have to use 'order by COUNT(Personal_ID) desc' instead */

         

         


        In your counting table at the end, we see the columns

        Row LabelView CountActual
        RL1x^2x
        RL2y^2y


        This means you've accidentally gotten a cartesian cross product where your tables contain the same amount of duplicates for the combination of [Personal_ID] and [Calendar_day].

        For RL '51512295', that means all 11 rows with a given [Personal_ID] and [Calendar_day] in [ARC_Skill] has been matched separately with all 11 rows with the same [Personal_ID] and [Calendar_day] in [Skill_Analysis]. 11*11 = 121.

        ----
        In conclusion, [Personal_ID] and [Calendar_day] are not sufficient to uniquely match between these tables.
        I hope there's another column you can add in the join predicate in order to get the expected result.

Resources