Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-8120 Umbrella JIRA tracking Parquet improvements
  3. HIVE-11763

Use * instead of sum(hash(*)) on Parquet predicate (PPD) integration tests

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 2.0.0
    • None
    • None

    Description

      The integration tests for Parquet predicate push down (PPD) use the following query to validate the values filtered:

      select sum(hash(*)) from ...
      

      It would be better if we use select * from ... instead to see that those values are correct. It is difficult to see if a value was filtered by seeing the hash.

      Also, we can try to limit the number of rows of the INSERT ... SELECT statmenet to avoid displaying many rows when validating the data. I think a LIMIT 2 on each of the SELECT.

      For example, the parquet_ppd_boolean.ppd has this:

      insert overwrite table newtypestbl select * from (select cast("apple" as char(10)), cast("bee" as varchar(10)), 0.22, true from src src1 union all select cast("hello" as char(10)), cast("world" as varchar(10)), 11.22, false from src src2) uniontbl;
      

      If we use LIMIT 2, then we will reduce the # of rows:

      insert overwrite table newtypestbl select * from (select cast("apple" as char(10)), cast("bee" as varchar(10)), 0.22, true from src src1 LIMIT 2 union all select cast("hello" as char(10)), cast("world" as varchar(10)), 11.22, false from src src2 LIMIT 2) uniontbl;
      

      Attachments

        1. HIVE-11763.2.patch
          185 kB
          Sergio Peña

        Activity

          People

            spena Sergio Peña
            spena Sergio Peña
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: