Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-10746

Hive 1.2.0+Tez produces 1-byte FileSplits from mapred.TextInputFormat

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 0.14.0, 0.14.1, 1.1.0, 1.1.1, 1.2.0
    • 1.2.1, 2.0.0
    • Hive, Tez
    • None
    • Use sane split min-sizes when using legacy mapred.InputFormat::getSplits(job, num)

    Description

      The following query:

      SELECT appl_user_id, arsn_cd, COUNT(*) as RecordCount FROM adw.crc_arsn GROUP BY appl_user_id,arsn_cd ORDER BY appl_user_id;
      

      runs consistently fast in Spark and Mapreduce on Hive 1.2.0. When attempting to run this same query against Tez as the execution engine it consistently runs for over 300-500 seconds this seems extremely long. This is a basic external table delimited by tabs and is a single file in a folder. In Hive 0.13 this query with Tez runs fast and I tested with Hive 0.14, 0.14.1/1.0.0 and now Hive 1.2.0 and there clearly is something going awry with Hive w/Tez as an execution engine with Single or small file tables. I can attach further logs if someone needs them for deeper analysis.

      HDFS Output:

      hadoop fs -ls /example_dw/crc/arsn
      Found 2 items
      -rwxr-x---   6 loaduser hadoopusers          0 2015-05-17 20:03 /example_dw/crc/arsn/_SUCCESS
      -rwxr-x---   6 loaduser hadoopusers    3883880 2015-05-17 20:03 /example_dw/crc/arsn/part-m-00000
      

      Hive Table Describe:

      hive> describe formatted crc_arsn;
      OK
      # col_name              data_type               comment             
                       
      arsn_cd                 string                                      
      clmlvl_cd               string                                      
      arclss_cd               string                                      
      arclssg_cd              string                                      
      arsn_prcsr_rmk_ind      string                                      
      arsn_mbr_rspns_ind      string                                      
      savtyp_cd               string                                      
      arsn_eff_dt             string                                      
      arsn_exp_dt             string                                      
      arsn_pstd_dts           string                                      
      arsn_lstupd_dts         string                                      
      arsn_updrsn_txt         string                                      
      appl_user_id            string                                      
      arsntyp_cd              string                                      
      pre_d_indicator         string                                      
      arsn_display_txt        string                                      
      arstat_cd               string                                      
      arsn_tracking_no        string                                      
      arsn_cstspcfc_ind       string                                      
      arsn_mstr_rcrd_ind      string                                      
      state_specific_ind      string                                      
      region_specific_in      string                                      
      arsn_dpndnt_cd          string                                      
      unit_adjustment_in      string                                      
      arsn_mbr_only_ind       string                                      
      arsn_qrmb_ind           string                                      
                       
      # Detailed Table Information             
      Database:               adw                      
      Owner:                  LOADUSER@EXA.EXAMPLE.COM   
      CreateTime:             Mon Apr 28 13:28:05 EDT 2014     
      LastAccessTime:         UNKNOWN                  
      Protect Mode:           None                     
      Retention:              0                        
      Location:               hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn        
      Table Type:             EXTERNAL_TABLE           
      Table Parameters:                
              EXTERNAL                TRUE                
              transient_lastDdlTime   1398706085          
                       
      # Storage Information            
      SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe       
      InputFormat:            org.apache.hadoop.mapred.TextInputFormat         
      OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat       
      Compressed:             No                       
      Num Buckets:            -1                       
      Bucket Columns:         []                       
      Sort Columns:           []                       
      Storage Desc Params:             
              field.delim             \t                  
              line.delim              \n                  
              serialization.format    \t                  
      Time taken: 1.245 seconds, Fetched: 54 row(s)
      

      Explain Hive 1.2.0 w/Tez:

      STAGE DEPENDENCIES:
        Stage-1 is a root stage
        Stage-0 depends on stages: Stage-1
      
      STAGE PLANS:
        Stage: Stage-1
          Tez
            Edges:
              Reducer 2 <- Map 1 (SIMPLE_EDGE)
              Reducer 3 <- Reducer 2 (SIMPLE_EDGE)
      
      
      Explain Hive 0.13 w/Tez:
      STAGE DEPENDENCIES:
        Stage-1 is a root stage
        Stage-0 is a root stage
      
      STAGE PLANS:
        Stage: Stage-1
          Tez
            Edges:
              Reducer 2 <- Map 1 (SIMPLE_EDGE)
              Reducer 3 <- Reducer 2 (SIMPLE_EDGE) 
      

      Results:

      	Hive 1.2.0 w/Spark 1.3.1:
       		Finished successfully in 7.09 seconds
      
      	Hive 1.2.0 w/Mapreduce:
      		Stage 1: 32 Seconds
      		Stage 2: 35 Seconds
      
      	Hive 1.2.0 w/Tez 0.5.3:
      		Time taken: 565.025 seconds, Fetched: 11516 row(s)
      	
      	Hive 0.13 w/Tez 0.4.0:
      		Time taken: 13.552 seconds, Fetched: 11516 row(s)
      

      And finally looking at the Dag Attempt that is stuck for 500 seconds or so in Tez it looks to be stuck running the same method over and over again:

      8 duration=2561 from=org.apache.hadoop.hive.ql.exec.tez.RecordProcessor>
      2015-05-18 19:58:41,719 INFO [TezChild] exec.Utilities: PLAN PATH = hdfs://xhadnnm1p.example.com:8020/tmp/hive/gss2002/dbc4b0b5-7859-4487-a56d-969440bc5e90/hive_2015-05-18_19-58-25_951_5497535752804149087-1/gss2002/_tez_scratch_dir/4e635121-c4cd-4e3f-b96b-9f08a6a7bf5d/map.xml
      2015-05-18 19:58:41,822 INFO [TezChild] exec.MapOperator: MAP[4]: records read - 1
      2015-05-18 19:58:41,835 INFO [TezChild] io.HiveContextAwareRecordReader: Processing file hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn/part-m-00000
      2015-05-18 19:58:41,848 INFO [TezChild] io.HiveContextAwareRecordReader: Processing file hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn/part-m-00000
      
      ......
      
      
      2015-05-18 20:07:46,560 INFO [TezChild] io.HiveContextAwareRecordReader: Processing file hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn/part-m-00000
      2015-05-18 20:07:46,574 INFO [TezChild] io.HiveContextAwareRecordReader: Processing file hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn/part-m-00000
      2015-05-18 20:07:46,587 INFO [TezChild] io.HiveContextAwareRecordReader: Processing file hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn/part-m-00000
      2015-05-18 20:07:46,603 INFO [TezChild] io.HiveContextAwareRecordReader: Processing file hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn/part-m-00000
      2015-05-18 20:07:46,603 INFO [TezChild] log.PerfLogger: </PERFLOG method=TezRunProcessor start=1431993518764 end=1431994066603 duration=547839 from=org.apache.hadoop.hive.ql.exec.tez.TezProcessor>
      2015-05-18 20:07:46,603 INFO [TezChild] exec.MapOperator: 4 finished. closing... 
      2015-05-18 20:07:46,603 INFO [TezChild] exec.MapOperator: RECORDS_IN_Map_1:13440
      

      Attachments

        1. HIVE-10746.2.patch
          3 kB
          Gopal Vijayaraghavan
        2. HIVE-10746.1.patch
          3 kB
          Gopal Vijayaraghavan
        3. slow_query_output.zip
          447 kB
          Greg Senia

        Activity

          People

            gopalv Gopal Vijayaraghavan
            gss2002 Greg Senia
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: