Saturday, October 1, 2022
HomeBig DataOptimizing Hive on Tez Efficiency

Optimizing Hive on Tez Efficiency


Tuning Hive on Tez queries can by no means be completed in a one-size-fits-all strategy. The efficiency on queries is dependent upon the dimensions of the information, file varieties, question design, and question patterns. Throughout efficiency testing, consider and validate configuration parameters and any SQL modifications. It’s advisable to make one change at a time throughout efficiency testing of the workload, and can be finest to evaluate the affect of tuning adjustments in your growth and QA environments earlier than utilizing them in manufacturing environments. Cloudera WXM can help in evaluating the advantages of question adjustments throughout efficiency testing.

Tuning Tips

It has been noticed throughout a number of migrations from CDH distributions to CDP Personal Cloud that Hive on Tez queries are inclined to carry out slower in comparison with older execution engines like MR or Spark. That is often attributable to variations in out-of-the-box tuning conduct between the totally different execution engines. Moreover, customers might have accomplished tuning within the legacy distribution that isn’t mechanically mirrored within the conversion to Hive on Tez. For customers upgrading from HDP distribution, this dialogue would additionally assist to assessment and validate if the properties are accurately configured for efficiency in CDP. 

The steps beneath assist you to determine the areas to concentrate on that may degrade efficiency. 

Step 1: Confirm and validate the YARN Capability Scheduler configurations. A misconfigured queue configuration can affect question efficiency as a result of an arbitrary cap on out there sources to the consumer. Validate the user-limit issue, min-user-limit p.c, and most capability. (Discuss with the YARN – The Capability Scheduler weblog to grasp these configuration settings.) 

Step 2: Assessment the relevance of any security valves (the non-default values for Hive and HiveServer2 configurations) for Hive and Hive on Tez. Take away any legacy and outdated properties.

Step 3: Determine the world of slowness, resembling map duties, cut back duties, and joins.

  1. Assessment the generic Tez engine and platform tunable properties.
  2. Assessment the map duties and tuneimprove/lower the duty counts as required.
  3. Assessment the cut back duties and tuneimprove/lower the duty counts as required.
  4. Assessment any concurrency associated points—tlisted here are two sorts of concurrency points as listed beneath:
    • Concurrency amongst customers inside a queue. This may be tuned utilizing the consumer restrict issue of the YARN queue (refer the main points in Capability Scheduler weblog).
    • Concurrency throughout pre-warmed containers for Hive on Tez classes, as mentioned intimately beneath.

Understanding parallelization in Tez

Earlier than altering any configurations, you will need to perceive the mechanics of how Tez works internally. For instance, this consists of understanding how Tez determines the right variety of mappers and reducers. Reviewing the Tez structure design and the main points concerning how the preliminary duties parallelism and auto-reduce parallelism works will assist you to optimize the question efficiency. 

Understanding numbers of mappers

Tez determines the variety of mapper duties utilizing the preliminary enter knowledge for the job. In Tez, the variety of duties are decided by the grouping splits, which is equal to the variety of mappers decided by the enter splits in map cut back jobs.

  • tez.grouping.min-size and tez.grouping.max-size decide the variety of mappers. The default values for min-size is 16 MB and max-size is 1 GB.
  • Tez determines the variety of duties such that the information per job is in keeping with the grouping max/min measurement. 
  • Reducing the tez.grouping.max-size will increase the variety of duties/mappers.
  • Growing the tez.grouping.max-size decreases the variety of duties.
  • Think about the next instance: 
    • Enter knowledge (enter shards/splits) – 1000 information (round 1.5 MB measurement)
    • Complete knowledge measurement can be – 1000*1.5 MB = ~ 1.5 GB
    • Tez might strive processing this knowledge with not less than two duties as a result of max knowledge/job may very well be 1 G. Finally, Tez might power 1000 information (splits) to be mixed to 2 duties, resulting in slower execution instances.
    • If the tez.grouping.max-size is diminished from 1 GB to 100 MB, the variety of mappers may very well be elevated to fifteen offering higher parallelism. Efficiency then will increase as a result of the improved parallelism will increase the work unfold from two concurrent duties to fifteen.

The above is an instance state of affairs, nevertheless in a manufacturing atmosphere the place one makes use of binary file codecs like ORC or parquet, figuring out the variety of mappers relying on storage sort, break up technique file, or HDFS block boundaries might get sophisticated. 

Be aware: The next diploma of parallelism (e.g. excessive variety of mappers/reducers) doesn’t all the time translate to raised efficiency, because it might result in fewer sources per job and better useful resource wastage as a result of job overhead. 

Understanding the numbers of reducers

Tez makes use of quite a lot of mechanisms and settings to find out the variety of reducers required to finish a question.

  • Tez determines the reducers mechanically primarily based on the information (variety of bytes) to be processed.
  • If hive.tez.auto.reducer.parallelism is ready to true, hive estimates knowledge measurement and units parallelism estimates. Tez will pattern supply vertices’ output sizes and regulate the estimates at runtime as crucial.
  • By default the max reducers quantity is ready to 1009 ( hive.exec.reducers.max
  • Hive/Tez estimates the variety of reducers utilizing the next components after which schedules the Tez DAG:

Max(1, Min(hive.exec.reducers.max [1009], ReducerStage estimate/hive.exec.reducers.bytes.per.reducer))  x  hive.tez.max.partition.issue [2]

  • The next three parameters could be tweaked to extend or lower the variety of mappers: 
    1. hive.exec.reducers.bytes.per.reducer
      Dimension per reducer. Change this to a smaller worth to extend parallelism or change it to a bigger worth to lower parallelism. Default Worth = 256 MB [i.e if the input size is 1 GB then 4 reducers will be used]
    2. tez.min.partition.issue Default Worth = 0.25
    3. tez.max.partition.issue Default Worth = 2.0
      Improve for extra reducers. Lower for much less variety of  reducers.
  • Customers can manually set the variety of reducers through the use of  mapred.cut back.duties. This isn’t beneficial and it is best to keep away from utilizing this.
  • Suggestions:  
    • Keep away from setting the reducers manually.
    • Including extra reducers doesn’t all the time assure higher efficiency.
    • Relying on the cut back stage estimates, tweak the hive.exec.reducers.bytes.per.reducer parameter to a decrease or greater worth if you wish to improve or lower the variety of reducers.

Concurrency 

This part goals to assist in understanding and tuning concurrent classes for Hive on Tez, resembling working a number of Tez AM containers. The beneath properties assist to grasp default queues and the variety of classes conduct.

  • hive.server2.tez.default.queues : A listing of comma separated values comparable to YARN queues for which to take care of a Tez session pool.
  • hive.server2.tez.classes.per.default.queue: The variety of Tez classes (DAGAppMaster) to take care of within the pool per YARN queue.
  • hive.server2.tez.initialize.default.classes: If enabled, HiveServer2 (HS2), at startup, will launch all crucial Tez classes inside the specified default.queues to fulfill the classes.per.default.queue necessities.

Once you outline the beneath listed properties, HiveServer2 will create one Tez Software Grasp (AM) for every default queue, multiplied by the variety of classes when HiveServer2 service begins. Therefore:

(Tez Classes)complete = HiveServer2instances x (default.queues) x (classes.per.default.queue)

Understanding through Instance:

  • hive.server2.tez.default.queues= “queue1, queue2”
  • hive.server2.tez.classes.per.default.queue=2
    =>Hiveserver2 will create 4 Tez AM (2 for queue1 and a couple of for queue2).

Be aware: The pooled Tez classes are all the time working, even on an idle cluster.

If there’s steady utilization of HiveServer2, these Tez AM will preserve working, but when your HS2 is idle, these Tez AM shall be killed primarily based on timeout outlined by tez.session.am.dag.submit.timeout.secs.

Case 1: Queue title isn’t specified 

  • A question will solely use a Tez AM from the pool (initialized as described above) if one doesn’t specify queue title (tez.queue.title).   On this case, HiveServer2 will choose certainly one of Tez AM idle/out there (queue title right here could also be randomly chosen). 
  • If one doesn’t specify a queue title,  the question stays in pending state with HiveServer2 till one of many default Tez AMs from the initialized pool is on the market to serve the question. There gained’t be any message in JDBC/ODBC consumer or within the HiveServer2 log file. As a result of no message is generated when the question is pending, the consumer might imagine the JDBC/ODBC connection or HiveServer2 is damaged, however it’s ready for a Tez AM to execute the question.

Case 2: Queue title specified 

  • If one does specify the queue title, it doesn’t matter what number of initialized Tez AMs are in use or idle, HiveServer2 will create a brand new Tez AM for this connection and the question could be executed (if the queue has out there sources).

Tips/suggestions for concurrency: 

  • To be used circumstances or queries the place one doesn’t need customers restricted to the identical Tez AM pool, set this hive.server2.tez.initialize.default.classes to false. Disabling this could cut back rivalry on HiveServer2 and enhance question efficiency.
  • Moreover, improve the variety of classes hive.server2.tez.classes.per.default.queue
  • If there are use circumstances requiring a separate or devoted Tez AM pool for every group of customers, one might want to have devoted HiveServer2 service, every of them with a respective default queue title and variety of classes, and ask every group of customers to make use of their respective HiveServer2.

Container reuse and prewarm containers

  • Container reuse:
    That is an optimization that limits the startup time affect on containers. That is turned on by setting tez.am.container.reuse.enabled to true. This protects  time interacting with YARN. I additionally preserve container teams alive, a sooner spin of containers, and skip yarn queues.
  • Prewarm containers:  
    The variety of containers is said to the quantity of YARN execution containers that shall be hooked up to every Tez AM by default. This identical variety of containers shall be held by every AM, even when Tez AM is idle (not executing queries).
    The draw back of this would seem in circumstances the place there are too many containers sitting idle and never launched, because the containers outlined right here can be held by Tez AM even when it’s idle. These idle containers would proceed taking sources in YARN that different functions might probably make the most of.
    The beneath properties are used to configure Prewarm Containers:
    • hive.prewarm.enabled
    • hive.prewarm.numcontainers

Common Tez tuning parameters 

Assessment the properties listed beneath as a first-level test when coping with efficiency degradation of Hive on Tez queries. You would possibly have to set or tune a few of these properties in accordance along with your question and knowledge properties. It will be finest to evaluate the configuration properties in growth and QA environments, after which push it to manufacturing environments relying on the outcomes. 

  • hive.cbo.allow
    Setting this property to true permits the cost-based optimization (CBO). CBO is a part of Hive’s question processing engine. It’s powered by Apache Calcite. CBO generates environment friendly question plans by analyzing tables and circumstances specified within the question, ultimately decreasing the question execution time and enhancing useful resource utilization.
  • hive.auto.convert.be a part of
    Setting this property to true permits Hive to allow the optimization about changing frequent be a part of into mapjoin primarily based on the enter file measurement.
  • hive.auto.convert.be a part of.noconditionaltask.measurement
    You’ll want to carry out as many mapjoins as attainable within the question.  This measurement configuration permits the consumer to manage what measurement desk can slot in reminiscence. This worth represents the sum of the sizes of tables that may be transformed to hashmaps that slot in reminiscence.
    The advice can be to set this as ⅓ the dimensions of hive.tez.container.measurement.
  • tez.runtime.io.kind.mb
    The scale of the type buffer when output is sorted. The advice can be to set this to 40% of hive.tez.container.measurement as much as a most of two GB. It will hardly ever must be above this most. 
  • tez.runtime.unordered.output.buffer.size-mb
    That is the reminiscence when the output doesn’t must be sorted. It’s the measurement of the buffer to make use of if not writing on to disk. The advice can be to set this to 10% of hive.tez.container.measurement.
  • hive.exec.parallel
    This property permits parallel execution of Hive question levels. By default, that is set to false. Setting this property to true helps to parallelize the unbiased question levels, leading to total improved efficiency.
  • hive.vectorized.execution.enabled
    Vectorized question execution is a Hive characteristic that enormously reduces the CPU utilization for typical question operations like scans, filters, aggregates, and joins. By default that is set to false. Set this to true.
  • hive.merge.tezfiles
    By default, this property is ready to false. Setting this property to true would merge the Tez information. Utilizing this property might improve or lower the execution time of the question relying on measurement of the information or variety of information to merge. Assess your question efficiency in decrease environments earlier than utilizing this property. 
  • hive.merge.measurement.per.job
    This property describes the measurement of the merged information on the finish of a job.
  • hive.merge.smallfiles.avgsize
    When the typical output file measurement of a job is lower than this quantity, Hive will begin a further map-reduce job to merge the output information into greater information. By default, this property is ready at 16 MB. 

Abstract

This weblog coated some fundamental troubleshooting and tuning tips for Hive on Tez queries with respect to CDP. Because the very first step in question efficiency evaluation, it is best to confirm and validate all of the configurations set on Hive and Hive on Tez companies. Each change made ought to be examined to make sure that it makes a measurable and useful enchancment. Question tuning is a specialised effort and never all queries carry out higher by altering the Tez configuration properties. It’s possible you’ll encounter eventualities the place you’ll want to deep dive into the SQL question to optimize and enhance the execution and efficiency. Contact your Cloudera Account and Skilled Companies staff to supply steering should you require extra help on efficiency tuning efforts.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular