Quantcast
Channel: DataStax Support Forums » Recent Topics
Viewing all articles
Browse latest Browse all 387

royw on "gradual deterioration of Hive performance"

$
0
0

Hi,

We are developing daily batch processing with Hive (DSE 3.0). As the more tables being loaded and processed, we observing a distinct slow down of Hive for identical HQL batch process (operating on batch data of appropriately the same size).

One distinct factor that appears to be related to the slow down is the increasing delay between the result table generation and the MapReduce job completion. Following is an example of the symptom that we are observing: given the following simple hql execution, from the execution log, last log entry's time stamp is at 17:53:53, and the target file shows time stamp of 17:54:18 (converted from UTC), -- so there's a 25 seconds gap between the output table being created after the processing finished. When we started off such daily processing, this gap would be about 1 or 2 seconds, and as more data being processed, we are observing this delay gradually increases to the current 25 seconds. Our current the DSE node size is about 80GB.

Based on CFS design, we couldn't think of any possible reason how the amount of data already stored would negatively affect new data insertion. I am wondering if anyone has also experienced similar problem? Or any idea what configuration options could contribute to this?

thanks,
Roy

##HQL output:
> insert overwrite table tmp_bind_00_20130116A
> select distinct
> concat(cast(site_id as string),'-',bind_id) as bind_guid
> , bind_date
> , day_of_week
> , bind_id as ori_bind_id
> , site_id
> , device_id
> , substring(bind_date,1,10) as bind_day
> from rawdata_00_20130116
> ;
Automatically selecting local only mode for query
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Execution log at: /tmp/root/root_20130513175353_58d446d0-b8c5-4206-9d22-bdb692c98d14.log
Job running in-process (local Hadoop)
Hadoop job information for null: number of mappers: 0; number of reducers: 0
2013-05-13 17:53:15,452 null map = 0%, reduce = 0%
2013-05-13 17:53:21,456 null map = 100%, reduce = 0%
2013-05-13 17:53:27,459 null map = 100%, reduce = 100%
Ended Job = job_local_0001
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
Loading data to table staging.tmp_bind_00_20130116A
Table staging.tmp_bind_00_20130116A stats: [num_partitions: 0, num_files: 0, num_rows: 21685, total_size: 0, raw_data_size: 2233555]
OK
bind_guid bind_date day_of_week ori_bind_id site_id device_id bind_day
Time taken: 73.574 seconds

##excerpt of execution log (/tmp/root/root_20130513175353_58d446d0-b8c5-4206-9d22-bdb692c98d14.log):

2013-05-13 17:53:39,467 INFO exec.ExecDriver (SessionState.java:printInfo(391)) - Ended Job = job_local_0001
2013-05-13 17:53:39,473 INFO exec.FileSinkOperator (Utilities.java:mvFileToFinalPath(1267)) - Moving tmp dir: cfs://IVM-CRS-VM41/tmp/hive-root/hive_2013-05-13_17-53-05_049_2835076602177774524/_tmp.-ext-10000 to: cfs://IVM-CRS-VM41/tmp/hive-root/hive_2013-05-13_17-53-05_049_2835076602177774524/_tmp.-ext-10000.intermediate
2013-05-13 17:53:53,348 INFO exec.FileSinkOperator (Utilities.java:mvFileToFinalPath(1278)) - Moving tmp dir: cfs://IVM-CRS-VM41/tmp/hive-root/hive_2013-05-13_17-53-05_049_2835076602177774524/_tmp.-ext-10000.intermediate to: cfs://IVM-CRS-VM41/tmp/hive-root/hive_2013-05-13_17-53-05_049_2835076602177774524/-ext-10000

## time stamp of target table:
> dfs -stat /user/hive/warehouse/staging.db/tmp_bind_00_20130116A/000000_0;
2013-05-13 21:54:18


Viewing all articles
Browse latest Browse all 387

Trending Articles