core-user  

Re: Speed up a job thats been running for 60+ hours (long)

Tamer Elsayed
Thu, 29 May 2008 07:07:09 -0700

Hi,

The number of pairs u r emmitting is totally dominated by the most frequent
users (i.e, users who have the longest list of groups). If u can accept
approximate results, then I'd suggest that u drop the *top* 1% (or even
0.1%) of the users based on their frequencies. In a very similar study at
University of Maryland, with about a million documents (which correspond to
groups in your problem), we managed to get a linear time approximation of
this problem of quadratic complexity. Here is the link to the study:

http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf

On a search-related application, we found experimentally that this trick
(specifficaly, dropping 0.1%) results in a drop of just 2% in the
effectiveness. Of course, this might be different for different
applications.

Tamer

On 5/26/08, jkupferman <[EMAIL PROTECTED]> wrote:
>
>
> Hi Yuri,
> So each user actually outputs (n^2-n)/2 records where n is the number of
> groups it is a member of. If the groups were arranged in an array from
> 0...n-1 then group x will output for all values between x+1...n-1.
>
> But yes, it does output a LOT of records. This is why I used the combiner
> which has shown to decrease the number of output records about 10x, and
> based on my understanding the combiner is run locally so only the combined
> records actually make it to the sort.
>
> I took a look at the implementation and the output is buffered so hopefully
> that helps since if it were written directly to disk on every output its
> understandable why it would be slow. I have the io.file.buffer.size set to
> 4096, since I am outputting so much, should I increase this size quite a
> bit? How big should I be looking to make this?
>
> Thanks for the help
>
>
>
>
> Yuri Kudryavcev-2 wrote:
> >
> > Hi.
> >
> > I really would like some input on this case, since I'm trying to scale up
> > a
> > similar algorithm.
> >
> > I can be totally wrong, please correct )
> > So you're emitting C^2_n group pairs from every user record by going for
> > group pairs?
> > For a n = 100 groups for an average user -- that's an 4950 output records
> > for every user. Do you see similar numbers in logs?
> > I think increasing the intermediate bunch of records in this proportion
> > degrades performance.
> >
> > - Yuri.
> >
> > On 5/26/08, jkupferman <[EMAIL PROTECTED]> wrote:
> >>
> >>
> >> Hi everyone,
> >> I am using hadoop (17) to try and do some large scale user comparisons
> >> and
> >> although the programs are all written, its taking incredibly long to run
> >> and
> >> it seems like it should be going faster. I would really like some
> insight
> >> as
> >> to what I could do to speed this up aside from just "add more
> computers".
> >> I
> >> would really appreciate some help from all of the sagacious hadoop
> >> core-users.
> >>
> >> The basic idea is that there are a bunch of users, each of which is some
> >> groups. I would like to know how many users each combination of groups
> >> has
> >> in common. I laid out the data using sequence files which seems to be
> >> working well and quickly, each sequence file entry has a text user name
> >> and
> >> a map writable which contains all of the groups they are in. The map
> >> function takes in each user and the outputs all of the combinations of
> >> the
> >> groups for which it is a part of and a 1 which is the instance
> >> counter(like
> >> in wordcount). So user x which is a member of groups 1,2,3,4 will output
> >> 1-2,1-3,1-4,2-3,2-4,3-4 as keys. Given that there are a lot of users, I
> >> made
> >> a collector which reduces the number of records about 10x. Reducer is
> >> really
> >> simple just sums up the total for each combination and then outputs it
> to
> >> a
> >> file. Just as an aside, I make sure to use intwritables just about
> >> everywhere which I hoped would help since there are inevitable tons of
> >> comparisons going on.
> >>
> >> This is being done on about 4gb of user data on an 20 Large instance
> >> cluster
> >> on Amazons EC2. With that much data, there are about 240 map tasks and I
> >> have it set to run 10 map tasks per task tracker. With those settings,
> >> the
> >> slaves are running about 100% CPU and memory is just about capacity but
> >> is
> >> almost no paging. Although the tasks seem to be progressing, some of the
> >> tasks that have just completed have run for 30+ hours. Some of the tasks
> >> have failed with a "Lost task tracker:" which I intend on fixing with
> >> HADOOP-3403_0_20080516.patch, whenever this job finishes.
> >>
> >> It seemed to me that the problem might have been calling the collector
> so
> >> many times since users can be in 1000's of groups and it does about n^2
> >> comparisons. I tried another version which outputs only n times by
> having
> >> each entry output a map, but this did not prove much better on the test
> >> trials I ran, and the extra work in the reducer is really killer.
> >>
> >> It is not clear to me what is dragging down this job, or what I can do
> to
> >> increase the rate at which it is computing. Although there is quite a
> bit
> >> of
> >> data, it doesnt seem like it should be taking this long on 20 nodes. Any
> >> help/questions/comments would be greatly appreciated. Thanks for all of
> >> your
> >> help.
> >>
> >>
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/Speed-up-a-job-thats-been-running-for-60%2B-hours-%28long%29-tp17465721p17465721.html
> >> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Speed-up-a-job-thats-been-running-for-60%2B-hours-%28long%29-tp17465721p17474577.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


-- 
Proud to be a follower of the "Best of Mankind"
"وَاذْكُرْ رَبَّكَ إِذَا نَسِيتَ وَقُلْ عَسَى أَنْ يَهْدِيَنِي رَبِّي
لأقْرَبَ مِنْ هَذَا رَشَدًا"