Tamer Elsayed
Thu, 29 May 2008 07:07:09 -0700
Hi, The number of pairs u r emmitting is totally dominated by the most frequent users (i.e, users who have the longest list of groups). If u can accept approximate results, then I'd suggest that u drop the *top* 1% (or even 0.1%) of the users based on their frequencies. In a very similar study at University of Maryland, with about a million documents (which correspond to groups in your problem), we managed to get a linear time approximation of this problem of quadratic complexity. Here is the link to the study: http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf On a search-related application, we found experimentally that this trick (specifficaly, dropping 0.1%) results in a drop of just 2% in the effectiveness. Of course, this might be different for different applications. Tamer On 5/26/08, jkupferman <[EMAIL PROTECTED]> wrote: > > > Hi Yuri, > So each user actually outputs (n^2-n)/2 records where n is the number of > groups it is a member of. If the groups were arranged in an array from > 0...n-1 then group x will output for all values between x+1...n-1. > > But yes, it does output a LOT of records. This is why I used the combiner > which has shown to decrease the number of output records about 10x, and > based on my understanding the combiner is run locally so only the combined > records actually make it to the sort. > > I took a look at the implementation and the output is buffered so hopefully > that helps since if it were written directly to disk on every output its > understandable why it would be slow. I have the io.file.buffer.size set to > 4096, since I am outputting so much, should I increase this size quite a > bit? How big should I be looking to make this? > > Thanks for the help > > > > > Yuri Kudryavcev-2 wrote: > > > > Hi. > > > > I really would like some input on this case, since I'm trying to scale up > > a > > similar algorithm. > > > > I can be totally wrong, please correct ) > > So you're emitting C^2_n group pairs from every user record by going for > > group pairs? > > For a n = 100 groups for an average user -- that's an 4950 output records > > for every user. Do you see similar numbers in logs? > > I think increasing the intermediate bunch of records in this proportion > > degrades performance. > > > > - Yuri. > > > > On 5/26/08, jkupferman <[EMAIL PROTECTED]> wrote: > >> > >> > >> Hi everyone, > >> I am using hadoop (17) to try and do some large scale user comparisons > >> and > >> although the programs are all written, its taking incredibly long to run > >> and > >> it seems like it should be going faster. I would really like some > insight > >> as > >> to what I could do to speed this up aside from just "add more > computers". > >> I > >> would really appreciate some help from all of the sagacious hadoop > >> core-users. > >> > >> The basic idea is that there are a bunch of users, each of which is some > >> groups. I would like to know how many users each combination of groups > >> has > >> in common. I laid out the data using sequence files which seems to be > >> working well and quickly, each sequence file entry has a text user name > >> and > >> a map writable which contains all of the groups they are in. The map > >> function takes in each user and the outputs all of the combinations of > >> the > >> groups for which it is a part of and a 1 which is the instance > >> counter(like > >> in wordcount). So user x which is a member of groups 1,2,3,4 will output > >> 1-2,1-3,1-4,2-3,2-4,3-4 as keys. Given that there are a lot of users, I > >> made > >> a collector which reduces the number of records about 10x. Reducer is > >> really > >> simple just sums up the total for each combination and then outputs it > to > >> a > >> file. Just as an aside, I make sure to use intwritables just about > >> everywhere which I hoped would help since there are inevitable tons of > >> comparisons going on. > >> > >> This is being done on about 4gb of user data on an 20 Large instance > >> cluster > >> on Amazons EC2. With that much data, there are about 240 map tasks and I > >> have it set to run 10 map tasks per task tracker. With those settings, > >> the > >> slaves are running about 100% CPU and memory is just about capacity but > >> is > >> almost no paging. Although the tasks seem to be progressing, some of the > >> tasks that have just completed have run for 30+ hours. Some of the tasks > >> have failed with a "Lost task tracker:" which I intend on fixing with > >> HADOOP-3403_0_20080516.patch, whenever this job finishes. > >> > >> It seemed to me that the problem might have been calling the collector > so > >> many times since users can be in 1000's of groups and it does about n^2 > >> comparisons. I tried another version which outputs only n times by > having > >> each entry output a map, but this did not prove much better on the test > >> trials I ran, and the extra work in the reducer is really killer. > >> > >> It is not clear to me what is dragging down this job, or what I can do > to > >> increase the rate at which it is computing. Although there is quite a > bit > >> of > >> data, it doesnt seem like it should be taking this long on 20 nodes. Any > >> help/questions/comments would be greatly appreciated. Thanks for all of > >> your > >> help. > >> > >> > >> -- > >> View this message in context: > >> > http://www.nabble.com/Speed-up-a-job-thats-been-running-for-60%2B-hours-%28long%29-tp17465721p17465721.html > >> Sent from the Hadoop core-user mailing list archive at Nabble.com. > >> > >> > > > > > > -- > View this message in context: > http://www.nabble.com/Speed-up-a-job-thats-been-running-for-60%2B-hours-%28long%29-tp17465721p17474577.html > Sent from the Hadoop core-user mailing list archive at Nabble.com. > > -- Proud to be a follower of the "Best of Mankind" "وَاذْكُرْ رَبَّكَ إِذَا نَسِيتَ وَقُلْ عَسَى أَنْ يَهْدِيَنِي رَبِّي لأقْرَبَ مِنْ هَذَا رَشَدًا"