unafold is taking long to finish...

1 reply [Last post]
apadr007
Offline
Joined: 07/02/2013

Hi I'm running this command on 22,000 transcripts.

hybrid-ss-min --suffix DAT -n DNA --tracebacks 1 input.fa

And after running for 4 days it has completed 6% of the total amount of transcripts. Why is this taking so long. If I take a look at top I see.

top - 10:28:50 up 100 days, 14:16, 1 user, load average: 1.03, 1.02, 1.00
Tasks: 256 total, 2 running, 254 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 0.0%sy, 6.2%ni, 93.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 49449116k total, 4240360k used, 45208756k free, 277588k buffers
Swap: 1020116k total, 0k used, 1020116k free, 775860k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
24047 padrona 39 19 1264m 1.0g 736 R 100.1 2.1 5493:42 hybrid-ss-min

It's running at 100% CPU capacity with 25gb of RAM allocated to the job. Is there a way to run this on multiple threads? at this pace, I will literally finish this run in 1 month which seems ridiculous. I must be doing something wrong, right? How do you guys speed up your jobs?

Thanks,
AP

NMarkham
Offline
Joined: 09/29/2011
Sorry for the late reply, but

Sorry for the late reply, but hopefully my comments will have some benefit to future users at least.

  1. This is unrelated to the performance issue, but I should clear up some misconceptions about the command like options:
    • --suffix overrides --NA, --tmin, --tinc, --tmax, --sodium and --magnesium. In this case, the DAT rules are for RNA, and adding -n DNA has no effect. (I regret that usage like this is silently ignored, rather than resulting in an error.)
    • There is no --tracebacks option in hybrid-ss-min--tracebacks is used in partition function programs to compute stochastic tracebacks. However, since hybrid-ss-min outputs an MFE structure by default, this one probably isn't hurting anything.
  2. As for the running time: what you described seems more or less reasonable. The run time depends, of course, on two factors: how long the sequence is and how powerful the CPU is. For example, on a fairly modest Core i5, I find it takes about 12 minutes to fold 7000 bases. The time grows cubically, so that e.g. 14000 bases would probably take eight times as long, and so on. (Memory usage grows quadratically, if that's a concern.)

    I believe UNAFold compares favorably to other similar software (notably mfold) regarding folding times. I'd be interested to hear of any software that can do it significantly faster.

  3. There is an obvious way to run this computation in parallel, even without any explicit support for it in UNAFold: make two files, each with 11,000 transcripts, and run two copies of hybrid-ss-min at the same time. :) The downside is that memory consumption will double, but the upside is that there will be almost no overhead, so the running time will virtually halve.