log in

Multiple Jobs In A Single Host


Advanced search

Message boards : Number crunching : Multiple Jobs In A Single Host

Author Message
Profile Laurence
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar
Send message
Joined: 12 Sep 14
Posts: 575
Credit: 128,447
RAC: 1,010
Message 236 - Posted: 8 Apr 2015, 15:00:53 UTC

At the moment we have set the max_jobs_in_progress to be 1. This means that when you join this project only one VM should be started no matter how many cores you have available. This has been done to avoid being greedy and swamping your machine with VMs.

Are any of you running more than one CMS-dev VM on your host and if so how did you manage to override this parameter?

Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 20 Jan 15
Posts: 818
Credit: 656,279
RAC: 5,709
Message 237 - Posted: 9 Apr 2015, 0:21:52 UTC - in response to Message 236.

I'm not, but when we get closer to production I'd like to be able to select myself how many to run -- e.g. I have 20-core servers with 128 GB of RAM that should be able to run more than one instance. (Not to mention that 60-core, 240-thread Xeon Phi languishing in the lab, tho' I don't think VirtualBox will run in its limited resources... :-)
____________

Profile Steve Hawker*
Send message
Joined: 6 Mar 15
Posts: 16
Credit: 129,381
RAC: 542
Message 238 - Posted: 9 Apr 2015, 0:39:58 UTC - in response to Message 237.

I'm not, but when we get closer to production I'd like to be able to select myself how many to run --


Oh yes, please please please!!!

Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 20 Jan 15
Posts: 818
Credit: 656,279
RAC: 5,709
Message 239 - Posted: 9 Apr 2015, 1:42:26 UTC - in response to Message 238.

I'm not, but when we get closer to production I'd like to be able to select myself how many to run --

Oh yes, please please please!!!

Just keep in mind that this is a very resource-intensive project, even as we're running it in pre-beta. I had a case recently where tasks were failing because a 5+ GB VM was left in a slot directory where subsequent tasks were run; as they accumulated results the directory exceeded 10 GB, a limit in our set-up, and the tasks failed.
Also, we do extensive network traffic, in the tens to hundreds of megabytes. This is really not viable on my home network, as I have (had! It's been very flaky since the Sunday before Easter) a maximum of 6 Mbps, or roughly 33 MB/min.
Given that 6 Mbps just barely covers a 720p video stream from iPlayer, I get some interuptions to the BBC's News at Six. Luckily I have the option now of fibre-to-the-cabinet, and the cabinet lives in my front hedge, so I can get 60 Mbps or so if I ever get the round tuits. At a guess, I'd say one needs at least a 10 Mbps connexion to avoid CMS@Home intruding excessively into one's other (home -- I have up to 1 Gbps at work :-) network activities.
There's also the issue of disk space usage, but I haven't really quantified that yet. And of course memory, our current VMs have a 1 GB memory space each.
And don't forget, all of these parameters are subject to change once we start processing "real-life" work flows.
I don't mean to be a Cassandra but, all in all, this will never be a set-and-forget project like SETI@Home can usually be.
____________

LCB001
Send message
Joined: 5 Apr 15
Posts: 3
Credit: 684,200
RAC: 2,451
Message 240 - Posted: 9 Apr 2015, 13:49:17 UTC

I had a case recently where tasks were failing because a 5+ GB VM was left in a slot directory where subsequent tasks were run; as they accumulated results the directory exceeded 10 GB, a limit in our set-up, and the tasks failed.

This just happened to me with my first two CMS-dev wu's except it were work from other projects that was getting computation errors due to lack of disk space.

This had me be puzzled until I found the two vdi's left in the slots directory which meant that between the slots and the regular files CMS-dev was taking more than 12.5GB of space on the 128GB ssd I use for Boinc, this without one running at the moment.

Now that I know to keep an eye out for Leftovers it won't be a problem for me but others might be in for an unpleasent surprise if this isn't fixed.
____________

Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 20 Jan 15
Posts: 818
Credit: 656,279
RAC: 5,709
Message 408 - Posted: 26 May 2015, 15:31:33 UTC - in response to Message 240.
Last modified: 26 May 2015, 15:36:12 UTC

I had a case recently where tasks were failing because a 5+ GB VM was left in a slot directory where subsequent tasks were run; as they accumulated results the directory exceeded 10 GB, a limit in our set-up, and the tasks failed.

This just happened to me with my first two CMS-dev wu's except it were work from other projects that was getting computation errors due to lack of disk space.

This had me be puzzled until I found the two vdi's left in the slots directory which meant that between the slots and the regular files CMS-dev was taking more than 12.5GB of space on the 128GB ssd I use for Boinc, this without one running at the moment.

Now that I know to keep an eye out for Leftovers it won't be a problem for me but others might be in for an unpleasent surprise if this isn't fixed.

Well, hopefully we've found the cause for that -- see this thread if you haven't already.

Retournons a nos moutons! The question of running more than one task at a time has arisen again. It doesn't make much sense at present given that we're not producing "real" data yet, but how to do it but not overload smaller machines? vLHC allows two tasks to be active at a time. In principle the user can use app_config.xml in the project directory to limit the number of concurrent jobs. I just tried on both a Windows and a Linux machine, using
<app_config> &ltapp> <name>vboxwrapper</name> <max_concurrent>1</max_concurrent> </app> </app_config>

The documentation says that the project needs to be reset if this is changed but I had to stop and reset BOINC as well. Oops, the Windows machine is the one that won't start a CMS task properly, and it looks like it's having the same problem with vLHC -- "VM Hypervisor failed to enter an online state in a timely fashion". The Linux box is now running just one task though, with the other in "waiting to run" state. I'll let it play for a couple of days and see what happens.
____________

Profile MAGIC Quantum Mechanic
Avatar
Send message
Joined: 8 Apr 15
Posts: 96
Credit: 888,083
RAC: 1,606
Message 409 - Posted: 26 May 2015, 20:52:54 UTC - in response to Message 408.

Yeah over at vLHC the usual way members get one or two tasks is just setting the preferences on their account.

Most still run X2 but we still have a few that just run one tasks at a time.

And there we can also run the regular tasks or the new Databridge tasks.

Since on occasion one type will not have any tasks so mine is set to run X2 with either version which usually keeps my hosts running (like right now when the Databridge tasks are empty so mine switched back to the other version)

http://lhcathome2.cern.ch/vLHCathome/top_hosts.php?sort_by=total_credit

That would work here too.....now or in the future.

(and nice not having those "snapshots" with Databridge)
____________

Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 20 Jan 15
Posts: 818
Credit: 656,279
RAC: 5,709
Message 411 - Posted: 27 May 2015, 21:20:14 UTC - in response to Message 409.

Tja, the .xml file I posted above does seem to work for the vLHC tasks. It's a little bit chicken-and-egg as to what order you need to do i) creating the file; ii) resetting the project; and iii) stopping and restarting BOINC -- but it works itself out in short order.
Now the question is, can we ship a one-job-at-a-time default file, and then raise the per-PC job limit so that people who want, and have the capacity, to run more than one can easily edit the file to that end?
____________

Richard Haselgrove
Send message
Joined: 4 May 15
Posts: 64
Credit: 54,671
RAC: 0
Message 412 - Posted: 27 May 2015, 21:29:19 UTC - in response to Message 411.

Tja, the .xml file I posted above does seem to work for the vLHC tasks. It's a little bit chicken-and-egg as to what order you need to do i) creating the file; ii) resetting the project; and iii) stopping and restarting BOINC -- but it works itself out in short order.
Now the question is, can we ship a one-job-at-a-time default file, and then raise the per-PC job limit so that people who want, and have the capacity, to run more than one can easily edit the file to that end?

Where are you seeing anything about resetting the project?

My recipe for doing that would be
i) Create the file
ii) Issue the 'Read config files' command from BOINC Manager

If you have two tasks running at once, and set <max_concurrent> to 1, one of them will stop. Simple as that.

Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 20 Jan 15
Posts: 818
Credit: 656,279
RAC: 5,709
Message 415 - Posted: 28 May 2015, 9:51:05 UTC - in response to Message 412.

In http://boinc.berkeley.edu/wiki/client_configuration#Application_configuration:
"If you remove app_config.xml, or one of its entries, you must reset the project in order to restore the proper values."
Perhaps I'm misreading it.
____________

Profile PDW
Send message
Joined: 20 May 15
Posts: 215
Credit: 1,951,272
RAC: 2,147
Message 416 - Posted: 28 May 2015, 10:04:30 UTC - in response to Message 415.

That's what it says !

I do know that if all you change is the max_concurrent value then just doing a Read config files will immediately take effect.

Richard Haselgrove
Send message
Joined: 4 May 15
Posts: 64
Credit: 54,671
RAC: 0
Message 417 - Posted: 28 May 2015, 10:30:01 UTC - in response to Message 415.

I think the operative word in that note might be 'remove' - either the whole file, or some entry types within it.

I suspect some entries, notably the thread count settings for MT applications, may have a delayed impact - they might only come into effect when a new task is started, or when new work is fetched from the project concerned. And the GPU count is slow to display, even though it comes into effect immediately. Altogether, it's a slightly clumsy and perhaps unfinished (though incredibly useful) mechanism.

I do have an editing account for that Wiki, so if I can think of a better form of words (or if anyone else can suggest one), I can post it.

Yeti
Avatar
Send message
Joined: 29 May 15
Posts: 117
Credit: 331,531
RAC: 0
Message 557 - Posted: 13 Aug 2015, 16:01:50 UTC - in response to Message 415.

In http://boinc.berkeley.edu/wiki/client_configuration#Application_configuration:
"If you remove app_config.xml, or one of its entries, you must reset the project in order to restore the proper values."
Perhaps I'm misreading it.


HM this differs on what you are changing:

If you allow one or more cores or take them away it is enough to say "Read config files" and BOINC can react on it.

If you change the number of CPUs a VM can use then this will work only with the next started VM / WU, so resetting may be a good interaction. For running VMs the number of allowed cores is not changed

m
Volunteer tester
Send message
Joined: 20 Mar 15
Posts: 188
Credit: 222,693
RAC: 612
Message 585 - Posted: 15 Aug 2015, 16:21:29 UTC
Last modified: 15 Aug 2015, 16:30:01 UTC

Note that there is a "bug" in that whilst the app_config file will allow only one task to run at a time, the work fetch process doesn't take this into account. So more tasks can be downloaded only for the "excess" to sit there waiting. This can also result in idle cores (presumably the one that would have run the waiting task). It may be possible to circumvent this using the avg_ncpus setting but it didn't work for me. Setting this to 1 allowed only one task to be downloaded but after a few seconds, BOINC simply went back and got another.
This is average cpus and can be fractional so maybe setting it to a lower value may work but I've not tried.

m
Volunteer tester
Send message
Joined: 20 Mar 15
Posts: 188
Credit: 222,693
RAC: 612
Message 1002 - Posted: 5 Sep 2015, 11:13:36 UTC - in response to Message 585.

Note that there is a "bug" in that whilst the app_config file will allow only one task to run at a time, the work fetch process doesn't take this into account. So more tasks can be downloaded only for the "excess" to sit there waiting. This can also result in idle cores (presumably the one that would have run the waiting task). It may be possible to circumvent this using the avg_ncpus setting but it didn't work for me. Setting this to 1 allowed only one task to be downloaded but after a few seconds, BOINC simply went back and got another.
This is average cpus and can be fractional so maybe setting it to a lower value may work but I've not tried.


Looks as though this may have been fixed in v7.6.9. From the version history
"client: fix job scheduling bug that starves CPU instances ". I haven't
specifcally tested it but it is running OK on Atlas so far.

Thanks, Rom.

Richard Haselgrove
Send message
Joined: 4 May 15
Posts: 64
Credit: 54,671
RAC: 0
Message 1003 - Posted: 5 Sep 2015, 12:13:09 UTC - in response to Message 1002.

Note that there is a "bug" in that whilst the app_config file will allow only one task to run at a time, the work fetch process doesn't take this into account. So more tasks can be downloaded only for the "excess" to sit there waiting. This can also result in idle cores (presumably the one that would have run the waiting task). It may be possible to circumvent this using the avg_ncpus setting but it didn't work for me. Setting this to 1 allowed only one task to be downloaded but after a few seconds, BOINC simply went back and got another.
This is average cpus and can be fractional so maybe setting it to a lower value may work but I've not tried.


Looks as though this may have been fixed in v7.6.9. From the version history
"client: fix job scheduling bug that starves CPU instances ". I haven't
specifcally tested it but it is running OK on Atlas so far.

Thanks, Rom.

That was a very specific bug - introduced by mistake in v7.6.3 - that Cliff Harding found and we worked through from here.

Unless you were seeing similar symptoms in cpu_sched_debug logging - specifically like

[cpu_sched_debug] using 2.00 out of 6 CPUs

I doubt this change is relevant to you. Work fetch and app_config.xml still aren't hooked up.

Jim1348
Send message
Joined: 17 Aug 15
Posts: 11
Credit: 84,896
RAC: 974
Message 1010 - Posted: 5 Sep 2015, 19:14:23 UTC - in response to Message 585.
Last modified: 5 Sep 2015, 19:16:18 UTC

Note that there is a "bug" in that whilst the app_config file will allow only one task to run at a time, the work fetch process doesn't take this into account. So more tasks can be downloaded only for the "excess" to sit there waiting. This can also result in idle cores (presumably the one that would have run the waiting task).

Quite true. It is not a problem here that I have found thus far, since the limit set for the CMS tasks is only "1". But to prevent CPU starvation on other projects, you can adjust the "resource share" for each project accordingly.

For example I have 6 cores available (2 are reserved for GPUs), and if I want 2 cores for Project A and 4 cores for Project B, I use the app_config to limit Project A to 2 work units max. Then, I adjust the resource share to 50% for Project A, which sets those downloads to 1/3 of the total. I don't need to do anything for Project B, and it all works out well most of the time. Occasionally, there are mis-estimates for the running time, but they get corrected eventually.

You don't actually need the app_config at all in that case if you are willing to live with long-term averages, but there are projects that take a lot of memory where I do want a limit at all times.

m
Volunteer tester
Send message
Joined: 20 Mar 15
Posts: 188
Credit: 222,693
RAC: 612
Message 1011 - Posted: 5 Sep 2015, 20:07:17 UTC - in response to Message 1003.

Note that there is a "bug" in that whilst the app_config file will allow only one task to run at a time, the work fetch process doesn't take this into account. So more tasks can be downloaded only for the "excess" to sit there waiting. This can also result in idle cores (presumably the one that would have run the waiting task). It may be possible to circumvent this using the avg_ncpus setting but it didn't work for me. Setting this to 1 allowed only one task to be downloaded but after a few seconds, BOINC simply went back and got another.
This is average cpus and can be fractional so maybe setting it to a lower value may work but I've not tried.


Looks as though this may have been fixed in v7.6.9. From the version history
"client: fix job scheduling bug that starves CPU instances ". I haven't
specifcally tested it but it is running OK on Atlas so far.

Thanks, Rom.

That was a very specific bug - introduced by mistake in v7.6.3 - that Cliff Harding found and we worked through from here.

Unless you were seeing similar symptoms in cpu_sched_debug logging - specifically like

[cpu_sched_debug] using 2.00 out of 6 CPUs

I doubt this change is relevant to you. Work fetch and app_config.xml still aren't hooked up.


You're right.Thanks Richard

Toby Broom
Send message
Joined: 19 Aug 15
Posts: 25
Credit: 1,220,315
RAC: 3,649
Message 1032 - Posted: 7 Sep 2015, 17:18:34 UTC
Last modified: 7 Sep 2015, 17:22:34 UTC

I followed these instructions for vLHC to setup all my computers to run >1 task. I still use it on one computer to crank out more work.

http://lhcathome2.cern.ch/vLHCathome/forum_thread.php?id=1154&postid=13411#13411

On Atlas@Home I make use of app_config to tune the WU's that my PC's can handle based on the amount of RAM, as it stomps your computer if you let it run free. The only computer that's close to working right is the on with 80Gb of ram.

Message boards : Number crunching : Multiple Jobs In A Single Host