Big Data Processing framework (Spark and Dask) on PBS

Hi everyone,

A quick word to share with you work that has been done on using Spark and Dask with PBS at CNES (French Spatial Agency). PBS scripts to launch Spark or Dask based cluster are available in this repo: https://github.com/guillaumeeb/big-data-frameworks-on-pbs.

Has anyone already done this? Do you have something to share?
Are there plan to add similar functionnality in PBS? I’ve already discussed a bit with @subhasisb some times ago, but I don’t know what is he current situation.
I will be happy to have any feedback on this, so fill free to answer or ask anything.

Cheers,
Guillaume.

2 Likes

Hi @guillaumeeb,

Thanks for updating about the work here. This is going to benefit the entire PBS community.
We can certainly look at including links to your work from the PBS Professional github pages etc.

To start with, we will help by testing out these scripts in the short term.

Regards,
Subhasis

Hi @subhasisb, how about this testing on the scripts? Thanks.

To update a bit here, we are now exclusively using Dask on our cluster. I thus contributed to a project that has been up for quite some time now:
https://jobqueue.dask.org/en/latest/

I encourage everyone to look at it, it can really simplify a cluster usage, from job array, to complex workflows, to distributed science.

See more about Dask here:

1 Like