Cuong Nguyen
|
783da640a2
[ci][release] repeated run for release tests (#43472)
|
7 months ago |
Justin Yu
|
aed14c8134
[train] Simplify `ray.train.xgboost/lightgbm` (5/n): Remove `xgboost_ray` and `lightgbm_ray` dependencies (for release tests) (#43425)
|
7 months ago |
Balaji Veeramani
|
006c83f6cd
[Data] Move batch inference release tests to Data team (#43389)
|
8 months ago |
Balaji Veeramani
|
9ddc08c144
[Data] Remove "inference" release test (#43365)
|
8 months ago |
Scott Lee
|
54cd4ca9a4
[Data] Add heterogeneous Ray Data + Train release test (#42618)
|
8 months ago |
Scott Lee
|
681976584b
[Data] Increase timeout for `iter_tensor_batches_benchmark_multi_node` release test (#43286)
|
8 months ago |
can
|
c6094a96aa
[ci] mark serve_autoscaling_multi_deployment.aws as unstable
|
8 months ago |
Justin Yu
|
71d37ff204
[ci][train] Remove unnecessary `xgboost_ray`/`lightgbm_ray` reinstalls for release tests (#43176)
|
8 months ago |
matthewdeng
|
6eb1814ecc
[train] remove DEFAULT_NCCL_SOCKET_IFNAME (#42808)
|
8 months ago |
Stephanie Wang
|
851d154a81
[core] Make microbenchmark stable again (#42813)
|
8 months ago |
Jiajun Yao
|
b05e38be6f
Mark many_pgs as stable (#42687)
|
9 months ago |
Cuong Nguyen
|
6f7b66b687
[ci] fix workspace_template_serving_stable_diffusion (#42397)
|
9 months ago |
Ricky Xu
|
f76081f0f4
[core][ci] Run dag microbenchmark seperately as unstable (#42360)
|
9 months ago |
Cuong Nguyen
|
3ddaef2b5d
[ci] upgrade release tests to py39 (#42102)
|
9 months ago |
Artur Niederfahrenhorst
|
7634169bdb
[Workspace template] Fix version conflict with torch (#42094)
|
10 months ago |
Cuong Nguyen
|
405735828a
[ci] disable finetuning tests (#42038)
|
10 months ago |
Cuong Nguyen
|
2daf1fc4a7
[ci] mark some rllib as non release blocking (#41698)
|
10 months ago |
Hao Chen
|
27b794cd1d
[data][train] default ingest resource limits should exclude resources used by training (#41603)
|
10 months ago |
Andrew Xue
|
d4baa3f9dc
[data] add task filter for node killer (#41099)
|
10 months ago |
Scott Lee
|
d1a1e7bbd2
[Data] Fix timeout on `read_images_train_4_gpu_chaos` release test (#41368)
|
11 months ago |
Gene Der Su
|
a9a17aed08
[CI][Serve] unjail serve_handle_wide_ensemble.aws (#41345)
|
11 months ago |
Gene Der Su
|
d3ac31ec80
[CI][Serve] unjail serve_handle_long_chain.aws (#41344)
|
11 months ago |
Gene Der Su
|
636234f94a
[Serve] fix long_running_serve.aws (#41322)
|
11 months ago |
Archit Kulkarni
|
a1a9a48e5b
[CI] [Cluster] Fix example GCP GPU/docker example cluster YAML file (#41134)
|
11 months ago |
Scott Lee
|
e9d5ac9eae
[Data] e2e multi-node train benchmark (#41034)
|
11 months ago |
Andrew Xue
|
80018ffaf4
[data] add data worker killer (#41112)
|
11 months ago |
Balaji Veeramani
|
29aea3ddd6
[Data] Add fault tolerance to remote tasks (#41084)
|
11 months ago |
Balaji Veeramani
|
9d137f9fd4
[Data] Add multi-node `read_images` benchmark (#40683)
|
11 months ago |
Balaji Veeramani
|
daf4f62697
[Data] Remove AIR data bulk benchmark (#40801)
|
11 months ago |
Yunxuan Xiao
|
1d15058488
[2.8][Train] Fix GPT-J Deepspeed Fine-tuning Release Test (#40648)
|
1 year ago |