-
Couldn't load subscription status.
- Fork 50
Description
I have setup SwitchML P4 app and controller successfully and compiled the client library for RDMA backend but I'm having a similar issue as #8 that no communication happens after worker setup. I have two workers with disabled ICRC and have tried this both for hello_world example and allreduce benchmark. Here is the output of GLOG_logtostderr=1 GLOG_v=2 ./allreduce_benchmark for one of the workers (the other worker has a similar output and the behavior is the same for hello_world as well):
../third_party/stdarg.h ?? ../../switchml_bkp.cfg
I0902 11:34:34.264950 18461 context.cc:64] Starting switchml context.
I0902 11:34:34.265347 18461 config.cc:139] Using this configuration file 'switchml.cfg'.
I0902 11:34:34.265789 18461 config.cc:216] Printing configuration
I0902 11:34:34.265799 18461 config.cc:219]
[general]
rank = 1
num_workers = 2
num_worker_threads = 4
max_outstanding_packets = 4
packet_numel = 64
backend = rdma
scheduler = fifo
prepostprocessor = cpu_exponent_quantizer
instant_job_completion = 0
controller_ip_str = 10.0.0.1
controller_port = 50099
timeout = 10000
timeout_threshold = 100
timeout_threshold_increment = 100
--(derived)--
max_outstanding_packets_per_worker_thread = 1
I0902 11:34:34.265834 18461 config.cc:270]
[backend.rdma]
msg_numel = 64
device_name = mlx5_0
device_port_id = 1
gid_index = 3
--(derived)--
num_pkts_per_msg = 1
max_outstanding_msgs = 4
max_outstanding_msgs_per_worker_thread = 1
I0902 11:34:34.265851 18461 rdma_backend.cc:42] Setting up worker.
I0902 11:34:34.266697 18461 rdma_endpoint.cc:65] Found Verbs device mlx5_0 with guid 0x98039b03008e0d50
I0902 11:34:34.266713 18461 rdma_endpoint.cc:65] Found Verbs device mlx5_1 with guid 0x98039b03008e0d51
I0902 11:34:34.266721 18461 rdma_endpoint.cc:79] Using Verbs device mlx5_0 gid index 3
I0902 11:34:34.290702 18461 rdma_endpoint.cc:116] GID 0 is 0x80fe 0x500d8efeff9b039a
I0902 11:34:34.290791 18461 rdma_endpoint.cc:116] GID 1 is 0x80fe 0x500d8efeff9b039a
I0902 11:34:34.290869 18461 rdma_endpoint.cc:116] GID 2 is 0 0x401a8c0ffff0000
I0902 11:34:34.290951 18461 rdma_endpoint.cc:116] GID 3 is 0 0x401a8c0ffff0000
I0902 11:34:39.579232 18461 context.cc:99] Switchml context started successfully.
Submitting 5 warmup jobs.
I0902 11:34:39.766185 18467 rdma_utils.h:193] Worker 0 bound to core 0 on NUMA node 0
I0902 11:34:39.766194 18469 rdma_utils.h:193] Worker 2 bound to core 2 on NUMA node 0
I0902 11:34:39.766386 18469 rdma_worker_thread.cc:129] Worker 2 QP 0:0x519 using rkey 5 for remote rkey 63210
I0902 11:34:39.766402 18467 rdma_worker_thread.cc:129] Worker 0 QP 0:0x517 using rkey 1 for remote rkey 63210
I0902 11:34:39.773722 18471 rdma_utils.h:193] Worker 3 bound to core 3 on NUMA node 0
I0902 11:34:39.773824 18471 rdma_worker_thread.cc:129] Worker 3 QP 0:0x51a using rkey 7 for remote rkey 63210
I0902 11:34:39.803228 18468 rdma_utils.h:193] Worker 1 bound to core 1 on NUMA node 0
I0902 11:34:39.803387 18468 rdma_worker_thread.cc:129] Worker 1 QP 0:0x518 using rkey 3 for remote rkey 63210
After no progress when I exit the process I get:
^CSignal 2 received, preparing to exit...
I0902 11:37:39.248157 18462 context.cc:105] Stopping switchml context
I0902 11:37:39.248188 18462 scheduler.cc:48] Waking up waiting threads
I0902 11:37:39.248227 18462 rdma_backend.cc:56] Cleaning up worker.
I0902 11:37:39.248417 18462 stats.cc:97] Stats:
Submitted jobs: #5#
Submitted jobs sizes: #[268435456,268435456,268435456,268435456,268435456,]#
Submitted jobs sizes distribution: #Sum: 1342177280 Mean: 268435456.0000 Max: 268435456 Min: 268435456 Median: 268435456 Stdev: 0.0000 #
Finished jobs: #0#
Worker thread: #0#
Total packets sent: #18#
Total packets received: #0#
Wrong packets received: #0#
Correct packets received: #0#
Number of timeouts: #17#
Worker thread: #1#
Total packets sent: #18#
Total packets received: #0#
Wrong packets received: #0#
Correct packets received: #0#
Number of timeouts: #17#
Worker thread: #2#
Total packets sent: #18#
Total packets received: #0#
Wrong packets received: #0#
Correct packets received: #0#
Number of timeouts: #17#
Worker thread: #3#
Total packets sent: #18#
Total packets received: #0#
Wrong packets received: #0#
Correct packets received: #0#
Number of timeouts: #17#
I0902 11:37:39.248509 18462 context.cc:130] Stopped switchml context
Warmup finished.
Submitting 10 jobs.
Signal handler thread is exiting
Here is the outputs on controller side:
SwitchML>show_switch_address
Switch MAC: 00:11:22:33:44:55 IP: 192.168.1.100
SwitchML>show_rdma_workers
Received Sent
Worker ID Worker MAC Worker IP Packets / Bytes Packets / Bytes
0 98:03:9b:83:1a:b2 192.168.1.2 0 / 0 0 / 0
1 98:03:9b:8e:3d:ac 192.168.1.4 0 / 0 0 / 0
SwitchML>show_ports
Port Up Valid Enabled Speed FEC Tx Packets Tx Bytes Rx Packets Rx Bytes Rx Errors Tx Errors FCS Errors
1/0 1 1 1 100G NONE 219 66345 68 10710 0 0 0
2/0 1 1 1 100G NONE 181 60116 106 8959 0 0 0
3/0 1 1 1 100G NONE 265 83132 33 6105 0 0 0
4/0 1 1 1 100G NONE 178 59180 133 12990 0 0 0
SwitchML>show_statistics
Broadcasted Recirculated Retransmitted Dropped
Index Set 0 Set 1 Set 0 Set 1 Set 0 Set 1 Set 0 Set 1
0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0
And on the switch side I get:
bf-sde.pm> show
-----+----+---+----+-------+----+--+--+---+---+---+--------+----------------+----------------+-
PORT |MAC |D_P|P/PT|SPEED |FEC |AN|KR|RDY|ADM|OPR|LPBK |FRAMES RX |FRAMES TX |E
-----+----+---+----+-------+----+--+--+---+---+---+--------+----------------+----------------+-
1/0 |23/0|132|2/ 4|100G |NONE|Ds|Au|YES|ENB|UP | NONE | 68| 219|
2/0 |22/0|140|2/12|100G |NONE|Ds|Au|YES|ENB|UP | NONE | 106| 181|
3/0 |21/0|148|2/20|100G |NONE|Ds|Au|YES|ENB|UP | NONE | 33| 265|
4/0 |20/0|156|2/28|100G |NONE|Ds|Au|YES|ENB|UP | NONE | 133| 178|
My environment is:
Switch: Wedge BF100-32x
SDE: 9.9.0
Python: 3.8
NICs: ConnectX-5
My ports.yaml has:
ports:
1/0 : {speed: "100G", fec: "none", autoneg: "disable", mac: "98:03:9b:8e:82:98"}
2/0 : {speed: "100G", fec: "none", autoneg: "disable", mac: "98:03:9b:83:1a:b2"}
3/0 : {speed: "100G", fec: "none", autoneg: "disable", mac: "98:03:9b:83:34:d2"}
4/0 : {speed: "100G", fec: "none", autoneg: "disable", mac: "98:03:9b:8e:3d:ac"}
And finally my config file is here:
My guess is that the switch data plane as an end point is unreachable for some reason (but only one packet does not timeout so I'm not sure). Is there a way to ensure connectivity between
Thank you!