14:09:23 #startmeeting OF plugin 14:09:23 Meeting started Wed Sep 17 14:09:23 2014 UTC. The chair is abhijitkumbhare. Information about MeetBot at http://ci.openstack.org/meetbot.html. 14:09:23 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:09:23 The meeting name has been set to 'of_plugin' 14:09:32 #info vishnoianil 14:10:30 #info abhijitkumbhare 14:10:44 hi michal_rehak 14:11:08 The bugs list is at: https://wiki.opendaylight.org/view/OpenDaylight_Bugs#openflowplugin 14:11:20 Under that Helium bugs 14:12:38 michal_rehak - we were on https://bugs.opendaylight.org/show_bug.cgi?id=1626 14:12:47 #info michal_rehak 14:13:08 1626 needs retest 14:15:00 OK - should we reassign it to Jan? 14:15:11 no - I need to retest that 14:15:16 just doing taht 14:15:23 OK 14:16:38 Next one is: https://bugs.opendaylight.org/show_bug.cgi?id=1907 14:18:24 oflibMichal already did some research and it probably originates in OFPLugin 14:18:54 I will have a look 14:19:18 OK 14:21:26 Next one is: https://bugs.opendaylight.org/show_bug.cgi?id=1866 14:22:50 as Hideyuki's message indicates - I am not sure why Tony has moved this to openflowplugin from controller 14:23:09 I need to discuss that with Hideyki and we may move it to lithium 14:23:24 OK 14:23:41 effect of that issue is that VTN does not see statistics 14:23:46 It may a priority for him however 14:23:51 OK 14:23:52 which is not a blocking issue 14:24:30 If not Helium - may be in the Helium stable in November 14:24:42 right 14:25:12 Next one is: https://bugs.opendaylight.org/show_bug.cgi?id=1879 14:25:44 that also need some investigation - I will take that too 14:26:24 OK 14:27:09 Next one Chris O'Shea will be updating 14:27:14 (https://bugs.opendaylight.org/show_bug.cgi?id=1395 ) 14:27:26 But this may go to Lithium 14:28:23 Next one - https://bugs.opendaylight.org/show_bug.cgi?id=1756 - there seems to be a patch for it: https://git.opendaylight.org/gerrit/#/c/11288/ 14:29:37 Needs a review 14:29:53 OK - the other topic 14:29:57 testing 14:30:44 We need to answer the following questions: 14:30:46 1) Features has been tested for your project: 14:31:03 2) Bugs found in RC1 test: 14:31:16 3) Patches: 14:31:37 4) Is your project a compatible projecf: 14:33:55 Also Sujatha has asked the following questions regarding karaf: 14:33:57 As I need to send the OF Plugin documentation for review, I thought of adding the karaf steps for the plugin installation. Would it be possible for you to answer a couple of queries? 14:33:58 1. Are these the karaf features for OF Plugin? 14:33:58 · OpenDaylight Openflow Plugin SouthBound 14:33:58 · OpenDaylight Openflow Plugin Flow Services 14:33:58 2. What is the OF Plugin repository in the Karaf shell? 14:33:58 (repo-add mvn:org.opendaylight.openflowplugin/features-openflowplugin/?-SNAPSHOT/xml/features) 14:33:58 14:36:44 Let's wrap up & continue tomorrow - about testing 14:36:51 and these questions 11:37:40 michal_rehak: just added a comment to that bug 11:37:44 I haven’t seen it 11:37:50 and in fact thought it must have been fixed :) 11:38:06 I guess we can watch to see if it happens again 11:38:15 but am fine with closing the bug for now, if you want 11:44:24 tbachman: ok 11:45:03 tbachman: talking about https://bugs.opendaylight.org/show_bug.cgi?id=1907 ? 11:45:21 yes :) 11:46:00 ok, resolved 11:46:04 thank you 11:46:29 np! 14:12:23 Hi folks 14:13:18 Sorry - a bit late 14:18:08 we have a release review coming up next week: 14:18:11 https://lists.opendaylight.org/pipermail/release/2014-September/000438.html 14:19:30 Also we have several bugs right now 14:19:43 https://bugs.opendaylight.org/buglist.cgi?bug_severity=blocker&bug_severity=critical&bug_severity=major&bug_severity=normal&bug_severity=minor&bug_severity=trivial&bug_status=UNCONFIRMED&bug_status=CONFIRMED&bug_status=IN_PROGRESS&bug_status=WAITING_FOR_REVIEW&columnlist=product%2Ccomponent%2Cassigned_to%2Cbug_status%2Cresolution%2Cshort_desc%2Cchangeddate%2Ccf_target_milestone&f1=cf_target_milestone&list_id=17007&n1=1&o1=su 14:19:43 bstring&product=openflowplugin&query_format=advanced&resolution=---&v1=Lithium 14:20:13 OK - easier to copy paste the following: https://wiki.opendaylight.org/view/OpenDaylight_Bugs#openflowplugin 14:26:09 hi vishnoianil - did you have the chance to look at the RC1? 14:29:15 https://bugs.opendaylight.org/show_bug.cgi?id=1840 should be closed - right? michal_rehak & vishnoianil 14:30:15 OK - will mark it 14:30:43 abhijitkumbhare: probably - Peter can do a workaround by changing xml content to setfield in order to have closer match 14:32:00 Yes - but OF 1.3 to OF 1.0 forward compatibility is not there - OF 1.0 to 1.3 backward compatibility is present. 14:32:41 tests run against of-1.3 14:33:14 and we wrote some backward compatibility hacks for vlan-id and set*port 14:35:56 I think we should try to ensure the critical bugs are fixed soon 14:36:15 I will not be able to make the meeting tomorrow 14:36:32 So we can continue Monday 06:51:27 morning 16:46:34 https://bugs.opendaylight.org/show_bug.cgi?id=1988 16:46:58 Hi, VTN project has detected a blocker bug in OpenFlow plugin. 16:48:11 I'm going to add more detailed information for bug 1988 in Bugzilla. 16:48:47 OK 16:49:29 Due to this bug, VTN Manager system test fails for OF 1.0 switches.... 16:51:35 OK - I think this may be another bug Michal may need to look into. vishnoianil is this something you can take a shot? 16:58:45 abhijitkumbhare: Thanks. This is a blocker bug to VTN project... 21:31:17 I like to start testing with helium RC1 and OF1.3 REST API; from which URL should I download? 21:36:49 rvdp: https://lists.opendaylight.org/pipermail/release/2014-September/000463.html 21:41:48 Using that, but I need to load features? I loaded odl-dlux-all, odl-openflowplugin-all, odl-restconf (what will be in the base release?) Where are the REST docs? 21:44:28 rvdp: fwiw, there’s this: https://wiki.opendaylight.org/view/Karaf:Step_by_Step_Guide#How_to_Test_RC0 21:45:21 found that already; searching further 21:45:24 rvdp: you can see the list of all features by doing feature:list and pipe it in to grep 21:45:42 oh 21:45:43 lol 21:45:46 that wiki page has that :) 09:14:37 michal_rehak: morning :) 09:15:01 (althought I guess it’s closer to “afternoon” for you ;) ) 09:16:01 tbachman: hi 09:16:09 close to noon 09:16:11 :-) 09:16:11 ah 09:16:52 michal_rehak: I’m going to be debugging an issue we’re having with flow programming this am 09:17:31 ok 09:17:34 I haven’t done much debugging with the openflowplugin yet — was hoping you might have some pointers? 09:18:13 well - you probably want to catch main entry points where the flow must pass 09:18:31 the path looks following: 09:19:02 after you write flow to datastore/config FRM is notified (onDataChangeEvent..) 09:19:21 so one brakpoint should be in FlowForwarder class in add method 09:19:36 k 09:19:42 then the next stop is ModelDrivenSwitchImpl.addFlow 09:20:06 k 09:20:09 this will lead you into asynchronous task in OFRpcTaskFactory 09:20:27 something inside of createAddFlowTask 09:21:02 and next stop is in openflowjava - there occurs encoding 09:21:02 okay 09:21:59 michal_rehak: if I wanted to make sure that the flow made it to the config store 09:22:03 using REST 09:22:15 starts in class SerializationFactory.messageToBuffer 09:22:16 do you have the URL example for that? 09:22:36 * tbachman can also go look it up in the yang model…. but is just waking up ;) 09:22:37 on how to push a flow into datastore/config via rest? 09:22:45 to read bcak 09:22:46 back 09:22:59 I want to see if I can see the flows in the config store 09:23:15 (e.g. compare config and operational, see if they’re different or the same) 09:23:22 http://localhost:8080/restconf/config/opendaylight-inventory:nodes/node/openflow:1/table/0 09:23:27 thx! 09:23:40 and operational is just change the “config” above to operational? 09:23:46 here you have DS/config for node openflow:1, table 0 09:23:50 right 09:23:55 k 09:23:55 thx 09:24:03 michal_rehak: got it 09:24:05 http://localhost:8080/restconf/operational/opendaylight-inventory:nodes/node/openflow:1/table/0 09:24:19 any stats avaiable via karaf console, perhaps? 09:24:32 in case you use curl then the xml output is awful - no indentation 09:24:34 or any other means? 09:24:35 lol 09:24:36 yeah 09:24:38 I use postman 09:24:45 ok - then you are fine 09:25:07 stats in karaf 09:25:36 I am not sure if there is a way in karaf to read datastore content into shell 09:25:42 okay 09:25:44 no worries :) 09:25:54 this gives me a start 09:26:01 there are internal stats of OFPlugin 09:26:23 which tell you how many messages passed important points in plugin 09:26:51 but is is started by configSubsytem config file 09:27:11 and I tried to create feature for karaf contaiing only this file, but no success 09:27:19 ah 09:27:29 so, not available via REST today :( 09:27:32 that’s okay 09:27:48 but if you increase log level for stats, it might go at least into log file 09:27:59 okay 09:28:32 09:28:50 and do not forget to start wireshark in order to see what was sent to device 09:28:58 michal_rehak: good idea! ;) 09:29:05 michal_rehak: thx for the help! 09:29:09 am going to try some of this 09:29:20 may try to ping you later in the day, if you’re around 09:29:25 sure 09:29:33 michal_rehak: thx for your help! :) 09:29:39 in 4 hours I have to leave for 2 hours 09:29:45 you are welcome 09:29:48 michal_rehak: no worries! :) 09:29:49 thx 11:11:36 michal_rehak: Could you provide tbachman with pointers on how to check in the config data store via RESTCONF to see if his flows are in the config data store? 11:11:52 michal_rehak: Oh... I see where you did above :) 11:11:56 :-) 11:11:56 :) 11:12:14 I can resend to make it more seem more important 11:12:15 * tbachman has already requested that info, and found it to be different from what’s in operational 11:12:18 lol 11:12:21 michal_rehak: ;) 11:12:37 michal_rehak: edwarnicke am collecting data 11:12:43 and will provide some info soon 11:13:06 will continue the chase down the rabbit-hole why you guys enjoy the data ;) 11:13:16 tbachman: As to FRM points to put breakpoints with the debugger, this patch may provide some hints: https://git.opendaylight.org/gerrit/#/c/11382/ 11:13:28 tbachman: I think the key questions are: 11:13:43 1) Do all the flows get to the switch (apparently not) 11:13:50 2) Do the missing flows get to the OFplugin 11:13:59 3) Do the missing flows get the FRM? 11:14:10 FRM is first to hit (from GBP)? 11:14:10 4) Do the missing flows get to the config data store? 11:14:19 (er, after data store of course ;) ) 11:14:29 edwarnicke: they get to the config data store 11:14:33 GBP -> config data store -> FRM -> OFplugin -> OFjava -> wire -> switch 11:14:39 tbachman: Excellent 11:14:49 But the key point is... where is the chain breaking 11:14:54 exactly :) 11:14:55 Because that's certainly where the bug is :) 11:14:58 lol 11:15:06 will get there 11:15:13 * tbachman is the tortoise 11:15:29 Do look at 11382, as it will show you where to debug to find out if you are getting to the FRM and if the FRM is calling the OFplugin 11:15:36 edwarnicke: thx! 11:15:43 Since that's the next place in the chain :) 11:15:51 is it okay that it failed build? 11:16:11 I guess it’s just a pointer to the code 11:16:17 (for breakpoint purposes) 15:59:30 oflibMichal: Lets wait a moment for michal_rehak 16:00:32 sure, meanwhile going through the 1997 comments 16:00:56 michal_rehak: Welcome :) 16:01:07 michal_rehak: Do you need a moment to go through the 1997 comments? 16:01:23 michal_rehak: oflibMichal is reading the 1997 comments now 16:01:54 edwarnicke: reading 16:07:20 oflibMichal: michal_rehak Take your time, but let me know when you are finished and if you have any questions... then I will tell you the latest finding from tbachman 16:07:49 sure 16:08:05 i am ready, I am implementing some logs in openflowjava to debug easier 16:08:50 oflibMichal: Many thanks :) 16:09:00 michal_rehak: Let me know when you are ready :) 16:17:04 ok, I guess we can 16:18:35 edwarnicke: when i want to build integration karaf distribution ... i need to cd distributions/extra/karaf mvn clean install ? 16:19:02 oflibMichal: Correct 16:19:07 ok 16:20:11 edwarnicke: michal_rehak: building the distribution according to the instructions here https://lists.opendaylight.org/pipermail/ovsdb-dev/2014-September/000720.html 16:23:28 oflibMichal: That is a good set of reproduction instructions for ovsdb, yes :) 16:24:44 tbachman: Wecome 16:24:58 tbachman: Can you explain your latest finding about Groups for the GBP case? 16:25:10 edwarnicke: back :) 16:25:29 we have found that all the flowmods are present on the wire 16:25:51 tbachman: And... 16:26:00 * tbachman is good with dramatic pauses 16:26:02 lol 16:26:11 the are two vSwitches 16:26:18 one gets all the flows, no problems 16:26:29 the other gets all the flows, but there are also two error messages 16:26:37 there are also error messages in the vSwitchd logs 16:26:54 the key will be understanding why those errors are happening 16:27:36 unfortunately, I’ve got a learning curve for interpreting wireshark :) 16:27:37 lol 16:27:56 edwarnicke: you had a theory that the groups might be referring to things that haven’t been created yet? 16:28:02 (i.e. sequencing problem) 16:28:04 also 16:28:08 one thing to keep in mind 16:28:14 this all *used to work* 16:28:20 w/o any change in GBP code 16:28:44 tbachman: Could you pastebin the error messages? 16:28:53 the vswitchd error message? 16:29:11 tbachman: We have made significant performance gains... so race conditions are racier 16:29:22 edwarnicke: I see 16:29:31 tbachman: The error message on the wire (and the vswitchd error too :) ) 16:29:37 k 16:30:39 http://pastebin.com/jzpRGKYn 16:30:43 fwiw 16:30:49 for those on the email that I sent 16:31:00 there is a text file version of the dissected OF messages 16:31:11 so, it should be available in there (with context) as well 16:31:27 oflibMichal1: Note that OVSDB is *not* using Groups (I checked) and so it wouldn't explain their issue 16:31:59 tbachman: What should we search for? 16:32:06 ERROR :) 16:32:09 all caps 16:32:26 * tbachman isn’t sure what edwarnicke is asking? 16:32:36 LOL 16:32:43 When you said 'with context'I thought you'd sent more stuff :) 16:33:20 tbachman: Instead you did exactly what I would have wanted :) 16:33:33 tbachman: And I would have realized that if I had looked closer :) 16:33:48 tbachman: Can you look at your capture with a filter roughly like 16:35:20 oh - do we have groups involved? 16:35:34 this looks like corresponding group is not ready 16:35:54 michal_rehak: (6:31:30 PM) edwarnicke: oflibMichal1: Note that OVSDB is *not* using Groups (I checked) and so it wouldn't explain their issue 16:36:07 (openflow_v4.type == 14 || openflow_v4.type == 1 ||openflow_v4.type == 15 ) 16:36:26 edwarnicke: will do 16:36:41 tbachman: And then see if you can add a filter to filter openflow_v4.type == 14 && tbachman: The idea is to see the sequence of groupmod and flow mod that produced the error 16:37:19 michal_rehak: Yes... for GBP... but not for ovsdb, they don't use them 16:39:29 oflibMichal1: michal_rehak: let me know if you’re interested in the pcapng, and will email 16:40:22 sure, that could be helpful 16:44:19 michal_rehak: sent :) 16:45:59 edwarnicke: michal_rehak oflibMichal1 : updated: http://pastebin.com/U4j328Ww 16:46:58 tbachman: Do you see any GroupMods ? 16:47:55 edwarnicke: with transaction ID 6? 16:49:03 edwarnicke: how do I start karaf in -debug mode ? 16:49:54 Is GroupMod an operation? 16:50:17 oflibMichal1: export KARAF_DEBUG=1 16:50:20 * tbachman realizes he should know, but hasn’t reahed that level of understanding of the OF side fo things yet 16:50:31 * tbachman goes to pull up OF1.3 spec 16:50:45 openflow_v4.type == 15 matches groupmods 16:50:54 openflow_v4.type == 14 matches flowmods 16:51:03 openflow_v4.type == 1 matches errors 16:51:04 So 16:51:12 (openflow_v4.type == 14 || openflow_v4.type == 1 ||openflow_v4.type == 15 ) gives you those things 16:51:28 I am curious if we see GroupMods before or after the FlowMod that produces the errors 16:51:38 There is one 16:51:48 but 16:51:51 to the other switch 16:53:23 so, that’s confusing 16:53:36 the # of flows is what we expect 16:53:42 but we only see one GROUP_MOD 16:53:48 to the one switch that works 16:54:00 could a GROUP_MOD be accidentally sent as a FLOW_MOD? 16:55:18 tbachman: there many featuresReply indicating that there are 1.0 and 1.3 switches 16:55:34 and datapathId is often the same 16:55:47 michal_rehak: I should explain 16:55:51 the vswitches start out 10 16:56:00 and we do an ovs-vsctl to change them to 1.3 16:56:04 could you point us to number where the session started 16:56:05 ? 16:56:28 oh 16:56:30 michal_rehak: unfortunately, I don’t have something that can sequence it that well 16:56:32 I should say 16:56:36 that I started the wireshark 16:56:43 and then ran our python script 16:56:46 I meant the left column in wireshark 16:56:47 which does the ovs-vsctl 16:57:26 oh 16:59:46 ok and one switch is talking from IP 192.168.194.137 and the other one from 192.168.194.138? 17:00:04 michal_rehak: right 17:00:10 and the controller is 192.168.194.137 17:00:19 so, one vSwitch is on the same host as the controller 17:00:52 ok 17:02:08 michal_rehak: I have some old traces 17:02:14 one of which has the “good” case 17:02:18 I want to compare it to the bad one 17:02:35 oflibMichal1: Let us know when you have your debugs in place :) 17:02:40 so both switches were first changed from 1.0 to 1.3 17:02:49 and then started all the business? 17:02:54 michal_rehak: that’s correct 17:03:39 guys 17:03:42 the weird thing is 17:03:54 I only see a single group flow even in the “good” case 17:03:55 guys 17:04:00 I have to dissappear for about 45 17:04:01 :( 17:04:08 edwarnicke: already debugging 17:05:18 ok, there is group sent only to one switch 17:11:11 looks like we lost groupMod message 1 sec after featuresReply from the 137 switch 17:11:41 oflibMichal1: Be aware that the issue is intermittent, and additional logs have made it harder to reproduce 17:12:23 sure, I do the eclipse debug 17:13:00 now when thomas mentioned that all flow are installed and there are errors it looks like conversion / translation issue 17:13:11 oflibMichal1: Oh? 17:13:40 we have a bug in XID assigning - all flowMod messages use the same xid as it was in featureReply 17:13:47 tat should be fixed now 17:13:55 michal_rehak: OK 17:14:02 but it makes hard to pair error to flowMod 17:14:20 edwarnicke: thomas mentioned that they do not use groups ... correct ? 17:15:13 oflibMichal1: I asked ovsdb if they used groups, they said no, and I also grepped their code and did not find any reference to Group 17:15:22 oh, ok 17:15:37 so than the flowMod looks incorrect ... 17:15:38 Action            Type: OFPAT_GROUP (22)            Length: 8            Group ID: 1 17:16:46 there should be no GROUP action ... 17:16:53 or am i missing something ? 17:17:20 how do I check the table pipline - mentioned in the instructions? 17:19:18 oflibMichal1: https://www.opennetworking.org/images/stories/downloads/sdn-resources/onf-specifications/openflow/openflow-spec-v1.3.0.pdf page 18 at the bottom 17:20:53 michal_rehak: table pipeline ? 17:21:10 trying to replicate 17:21:21 after this: ovs-vsctl set-manager tcp::6640 17:21:35 I should see those flows somehow 17:28:22 michal_rehak: Oh... you are asking how to dump their flows? 17:28:51 I obviously failed to start something 17:28:53 michal_rehak: sudo ovs-ofctl dump-flows br-int -O OpenFlow13 17:28:56 let me retry 17:28:59 ack 17:41:46 i need to leave ... i will try to reconnect later, probably in 45 minutes 17:44:02 oflibMichal1: Thank you, see you in about 45 minutes :) 17:47:25 alagalah: is there a fast way to clean and start replicating again? 17:47:35 https://lists.opendaylight.org/pipermail/ovsdb-dev/2014-September/000720.html 17:47:53 I used sudo ovs-vsctrl emer-reset 17:50:32 michal_rehak: You mean to reset ovs? 17:50:37 michal_rehak: sudo mn -c 17:50:44 yes 17:50:46 ok 17:50:53 sudo ovs-vsctl show (see stuff) 17:50:56 sudo mn -c 17:51:00 sudo ovs-vsctl show (clean) 17:53:11 alagalah: I understand you can pick up the wireshark sequencing stuff 17:53:16 alagalah: http://pastebin.com/U4j328Ww was sent by tbachman 17:53:20 It shows the error 17:53:44 (openflow_v4.type == 14 && || openflow_v4.type == 1 ||openflow_v4.type == 15 ) 17:54:03 will give you GroupMods (15) Errors(1) and FlowMods(14) with the offending transaction id 17:56:00 edwarnicke: Ack 17:56:23 edwarnicke: I take it i should ditch my Priority filters then, which is fine. 17:56:39 alagalah: They will always be there for you if you need them :) 17:56:54 (which is my way of saying... I could be wrong, but probably :) ) 17:57:23 edwarnicke: LOLZ, yes, I know... filtering is non-destructive :D 18:08:58 edwarnicke: Applied filter.... interesting results 18:09:06 alagalah: Do tell :) 18:09:44 Firstly we have to start with the assumption from our tests that OVS-s1 has "all the flows" (ie 57) and OVS-s2 does not (55). 18:09:45 BUT 18:10:10 When I apply the filter, (and I tried with xid==6 and without to get both switches) ... I only see errors on s1.... 18:10:41 Still digging, but I want to ensure that thomas has his system setup up for .137 as s1 and .138 as s2... (which I am almost positive it is) but ... hmmm. 18:11:15 I am just going to filter on the Type1 (Errors) and have a dig 18:17:03 4142 18:17:03 4146 18:17:03 4176 18:17:05 edwarnicke: michal_rehak Some interesting things again from the trace... packets 18:17:08 doh 18:17:29 jsut missing groupMod to one of the switches 18:17:29 edwarnicke: michal_rehak Some interesting things again from the trace... packets 4142, 4146, 4176 all have a OFPAT_GROUP (the only ones) as an action 18:17:43 4142, 4146 are proceeded by the error 18:17:46 4176 is not 18:18:02 But its likely that those are the mysterious two flows we are missing michal_rehak to your point 18:18:08 alagalah: Do we get a GroupMod between 4142 and 4176 ? 18:18:29 edwarnicke: That was next on my list to explain why 4176 doesnt get an error 18:18:42 * alagalah not much of an OF debugging maven so I don't know how to filter on GroupMod 18:19:36 * alagalah furiously RTFM'ing 18:21:57 openflow_v4.action.group.group_id only returns 4142, 4146, 4176 and a bunch from s2 18:22:44 alagalah: One moment 18:22:56 openflow_v4.type == 15 matches groupmods 18:23:05 openflow_v4.type == 14 matches flowmods 18:23:12 openflow_v4.type == 1 matches errors 18:23:52 edwarnicke: Thanks, I was running down the: openflow_v4.group_features.actions.all rabbit-hole 18:24:37 GroupMod from .137 (s1?) to .138 (s2?) @PAK: 3657 18:24:51 michal_rehak: Should we have a GroupMod on s1-s1 ? 18:24:57 (note: S1 is also the controller) 18:25:12 So .137 - .138 is really controller - s2 (in my tiny mind) 18:25:58 I believe so as there is reference to that group in failing flows 18:26:09 and that is what the error message told 18:26:33 michal_rehak: Ok, let me see where the GroupMod is built in our code... might take some time as I am not that familiar with it, but I'll dig 18:27:28 ok, great 18:28:15 and I am sorry (missing something obvious) but do we have instructions on how to replicate the scenario which was captured by the pcap session of wireshark? 18:28:32 I mean change from 1.0 to 1.3 and all the many flows with group? 18:28:52 or is this something magic what need some third party app? 18:30:15 michal_rehak: neither unfortunately mate 18:30:59 michal_rehak: We don't know the trigger other than "it happened a few weeks ago we think" but we have made strides but now see this problem (2 flows, tied to a missing GroupMod).... note it is intermittent (Heisenbug) 18:31:39 michal_rehak: I can run our Proof of Concept 5 times in a row with full success, then the 6 it fails... sometimes if fails continuously... no rhyme or reason for the trigger (so suspect a race condition or some such) 18:32:04 alagalah: I think the 'didn't happen a week ago' may just be related to performance bugs being fixed and thus race conditions getting racier :) 18:32:04 michal_rehak: but if it was a race condition, I would still expect us to be sending the GroupMod (out of sequence) but I don't 18:32:07 michal_rehak: Its weird 18:32:29 edwarnicke: I concur... other issues we have faced have been resolved but they have brought this issue to light 18:33:00 tbachman: ping 18:33:07 pong 18:33:12 sorry guys 18:33:17 tbachman: So you should be! 18:33:19 are you guys talking about https://bugs.opendaylight.org/show_bug.cgi?id=1997 ? 18:33:20 * alagalah kidding 18:33:24 abhijitkumbhare: ack 18:33:30 OK 18:34:10 need to dispose of 2 disturbances, be back in 30 mins 18:34:19 lol 18:34:30 * alagalah Is expecting michal_rehak to knock on his door with a shotgun to eliminate one disturbance 18:34:35 lol 18:37:00 * tbachman is recreating his env :( 18:37:13 mac crash 18:37:48 folks 18:37:53 I want to recreate the “good” case 18:38:04 and verify that we see the group mod to both switches 18:39:31 tbachman: That sounds like an excellent plan 18:39:44 abhijitkumbhare: Yep... welcome to #opendaylight-Bug1997 :) 18:39:48 (this channel) 18:39:49 If I don’t, will be worried 18:39:52 what’s weird 18:39:56 is the # of flows is the same 18:39:57 abhijitkumbhare: here's what we think we know 18:40:09 (on the wire) 18:40:09 abhijitkumbhare: Both OVSDB and GBP have found the following to be true: 18:40:17 a) Their flows are getting to the config datastore 18:40:32 b) They are *sometimes* (not always, it can be hard to reproduce) missing a couple of flows on the switch 18:40:47 abhijitkumbhare: GBP has debugged further and determined that 18:40:53 c) All of their flows are getting to OFplugin 18:41:15 d) OFplugin stats show the same number of AddFlows coming in from MD-SAL as going out to OFjava (in the failure case) 18:41:31 Well... #a-d are also true when things work :) 18:41:57 abhijitkumbhare: OVSDB has not debugged that far, so we can only surmise its true for them too 18:42:11 abhijitkumbhare: So tbachman and alagalah have been looking at the wire with wireshark, to see what's going on 18:42:44 abhijitkumbhare: Current theory is that we may be seeing an ordering issue where a FlowMod that references a group arrives before the GroupMod that creates it 18:42:52 abhijitkumbhare: This is because we see the Error for that situation 18:42:59 abhijitkumbhare: alagalah and tbachman are digging more deeply into that 18:43:06 abhijitkumbhare: We are pretty sure that is not the issue for OVSDB 18:43:24 abhijitkumbhare: Because they claim to not be using groups, and a grep through their code fro Group reveals nothing 18:43:35 abhijitkumbhare: So michal_rehak and oflibMichal are working on debugging the OVSDB case 18:43:40 abhijitkumbhare: Does all of that make sense? 18:46:57 tbachman: Say more 18:47:03 * edwarnicke was just running off at the mouth ;) 18:47:08 lol 18:47:09 sorry 18:47:11 just ran a test 18:47:14 seems to be working 18:47:15 so 18:47:17 am looking at flows 18:47:19 Yes 18:47:23 and 18:47:24 I see it 18:47:27 the GROUP_MOD flow 18:47:31 Flows on OVS or FlowMods on the wire? 18:47:33 edwarnicke: your summary is correct 18:47:35 wireshark 18:47:39 Cool 18:47:40 let me verify that I see *both* 18:47:54 edwarnicke: abhijitkumbhare I would also add that between CONFIG and OFPlugin we have not only verified the count of flows but also the flowIDs 18:48:16 * alagalah point of order=temp pumpkinate in 45min 18:48:31 * alagalah Have to meet a friend 18:48:33 alagalah: Point of pumpkination ? 18:48:54 alagalah: Which was a lovely bit of work I might add :) 18:49:05 edwarnicke: What was? 18:49:35 edwarnicke: That 15min python hack ??? I found it to be profoundly ugly LOL ... could do a lot better with difflib.SequenceMatcher but more tooling for another time. 18:51:38 * tbachman is thoroughly confused 18:52:01 edwarnicke: abhijitkumbhare: oflibMichal: michal_rehak: is it possible to piggy-back a group flow onto another message? 18:52:08 b/c 18:52:09 again 18:52:16 I only see *one* group flow 18:52:20 this time to the other switch 18:52:27 (i.e. the one that wasn’t getting one before) 18:53:56 wiat 18:54:01 I had the wrong filter setting 18:54:03 it’s ther 18:54:05 there 18:54:06 to both 18:54:10 tbachman: You may find that there are multiple OFmessages in a wireshark summary line 18:54:12 Use 18:54:18 openflow_v4.type == 15 18:54:21 that’s what I used 18:54:26 interestingly 18:54:28 in this trace 18:54:31 there are 3 :) 18:54:34 That was what I would recommend 18:54:39 tbachman: Intresting 18:54:43 two to one switch 18:54:58 back-to-back 18:57:30 edwarnicke: it may be splitting it into two messages? 18:57:49 the message to the first switch has a length of 436 18:57:58 the two messages to the other switch are each of size 204 18:58:33 tbachman: edwarnicke abhijitkumbhare michal_rehak Its important to note that trace tbachman is referring to is a WORKING example of our POC. The one I referred to above is a FAILING example 18:58:57 alagalah: Thank you for being clear there :) 19:00:30 tbachman: So BAD: No GroupMod for s1, Single GroupMod for s2. GOOD: One GroupMod for s1, Two GroupMods for S2 (potentially we could be seeing OFPack/TCP fragmentation) 19:01:22 fwiw 19:01:30 in one of the bad cases 19:01:35 I see a message of only 364 19:02:31 and in another failure case, it’s only 340 19:03:23 tbachman: Its going to be tough to use length without quite a bit of work..... I kind of started down this road to compare traces (potentially) 19:03:43 tbachman: The issue is for each pak you have a fixed+variable length so a) you can't add the lengths for each trace 19:04:05 b) you could take the lengths for each case, minus (#paks * fixed header length) and normalise both counts 19:04:30 All that is telling you is you are sending the same information in both, but perhaps then the order is wrong 19:04:35 alagalah: ack — but the point is that for a single message, it looks like we are sending 436 19:04:37 but 19:04:44 we’re seeing 340, 364 19:04:45 Easier is just to count manually in the good case Openflow13 parts of the TCP message 19:05:45 * tbachman wonders if this could be an issue of how group flows may be split into multiple messages 19:06:21 abhijitkumbhare: oflibMichal michal_rehak: do any of you know how multiple group flows may be sent (i.e. one or more messages)? 19:06:26 Could we have an issue 19:06:44 where we split across a boundary we shouldn't? 19:06:55 and if we get the ERROR 19:07:02 maybe we don’t complete those? 19:14:59 michal_rehak: Thoughts on tbachman's comments? 19:15:26 * alagalah I'm an idiot... forgot about kiddy nighty night time 19:15:59 * alagalah Sorry folks... I have to pumpkinate... fwiw been on a hangout w tbachman in parallel discussing ideas as its quicker... 19:16:41 so the question is ... do we split group mod message into more group mod messages ? 19:16:43 Thanks for the summary edwarnicke (still going thru the other's comments on/beyond it) - will also relay/playback the conversation to vishnoianil if he is interested (over google chat) 19:17:11 oflibMichal: yeah — just trying to think of why these would look different 19:17:26 since we send the same flows every time (or at least we think we do) 19:17:39 I guess in general 19:17:48 how do you decide how many flows to send in a message? 19:18:14 Is it asynchronous to what’s happening in the data store (e.g. timer or threshold based, and send once either is reached)? 19:18:30 Or does each data store notification result in an OF message? 19:18:54 (where a data store notifcation may have multiple flow) 19:19:01 i believe that there is currently no mechanismi that would allow for message split 19:19:08 oflibMichal: okay 19:19:18 so — it’s tied to the asynch notify 19:19:20 so basically each rpc or notification will be transfered into of message 19:19:29 oflibMichal: got it 19:19:30 so 19:19:39 interesting that one switch gets two messages, and the other gets one 19:19:41 and in the failing case 19:19:54 we only see a message to one of the switches 19:20:05 there is one exception I know, but that is programatically done, - based on the need to install two flows 19:20:16 not because of the length 19:20:18 and it’s bigger than the normal messages (364 or 340, vs. 204) 19:20:30 and that was the vlan id discussion with hideyuki 19:20:33 from vtn 19:20:54 oh — slight correction on my statement 19:21:01 er, nm 19:21:02 :) 19:21:07 * tbachman managed to confuse himseelf 19:21:09 ;) 19:21:14 :-) 19:21:43 tbachman: the number of flows you see in a message depends on the TCP stack 19:21:54 tbachman: Its frame size and tcp window dependent 19:22:03 tbachman: Wireshark is just dissecting the frames 19:22:12 edwarnicke: TCP will send things in chunks, yes, but there’s a notion of OF message boundaries, no? 19:22:20 edwarnicke: ah 19:22:22 good point 19:22:25 so 19:22:33 even though there are two entries in the dissector 19:22:39 those are truly Ethernet packets 19:22:41 tbachman: Exactly 19:22:48 tbachman: yep 19:22:50 got it 19:22:51 k 19:22:55 so 19:23:00 so 19:23:04 I need to put these together 19:23:09 stitch them 19:23:09 tbachman: And a good TCP stack will pack in as much as it can 19:23:14 right :) 19:23:26 In the bad case 19:23:27 oflibMichal: How goes the debugging ? 19:24:14 * tbachman sees a useful feature to add to wireshark — OF message stitching 19:24:48 edwarnicke: I was talking with Michal Rehak on how to get the ovsdb usecase running 19:25:02 because I traced some wrong flow mods 19:25:10 but than he needed to go off 19:25:14 for a while 19:25:35 * vishnoianil is away: Break time, will be back soon 19:26:08 tbachman: You can use follow tcp stream :) 19:26:15 tbachman: Not *quite* what you want though 19:26:16 edwarnicke: :) 19:26:18 lol 19:26:18 tbachman: But yes :)( 19:26:19 so 19:26:23 there’s quite a bit of time 19:26:30 from when that first segment is sent to that switch 19:26:38 and when we get back the error from the vSwitch 19:27:20 like 3 seconds 19:27:33 tbachman: OF is allowed to be tardy :( 19:27:40 that sounds like a fairly long time for a window on a local IP stack 19:28:00 tbachman: OF is allowed to be more tardy than the IP stack :( 19:28:07 lol 19:28:22 hmmm 19:28:23 actually 19:28:25 the point is 19:28:36 that the switch throwing the error is the one that didn’t get the message 19:28:43 * tbachman forgets these things 19:29:38 edwarnicke it probably takes more work to be tardy (and respond tardy) than respond immediately :) 19:29:39 fwiw, in the working case, these messages are sent to each switch 19:29:42 in succession 19:30:39 * tbachman fears that we’re going to have to go through the logs to see the actual flows as they progress from datastore to Ofplugin to Oflowjava 19:30:59 tbachman: I don't think you will have to 19:31:06 19:31:06 :) 19:31:08 tbachman: We just need to know if the issue is succession on the wire 19:31:19 ? 19:31:42 tbachman: If its an issue of succession on the wire 19:31:45 the # of flows is also confusing 19:31:57 b/c they are the same in the pass/fail case 19:32:01 tbachman: Then we just have to look at how you guys are enforcing order on your flow programming 19:32:06 but we don’t see the group flow in the fail case 19:32:17 it’s like the group flow was replaced with something else 19:32:18 tbachman: But in the fail case you have the error, right? 19:32:29 tbachman: OK... *that* could be interesting 19:32:43 * tbachman is just theorizing here 19:32:46 tbachman: We may need to trace that and see if perhaps the GroupFlow is getting lost 19:32:54 tbachman: Let me see what tools we have for that 19:33:09 edwarnicke: it wouldn’t have to be just lost — it would have to be replaced 19:33:14 that’s the only way the counts could add up 19:34:37 tbachman: But a GroupMod is not a FlowMod, right? 19:34:54 tbachman: You have OFplugin stats for both working and failing , right? 19:34:55 edwarnicke: so, would that show up in ovs-ofctl? 19:35:05 edwarnicke: I don’t have working yet 19:35:09 but I can collect those 19:35:12 just a sec... 19:35:28 tbachman: I thought you collected OFplugin stats to verify OFplugin wasn't loosing flows? 19:35:35 Oh 19:35:37 read to fast 19:35:41 You only have stats for failing 19:35:44 correct 19:35:50 you’re talking about jconsole stuff, right? 19:38:38 here’s the jconsole from working/non-working: http://pastebin.com/6JtgCfJz 19:39:21 edwarnicke: would that be “AddGroupInput”? 19:39:55 If so, we have +4/+4 on ENQUEUED and SUBMITTED 19:40:12 Yep 19:40:26 tbachman: And how many GroupMods do you see on the wire? 19:40:38 * tbachman goes back to check 19:41:00 edwarnicke: you’re talking about the failure case, right? 19:43:11 edwarnicke: there are two in the failing case - but am worried about the length there 19:45:52 edwarnicke: abhijitkumbhare oflibMichal michal_rehak: this is looking more suspicious 19:46:23 tbachman: do you have wireshark captures with groupmods for both successful and failure case ? 19:46:29 reading channel - I am currently 30 mins ago 19:46:32 oflibMichal: yes 19:46:41 michal_rehak: distractions asleep? ;) 19:46:46 can you please send me those ? 19:46:51 oflibMichal: sure 19:46:56 on michal.polkorab@pantheon.sk ? 19:46:59 thanks 19:47:06 np 19:48:28 http://pastebin.com/uzaNtMX6 19:48:37 so, in the failure case 19:48:58 we see a single message with 2 group flow mods, with total length of 340 19:49:24 hmmmm 19:49:29 that may be a legitimate one 19:49:29 but 19:49:38 we never see the group flow mod for the other switch 19:50:16 (I say the one of length 340 may be legitimate, b/c that switch appears to work — I guess it’s conceiveable that the two switches are getting slightly different group flow entries, and therefore the 340 may be okay?) 19:50:51 again — fwiw, in the working case, the group flow messages are sent two both switches back-to-back-to-back 19:51:01 (i.e. no intervening flow mod messages) 19:51:04 not sure if that’s important 19:51:15 but, from the perspective of what we’d expect to see 19:52:16 oflibMichal: did you receive it? 19:52:22 (wireshark traces) 19:52:35 yes i did 19:52:41 cool :) 19:54:35 ok, so the lengths look ok 19:54:46 every group mod message is 136 bytes long 19:55:00 oflibMichal: in both the good and bad cases? 19:55:06 yes 19:55:29 please see the length inside the group mod message 19:55:30 * tbachman wonders why in the good case, a single switch gets a message that’s over 400 bytes in length, but in the bad case, the other switch gets a single message of 340 bytes in length 19:55:36 not in the packet header 19:55:41 oflibMichal: okay 19:55:46 good point 19:55:54 but — also good to make sure these line up :) 19:55:54 i can see in good case >>> 19:56:15 tbachman: distractions partially asleep 19:56:21 michal_rehak: lol 19:56:39 hmmm 19:56:44 I wonder how clever the OF dissector is 19:56:52 b/c we could have another packet 19:56:59 that contains the rest of the message 19:56:59 wireshark is known to miss some packages 19:57:48 one packet - size 436 with 2*136 (group mods) +96 (one flow mod) + 68 (packet header size) 19:57:54 tbachman: michal_rehak has a lot more recent experience with wireshark and OF than I do :) 19:57:59 lol 19:58:01 yes 19:58:19 * tbachman is a total dunce in this (and many other) domains 19:58:35 edwarnicke: I have 2 week positive experience with wireshark-1.12.0 19:58:43 michal_rehak: LOL 19:58:47 * tbachman is using 1.12.1 19:59:00 tbachman: You are now the most modern of us all :) 19:59:05 lol 19:59:13 oflibMichal: How are you doing on reproducing the ovsdb case? 19:59:46 tbachman: I shall bow in front of you 19:59:52 lol 19:59:58 have you guys been drinking? 20:00:00 :p 20:00:11 tbachman: I started on Pellegrino in the wee hours of the morning :0 20:00:15 lol 20:00:15 * edwarnicke has always been drinking :) 20:00:22 hydration is a good thing 20:00:33 that’s a centuries old bio-hack ;) 20:03:25 oflibMichal: so, I just realized what you were saying with one of the above messages 20:03:34 there are 2 group mods with one flow mod 20:03:39 interesting 20:04:13 if a group mod fails 20:04:16 and there’s a flow mod with it 20:04:27 then the flowMod fails tooo 20:04:28 does that mean all will fail? 20:04:31 ah 20:04:48 that explains why I can see all the flow messages 20:04:55 but we’re missing a flow in OVS 20:05:01 b/c it got packaged with a group mod 20:05:12 that failed 20:05:34 * tbachman is probably catching up to “start”, whereas michal_rehak and oflibMichal are almost at the finish 20:05:36 ovs can ignore your flow in case there is already another one installed with the same match rules and priority 20:05:52 in the same table 20:07:19 tbachman: back to failing flowMod - it fails when relying on group and corresponding groupMod failed 20:08:07 michal_rehak: since I excell at asking dumb questions, can I ask another? 20:08:18 I would be pleased 20:08:25 does ovs-ofctl dump-flows show just flow mods, or are group flows covered in that as well? 20:09:23 tbachman: that is funny question for ovs does not support groups now 20:09:36 * tbachman is alarmed 20:09:43 really? 20:09:55 * tbachman wonders why readams put those in there 20:10:47 there were some issues with kernel modules since some older version 20:10:56 I guess the thing I’m trying to undertand 20:11:03 for dumping groups there is dump-groups 20:11:09 is that when we do ovs-ofctl dump-flows -O OpenFlow13 20:11:16 and pipe that into wc -l 20:11:25 we get 57 (if you exclude the banner/header) 20:11:32 we expect each switch to have 57 flows 20:11:35 but 20:11:41 I wasn’t sure if that covers group flows as well 20:12:28 in the failing case, we’re short 2 flows 20:12:42 which could be the 2 group flows 20:15:52 groups are listed separate via ovs-ofctl dump-groups 20:15:57 regarding help 20:16:52 I think I am finally able to make the numbers almost add up in my head now :) 20:16:53 thx 20:18:05 tbachman: I still remember the point where I realized all of quantum field theory and statmech was mostly counting... and I was *really* bad at counting ;) 20:18:25 lol 20:18:39 edwarnicke: while you are applying quantum theory 20:18:49 I’m still trying to apply how to code in java ;) 20:18:50 lol 20:19:57 so, this adds up: alagalah@odlgbp1:~/logdata/wireshark$ egrep 'OFPT_GROUP_MOD \(15\)' wiresharkdata-working.txt | wc -l 20:19:58 4 20:19:59 alagalah@odlgbp1:~/logdata/wireshark$ egrep 'OFPT_GROUP_MOD \(15\)' wiresharkdata.txt | wc -l 20:20:00 2 20:20:02 yuck 20:20:03 sorry folks 20:20:05 * tbachman goes back to pastebin 20:21:57 michal_rehak: Thank you for the patch for Bug 1988. 20:22:15 hideyuki: hi! have you tested it? 20:22:35 michal_rehak: I've built the Karaf edition using the latest code of Integration Git repository, and I'm testing bug 1988. 20:22:42 this is what I should have done: http://pastebin.com/Hh8aRHM0 20:22:43 hideyuki: perfect 20:22:46 michal_rehak: But, it seems to me that the bug is not fixed. 20:23:03 hideyuki: :-( 20:23:22 michal_rehak: The result has been not changed. 20:23:28 tbachman: OK... and the OFplugin stats in both cases? 20:23:34 tbachman: For GroupMods? 20:23:46 they were 4 and 4 in both cases 20:23:54 * tbachman scrolls up to retrieve pastebin 20:24:23 http://pastebin.com/6JtgCfJz 20:25:11 * tbachman presumes these are the entries of interest: 20:25:12 TO_SWITCH_ENQUEUED_SUCCESS: MSG[AddGroupInput] -> +4 | 4 20:25:12 oflibMichal: How are you doing on getting debugs going? 20:25:20 TO_SWITCH_SUBMITTED_SUCCESS: MSG[AddGroupInput] -> +4 | 4 20:25:36 oflibMichal: Because it looks like we might be loosing GroupMods south of OFplugin as well 20:26:27 I must miss something 20:26:42 not able to reproduce by these instructions 20:26:43 https://lists.opendaylight.org/pipermail/ovsdb-dev/2014-September/000720.html 20:26:55 no flow mods are send 20:28:04 I was wrong - ovs-2.1.3 supports groups 20:28:24 * tbachman is using 2.1.2 20:29:16 oflibMichal: did you see his 3a/b/c? 20:29:30 tbachman: I misread changelog 20:31:17 yes i did 20:31:22 oflibMichal: Can you bounce over to #opendaylight-ovsdb with me? 20:31:24 but let me recheck 20:32:06 edwarnicke: been trying to summon Madhu_offline, but he may be out 20:32:25 tbachman: the amounts in pastebin are weird - 4groups+114flows = OK and 2groups+116flows is WRONG 20:32:41 michal_rehak: exactly ;) 20:33:05 like there would be some fallback in case without groups add 2 more flows 20:33:24 and device shows how many flows? 20:33:40 * tbachman wonders if it’s possible that the group flows were accidentally sent as flow mods 20:33:51 it shows 55 (out of 57) flows 20:34:16 b/c ovs-ofctl doesn’t appear to show group flows 20:34:25 two switches 20:34:29 one switch gets all 57 20:34:32 the other 55 20:34:42 (or I should say *programs* 55) 20:35:41 so good case is 57 per each switch - 20:35:43 . 20:35:47 114 20:35:51 right 20:36:05 is it possible that the missing 2 groups are turned into flow mods? 20:36:17 unlikely 20:36:47 michal_rehak: I’ll see if I can figure out what the 2 extra flow mods are 20:37:11 that would be great 20:37:27 unless oflibMichal beats me to it ;) 20:37:28 lol 20:37:39 tortoise v. hare 20:37:45 except am going to trip the hare ;) 20:37:46 lol 20:41:18 tbachman: I am slightly missing context but it sounds adventurous 20:41:26 lol 20:42:01 tbachman: ok, i was able to reproduce the ovsdb usecase, but only once now 20:42:08 oflibMichal: nice! 20:42:09 how can i restart the mininet 20:42:36 do i need to shut it down and restart completely ? 20:42:49 cat reset.sh 20:42:50 #!/bin/bash 20:42:50 echo -e '\e[33;1mClean MANAGER:\e[0m' 20:42:50 echo -e '\e[32;1mSee current status - manager.\e[0m' 20:42:50 sudo ovs-vsctl get-manager 20:42:50 echo -e '\e[32;1mClean manager.\e[0m' 20:42:50 sudo ovs-vsctl del-manager 20:42:51 echo -e '\e[32;1mSee current status after clean - manager.\e[0m' 20:42:51 sudo ovs-vsctl get-manager 20:42:53 echo 20:42:53 echo -e '\e[33;1mClean CONTROLLER:\e[0m' 20:42:53 echo -e '\e[32;1mSee current status - ctrl.\e[0m' 20:42:53 sudo ovs-vsctl show 20:42:55 echo -e '\e[32;1mClean controller.\e[0m' 20:44:01 oflibMichal: You can rest mininet with 20:44:03 sudo mn -c 20:44:44 http://pastebin.com/JFUU2NaJ 20:45:14 edwarnicke: will try 20:46:47 michal_rehak: I’m not confident in the 116 number 20:47:33 b/c I used wireshark to filter on just openflow_v4.type == 14 20:47:38 and dumped that to a text file 20:47:42 and greped for OFPT_FLOW_MOD (14) 20:47:44 and that is 114 20:47:50 both good and bad cases 20:47:56 sounds promising 20:48:20 now we are searching for 2 missing groupModes only 20:48:45 ah 20:48:59 I think I figured out where the extra count of 2 came from 20:49:03 the error messages back 20:49:21 michal_rehak: correct 20:52:18 oflibMichal: Did you see steps 3a-c: https://lists.opendaylight.org/pipermail/ovsdb-dev/2014-September/000720.html 20:52:27 yes i did 20:52:29 edwarnicke: he was able to reproduce it 20:53:22 tbachman did you say ovs-ofctl doesn't show groups? 20:53:34 djx: maybe that’s not the right way to say that 20:53:45 we’re sending group mods to OVS 2.1.2 20:53:53 running the above doesn’t seem to show them 20:53:53 but 20:53:55 that may just be 20:54:01 b/c OVS 2.1.2 doesn’t support groups 20:54:09 yes it doe 20:54:10 * tbachman isn’t versed enough yet to know 20:54:12 k 20:54:13 since 2.1.0 20:54:21 djx: is there a separate command to dump those? 20:54:28 sudo ovs-ofctl dump-groups --protocols=OpenFlow13 20:54:32 ah 20:54:33 thx :) 20:54:46 and don't forget the switch 20:54:47 * tbachman apparrently is too lazy to do “man ovs-ofct" 20:54:52 no problemo 20:54:53 the protocols? 20:55:14 sudo ovs-ofctl dump-groups --protocols=OpenFlow13 s1 20:55:18 the s1 part 20:55:19 oflibMichal: Datapoint... when I follow the directions at https://lists.opendaylight.org/pipermail/ovsdb-dev/2014-September/000720.html 20:55:20 ah 20:55:24 oflibMichal: I am missing flow: 20:55:28 cookie=0x0, duration=16.196s, table=0, n_packets=0, n_bytes=0, send_flow_rem dl_type=0x88cc actions=CONTROLLER:65535 20:56:31 I missed the one with table:20 20:58:24 tbachman if you want to update your stuff https://wiki.opendaylight.org/view/CrossProject:Integration_Group:Create_System_Test_Environment 20:58:47 djx: awesome — thx! : 20:58:48 :) 20:59:21 * tbachman will wait to download, as he’s currently tethering, and doesn’t want to kill his monthly GB limit ;) 20:59:32 (test VM) 21:00:00 * djx feels sorry for tbachman GB limit 21:00:05 lol 21:05:09 tbachman: is this ok? 21:05:10 org.opendaylight.ovsdb.openstack.net-virt-providers - 1.0.0.SNAPSHOT | Unable to spot Bridge Identifier for br-ex in OVS|10.0.42.16:39032 21:05:32 2014-09-21 23:03:35,115 | ERROR | otification-1342 | OF13Provider | 340 - org.opendaylight.ovsdb.openstack.net-virt-providers 21:05:42 one sec.... 21:15:38 sorry 21:15:38 back 21:15:51 br-ex shouldn’t be critical 21:16:12 this is just to test out the OVSDB failure, right? 21:16:49 tbachman: I just spotted this in karaf logs 21:17:44 it’s probably fine 21:18:16 I can double check, but I think that’s only needed in order to set up external traffic 21:18:28 they try to install some flows on it if it’s there 21:18:46 * tbachman notes that it has been quite some time since he’s poked around OVSDB code 21:21:05 tbachman: ok, I am exhausted for today - sorry that we did not move much further 21:21:14 michal_rehak: understood 21:21:22 almost 12:30 am there, right? 21:21:42 23:21 21:21:48 ah 21:21:54 one hr off 21:22:06 michal_rehak: thx for your help! 21:22:15 michal_rehak you guys are shifted from the timezone? 21:22:20 we can go on tomorrow 21:22:33 michal_rehak: I will be up at 5am EST 21:22:39 if that’s early enough 21:22:43 11am your time? 21:22:49 djx: no, standard CEST: GMT+2 during summer 21:23:03 djx: late for you, too! ;) 21:23:37 here is only 22:23 but I thought michal had 2 hours more than me 21:24:32 michal_rehak: What is the status of hideyuki 's bug 1988 ? 21:24:37 michal_rehak: one other thing 21:24:53 If you want to try to recreate the GBP test setup 21:24:54 https://wiki.opendaylight.org/view/Group_Policy:Building_and_Running 21:24:56 edwarnicke: need to replicate 21:25:14 * tbachman realizes that he needs to update this to reflect running using karaf 21:25:15 michal_rehak: Where you able to get enough info from hideyuki to replicate it tomorrow morning? 21:25:20 oflibMichal: same: https://wiki.opendaylight.org/view/Group_Policy:Building_and_Running 21:25:43 edwarnicke: nothing new - I will try with OF-1.0 21:27:19 michal_rehak: ACK 21:27:25 oflibMichal: How are you doing? 21:30:09 edwarnicke: nothing new 21:30:17 i am like deleted 21:30:39 oflibMichal: like kill -9 deleted? 21:31:24 oflibMichal: any other info I can provide? 21:31:36 i guess we have all the info 21:31:51 just need to create som statistics inside ofjava 21:32:06 oflibMichal: okay 21:32:07 I may head out for a bit 21:32:08 but i am affraid that's for tomorrow 21:32:21 oflibMichal: sounds good 21:33:56 need to go off 21:34:02 okay 21:34:08 oflibMichal: Pick up again tomorrow morning? 21:34:10 will see you in the a.m. 21:34:13 sorry that I couldn't help more :-( 21:34:19 oflibMichal: we’ll get there! 21:34:28 edwarnicke: definitely 21:37:19 Hi all, I tested sveral times on Bug 1988, and confirmed that bug 1988 is not fixed. 21:37:23 https://bugs.opendaylight.org/show_bug.cgi?id=1988#c8 21:37:50 It seems to me that the result of the issue is not changed. 21:40:36 hideyuki: Michal is going to try to recreate again tomorrow morning with OF10 21:40:47 hideyuki: Could you get up early and get on IRC so we can collaborate on it? 21:44:40 OK. 21:44:58 edwarnicke: Thank you! 08:24:15 hello guys 09:40:38 mbobak: hi :) 09:41:21 oflibMichal: michal_rehak hey guys :_ 09:41:23 :) 09:48:45 * vishnoianil is away: Break time, will be back soon 09:50:19 Morning edwarnicke michal_rehak 09:55:20 tbachman: hi there 09:55:32 oflibMichal: hi! :) 09:55:50 anything new ? 09:55:54 lol 09:55:59 was going to ask the same :) 09:56:04 :-) 09:56:07 did you need more information? 09:56:17 not yet 09:56:18 I can add logs, etc. 09:56:18 k 09:56:29 i have implemented some logs 09:56:40 ah, k 09:57:06 I guess a key question is test env 09:57:08 and it shows that the number of flows that entered ofjava is the same as is written to the switch 09:57:15 in wrong case scenario 09:57:15 oflibMichal: right 09:57:18 right 09:57:25 morning oflibMichal 09:57:32 i wasn't able to reproduce successful scenario this morning 09:57:35 oflibMichal: keep in mind: we actually see all the flow mod messages on the wires 09:57:41 alagalah: hello 09:57:51 oflibMichal: ah — you set up the GBP demo/POC? 09:58:26 * alagalah has already taken note we need to update POC instructions for Karaf 09:58:28 tbachman: i tried with ovsdb instructions so far 09:58:29 just to clarify — it’s just the group mod message to the one switch that’s missing 09:58:34 oflibMichal: ah, I see 09:58:41 those *may* be different issues 09:58:45 will try with the gbp demo now 09:58:55 the GBP one seems to be a missing group message 09:59:10 i will follow these instructions 09:59:10 https://wiki.opendaylight.org/view/Group_Policy:Building_and_Running 09:59:14 oflibMichal: If its helpful I have the POC as OVAs exported 09:59:21 (dropbox) 10:00:25 oflibMichal: the other possibility is that I could run something for you 10:00:41 (i.e. either tell or send me the changes, I can build and run) 10:01:12 hmm, I will try to run the demo first 10:01:18 okay 10:01:24 you need two VMs, fwiw 10:01:50 b/c it uses two vSwitches 10:02:10 ou, ok 10:02:37 so we might first want to try your code modification 10:02:43 oflibMichal: okay 10:02:52 just let me know what to change/add 10:03:14 you could also push a “DO NOT MERGE” gerrit, that I could fetch 10:03:28 whatever works 10:03:34 ok, that would be the best option i guess 10:03:50 tbachman: Let me know if you want me to do the same... 10:03:59 alagalah: thx! 10:04:03 let me try it out first 10:04:21 I have all the build stuff now on a separate VM 10:04:52 * vishnoianil is back (gone 00:16:07) 10:04:52 oflibMichal: we’ve added in some debug messages to the openflowplugin, etc. 10:04:59 tbachman: Cool! We just need to iterate a lot over this since its a bit of a heisenbug.... when you say "build stuff" how is that different from my setup which is Ed's integrated logging ? 10:05:02 do you want/need any of those? 10:05:17 alagalah: did you build that integration? 10:05:24 Or just download it from ed? 10:05:28 tbachman: Just downloaded 10:05:52 I guess we could just build from that 10:06:04 I believe that they might be helpful 10:06:05 * tbachman goes off to check the size of that 10:06:12 *those log from openfowplugin 10:06:32 tbachman: oflibMichal Those are just Karaf debug levels though, right? 10:06:54 alagalah: we added some patches to the openflowplugin, I believe 10:07:13 alagalah: tbachman can you provide me your gerrit emails ? 10:07:15 alagalah: I think it’s in that gdoc for bug tracking 10:07:18 tbachman: Ack, I recall now, and to the controller, but that was prior to Ed sending a pre-built zip 10:07:19 i pushed a dfraft 10:07:27 oflibMichal: thx! 10:07:41 oflibMichal: alagalah@gmail.com for me 10:09:07 tbachman: and what is your gerrit email please ? 10:09:09 oflibMichal: tbachman@yahoo.com to add me 10:09:11 thx 10:09:30 ok, added both 10:09:36 oflibMichal: ty! 10:09:48 it just logs when any flowmod or group mod message is received in openflowjava 10:10:04 and also when these messages are about to be serialized 10:10:31 when i run the ovsdb demo ... i also used only the Error log level and only on openflowjava 10:10:56 oflibMichal: so…. the messages that Ed added were for debug level 10:11:00 do you want those? 10:11:14 we can try running with them 10:11:25 okay 10:11:25 and if we are unable to reproduce the issue 10:11:27 actually 10:11:34 can I do this first with your traces 10:11:34 we might use only openflowjava logs 10:11:43 morning 10:11:46 sure, it is up to you 10:11:50 b/c I still have to pull in and build the changes for openflowplugin 10:11:57 if we need it, I’ll add those as a second iteration 10:11:59 oflibMichal: thx 10:12:00 ok 10:12:05 michal_rehak: afternoon ;) 10:12:46 Hey michal_rehak 10:15:16 * vishnoianil is away: Break time, will be back soon 10:20:44 oflibMichal: building my setup 10:20:57 give me a bit 10:30:40 tbachman: I got an idea - how do you guys setup 2 mininets? 10:30:51 * tbachman listens in 10:30:59 michal_rehak: we set them up using a python script 10:31:01 tbachman: is there a parameter for OVS to force dpid? 10:31:12 michal_rehak: not sure 10:31:17 * tbachman goes off to consult google 10:31:30 michal_rehak: We use python scripts to setup everything, the mininet, the policies, and then REST calls into controller 10:37:05 tbachman: could you point me to those scripts 10:38:02 https://wiki.opendaylight.org/view/Group_Policy:Building_and_Running 10:38:08 michal_rehak: actually, wait 10:38:19 let me just give you a pointer in gitweb 10:38:53 tbachman: because having 2 mininets with same dpid will render into session fights 10:39:09 * tbachman is confused 10:39:13 and disconnect previous device +reconnect 10:39:13 you want the same dpid? 10:39:24 michal_rehak: I think they have different ones already? 10:39:43 no - one controller can not handle 2 nodes with same dpid 10:39:46 or, at least that’s how they’re showing up int the controller 10:39:58 i.e. we see openflow:1 and openflow:2 10:40:34 michal_rehak: https://git.opendaylight.org/gerrit/gitweb?p=groupbasedpolicy.git;a=tree;f=util/testOfOverlay;h=ee972ca7dbc6a62247ac9340bde62c60ab5c865c;hb=HEAD 10:48:35 oflibMichal: okay, so I ran it and got the failure case 10:48:43 there are some ERROR messages from openflowjava in the logs 10:49:04 sure, there should be some 10:49:14 i did implement it intentionally as error 10:49:18 oflibMichal: okay 10:49:19 messages 10:49:25 do you want me to email you the log? 10:49:29 you should se two types 10:49:38 tbachman: please do 10:49:42 okay 10:51:50 sent 10:51:51 tbachman: log received ... thx 10:51:54 np 10:51:59 thank YOU :) 10:54:49 oflibMichal: let me know you’d like the logs from openflowplugin 11:15:22 oflibMichal: michal_rehak FYI, we have distinct DPIDs for each switch 11:18:07 alagalah@odlgbp1:~$ sudo ovs-ofctl show s1 -O OpenFlow13 11:18:08 OFPT_FEATURES_REPLY (OF1.3) (xid=0x2): dpid:0000000000000001 11:18:08 alagalah@odlgbp2:~$ sudo ovs-ofctl show s2 -O OpenFlow13 11:18:09 [sudo] password for alagalah: 11:18:11 OFPT_FEATURES_REPLY (OF1.3) (xid=0x2): dpid:0000000000000002 11:28:38 tbachman: ok, so the log you sent me, this is a failure case ? 11:28:46 oflibMichal: yes 11:28:54 because i can see 114 flowmods 11:28:59 oflibMichal: okay 11:29:03 and just 2 groupmods 11:29:08 interesting 11:29:13 not 4 groupmods? 11:29:18 can you combine with the openflowplugin statistics ? 11:29:29 oflibMichal: the jconsole ones? 11:29:32 no, only 2 groupmods in logs 11:30:12 hmmm ... not sure 11:30:17 those where 11:30:30 we can see the addgroupinput logs 11:30:36 from message spy 11:30:55 those that count messages that were sent into ofjava from ofplugin 11:31:10 oflibMichal: pastebin coming up 11:31:18 great 11:32:11 oflibMichal: http://pastebin.com/wQPhBEkz 11:32:32 ok ... TO_SWITCH_SUBMITTED_SUCCESS: MSG[AddGroupInput] -> +2 | 2 11:32:39 oflibMichal: this is interessting 11:32:40 in this case 11:32:41 we only see 2 11:32:42 which is different from the cases before 11:32:52 * tbachman is even more confused 11:33:04 TO_SWITCH_ENQUEUED_SUCCESS: MSG[AddGroupInput] -> +4 | 4 11:33:11 so 4 came in 11:33:13 and 2 went out 11:33:24 we have a roach motel ;) 11:33:25 lol 11:33:39 * tbachman would understand if oflibMichal doesn’t follow that one — joke in the u.s. 11:33:44 (commercial) 11:33:47 (in u.s.) 11:33:58 no i don't :-( 11:34:00 lol 11:34:01 so 11:34:09 do those counters say what I think? 11:34:20 i.e. that 2 came in to openflowjava 11:34:23 sorry 11:34:26 4 came in to openflowjava 11:34:36 and 2 left openflowjava 11:34:41 no 11:34:54 (1:33:04 PM) tbachman: TO_SWITCH_ENQUEUED_SUCCESS: MSG[AddGroupInput] -> +4 | 4 11:35:01 where do you have this log from ? 11:35:08 that pastebin 11:35:13 because the pastebin you shared contains only 2 group mods 11:35:23 oflibMichal: edwarnicke said there were two parts to this 11:35:25 the enqueued 11:35:26 oh i see 11:35:28 and the submitted 11:35:30 my fault 11:35:31 np 11:35:38 so this means 11:35:44 oflibMichal: fwiw 11:35:45 before 11:35:50 that 4 groupmods were enqueued in openflowplugin 11:35:55 but only 2 reached ofjava 11:35:59 okay 11:36:00 fwiw 11:36:03 before 11:36:06 we did see 4 and 4 11:36:15 (should be in that tarball I sent you before) 11:36:26 yes, agree 11:36:31 oflibMichal: I can re-run 11:36:36 and make sure that’s still the case 11:36:50 tbachman: agree, it might tell us more 11:36:54 I probably should include a local version of the openflowplugin with those logs 11:37:02 to make sure we have apples-apples 11:37:09 true 11:37:22 okay 11:37:24 please send me the new log when you have it 11:37:24 I’ll do that 11:49:36 oflibMichal: sorry this is taking a bit 11:49:40 openflowplugin build for first time 11:49:42 :) 11:49:53 in karaf tests now 11:50:55 tbachman: that's ok ... no worries 11:51:15 oflibMichal: there were some prints in the controller 11:51:18 for FRM 11:51:22 is it okay if we don’t have those? 11:51:26 I can add them 11:51:29 but will be a bit longer 11:51:47 (seems like we’re okay with FRM - it’s ofplugin and ofjava that we’re focusing on?) 11:52:44 well 11:52:47 up to you 11:52:49 k 11:52:52 let’s start w/o it 11:52:55 we can try without frm 11:53:06 and I’ll build those while we’re testing w/o it 11:53:14 ok 11:59:16 oflibMichal: What do these logs mean (I see them in the case where I get missing flows for OVSDB): 11:59:18 https://www.irccloud.com/pastebin/tviasVNL 11:59:28 I have one such log message for every missing flow 12:00:41 it reports missing BarrierReply message for BarrierRequest send 12:00:52 oflibMichal: Could that cause flow programming failures? 12:01:00 oflibMichal: And how long are we waiting for them? 12:03:45 edwarnicke: good morning :) 12:05:00 edwarnicke: I’m setting up my own build environment, so that you don’t have to be my build proxy :) 12:05:10 * tbachman should have thought of that before 12:05:12 tbachman: Excellent :) 12:05:39 the only piece I haven’t included is controller, but will do that next 12:05:44 then I should have the same stuff 12:05:58 (controller, openflowplugin, openflowjava, groupbasedpolicy, integration) 12:07:04 edwarnicke: don't know, need to check if there can be a barrier request inside a barrier request 12:07:22 ACK 12:08:10 edwarnicke: and can you see these logs in successful scenario ? 12:10:27 morning edwarnicke 12:13:47 edwarnicke: Upon receipt, the switch must finish processing all previously-received messages, including sending corresponding reply or error messages, before executing any messages beyond the Barrier Request. 12:13:59 sending a lot of barriers might cause us problems 12:14:07 timeouts and so on 12:15:20 tbachman: how goes the testing ? 12:15:35 oflibMichal: I got it to happen 12:15:36 but 12:15:41 just verifying that I have the logs 12:15:48 ok, great 12:16:26 okay 12:16:49 I think I have them 12:17:38 oflibMichal: will send the logs on 12:17:44 I think you only need karaf.log.1 12:17:46 fwiw 12:18:24 yes, and also the ofplugin (message spy) statistics 12:18:29 to verify 12:19:11 oflibMichal: will get those now 12:20:52 oflibMichal: jconsole stats: http://pastebin.com/M1KkT0GR 12:21:09 oflibMichal: so, this case is different 12:21:10 4 came in 12:21:12 4 went out 12:21:17 oflibMichal: tbachman Just tried the ovsdb case, and got flows going missing in OFplugin: 12:21:27 https://www.irccloud.com/pastebin/dR38nlxX 12:21:28 edwarnicke: that’s good! 12:21:32 reproduceable 12:21:33 and isolated 12:21:42 TO_SWITCH_ENQUEUED_SUCCESS: MSG[AddFlowInput] -> +11 | 11 12:21:43 oflibMichal: not sure what to make of the 2/4 on the last run 12:21:48 TO_SWITCH_SUBMITTED_SUCCESS: MSG[AddFlowInput] -> +9 | 9 12:22:02 other than the fact that it wasn’t the same as what we ran before (i.e. openflowplugin wasn’t built locally) 12:22:03 tbachman: l also have wireshark 12:22:07 edwarnicke: nice! 12:22:08 tbachman: oflibMichal investigating 12:22:11 is the message on the wire? 12:22:12 ah, h 12:22:13 k 12:22:25 My gut is telling me this has something to do with barrier 12:22:48 oflibMichal: michal_rehak how are we handling barriers? 12:24:21 tbachman: there are two attachments in your email 12:24:30 oflibMichal: do karaf.log.1 12:24:32 I think it’s in there 12:24:35 there’s log rotation 12:24:41 what is the difference between karaf.log and .1 ? 12:24:55 log messages are probably in karaf.log.1 12:25:04 when it exceeds 1MB in size 12:25:12 it saves the current log as karaf.log. 12:25:16 where increments 12:25:22 and restarts with a clean karaf.log 12:25:28 so, karaf.log.1 is the oldest log messages 12:25:29 because i can see 639 flows entered ofjava in log *.1 12:25:31 karaf.log is new 12:25:38 oflibMichal: wow :) 12:25:54 ok, so i shall check only the karaf.log ... correct ? 12:26:08 ah — I think your messages only showe up in karaf.log.1? 12:26:25 oflibMichal: fwiw, these log messages are all from the same run 12:26:33 you could cat them into a single big file if you prefer :) 12:34:59 oflibMichal: http://pastebin.com/NkR6Rxrc 12:35:08 that log shows your message 12:38:41 tbachman: any chance that you have wireshark capture ? 12:38:51 oflibMichal: I didn’t get it that time 12:38:52 but 12:38:55 I can get one if you want 12:39:08 oflibMichal: I can probably tell you whether it’s the same error or not though 12:39:11 (i.e no group flow) 12:40:11 2014-09-22T10:17:28.675Z|00031|connmgr|INFO|s1<->tcp:192.168.194.137:6653: sending OFPBAC_BAD_OUT_GROUP error reply to OFPT_FLOW_MOD message 12:40:12 2014-09-22T10:17:28.697Z|00032|connmgr|INFO|s1<->tcp:192.168.194.137:6653: sending OFPBAC_BAD_OUT_GROUP error reply to OFPT_FLOW_MOD message 12:40:22 those are the ovs-vswitchd logs 12:40:33 this is what we see when the group flows aren’t sent 12:40:34 so 12:40:43 even though we don’t have the wireshark traces 12:40:51 this looks exactly like what we’ve seen before 12:40:57 hmm, ok - so the statistics probably don't work correctly 12:41:21 oflibMichal: I will do this again and capture the wire with wireshark 12:41:24 that way we’ll know for sure 12:41:34 because your log shows that there was 4 groupmods enqueued and 4 submitted 12:41:42 right 12:41:44 but in the karaf.log.1 12:41:55 there are only 2 ofjava logs 12:41:57 yeah 12:42:01 on group mod message received 12:42:02 so, that is certainly true 12:42:15 and is consistent with what we’ve seen on the failure case 12:43:45 so those groups mods ale probably lost somewhere inside ofplugin - somewhere between counting and sending ... 12:44:01 okay 12:44:12 at least i undestand it this way 12:44:12 oflibMichal: do you still want a run with the wireshark captures? 12:44:38 hmm, give me one second 12:44:43 need to check one more thing 12:45:01 k 12:46:44 hmm, I believe that the wireshark capture is not needed at the moment 12:47:05 ofjava clearly states that it is encoding only two group mods in last run 12:47:07 oflibMichal: okay 12:58:14 tbachman: Do you see all of your groupmods in the config data store? 12:58:19 tbachman: In the failed case? 12:58:31 tbachman: Or do you see from the OFplugin stats that you are loosing them in the plugin? 12:58:39 edwarnicke: since they show up in openflowplugin, I’m guessing they’re there 12:58:48 4 in, 4 out 12:58:55 but I can check 12:59:00 tbachman: I thought you saw a 4 in 2 out at one point 12:59:07 edwarnicke: not sure about that case 12:59:08 b/c 12:59:16 I didn’t have a locally built openflowplugin 12:59:24 I wasn’t sure if it was truly apples-apples 12:59:29 tbachman: No worries 12:59:40 tbachman: Just keep checking the stats each run and keep an eye out 12:59:45 I also (unfortunately) don’t have that setup running any more 12:59:49 edwarnicke: fair enough 12:59:54 maybe that’s what I need to do here 12:59:56 run a few times 13:00:01 collect more data 13:00:32 tbachman: I'd stick with oflibMichal 13:00:40 tbachman: Just suggesting when you do do a run, check stats 13:00:47 k 13:09:58 Back 13:10:13 (sorry, need to head into office in 20min, but will be on line at top of the hour) 13:13:11 edwarnicke: I just ran 4 times, and they all failed the same way (4 in, 4 out) 13:13:21 tbachman: Good to know :) 13:13:28 oflibMichal: Any clues from your side? 13:13:35 okay — going to move on to other things 13:13:42 oflibMichal: anything I can provide or help with? 13:13:48 I’ll still be around 13:13:52 just focusing on other things 13:14:34 tbachman: nothing that comes into my mind, we have seen that the count are different - so i am trying to investigate this issue 13:14:42 oflibMichal: okay 13:14:50 and michal_rehak will try to debug the barrier issue 13:14:59 if there’s something that you’d like me to try, just push an amendment to that gerrit, and I’ll build and retest 13:15:20 oflibMichal: Is the situation that the stats say you are getting 4 groupMods but the logs only show 2 coming in ? 13:15:44 edwarnicke: I believe that’s the case 13:16:02 we only see his log message in there twice 13:16:06 but stats show 4 in and 4 out 13:16:14 true 13:18:01 oflibMichal: Any idea how to debug further? 13:18:37 well, i am investigating stats collection 13:18:46 when the future is set and so on 13:19:03 if there is a place where the group mod is overwritten or so 13:19:31 HA! I found the step we are loosing flows for ovsdb :) 13:19:42 :) 13:19:47 edwarnicke: congrats! 13:19:53 * tbachman claps approvingly for edwarnicke 13:20:11 edwarnicke: wow, where's the catch ? 13:21:21 We get 11 flows that come into OFRpcTaskFactory line 195, and only 7 that go out at line 206 13:21:26 Its the barrier handling 13:22:05 oflibMichal: michal_rehak Do we do barrrier handling for Groups 13:22:06 ? 13:22:14 tbachman: Do you guys set barrier for your groups? 13:22:58 edwarnicke: by default almost all rpc do handle barrier 13:23:03 .. if modeled 13:23:32 edwarnicke: will check 13:23:35 more importantly 13:23:56 would this change in behavior be the result of just performance improvements in the plugin? 13:24:05 or was there a change in barrier handling recently 13:24:08 tbachman: Possibly 13:24:30 tbachman: So if we are just being impatient about barriers, that could explain it 13:24:37 michal_rehak: are you around? 13:25:01 edwarnicke: yes 13:25:50 edwarnicke: there is max timeout while waiting for barrier response 13:25:57 michal_rehak: I have found were we are loosing OVSDB's flows for 1997 13:26:05 michal_rehak: Where is it set? 13:26:10 1sec 13:26:33 edwarnicke: hardcoded in ModelDrivenSwitchImpl, line 120 13:26:58 michal_rehak: So what do we do if we don't see the barrier reponse in time? 13:27:01 this is common singleton object for all rpc (similar to lightweight pattern) 13:27:16 wait 13:27:19 by hard-coded 13:27:27 does that mean it ignores what the user passed in? 13:27:34 the we get rpcResponse containing exception and success=false 13:27:56 rpc wont continue - no flowMod/groupMod sent 13:28:11 edwarnicke: it looks like we do .setBarrier(false) on all of ours 13:28:19 no, user can specify timeout 13:28:31 and rpc contains of barrier + flowMod 13:28:33 tbachman: Both flows and groups ? 13:28:41 edwarnicke: I believe all of them 13:28:44 those are rpc towards ofjava and both async 13:28:56 michal_rehak: But what happens if we don't see a barrier response in time? 13:29:06 so we chain 2 futures and user can specify timeout for the chain product 13:29:42 then rpc returns failure wrapped inside of RpcResult and does not continue with the message 13:30:10 michal_rehak: OK... so if the user requested a barrier 13:30:14 michal_rehak: We send a barrier 13:30:20 michal_rehak: If we don't see a barrier response in time 13:30:29 michal_rehak: We don't send the subsequent message, correct? 13:30:58 yes 13:31:08 there is 1 sec for the barrier response 13:31:13 OK... so does that mean a flowMod with a barrier could wait as much as a second and then fail? 13:31:23 so even if user specified longer timeout 13:31:30 right 13:31:44 but then we should not see it in stats as leaving ofplugin 13:31:47 michal_rehak: Question... why are we sending barriers before instead of after? 13:32:04 michal_rehak: For the ovsdb case I see in the stats 11 in and 7 out 13:32:21 good question 13:32:41 michal_rehak: How hard would it be to send it after instead of before? 13:32:46 probably better approach is to send it afterwards and add plain barrier to API 13:32:47 * edwarnicke is just exploring possibilities 13:33:12 quite simple 13:33:44 Collection barrierErrors = OFRpcTaskUtil.manageBarrier(getTaskContext(), getInput().isBarrier(), getCookie()); 13:33:45 if (!barrierErrors.isEmpty()) { 13:33:45 OFRpcTaskUtil.wrapBarrierErrors(((SettableFuture>) result), barrierErrors); 13:33:45 } else { ... message } return result; 13:34:06 this pattern is all over the OFRpcTaskFactory 13:34:57 when you said 'quite simple' did you mean it would be 'quite simple' to have the barrier message sent after the flowMod instead of before? 13:37:15 yes 13:37:36 just need to attach right notification content 13:37:43 michal_rehak: Could you prepare a patch that does so? 13:41:56 edwarnicke: you mean today? 13:42:12 Yes... I mean ASAP... how long would it take to do? 13:48:45 michal_rehak: There? 13:49:33 hmm 13:49:35 1 hour 13:50:30 michal_rehak: Excellent :) 13:50:33 michal_rehak: One other thing 13:50:49 michal_rehak: I am curious what you are planning to do with missing barrier requests? 13:51:09 michal_rehak: In terms of error handling 13:51:16 michal_rehak: For the 'after' case 13:54:00 good question 13:54:29 I guess there we wont put hardcoded timeout on the future 13:55:32 michal_rehak: So just send it after and then forget about it? 13:55:42 michal_rehak: Or something else? 13:58:48 the future would never return success 13:59:07 which does not expose the real state on device 13:59:39 but the future here is not mentioned to report truly state on device 14:00:01 so if app puts timeout then there would be exception on app side 14:00:16 hi folks - has anybody seen this before? https://gist.github.com/054171d0b26a6e580497 14:03:36 dave_tucker: hi dave, I haven't seen this log before, but it means that there is no deserializer for error message registered 14:03:49 *for experimenter error message 14:03:57 msgType: 65535 experimenterID: 8992 14:04:08 could it be ONFBIC_DUP_INSTRUCTION? seeing that sent from ovs -> controller 14:04:14 just making sure it's nothing serious 14:05:51 msgType is 0xFFFF, experimenterID is 0x2320? 14:06:06 0x23 is # 14:06:08 0x20 is space 14:06:10 ASCII 14:06:19 maybe not relevant 14:06:36 dave_tucker: That looks a lot to me like the switch sent an experimenter message we know nothing about 14:07:00 ok i'll treat it as an error from ovs right now. will dig down on root cause 14:07:34 edwarnicke: that's the case 14:12:24 * vishnoianil is back (gone 01:18:27) 14:12:55 Hi folks 14:14:25 any progress on https://bugs.opendaylight.org/show_bug.cgi?id=1997 ? 14:16:40 michal_rehak: I have verified that the barrier thing is what is causing 1997... ETA on a clean patch? 14:16:58 Great edwarnicke 14:17:06 abhijitkumbhare: I have isolated the source of the failure for 1997... I have a 'machete' fix that I am using to test with... michal_rehak Is doing the 'clean' patch 14:17:20 abhijitkumbhare: Last I heard from him he was saying ETA of about an hour 14:17:26 Great! 14:18:16 abhijitkumbhare: But... the downside is that GBPs issue is *not* 1997 14:18:24 abhijitkumbhare: So we still have that showstopper to fix 14:18:25 mbobak also have been working 1985 14:18:29 abhijitkumbhare: But oflibMichal is on it :) 14:18:44 No mbobak is on it 14:18:54 that is also GBP raised 14:19:11 abhijitkumbhare: Which one? 14:19:37 mbobak , vishnoianil were on https://bugs.opendaylight.org/show_bug.cgi?id=1985 14:19:52 abhijitkumbhare: This is not 1985 14:19:55 they had found the solution to it during the night 14:20:00 abhijitkumbhare: Is a groupMod getting lost in OFplugin 14:20:13 OK - let me check 14:22:06 Is there a bug number for it? I don't see it 14:22:17 tbachman: alagalah Could you file a bug for your group issue? 14:22:33 abhijitkumbhare: We only just figured out this morning it was different from 1997 14:22:37 edwarnicke: sure 14:22:42 OK 14:22:45 * tbachman didn’t realize that he hadn’t already ;) 14:23:08 OK edwarnicke & tbachman 14:27:12 mbobak - how goes it on https://bugs.opendaylight.org/show_bug.cgi?id=1985 ? 14:28:30 abhijitkumbhare: edwarnicke: oflibMichal: https://bugs.opendaylight.org/show_bug.cgi?id=2023 14:28:49 good 14:28:55 thx tbachman 14:29:02 abhijitkumbhare: np! 14:29:17 is oflibMichal already working on it? 14:29:40 abhijitkumbhare: yes :) 14:29:45 great 14:30:02 abhijitkumbhare: yes, trying 14:30:10 thx oflibMichal 14:31:04 meeting now 14:36:28 tbachman: Do you have your bug? 14:36:41 edwarnicke: I didn’t enter it into the spreadsheet 14:36:49 edwarnicke: https://bugs.opendaylight.org/show_bug.cgi?id=2023 14:36:52 I can add it if you want 14:37:06 hmmm 14:37:07 or not 14:37:10 can’t seem to edit 14:37:20 privileges, maybe 14:38:36 entered 14:38:46 * edwarnicke is full of magical super powers 14:43:51 tbachman: back 14:44:07 alagalah: :) 15:23:30 michal_rehak: question for you 15:23:35 we filed this bug: https://bugs.opendaylight.org/show_bug.cgi?id=1984 15:23:44 ttkacik addressed one point 15:24:02 but then he mentions “First exception sees that RegLoadNotifGroupDescStatsUpdatedCase was used 15:24:02 in place where it is not allowed by model.” 15:24:21 but I didn’t see RegLoadNotifGroupDescStatsUpdatedCase used anywhere in our code 15:24:30 could this be something that’s within openflowplugin? 15:26:16 tbachman: this is from nicira extensions 15:26:59 michal_rehak: so, this should be in our project? Or in openflowplugin? 15:27:29 b/c I greped for the above and didn’t see it 15:27:34 but found it in openflowplugin 15:27:44 tbachman: in principle this should be somewhere outside ODL, but technically it is in OFPlugin repo 15:28:08 okay 15:28:13 those classes should be used when special nicira match/action is used in flow 15:28:16 just trying to find out what to file the bug against 15:28:25 otherwise not 15:31:07 tbachman: bug-1984 looks like you pushed nicira contaminated flow to device somehow and then statistics are trying to read that but corresponding classes are not loaded (missing nicira bundles in controller) 15:31:52 ah 15:32:05 is there a separate bundle for nicira statistics? 15:33:20 michal_rehak: How is 1997 going? 15:33:38 the barriers? 15:33:44 Yes 15:33:58 got distracted, but 15:34:08 michal_rehak: Ah... please make that your highest priority 15:34:09 I have catched some wireshark 15:34:20 ok, now all is gone, I am free to code 15:34:21 :-) 15:34:24 :) 15:34:45 should be matter of one example and then repeat until end of class 15:35:20 michal_rehak: Yep :) 15:35:36 michal_rehak: Did we loose oflibMichal ? 15:56:49 went home 15:56:57 there he will connect again 16:47:54 michal_rehak: Good to know :) 17:17:51 tbachman: Have you gotten to MessageDispatchServiceImpl ? 17:18:05 MessageDispatchServiceImpl.getGroupMod() ? 17:18:16 edwarnicke: just groupMod() 17:18:20 I used that print 17:18:21 saw it 4x 17:18:24 so, that looks good 17:18:26 but am narrowing 17:18:28 So lets drill in deeper 17:18:29 to the actual call 17:18:30 yeah 17:19:58 Looks like ConnectionAdaptorImpl.groupMod() could use logging 17:22:43 Looks like we already have some good logs in ConnectionAdapter to turn on 17:22:54 yeah, I didn’t do those 17:23:05 Yes.. but you can use them :) 17:23:09 lol 17:23:17 * tbachman was admitting his guilt 17:23:18 lol 17:38:25 oflibMichal: Welcome back ! :) 17:38:36 tbachman: Could you summarize what you know for oflibMichal 17:38:37 ? 17:38:53 just a sec.... 17:38:56 hi guys 17:48:21 okay 17:48:21 back 17:48:43 so, I printed the log messages in the RPC entry point in openflowjava 17:50:56 hmmm 17:51:04 * tbachman needs to go back and double-check his work 17:51:24 I’ll be back 17:55:33 sorry guys 17:55:40 just want to make sure I have my story straight 17:56:59 tbachman: What line number in what file is the debug and could you paste in the debug 17:57:10 edwarnicke: sure 17:57:24 line 133 17:57:34 in MessageDispatchServiceImpl.java 17:57:39 there’s already a debug log there 18:02:05 tbachman: And how many GroupMods do you see get there? 18:02:17 edwarnicke: that’s what am double-checking 18:02:32 * tbachman has jumped to too many conclusions before :( 18:02:40 michal_rehak: How are we doing on the fix for 1997 ? 18:02:48 50% 18:03:17 got stuck with Barrier as it is not modeled in md-sal and there is no transactionaware version 18:04:35 * tbachman prepares pastebin 18:05:19 http://pastebin.com/fdzGChpG 18:05:20 okay 18:05:23 so, that shows 2 things 18:05:32 the print statment as the RPC is received in OFjava 18:05:35 (2) 18:05:57 and the print satement at line 133 of MessageDispatchServiceImpl.java 18:06:23 (4) 18:06:50 It seems that the print statement in OFjava is pretty much at the beginning of the RPC 18:06:51 so 18:06:56 I don’t see how it can be lost in there 18:07:07 So, I’m going to continue chasing this in openflowplugni 18:07:11 openflowplugin 18:07:19 (sorry — had some distractions of late) 18:07:44 so it look like we are losing group mod in the groupMod() method ? 18:08:14 oflibMichal: it appears so 18:08:25 hmmm ... ok 18:11:07 oflibMichal: michal_rehak: something interesting 18:12:27 http://pastebin.com/s5LcFQVH 18:12:35 will send you guys the complete log 18:15:14 michal_rehak: oflibMichal: sent you guys the log 18:17:02 tbachman: log received 18:17:05 thx 18:17:19 np 18:17:48 note the error in ConnectionAdapterImpl just below Calling OFLibrary groupMod 18:18:07 Received groupmodinput: 53 4 18:21:30 oflibMichal: michal_rehak: edwarnicke: it seems there’s useful information there? 18:22:25 I believe it is 18:22:40 somehow it looks that the condition 18:22:58 (!session.isValid()) 18:23:11 is true in one case and false in the other case 18:23:30 must discuss when the valid flag is set with michal 18:23:35 okay 18:23:57 oflibMichal: actually 18:24:02 I think we see that print for all 4 caess 18:24:03 cases 18:24:10 tbachman: It sounds like you guys figured something out? 18:24:13 but we only see 2 error messages 18:24:21 edwarnicke: just getting further down the rabbit hole 18:24:29 : http://pastebin.com/s5LcFQVH 18:25:01 tbachman: So... let me make sure I understand 18:25:09 You see 4 messages go into OFjava, right? 18:25:11 ok, so lets analyze the http://pastebin.com/s5LcFQVH 18:25:16 with the code 18:25:39 wrong 18:25:45 I believe that there are two behaviors for method getConnectionAdapter(SwitchConnectionDistinguisher cookie) 18:25:46 well 18:25:48 I see 2 18:26:10 first it logs "finding connecton for cookie value null." 18:26:19 edwarnicke: by virtue of that print statement in OFjava 18:26:26 oflibMichal: but, we see that 4x 18:26:29 which is at line 78 18:26:32 right 18:26:40 I think it hits that for all 4 group mods 18:26:45 (2x switch) 18:26:59 I think that’s a “red herring" 18:27:11 * tbachman wonders if they have an analog for that in slovakia 18:27:27 so the condition if (!session.isValid()) is false and therefore is skipped 18:27:39 not skipped 18:27:43 just throws an exception 18:27:55 in last two cases 18:28:03 not in the first two 18:28:05 oflibMichal: we see this 2x: 2014-09-22 09:07:18,307 | ERROR | OFRpc-3 | ConnectionAdapterImpl | 210 - org.opendaylight.openflowjava.openflow-protocol-impl - 0.5.0.SNAPSHOT | Received groupmodinput: 53 4 18:28:18 exactly 18:28:44 so, we see this 4x: “Calling OFLibrary groupMod” 18:29:03 and this 4x (right after each Calling OFLibrary groupMod”): “finding connecton for cookie value null.” 18:29:09 that's correct problem is elsewhere 18:29:23 oh, so my fault 18:29:27 but we only see this 2x: “Received groupmodinput: 53 4” 18:29:35 i will check the log you sent, not the one from pastebin 18:29:51 and that last one is thrown in ConnectionAdapterImpl 18:29:52 there are only two finding connecton for cookie value null. logs 18:29:59 my guess is those are the 2 that are missing 18:30:47 oh 18:30:50 sorry oflibMichal 18:30:51 you’re right 18:30:53 slight difference 18:31:18 let me check with full log 18:31:20 “finding connecton for cookie value null” vs. “Session for the cookie null is invalid” 18:31:21 want to be sure 18:31:21 my bad 18:31:36 tbachman: ok, exactly that's my point 18:31:49 one session seems to be set to valid 18:31:54 right, sorry 18:31:55 while the other is not 18:32:21 saw “cookie” and “null” and thought they were the same print :) 18:32:28 therefor it throws an connectionException 18:32:39 tbachman: no problem :-) 18:32:49 right 18:32:50 sorry 18:33:15 and that's the reason why there are 2 groupMods missing ... at least that's how my understanding goes 18:33:25 exactly 18:33:30 so, if we can answer that 18:33:36 now we should investigate why one session is valid while the other one is not 18:33:37 we may have an answer to the problem :) 18:33:41 yup 18:35:00 tbachman: oflibMichal Could this be a case of connection flapping where the groupmods are getting lost when the connection is down? 18:35:30 edwarnicke: would we also see flow mods lost as well? 18:35:42 (i.e. why just the group mods) 18:36:42 edwarnicke: connection flapping ? I don't understand 18:37:48 oflibMichal: If the switch connects... and then disconnects, and then connects that's called flapping 18:38:24 oflibMichal: So if the switch connects, we send GroupMods... before they exit OFjava the switch disconnects... we loose them... then the switch reconnects... and the other messages go out 18:38:32 oflibMichal: Does that make sense? 18:39:57 edwarnicke: hmmm ... not sure but it might be the case. The other usecase that comes to my mind is that the "valid" boolean / flag is set incorrectly 18:41:04 edwarnicke: yes it makes sense, I can imagine, that while sending messages downstream the switch disconnects and the session is set to invalid ... (i don't know if there are any timers or so) 18:41:30 tbachman: how are you guys setting up your switches to connect? 18:41:39 edwarnicke: oflibMichal: again, doesn’t this seem strange that it would only affect the group mod flows? 18:42:05 tbachman: Don't you push your groups first? 18:42:08 As a bunch 18:42:11 not first 18:42:12 plus 18:42:26 well, it is really strange that we would lose only groupmods 18:42:33 I’ve also tried this with a 5 second wait after changing the switches from 10 to 13 mode 18:43:00 not saying that it’s not the problem, but just doesn’t seem to make sense completely to me 18:43:15 plus 18:43:20 it’s always the same switch 18:43:29 tbachman: I agree 18:43:30 I’d expect it to be a bit more random if that were the case 18:43:33 oflibMichal: tbachman keep digging :) 18:43:39 :) 18:43:55 oflibMichal: if we can understand how that flag gets set better, that would be great ;) 18:45:15 will try to get more info on that 18:45:39 tbachman: What file what line are you guys seeing stuff? 18:45:44 tbachman: And what? 18:45:57 log file line, or java file line? 18:46:10 edwarnicke: I can send you the karaf log file 18:46:12 that may help 18:46:44 edwarnicke: sent 18:47:11 we see this print 2x: Received groupmodinput 18:47:22 that’s the print that happens on the groupMod RPC call in OFjava 18:47:52 edwarnicke: we see this one 4x: ‘Calling OFLibrary groupMod’ 18:48:38 That’s on entry to MessageDispatchServiceImpl.java, line 201 18:48:42 in groupMod() 18:50:25 tbachman: Specifically... what file what line are you encountering the sesion being invalid? 18:50:29 sorry, cookie 18:50:45 line 74 of the same 18:50:49 that’s the session invalid 18:51:00 the cookie invalid is just a print 18:51:02 not an exception 18:51:12 getConnectionAdapter 18:52:48 And only for the groupMod call ? 18:53:09 yes, 2x of them 18:53:51 * tbachman follows the protocol of IRC’ing edwarnicke, to let him know he has new email :) 18:53:56 So walking up the call tree 18:55:03 This is set in the setValid method of SessionContextOFImpl 18:56:19 that gets created in OFSesionUtil 18:56:26 (these are all in the same package) 18:57:36 and it looks like its 18:57:41 set to true when it’s crated 18:57:42 created 18:58:10 grepping for setValid 18:58:16 reveals a very small set of places 18:58:43 Cool...something to grab a hold of :) 18:59:15 invalidateSessionContext 18:59:27 In SessionManagerOFImpl 18:59:35 (again, same package) 18:59:48 There’s a print for this 18:59:51 I should enable it :) 18:59:57 er 18:59:57 wait 19:00:01 for the error case only 19:00:02 but 19:00:05 maybe I should add one here 19:00:19 there is a TODO in there 19:00:22 notifyListeners ;) 19:00:23 lol 19:00:29 but probably not important 19:01:25 interesting 19:01:39 it hits htis 19:01:40 this 19:01:45 if there’s already a session context 19:01:55 this is in registerSession() 19:02:02 in OFSessionUtil.java 19:02:53 michal_rehak: if you’re around, can you look at registerSession in OFSessionUtil.java? 19:03:02 openflowplugin/src/main/java/org/opendaylight/openflowplugin/openflow/md/core/session/OFSessionUtil.java 19:03:10 or oflibMichal 19:03:12 or edwarnicke 19:03:14 or anyone :) 19:03:16 lol 19:03:26 tbachman: Let me look 19:03:46 line 56 19:03:50 invalidateSessionContext 19:03:56 sets the session.valid flag to false 19:04:06 so it hits this if it thinks there’s already a session 19:05:45 tbachman: yes, bug-1985 19:06:54 michal_rehak: Wasn't mboback fixing that? 19:07:35 this is a warn print 19:07:41 tbachman: So is OFSessionUtil definitely where your session is getting invalidated? 19:07:43 do warns show up in karaf by default? 19:07:49 edwarnicke: not sure 19:07:55 that’s why I was curious about the print 19:08:20 Did you set:log to something for that file? 19:08:30 nope 19:08:31 tbachman: I think you will get WARNS by defautl 19:08:37 Are you seeing a WARN from there? 19:08:40 edwarnicke: I would have thought so too 19:08:45 I don’t see that log message in there 19:08:49 grep’d for “duplicate datapathId occured while registering new switch session" 19:08:51 but not there 19:09:02 I can run again and set it explicitly 19:09:17 default is info 19:10:19 Yep, I don't either 19:10:35 invalidateDeadSessionContext is the other candidate 19:10:57 tbachman: And I do see: openflowplugin - 0.0.3.SNAPSHOT | context for invalidation not found 19:12:08 vishnoianil: Are you around? 19:12:15 vishnoianil: You were looking at 985 right? 19:14:12 lol 19:14:18 edwarnicke: you scared him offf! 19:14:59 * edwarnicke has that effect on people 19:15:32 lol 19:16:17 edwarnicke: so, I tried again 19:16:24 and enabled debug log level for that file 19:16:29 so, it’s definitely not hitting that one 19:16:32 tbachman: Results 19:16:41 I’ll look at the invalidateDeadSessionContext 19:16:41 not hitting invalidateDeadSessionContext 19:17:04 And its not hitting invalidateSessionContext ? 19:17:05 looks like it’s hitting it: 19:17:06 karaf.log.1: at org.opendaylight.openflowplugin.openflow.md.core.session.SessionManagerOFImpl.invalidateDeadSessionContext(SessionManagerOFImpl.java:117)[212:org.opendaylight.openflowplugin:0.0.3.SNAPSHOT] 19:17:36 tbachman: OK 19:17:46 So that looks like we are getting the switch disconnected 19:17:54 yes 19:17:57 tbachman: Is there a failure case pcap somewhere I could look at? 19:18:21 yes 19:18:28 email it? 19:18:33 pull it up locally? 19:18:58 Drop it somewhere i can download it? 19:19:04 sure 19:19:27 actually 19:19:30 just a sec.... 19:20:19 tbachman: now is probably the time for wireshark capture 19:20:27 can you get one ? 19:21:00 oflibMichal: that tarball has one 19:21:03 a failed case 19:21:22 logdata.tar.gz 19:21:29 oflibMichal: did I email that to you? 19:21:41 i don't remember such file 19:21:44 give me a sec 19:21:48 i will check it 19:21:50 k 19:21:53 if not, will send it now 19:22:20 ok, i have it 19:24:24 * tbachman appreciates the ofplugin/ofjava folks staying up so late to help 19:24:26 :) 19:26:35 oflibMichal: fwiw, the one group mod message that is sent is packet 3657 19:26:50 ok, thanks 19:26:59 when it works, I typically see them to each switch, back-to-back 19:29:24 tbachman: I only see *one* groupmod on the wire here though 19:29:35 edwarnicke: right 19:29:47 one message, to the one working switch, with to group mods in it 19:30:02 there is no message to the other switch, which is what we’re expecting 19:30:12 b/c of the deadswitch invalidation 19:30:39 (this is the non-working case) 19:30:43 did we want the working one? 19:30:49 * tbachman has a trace for that as well 19:31:30 tbachman: are you running the controller on one of your host nodes? 19:32:24 yes 19:33:08 Ah OK 19:33:12 Makes more sense 19:33:59 tbachman: OK... I've learned all I can from the pcap 19:34:11 You have no fins anywhere near your group/flowmods 19:34:17 So I don't think thats it 19:35:48 * tbachman agrees 19:36:16 edwarnicke: this doesn’t feel like a comms issue 19:36:26 ? 19:48:38 tbachman: OK... I just talked to Tony 19:48:51 tbachman: And as soon as Michal finishes 1997 19:48:58 tbachman: He will switch to this 19:49:06 edwarnicke: thx 19:49:12 is there anything else I can provide here? 19:49:15 tbachman: Basically the issue appears to be 19:49:32 tbachman: If the switch connects and disconnects to fast 19:49:42 debouncer ;) 19:49:43 lol 19:49:45 It can produce some nasty behavior on race conditions 19:49:57 edwarnicke: question there tho 19:50:02 why is this creeping in now? 19:50:02 tbachman: I am not completely certain thats it... but it is possible 19:50:07 b/c we didn’t see this before 19:50:20 tbachman: Not entirely sure... but a lot of stuff has gotten faster 19:50:24 again — it’s all racy, so maybe something got faster, which caused the problem 19:50:52 tbachman: I should be finished downloading ova's here in a bit 19:51:05 edwarnicke: cool 19:51:05 but 19:51:20 can alagalah or I help here/ 19:51:20 ? 20:01:47 keep digging 20:01:53 I could be wrong 20:12:01 ping 20:12:03 Just back 20:12:17 edwarnicke: tbachman I'm catching up 20:12:32 alagalah: ACK 20:12:49 alagalah: howdy :) 20:12:56 alagalah: tbachman I will say this 20:13:03 Looking at other captures of OF traffic 20:13:11 Yours is the only one I see flapping like that 20:13:14 On the wire 20:13:15 lol 20:13:19 alagalah: How are you guys setting up? 20:13:52 edwarnicke: we use mininet 20:14:05 tbachman: OK... how do you invoke mininet ? 20:14:15 python APIs 20:14:23 Could you point me to the specifics? 20:14:30 in our repo 20:14:39 under util/testOfOverlay 20:14:45 there’s a file, mininet_gbp.py 20:15:03 we do use a system call 20:15:13 to set things like the OF protocll 20:15:15 protocol 20:15:28 and to create the tunnels 20:15:54 towards the end 20:15:55 there’s this: 20:15:56 for sw in switches: 20:15:57 setOFVersion(sw['name']) 20:16:00 after that 20:16:07 I put a time.sleep(5) 20:16:11 and that still didn’t fix it 20:16:34 * tbachman will retry, given the plethora of things that have transpired since the last time he did it 20:16:44 edwarnicke: the other thing 20:16:50 is that this conditions is transient 20:16:50 but 20:16:54 we always get the same result 20:17:01 that’s strange to me 20:17:12 edwarnicke: michal_rehak tbachman All the information about the log zip is in 1997 (look for my comments) 20:17:23 *Yes I know its now the wrong bug, I will update 2023* 20:18:59 alagalah: Looking 20:19:57 alagalah: Yep, tbachman directed us to that 20:20:40 tbachman: Are you setting the OF version before or after you connect to the controller? 20:20:58 * tbachman isn’t sure what’s initiating the connection 20:21:15 tbachman: Let me put it this way 20:21:18 edwarnicke: the protocols is one of the last things that’s done, I think 20:21:27 tbachman: Are you setting the OFVersion before or after you tell it were teh controller *is* 20:22:15 setting the controller IP is done way first 20:22:18 OK 20:22:25 SetOFVersion before you set the IP 20:22:27 And try that 20:22:30 okay 20:22:31 I bet you will be happy :) 20:22:34 lol 20:22:35 but 20:22:43 Note: This is a workaround 20:22:46 would this have *changed* 20:22:47 sorry to keep beating this horse 20:22:49 k 20:22:57 tbachman: race conditions 20:23:31 tbachman: Give the workaround a try :) 20:23:41 edwarnicke: am doing so now :) 20:28:52 so, I’ve tried two different ways of doing it 20:28:58 both have actually made it worse 20:29:24 * tbachman goes back to double-check his work 20:32:29 tbachman: Are you setting the OFversion before setting the controller IP? 20:32:39 edwarnicke: that’s what am trying 20:32:50 end result is undesirable, so far 20:33:55 * tbachman goes back to square 1, trying to sanity check his setup 20:36:29 edwarnicke: ping 20:37:13 tbachman: What are you seeing? 20:37:34 the nodes can’t ping each other at all 20:37:38 so, worse 20:37:43 let me get back to a good baseline 20:40:16 okay, verfied back to working 20:47:04 BTW folks - vishnoianil will work on https://bugs.opendaylight.org/show_bug.cgi?id=2028 tomorrow. I don't think we should stop the artifact cutting for that bug (from the description) 20:48:11 tbachman: may I ask you to test a patch? 20:48:20 michal_rehak: would love to! 20:48:22 tbachman: it fixes sessions killing 20:48:26 oka 20:48:27 okay 20:48:36 https://git.opendaylight.org/gerrit/#/c/11456/1 20:48:48 the there is another patch moving barrier at the end 20:48:56 https://git.opendaylight.org/gerrit/#/c/11457/ 20:49:16 my guess is that the sessions fix is crucial 20:49:45 okay 20:49:57 tbachman: thank you 20:50:04 michal_rehak: thank YOU! :) 20:50:22 michal_rehak: Thanks man 20:50:22 I wrote only one of them :-) 20:50:34 lol 20:50:39 you get to be the messenger 20:50:41 tbachman: ETA? 20:50:57 michal_rehak: can I do a fetch on each? Or do I have to cherry pick? 20:51:07 * tbachman hasn’t done multiple fetches in gerrit 20:51:53 tbachman: I think the best way is to try them separate 20:52:04 wait 20:52:10 just to make sure I understand 20:52:11 moving barrier is not really fixing disconnetions 20:52:15 just fetch the first gerrit 20:52:16 and test 20:52:19 right 20:52:21 then if that doesn’t work 20:52:23 try the second? 20:52:27 right 20:52:36 should I revert the first patch as well? 20:52:40 the first is favorite of mine 20:52:44 yes 20:52:45 Or just apply the second on top of the first? 20:52:46 okay 20:53:02 michal_rehak: I have to rebuild this and integration 20:53:04 give me about 15 mins 20:53:15 ok 20:53:18 perfect 20:53:28 exclamation mark 20:53:52 lol 21:00:52 * tbachman watches openflowplugin karaf tests scroll by 21:01:57 michal_rehak: may be a bit more — openflowplugin builds take a bit 21:02:09 :-) 21:03:58 michal_rehak: first patch applied, and openflowplugin is built — moving on to integration build 21:04:08 :( 21:04:12 fingers crossed 21:04:14 I built on the wrong VM 21:04:16 ugh 21:04:21 give me 15 more to get there 21:04:23 sorry! 21:04:29 no problem 21:04:48 http://en.wikipedia.org/wiki/Ugh! 21:05:18 lol 21:06:10 michal_rehak: Would https://git.opendaylight.org/gerrit/#/c/11456/1 help 2023 ? 21:07:52 I hope so - all the mysteriously missing packets could be caused by killing netty thread by this NPE 21:08:03 tbachman: are you trying https://git.opendaylight.org/gerrit/#/c/11456/1 21:08:12 edwarnicke: building it as we speak 21:08:17 (for the 2nd time :) ) 21:08:18 lol 21:08:25 don’t ask 21:08:45 tbachman: apologies... fielding calls at the same time 21:08:51 edwarnicke: no worries :) 21:09:13 * tbachman understands edwarnicke is in constant communication with many, many parties 21:17:50 michal_rehak: hang in there — still building the openflowplugin, but hoping that wraps up in the next 5 mins 21:17:58 I think the integration build with -nsu is pretty quick 21:18:19 michal_rehak: headed to the summit? 21:21:42 okay, on to integration build 21:24:41 michal_rehak: integration build done :) 21:24:45 * tbachman scp’s zip file 21:26:59 tbachman: now I feel like senior progress bar specialist :-) 21:27:04 lol 21:27:15 you’re acquiring new titles ;) 21:27:26 skills :-) 21:28:35 we’re 1/1 21:28:51 michal_rehak: if we go 5/5, I’d say it’s looking pretty good 21:31:16 2/2 21:31:26 tbachman: michal_rehak 3, 3, 3 ..... 21:32:00 alagalah: 5,5,5 ... :-) 21:32:03 tbachman: michal_rehak woot!!! 4,4,4 .... 21:32:12 oflibMichal: heheheheh nice one mate 21:32:44 * alagalah Waitings for tbachman to deliver the good news :) 21:32:46 3/3 21:33:07 now my progress bar skills are failing 21:33:11 michal_rehak: What is the essence of the patch ? 21:34:03 alagalah: creating of session and registering of services could interfere with session removal 21:34:08 4/4 21:34:09 now those steps are separated 21:35:34 michal_rehak: "registering of services" includes OF1.x version negotiation ? 21:35:38 5/5 :) 21:35:38 (by any chance?) 21:35:44 okay 21:35:48 so, I’ll do about another 10 runs 21:35:49 but 21:35:52 this is looking good 21:35:59 tbachman: Sounds good 21:36:16 michal_rehak (et. al.): thx for the help! 21:36:36 tbachman: michal_rehak oflibMichal edwarnicke : Many thanks for all the effort and patience!!! 21:37:17 michal_rehak: edwarnicke oflibMichal: we need reviewers and committers :) 21:37:21 alagalah: yes - the conflicts might emerge when switching OF-1.0 to OF-1.3 21:37:23 will this make it in for 5pm? 21:39:28 for those of you still counting: 6/6 21:40:23 this one https://git.opendaylight.org/gerrit/#/c/11456/1 can be merged now 21:40:27 7/7 21:40:35 .. we have the possibility 21:40:51 :) 21:40:53 bionic patch 21:43:05 tbachman: Did you apply both 11456 and 11457 ? 21:43:11 just 11456 21:43:16 they don’t need to go together 21:45:09 8/8 21:46:56 9/9 21:48:03 10/10 21:48:05 I’m satisfied 21:48:27 tbachman: So you are fixed? 21:48:34 edwarnicke: it appears so :) 21:48:53 Victory! 21:49:04 edwarnicke: congrats were made above ^^^^^ 21:49:07 thanks to all! 21:49:20 tbachman: Lets merged 21:49:29 * tbachman agrees 21:49:51 hmmm 21:49:54 let me verify my http 21:50:40 michal_rehak: have you and vishnoianil sorted your stuff out on 11456 ? 21:50:56 I just answered 21:51:13 guys 21:51:17 have to step away for 30 21:51:19 back soon 21:51:55 michal_rehak: so am I good to merge? 21:52:06 edwarnicke: I think this is minor detail what Anil suggested 21:52:12 edwarnicke: yes 21:52:32 https://git.opendaylight.org/gerrit/#/c/11456/ merged 21:53:03 michal_rehak: I am trying out https://git.opendaylight.org/gerrit/#/c/11457 21:53:34 michal_rehak: I was able to reliably reproduce the error 21:53:40 michal_rehak: So I should be able to reliably verify its gone 21:54:33 edwarnicke: ok 21:55:11 tbachman: Before you go, could you update bug 2023 indicating that its fixed? 21:55:21 tbachman: So I can close it out on the showstopper spreadsheet? 21:59:08 ok, so I am leaving for today 21:59:15 see you tomorrow 21:59:20 and good job all 21:59:35 thx for all help while debugging 22:18:28 edwarnicke: ping 22:18:35 pong 22:18:38 edwarnicke: Let me raise tbachman to close out the bug for you 22:18:43 edwarnicke: If its urgent 22:18:49 It would help me a lot 22:18:59 Because I could then retire it from the showstopper list 22:19:03 Or if you wish to allege 22:30:31 I lied 22:30:33 40 minutes 22:30:35 am back :) 22:32:39 https://www.irccloud.com/pastebin/56HRtzrS 22:32:45 guhs 22:32:46 gusy 22:32:47 ugh 22:32:48 guys 22:32:48 :) 22:32:57 I have to revisit this 22:32:59 :) 22:33:02 tbachman: huh ? 22:33:04 my tests were only testing out pings 22:33:11 those are working 22:33:11 tbachman: you have 90 minutes.-:) 22:33:14 but 22:33:16 90? 22:33:18 tbachman: you didn't test curls ? 22:33:20 nope 22:33:26 just did now 22:33:49 and ? 22:34:07 only 46 flows now on one of th nodes 22:34:56 the group flows are there now 22:35:25 I’ll re-run 22:35:30 b/c I want to see if this is consistent 22:37:30 same 22:37:31 46 flows 22:37:33 :( 22:37:42 I’ll have to go through the steps again 22:37:49 to figure out where things are going wrong this time 22:37:50 sorry folks 22:38:12 ZZZzzzz 22:38:19 see you tomorrow 22:38:21 bye 22:38:32 michal_rehak: will give you the update once I know it 22:39:00 thank you 22:41:46 only 102 flows are recorded in jconsole 22:41:48 we expect 114 23:24:54 coming into the plugin? 23:26:31 tbachman: opinion on closing 2023 and opening another one ??? This is different 23:38:08 looks like I may have forgotten to end meeting :) 23:38:16 #endmeeting