I'll just assume there's a good reason, but I did want to mention that the TCP LB is a more typical solution. grpc.max_connection_age_ms and grpc.max_connection_age_grace_ms are server settings. The client gets an error (rpc error: code = Unavailable desc = closing transport due to: connection error: desc = "error reading from server: EOF", received prior goaway: code: NO_ERROR, debug data:) rather than io.EOF. While we could also solve this problem using proxies, it seems like these are settings that grpclib should support to allow for robust client side load balancing. Having to use MaxConnectionAge, which just happens to be coupled to re-resolution, to emulate a polling behaviour seems like a bad workaround to me. usr/local/include/grpc/impl/codegen/grpc_types.h machinaut/grpc Have a question about this project? if there are no events, connections are idle but open for as long as possible i.e. You switched accounts on another tab or window. Really, we could write whatever logic we want with this and perform the resolution whenever we saw fit; it gives us full control of something that works for us. Already on GitHub? UPDATE - I apologize for the misunderstanding. My understanding is that for HTTP/2 the LB would not load balance each request as it does with HTTP/1.1, it's just picking a backend and that's it - the connection stays with it until the backend closes it or LB aborts it. The server has to send GOAWAY and finish all pending RPCs in under 1 hour or the client will see an abort. IMO it should include the grace period, otherwise RPCs will unnecessarily fail because of this feature. By clicking Post Your Answer, you agree to our terms of service and acknowledge that you have read and understand our privacy policy and code of conduct. You signed in with another tab or window. But that seems too strict. I'd consider adding gitter to the time we provide, but that seems at direct conflict to your goal. Already on GitHub? I do think we should consider providing a way for the client to be configured to periodically re-resolve at some fixed interval. This document explains how to get the best performance possible from gRPC. It's not quite that simple. I just checked and the top hit that comes up for me on this topic on google still proposes DNS-LB using headless services and makes no mention of the required configuration (max connection age + grace) explained in this thread. This gives us a way to start listening until we're 2/2 again OR we were unsuccessful in getting back 2/2 (so we stay 1/2). I see these options: Option 1: My thinking is in grpc-go it can be a context that is "done" when MAX_CONNECTION_AGE expires. However, even in this case, this is technically what the user wants, because the RPC cannot exist on the server for longer than the connection. If our client has streams 1, 2, and 3 connected to server A, when server B comes up we'd like the new stream 4 to consider both servers A and B without having to disrupt streams 1, 2, and 3 connected to server A. You can use grpc.server()'s options argument: options An optional list of key-value pairs (channel_arguments in Among the channel args documented here I have had the following issues with understanding their functionality, specifically when trying to implement them in @grpc/grpc-js: In addition, many of the options are missing information about the default value if the argument is not explicitly set. This idea is related solely to the gRPC feature of max connection age, and unrelated to the other concern of an LB in the middle killing connections. I have a service for our gRPC server. The connections being 1:1 is more of an attribute of L4 load balancing. Client (there are many) establishes a connection and then server may eventually have something to reply with. Sign in Is it a concern? The reason why it's not working as intended is probably related to the way you are making the reads and writes. Well occasionally send you account related emails. For expensive RPCs, or RPCs where failures are costly, this could be a big win. Fix/workaround is here https://gitlab.com/gitlab-org/cluster-integration/gitlab-agent/-/merge_requests/424, perhaps it can help understand the issue I'm having. The application is running in a Docker container on Kubernetes where the exit code 139 which could mean a segmentation fault. At the application level, gRPC streamlines messaging between clients and back-end services. It's simple, reliable, and hassle-free. By clicking Sign up for GitHub, you agree to our terms of service and You may be able to do that, but that isn't generally the case so isn't worth spending much time considering. Torchserve gRPC configuration options for client-side load-balancing without errors). This can be used directly, as the first element of a struct in C, or as a base class in C++. So if you restart both of the servers, the client will re-resolve and then connect to the new addresses. The goal is to make it "reasonably" efficient. Already have an account? You want a custom name resolver API in Node.js. The text was updated successfully, but these errors were encountered: This issue/PR has been automatically marked as stale because it has not had any update (including commits, comments, labels, milestones, etc) for 30 days. @dgquintas and I have talked about possible ways to address this problem. You switched accounts on another tab or window. I noticed that a maintainer of gRPC suggested using MAX_CONNECTION_AGE to load balance long lived gRPC streams in their video, Using gRPC for Long-lived and Streaming RPCs - Eric Anderson, Google. Ideally we shouldn't get any errors. The client wait on read indefinitely. Already on GitHub? But if you move just one of the servers, we will just stop being able to use that one. We have to have MaxConnectionAge for other reasons, so the question is if the deficiencies are bad enough to warrant another solution for this specific case. The server does behave differently based on connection age, because of MAX_CONNECTION_AGE support itself. 2) there are server-controlled connection timeouts like GRPC_ARG_MAX_CONNECTION_AGE_MS and GRPC_ARG_MAX_CONNECTION_IDLE_MS (see grpc_types.h comments for documentation), which can also influence how long a connection lasts. Now we're running 7 instances and because we never scaled the clients and only the servers, we have no way to serve the surge of traffic (because they'll never refresh the DNS pool) so we stay 5/7 One possible solution is giving us a function to call that can refresh the pool of hosts via DNS resolution. I don't think there is much argument there. Have a question about this project? Need more intelligent re-resolution of names, https://gist.github.com/carldanley/39d5a0d7f9b1ea865af94481da1e0cac, https://gist.github.com/carldanley/39d5a0d7f9b1ea865af94481da1e0cac#file-index-js-L50, https://github.com/grpc/proposal/pull/23/files, https://github.com/grpc/proposal/blob/master/A9-server-side-conn-mgt.md, Channel not re-resolving hostname after failed connect attempt in gRPC v1.10.x, Agent stuck with log "resolver returned no addresses", Decide on service discovery strategy for Boulder, Implement on-demand DNS lookup for GRPC predictor, L96: .NET: Load balancing, connectivity and wait for ready in client, How to reconnect gRPC connection from DNS resolve by client-side, Add configurable GrpcMaxConnectionAge, GrpcMaxConnectionAgeGrace, distirbution of traces/span amongst collector, feat(server): set grpc MaxConnectionAge for dns connection cycling, Add support for MAX_CONNECTION_AGE and MAX_CONNECTION_AGE_GRACE equivalents, Create your own lookaside load balancer, modify ALL of your clients code to use the lookaside balancer, maintain the code and integrations with your service discovery platform of choice (K8, Consul, ZooKeeper, etc.). Let me try to answer your question again. Please forget the load balancer, the same problem happens on a single machine with server and client communicating directly. From what I can tell it would just work. Then you gracefully close them in the backend after an hour and you set your LB's MAX_CONNECTION_AGE_GRACE to at least 1 hour + 1 minute. Wish I had seen this document 2 weeks ago. This would give us a way to decide when we want to trigger it ourselves. The important bit is that max connection age has +/- 10% jitter and I don't have to deal with that if I'm provided with an exact number. INT_MAX means unlimited. One possible solution would be to make the DNS resolver aware of DNS TTLs, so that we can automatically re-resolve after the previous results expire; this would essentially allow the DNS data to determine how often the clients re-resolve. I.e. Using a consistent polling frequency doesn't cause herds, but configuration becomes a problem. Server-side Connection Management - GitHub Short-lived requests continue using the same connection until MAX_CONNECTION_AGE expires, unless MAX_CONNECTION_IDLE kicks in which seems unlikely for you. Thank you. Are there any other new options I am missing that make basic "proxyless" load balancing easy to get right in k8s across languages these days? I see that as a feature; when well-configured, it is invisible and the grace period is mostly not exceeded (maybe just long-tail RPCs). When using an L4 LB the max age tends to skew lower, like 1-10 minutes. Support for `grpc.max_connection_age_ms` and `grpc.max_connection_age we are talking about the LB again, but as I said above, even without LB in the picture with direct client-server comms it's still a problem. If the load balancer's grace period is no smaller than the backend's maximum poll duration, then there are no issues. Once the client sees a GOAWAY on an existing connection, it will stop using that connection for new streams, even if the connection remains open for a long time for the existing streams that had already been started on that connection before the GOAWAY was received. That changes a lot of the discussion. 3. MAX_CONNECTION_AGE or a bit more. Line integral on implicit region that can't easily be transformed to parametric region. As you can see it was caused by ChannelFor testing utility - calling Protocol.data_received after connection was closed. Our client streams run "forever" and we'd like them to pick up added servers without disrupting them. Your backend would then limit itself to 23 hours - 2 minutes. So it's not really a portable solution. As I've also mentioned, many solutions are dependent on your network topology, which means we solve this for direct connections, and then we need another solution for single-LB-usages, and then still may need to consider multi-LB usages.