|
| 1 | +--- |
| 2 | +title: egress-flow |
| 3 | +authors: |
| 4 | + - "@bnshr" |
| 5 | +reviewers: |
| 6 | + - "@trozet" |
| 7 | + - "@danwinship" |
| 8 | + - "@msherif1234" |
| 9 | +approvers: |
| 10 | + - "@trozet" |
| 11 | + - "@danwinship" |
| 12 | +api-approvers: |
| 13 | + - None |
| 14 | +creation-date: 2025-11-10 |
| 15 | +last-updated: 2025-11-12 |
| 16 | +status: implementable |
| 17 | +--- |
| 18 | + |
| 19 | + |
| 20 | +# Communication egress flows matrix of OpenShift and Operators |
| 21 | + |
| 22 | +## Summary |
| 23 | + |
| 24 | +This enhancement allows to automatically generate the communication network communication in the |
| 25 | +product documentation for all egress flows of OpenShift (multi-node and |
| 26 | +single-node deployments) and Operators. |
| 27 | + |
| 28 | +## Motivation |
| 29 | + |
| 30 | +Security-conscious customers need OpenShift flows matrix for regulatory reasons |
| 31 | +and/or to implement firewall rules to restrict traffic to the minimum set of |
| 32 | +required flows only, on-node firewall or external. |
| 33 | + |
| 34 | +### User Stories |
| 35 | + |
| 36 | +- As an OpenShift cluster administrator, I want documentation on the expected |
| 37 | + flows of traffic outgoing from to every OpenShift installation so I can set up |
| 38 | + firewall rules such as nftables, NGFW, etc. to restrict traffic to the |
| 39 | + minimum required set of flows only. |
| 40 | + |
| 41 | +### Goals |
| 42 | + |
| 43 | +- Provide a mechanism to automatically generate an accurate and up-to-date |
| 44 | + OpenShift communication egress flows matrix. |
| 45 | + |
| 46 | +- Keep the egress flow matrix documented in OpenShift release documents |
| 47 | + updated and validate it. |
| 48 | + |
| 49 | +### Non-Goals |
| 50 | +N/A |
| 51 | + |
| 52 | +## Proposal |
| 53 | + |
| 54 | +We propose to leverage OpenShift Network Observability Operator to collect the egress communication from the cluster to the outside world. |
| 55 | + |
| 56 | +- A communication matrix describing the expected flows of outgoing traffic will |
| 57 | + be included in every OpenShift release documentation. |
| 58 | + |
| 59 | +### Workflow Description |
| 60 | + |
| 61 | +An OpenShift administrator would like to get an accurate and up-to-date OpenShift |
| 62 | +communication egress flows matrix. |
| 63 | + |
| 64 | +- The admin reviews OpenShift release documentation to get the included communication |
| 65 | + matrix describing the expected flows of outgoing traffic. |
| 66 | + |
| 67 | +### API Extensions |
| 68 | +N/A |
| 69 | + |
| 70 | +### Topology Considerations |
| 71 | + |
| 72 | +#### Hypershift / Hosted Control Planes |
| 73 | +Out of scope for this proposal. |
| 74 | + |
| 75 | +#### Standalone Clusters |
| 76 | + |
| 77 | +The communication matrix can be generated on standalone clusters. |
| 78 | + |
| 79 | +#### Single-node Deployments or MicroShift |
| 80 | + |
| 81 | +The communication matrix can be generated on single-node deployments and MicroShift. |
| 82 | + |
| 83 | +### Implementation Details/Notes/Constraints |
| 84 | + |
| 85 | +1. OpenShift CI installs the Network Observability Operator in the cluster in test. |
| 86 | +2. Through eBPF agent of the Network Observability Operator, the egress network data are collected. The data is aggregated through Loki. The retention of the flow logs in the Loki kept for 24 hours. `FlowCollector` is adjusted to capture all data with sampling rate 1. |
| 87 | +3. CI job would run OpenShift tests to track any special flow that generates outgoing flow within the cluster. |
| 88 | +4. The start and end time of the test result are captured and then we filter the Loki aggregated egress flow to process it. |
| 89 | +5. The data processing would filter out only egress data from the OpenShift operators. |
| 90 | + |
| 91 | +**Basic Loki query** |
| 92 | + |
| 93 | +```{K8S_FlowLayer="infra", FlowDirection="1"} | json | DstSubnetLabel = "" | SrcSubnetLabel = "Pods" | line_format "{{.SrcAddr}},{{.SrcPort}},{{.DstAddr}},{{.DstPort}}" ``` |
| 94 | + |
| 95 | +This query would be readjusted to find the Operators that are generating the egress flow. |
| 96 | + |
| 97 | + |
| 98 | +#### Architecture |
| 99 | + |
| 100 | + |
| 101 | + |
| 102 | + |
| 103 | + |
| 104 | + |
| 105 | +### Risks and Mitigations |
| 106 | + |
| 107 | +1. Having the sampling rate for flow capture may hit the peformance issue. |
| 108 | +2. The small size Loki (1x.small) in the installed Loki may impose the risk of storage issue. |
| 109 | +3. Loki could be down and hence debugging is necessary and data loss can occur. However, rerun of CI job is required in that case. |
| 110 | + |
| 111 | +### Drawbacks |
| 112 | +N/A |
| 113 | + |
| 114 | +## Open Questions |
| 115 | + |
| 116 | +1. What should be reporting strategy once we get the egress data report? |
| 117 | +2. Should we automate the reporting the teams? If yes, how? |
| 118 | +3. Do we need persistent storage for Loki and storage in the Cloud (maybe in AWS)? |
| 119 | + |
| 120 | +## Test Plan |
| 121 | + |
| 122 | +- E2E tests will be added to `openshift-tests` |
| 123 | + - Validate an up-to-date generated egress flow matches the |
| 124 | + one documented in OpenShift release documents |
| 125 | + |
| 126 | + |
| 127 | +## Graduation Criteria |
| 128 | + |
| 129 | +### Dev Preview -> Tech Preview |
| 130 | +N/A |
| 131 | + |
| 132 | +### Tech Preview -> GA |
| 133 | +N/A |
| 134 | + |
| 135 | +### Removing a deprecated feature |
| 136 | +N/A |
| 137 | + |
| 138 | +## Upgrade / Downgrade Strategy |
| 139 | +N/A |
| 140 | + |
| 141 | +## Version Skew Strategy |
| 142 | +N/A |
| 143 | + |
| 144 | +## Operational Aspects of API Extensions |
| 145 | +N/A |
| 146 | + |
| 147 | +## Support Procedures |
| 148 | +N/A |
| 149 | + |
| 150 | +## Alternatives (Not Implemented) |
| 151 | +N/A |
0 commit comments