The following curl command was ran to gather the top 1000 highest starred Java
repos from github
Note: replacing stars
with fork
lets us filter by the most forked items
echo "clone_url" > java_repos.csv
for page in {1..10}; do
curl -s "https://api.github.com/search/repositories?q=language:Java+stars:%3E1&sort=stars&order=desc&per_page=100&page=$page" \
| jq -r '.items[].clone_url'
done >> java_repos.csv
pip install git+https://github.com/appthreat/joern-lib.git
-
Which code metrics derived from code property graphs are most effective at detecting SRP violations and code smells?
-
How widespread are code smells (e.g., large methods, tightly coupled classes) across open-source Java projects?
-
Can we identify project-level or organization-level trends in software quality based on aggregated CPG metrics?
If you go with Refactoring Opportunities via CPG Metrics, your main goal is to analyze graphs extracted from Java code (via Code Property Graphs) to detect Single Responsibility Principle (SRP) violations and other code smells.
- Why: High complexity often means a method does too much (SRP violation).
- How:
- Use NetworkX’s
nx.dag_longest_path_length()
for dataflow subgraphs per function. - Count number of
if
,for
,while
, etc., nodes in control flow (from CPG). - Optionally use graph diameter or max out-degree.
- Use NetworkX’s
- Outcome: Flag methods with complexity > threshold.
- Why: High fan-in/out indicates tight coupling or too many responsibilities.
- How:
- For a method node:
fan_in = number of incoming CALL edges
,fan_out = outgoing CALL edges
. - Use NetworkX
G.in_degree()
andG.out_degree()
.
- For a method node:
- Outcome: Rank methods/classes by coupling.
- Why: A method that touches too many variables might be doing multiple jobs.
- How:
- Traverse dataflow edges (
REACHES
,DEF
,USE
). - Count unique variable nodes touched.
- Traverse dataflow edges (
- Outcome: Flag methods that read/write too many distinct variables.
- Why: Low cohesion = different responsibilities mashed into one place.
- How:
- Build a similarity score between accessed variables within a method.
- Or cluster nodes within a method and measure internal edge density vs external edges.
- Outcome: Low cohesion score ⇒ suspect SRP violation.
- Why: You can detect entire classes violating SRP by aggregating method metrics.
- How:
- Compute averages and variances of the above per class.
- Flag classes with wide variance (some methods very complex, others not).
- Outcome: Classes with mixed-responsibility methods.
- Histograms of method complexity.
- Scatter plot: fan-out vs complexity.
- Heatmap: methods in a class vs their metric scores.
- Table of “Top 10 Suspect Methods” with metrics.
You can even train a model like GCN or GAT to predict SRP violations (label some examples yourself):
- Node features: one-hot types, number of connections, data types used.
- Edge types: encode control flow, call, dataflow as multi-relation graphs.
- Target: 0 = healthy method, 1 = smells/SRP violation.
Add refactorminer to pipeline
map rfminer's output to cpg metadata
Create data.node_metadata
# Example structure for node_metadata and file_map
node_metadata = {}
file_map = {}
for node in cpg_nodes:
# Process the raw CPG node (example: extracting function name)
metadata = {
'function_name': node.function_name,
'method_signature': node.method_signature,
# Add other metadata as needed
}
node_metadata[node.id] = metadata
# Create a file map (if necessary)
file_map[node.id] = node.file_path # Mapping nodes to their file paths
add to data.y
import torch
from torch_geometric.data import Data
# Example node features and labels (adjust as needed)
node_features = torch.tensor([...]) # Node features (e.g., embeddings)
node_labels = torch.tensor([...]) # Labels for nodes or subgraphs
# Create a Data object to store the graph along with node metadata and labels
data = Data(x=node_features, edge_index=edge_index, y=node_labels,
node_metadata=node_metadata, file_map=file_map)
# Save the data as a .pt file
torch.save(data, 'graph_with_metadata.pt')