Thesis Artefact Methodology and README

Data Aquisition

Getting Top Repo urls

The following curl command was ran to gather the top 1000 highest starred Java repos from github

Note: replacing stars with fork lets us filter by the most forked items

echo "clone_url" > java_repos.csv
for page in {1..10}; do
  curl -s "https://api.github.com/search/repositories?q=language:Java+stars:%3E1&sort=stars&order=desc&per_page=100&page=$page" \
    | jq -r '.items[].clone_url'
done >> java_repos.csv

pip install git+https://github.com/appthreat/joern-lib.git

Refactoring Opportunities via CPG Metrics

Which code metrics derived from code property graphs are most effective at detecting SRP violations and code smells?
How widespread are code smells (e.g., large methods, tightly coupled classes) across open-source Java projects?
Can we identify project-level or organization-level trends in software quality based on aggregated CPG metrics?

If you go with Refactoring Opportunities via CPG Metrics, your main goal is to analyze graphs extracted from Java code (via Code Property Graphs) to detect Single Responsibility Principle (SRP) violations and other code smells.

🔧 Metrics and How to Analyze Them

1. Method Complexity (Cyclomatic Complexity / Graph Depth)

Why: High complexity often means a method does too much (SRP violation).
How:
- Use NetworkX’s nx.dag_longest_path_length() for dataflow subgraphs per function.
- Count number of if, for, while, etc., nodes in control flow (from CPG).
- Optionally use graph diameter or max out-degree.
Outcome: Flag methods with complexity > threshold.

2. Fan-in / Fan-out

Why: High fan-in/out indicates tight coupling or too many responsibilities.
How:
- For a method node: fan_in = number of incoming CALL edges, fan_out = outgoing CALL edges.
- Use NetworkX G.in_degree() and G.out_degree().
Outcome: Rank methods/classes by coupling.

3. Number of Data Flows Through a Method

Why: A method that touches too many variables might be doing multiple jobs.
How:
- Traverse dataflow edges (REACHES, DEF, USE).
- Count unique variable nodes touched.
Outcome: Flag methods that read/write too many distinct variables.

4. Cohesion (How Related Are Method Components?)

Why: Low cohesion = different responsibilities mashed into one place.
How:
- Build a similarity score between accessed variables within a method.
- Or cluster nodes within a method and measure internal edge density vs external edges.
Outcome: Low cohesion score ⇒ suspect SRP violation.

5. Class Responsibility Aggregation

Why: You can detect entire classes violating SRP by aggregating method metrics.
How:
- Compute averages and variances of the above per class.
- Flag classes with wide variance (some methods very complex, others not).
Outcome: Classes with mixed-responsibility methods.

📊 Output Ideas

Histograms of method complexity.
Scatter plot: fan-out vs complexity.
Heatmap: methods in a class vs their metric scores.
Table of “Top 10 Suspect Methods” with metrics.

📁 If You're Using PyTorch Geometric

You can even train a model like GCN or GAT to predict SRP violations (label some examples yourself):

Node features: one-hot types, number of connections, data types used.
Edge types: encode control flow, call, dataflow as multi-relation graphs.
Target: 0 = healthy method, 1 = smells/SRP violation.

Add refactorminer to pipeline map rfminer's output to cpg metadata Create data.node_metadata

# Example structure for node_metadata and file_map
node_metadata = {}
file_map = {}

for node in cpg_nodes:
    # Process the raw CPG node (example: extracting function name)
    metadata = {
        'function_name': node.function_name,
        'method_signature': node.method_signature,
        # Add other metadata as needed
    }
    node_metadata[node.id] = metadata

    # Create a file map (if necessary)
    file_map[node.id] = node.file_path  # Mapping nodes to their file paths

add to data.y

import torch
from torch_geometric.data import Data

# Example node features and labels (adjust as needed)
node_features = torch.tensor([...])  # Node features (e.g., embeddings)
node_labels = torch.tensor([...])    # Labels for nodes or subgraphs

# Create a Data object to store the graph along with node metadata and labels
data = Data(x=node_features, edge_index=edge_index, y=node_labels, 
            node_metadata=node_metadata, file_map=file_map)

# Save the data as a .pt file
torch.save(data, 'graph_with_metadata.pt')

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.old		.old
data		data
data_old		data_old
models		models
python		python
temp_cloned_repo_for_suggestion		temp_cloned_repo_for_suggestion
trained_models		trained_models
trained_models_improved		trained_models_improved
trained_models_v2		trained_models_v2
trained_models_v4		trained_models_v4
wandb		wandb
.gitignore		.gitignore
AndroidAutoSize_filtered.json		AndroidAutoSize_filtered.json
README.md		README.md
TODO.md		TODO.md
TODO2.md		TODO2.md
docker-compose.yml		docker-compose.yml
java_repos_list.csv		java_repos_list.csv
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Thesis Artefact Methodology and README

Data Aquisition

Getting Top Repo urls

Refactoring Opportunities via CPG Metrics

🔧 Metrics and How to Analyze Them

1. Method Complexity (Cyclomatic Complexity / Graph Depth)

2. Fan-in / Fan-out

3. Number of Data Flows Through a Method

4. Cohesion (How Related Are Method Components?)

5. Class Responsibility Aggregation

📊 Output Ideas

📁 If You're Using PyTorch Geometric

About

Uh oh!

Releases

Packages

Uh oh!

Languages

jcs021102/Thesis

Folders and files

Latest commit

History

Repository files navigation

Thesis Artefact Methodology and README

Data Aquisition

Getting Top Repo urls

Refactoring Opportunities via CPG Metrics

🔧 Metrics and How to Analyze Them

1. Method Complexity (Cyclomatic Complexity / Graph Depth)

2. Fan-in / Fan-out

3. Number of Data Flows Through a Method

4. Cohesion (How Related Are Method Components?)

5. Class Responsibility Aggregation

📊 Output Ideas

📁 If You're Using PyTorch Geometric

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages