Skip to content

jcs021102/Thesis

Repository files navigation

Thesis Artefact Methodology and README

Data Aquisition

Getting Top Repo urls

The following curl command was ran to gather the top 1000 highest starred Java repos from github

Note: replacing stars with fork lets us filter by the most forked items

echo "clone_url" > java_repos.csv
for page in {1..10}; do
  curl -s "https://api.github.com/search/repositories?q=language:Java+stars:%3E1&sort=stars&order=desc&per_page=100&page=$page" \
    | jq -r '.items[].clone_url'
done >> java_repos.csv

pip install git+https://github.com/appthreat/joern-lib.git

Refactoring Opportunities via CPG Metrics

  1. Which code metrics derived from code property graphs are most effective at detecting SRP violations and code smells?

  2. How widespread are code smells (e.g., large methods, tightly coupled classes) across open-source Java projects?

  3. Can we identify project-level or organization-level trends in software quality based on aggregated CPG metrics?

If you go with Refactoring Opportunities via CPG Metrics, your main goal is to analyze graphs extracted from Java code (via Code Property Graphs) to detect Single Responsibility Principle (SRP) violations and other code smells.


🔧 Metrics and How to Analyze Them

1. Method Complexity (Cyclomatic Complexity / Graph Depth)

  • Why: High complexity often means a method does too much (SRP violation).
  • How:
    • Use NetworkX’s nx.dag_longest_path_length() for dataflow subgraphs per function.
    • Count number of if, for, while, etc., nodes in control flow (from CPG).
    • Optionally use graph diameter or max out-degree.
  • Outcome: Flag methods with complexity > threshold.

2. Fan-in / Fan-out

  • Why: High fan-in/out indicates tight coupling or too many responsibilities.
  • How:
    • For a method node: fan_in = number of incoming CALL edges, fan_out = outgoing CALL edges.
    • Use NetworkX G.in_degree() and G.out_degree().
  • Outcome: Rank methods/classes by coupling.

3. Number of Data Flows Through a Method

  • Why: A method that touches too many variables might be doing multiple jobs.
  • How:
    • Traverse dataflow edges (REACHES, DEF, USE).
    • Count unique variable nodes touched.
  • Outcome: Flag methods that read/write too many distinct variables.

4. Cohesion (How Related Are Method Components?)

  • Why: Low cohesion = different responsibilities mashed into one place.
  • How:
    • Build a similarity score between accessed variables within a method.
    • Or cluster nodes within a method and measure internal edge density vs external edges.
  • Outcome: Low cohesion score ⇒ suspect SRP violation.

5. Class Responsibility Aggregation

  • Why: You can detect entire classes violating SRP by aggregating method metrics.
  • How:
    • Compute averages and variances of the above per class.
    • Flag classes with wide variance (some methods very complex, others not).
  • Outcome: Classes with mixed-responsibility methods.

📊 Output Ideas

  • Histograms of method complexity.
  • Scatter plot: fan-out vs complexity.
  • Heatmap: methods in a class vs their metric scores.
  • Table of “Top 10 Suspect Methods” with metrics.

📁 If You're Using PyTorch Geometric

You can even train a model like GCN or GAT to predict SRP violations (label some examples yourself):

  • Node features: one-hot types, number of connections, data types used.
  • Edge types: encode control flow, call, dataflow as multi-relation graphs.
  • Target: 0 = healthy method, 1 = smells/SRP violation.

Add refactorminer to pipeline map rfminer's output to cpg metadata Create data.node_metadata

# Example structure for node_metadata and file_map
node_metadata = {}
file_map = {}

for node in cpg_nodes:
    # Process the raw CPG node (example: extracting function name)
    metadata = {
        'function_name': node.function_name,
        'method_signature': node.method_signature,
        # Add other metadata as needed
    }
    node_metadata[node.id] = metadata

    # Create a file map (if necessary)
    file_map[node.id] = node.file_path  # Mapping nodes to their file paths

add to data.y

import torch
from torch_geometric.data import Data

# Example node features and labels (adjust as needed)
node_features = torch.tensor([...])  # Node features (e.g., embeddings)
node_labels = torch.tensor([...])    # Labels for nodes or subgraphs

# Create a Data object to store the graph along with node metadata and labels
data = Data(x=node_features, edge_index=edge_index, y=node_labels, 
            node_metadata=node_metadata, file_map=file_map)

# Save the data as a .pt file
torch.save(data, 'graph_with_metadata.pt')

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published