From 826069c90f2409d1b641cc7715f0323d8ead37d5 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Sat, 2 Aug 2025 17:04:04 +0000 Subject: [PATCH 1/3] Initial plan From 76505843915692a94ec4d3229eb9273597bac099 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Sat, 2 Aug 2025 17:20:59 +0000 Subject: [PATCH 2/3] Create comprehensive Musoq syntax documentation foundation Co-authored-by: Puchaczov <6973258+Puchaczov@users.noreply.github.com> --- .docs2/basic-query-structure.md | 174 +++++++++ .docs2/common-table-expressions.md | 553 ++++++++++++++++++++++++++++ .docs2/coupling-syntax.md | 443 ++++++++++++++++++++++ .docs2/index.md | 134 +++++++ .docs2/schema-data-source-syntax.md | 258 +++++++++++++ .docs2/table-definitions.md | 371 +++++++++++++++++++ 6 files changed, 1933 insertions(+) create mode 100644 .docs2/basic-query-structure.md create mode 100644 .docs2/common-table-expressions.md create mode 100644 .docs2/coupling-syntax.md create mode 100644 .docs2/index.md create mode 100644 .docs2/schema-data-source-syntax.md create mode 100644 .docs2/table-definitions.md diff --git a/.docs2/basic-query-structure.md b/.docs2/basic-query-structure.md new file mode 100644 index 00000000..14be9964 --- /dev/null +++ b/.docs2/basic-query-structure.md @@ -0,0 +1,174 @@ +# Basic Query Structure + +## Overview + +Musoq uses SQL-like syntax with some important extensions and modifications. Understanding the basic query structure is essential for effective use of the tool. + +## Basic Query Anatomy + +The fundamental structure of a Musoq query follows this pattern: + +```sql +[table definitions] +[coupling statements] +[with cte_name as (subquery)] +select column_list +from data_source [alias] +[join_clause] +[where condition] +[group by column_list] +[having condition] +[order by column_list] +[skip number] +[take number] +``` + +## Key Differences from Standard SQL + +### 1. Data Source Syntax +Musoq uses the `#schema.method(parameters)` syntax to specify data sources: + +```sql +-- Query files in a directory +select Name, Length from #os.files('/path/to/directory', true) + +-- Query Git repository commits +select Sha, Message from #git.repository('/path/to/repo') r cross apply r.Commits c +``` + +### 2. Case Sensitivity +- **Column names and methods are case-sensitive** +- **Keywords are not case-sensitive** + +```sql +-- This works +SELECT Name, Length FROM #os.files('/path', true) + +-- This also works +select Name, Length from #os.files('/path', true) + +-- This will fail - wrong column case +select name, length from #os.files('/path', true) -- ERROR +``` + +### 3. Strict Typing +Queries are strictly typed - types must match exactly: + +```sql +-- This works - comparing string to string +select * from #os.files('/path', true) where Name = 'test.txt' + +-- This fails - cannot compare string to number +select * from #os.files('/path', true) where Name = 123 -- ERROR +``` + +### 4. Mandatory Aliases for Joins +When using joins with parameterizable sources, aliases are required: + +```sql +-- Correct - alias 'r' is used +select c.Sha from #git.repository('/path') r cross apply r.Commits c + +-- Incorrect - no alias +select Commits.Sha from #git.repository('/path') cross apply Commits -- ERROR +``` + +## Essential Query Components + +### SELECT Clause +Specifies which columns to return: + +```sql +-- Select specific columns +select Name, Length from #os.files('/path', true) + +-- Select all columns +select * from #os.files('/path', true) + +-- Select with expressions +select Name, Length / 1024 as SizeInKB from #os.files('/path', true) +``` + +### FROM Clause +Specifies the data source: + +```sql +-- Simple data source +from #os.files('/path/to/directory', true) + +-- Data source with alias +from #git.repository('/path/to/repo') r + +-- Multiple data sources with joins +from #os.files('/path', true) f +inner join #os.files('/other/path', true) o on f.Name = o.Name +``` + +### WHERE Clause +Filters rows based on conditions: + +```sql +-- Simple condition +where Length > 1000 + +-- Multiple conditions +where Length > 1000 and Extension = '.txt' + +-- Pattern matching +where Name like '%.log' +``` + +## Minimal Working Examples + +### 1. Simple File Listing +```sql +select Name, Length +from #os.files('/tmp', false) +``` + +### 2. Filtered Query with Expressions +```sql +select + Name, + Length / 1024 / 1024 as SizeMB +from #os.files('/home', true) +where Extension = '.pdf' and Length > 1000000 +order by Length desc +take 10 +``` + +### 3. Basic Aggregation +```sql +select + Extension, + count(*) as FileCount, + sum(Length) as TotalSize +from #os.files('/documents', true) +group by Extension +having count(*) > 5 +order by TotalSize desc +``` + +## Schema Discovery + +Use the `desc` command to discover available columns for any data source: + +```sql +desc #os.files('/path', true) +``` + +This returns a table showing column names and their data types. + +## Next Steps + +- Learn about [SELECT clause](./select-clause.md) details and expressions +- Understand [data sources](./from-clause-data-sources.md) and how to connect to different types of data +- Explore [filtering](./where-clause-filtering.md) with WHERE clauses + +## Common Gotchas + +1. **Case sensitivity** - Always match the exact case of column names +2. **Type matching** - Ensure types match in comparisons and joins +3. **Aliases required** - Use aliases when joining parameterizable sources +4. **Schema syntax** - Remember the `#schema.method()` format for data sources +5. **Boolean parameters** - Use `true`/`false`, not `1`/`0` for boolean parameters \ No newline at end of file diff --git a/.docs2/common-table-expressions.md b/.docs2/common-table-expressions.md new file mode 100644 index 00000000..76145c07 --- /dev/null +++ b/.docs2/common-table-expressions.md @@ -0,0 +1,553 @@ +# Common Table Expressions (CTEs) + +## Overview + +Common Table Expressions (CTEs) in Musoq provide a powerful way to create temporary named result sets that can be referenced within a query. CTEs improve query readability, enable complex data transformations, and support recursive operations. + +## Basic CTE Syntax + +### Simple CTE Structure +```sql +with cte_name as ( + select column1, column2 + from data_source + where condition +) +select * from cte_name; +``` + +### Multiple CTEs +```sql +with +first_cte as ( + select column1, column2 from source1 +), +second_cte as ( + select column3, column4 from source2 +) +select * from first_cte +union all +select * from second_cte; +``` + +## Basic CTE Examples + +### Simple Data Filtering +```sql +with large_files as ( + select Name, Length, Extension + from #os.files('/documents', true) + where Length > 1000000 +) +select + Extension, + count(*) as FileCount, + sum(Length) as TotalSize +from large_files +group by Extension +order by TotalSize desc; +``` + +### Data Transformation +```sql +with normalized_data as ( + select + Upper(Trim(Name)) as CleanName, + Length / 1024 / 1024 as SizeMB, + Extension + from #os.files('/projects', true) + where Extension in ('.cs', '.js', '.py') +) +select + CleanName, + SizeMB, + Extension, + case + when SizeMB < 1 then 'Small' + when SizeMB < 10 then 'Medium' + else 'Large' + end as SizeCategory +from normalized_data +order by SizeMB desc; +``` + +## Advanced CTE Patterns + +### Aggregation and Window Functions +```sql +with commit_stats as ( + select + c.AuthorEmail, + count(*) as CommitCount, + min(c.Date) as FirstCommit, + max(c.Date) as LastCommit + from #git.repository('/repo/path') r + cross apply r.Commits c + group by c.AuthorEmail +), +ranked_contributors as ( + select + AuthorEmail, + CommitCount, + FirstCommit, + LastCommit, + rank() over (order by CommitCount desc) as Rank + from commit_stats +) +select + AuthorEmail, + CommitCount, + FirstCommit, + LastCommit, + Rank +from ranked_contributors +where Rank <= 10; +``` + +### Complex Data Processing +```sql +with method_complexity as ( + select + p.Name as ProjectName, + c.Name as ClassName, + m.Name as MethodName, + m.CyclomaticComplexity, + m.LinesOfCode + from #csharp.solution('/solution/path.sln') s + cross apply s.Projects p + cross apply p.Documents d + cross apply d.Classes c + cross apply c.Methods m + where m.CyclomaticComplexity > 1 +), +project_summary as ( + select + ProjectName, + count(*) as MethodCount, + avg(CyclomaticComplexity) as AvgComplexity, + max(CyclomaticComplexity) as MaxComplexity, + sum(LinesOfCode) as TotalLinesOfCode + from method_complexity + group by ProjectName +) +select + ProjectName, + MethodCount, + round(AvgComplexity, 2) as AvgComplexity, + MaxComplexity, + TotalLinesOfCode, + case + when AvgComplexity > 10 then 'High Risk' + when AvgComplexity > 5 then 'Medium Risk' + else 'Low Risk' + end as RiskLevel +from project_summary +order by AvgComplexity desc; +``` + +## CTEs with Different Data Sources + +### File System Analysis +```sql +with file_analysis as ( + select + f.Name, + f.Length, + f.Extension, + f.DirectoryName, + case + when f.Extension in ('.jpg', '.png', '.gif') then 'Image' + when f.Extension in ('.mp4', '.avi', '.mov') then 'Video' + when f.Extension in ('.pdf', '.doc', '.txt') then 'Document' + else 'Other' + end as FileCategory + from #os.files('/media', true) f + where f.Length > 0 +), +category_stats as ( + select + FileCategory, + count(*) as FileCount, + sum(Length) as TotalSize, + avg(Length) as AvgSize + from file_analysis + group by FileCategory +) +select + FileCategory, + FileCount, + round(TotalSize / 1024.0 / 1024.0, 2) as TotalSizeMB, + round(AvgSize / 1024.0, 2) as AvgSizeKB +from category_stats +order by TotalSize desc; +``` + +### Git Repository Analysis +```sql +with author_activity as ( + select + c.AuthorEmail, + c.Date, + DatePart('year', c.Date) as Year, + DatePart('month', c.Date) as Month + from #git.repository('/repo/path') r + cross apply r.Commits c + where c.Date >= DateAdd('year', -1, GetDate()) +), +monthly_activity as ( + select + AuthorEmail, + Year, + Month, + count(*) as CommitCount + from author_activity + group by AuthorEmail, Year, Month +), +activity_trends as ( + select + AuthorEmail, + Year, + Month, + CommitCount, + lag(CommitCount, 1) over (partition by AuthorEmail order by Year, Month) as PrevMonthCommits + from monthly_activity +) +select + AuthorEmail, + Year, + Month, + CommitCount, + PrevMonthCommits, + case + when PrevMonthCommits is null then 'New' + when CommitCount > PrevMonthCommits then 'Increasing' + when CommitCount < PrevMonthCommits then 'Decreasing' + else 'Stable' + end as Trend +from activity_trends +where CommitCount > 5 +order by Year desc, Month desc, CommitCount desc; +``` + +## CTEs with Joins and Cross Apply + +### Multi-Source Analysis +```sql +with large_files as ( + select + FullName, + Name, + Length, + Extension + from #os.files('/projects', true) + where Length > 10000000 -- Files larger than 10MB +), +file_metadata as ( + select + lf.FullName, + lf.Name, + lf.Length, + lf.Extension, + m.TagName, + m.Description + from large_files lf + cross apply #os.metadata(lf.FullName) m + where lf.Extension in ('.jpg', '.png', '.tiff') +) +select + Name, + round(Length / 1024.0 / 1024.0, 2) as SizeMB, + TagName, + Description +from file_metadata +where TagName in ('Image Width', 'Image Height', 'Camera Make') +order by Length desc; +``` + +### Complex Data Relationships +```sql +with project_files as ( + select + p.Name as ProjectName, + d.Name as FileName, + d.LinesOfCode, + c.Name as ClassName + from #csharp.solution('/solution.sln') s + cross apply s.Projects p + cross apply p.Documents d + cross apply d.Classes c + where d.LinesOfCode > 100 +), +project_stats as ( + select + ProjectName, + count(distinct FileName) as FileCount, + count(distinct ClassName) as ClassCount, + sum(LinesOfCode) as TotalLinesOfCode, + avg(LinesOfCode) as AvgLinesPerFile + from project_files + group by ProjectName +), +solution_summary as ( + select + sum(FileCount) as TotalFiles, + sum(ClassCount) as TotalClasses, + sum(TotalLinesOfCode) as TotalLinesOfCode, + avg(AvgLinesPerFile) as OverallAvgLinesPerFile + from project_stats +) +select + ps.ProjectName, + ps.FileCount, + ps.ClassCount, + ps.TotalLinesOfCode, + round(ps.AvgLinesPerFile, 2) as AvgLinesPerFile, + round((ps.TotalLinesOfCode * 100.0) / ss.TotalLinesOfCode, 2) as PercentOfSolution +from project_stats ps +cross join solution_summary ss +order by ps.TotalLinesOfCode desc; +``` + +## Recursive CTEs (Future Feature) + +### Hierarchical Data Processing +```sql +-- Note: Recursive CTEs are planned for future releases +with recursive directory_tree as ( + -- Anchor: Start with root directory + select + Name, + FullName, + 0 as Level + from #os.directories('/root/path') + where ParentDirectory is null + + union all + + -- Recursive: Add child directories + select + d.Name, + d.FullName, + dt.Level + 1 + from #os.directories('/root/path') d + inner join directory_tree dt on d.ParentDirectory = dt.FullName + where dt.Level < 5 -- Prevent infinite recursion +) +select + replicate(' ', Level) + Name as IndentedName, + FullName, + Level +from directory_tree +order by FullName; +``` + +## Performance Optimization with CTEs + +### Efficient Data Processing +```sql +-- Pre-filter data to reduce processing overhead +with filtered_commits as ( + select + c.Sha, + c.AuthorEmail, + c.Date, + c.Message + from #git.repository('/large/repo') r + cross apply r.Commits c + where c.Date >= DateAdd('month', -6, GetDate()) -- Only recent commits + and c.AuthorEmail like '%@company.com' -- Only company emails +), +author_metrics as ( + select + AuthorEmail, + count(*) as CommitCount, + count(distinct DatePart('week', Date)) as ActiveWeeks + from filtered_commits + group by AuthorEmail + having count(*) > 10 -- Only active contributors +) +select + AuthorEmail, + CommitCount, + ActiveWeeks, + round(CommitCount / cast(ActiveWeeks as decimal), 2) as CommitsPerWeek +from author_metrics +order by CommitsPerWeek desc; +``` + +### Memory-Efficient Processing +```sql +-- Process large datasets in chunks +with batch_processing as ( + select + ((RowNumber() - 1) / 1000) as BatchId, + Name, + Length, + Extension + from #os.files('/very/large/directory', true) + where Length > 0 +), +batch_summary as ( + select + BatchId, + count(*) as FileCount, + sum(Length) as TotalSize, + max(Length) as MaxSize + from batch_processing + group by BatchId +) +select + BatchId, + FileCount, + round(TotalSize / 1024.0 / 1024.0, 2) as TotalSizeMB, + round(MaxSize / 1024.0 / 1024.0, 2) as MaxSizeMB +from batch_summary +order by BatchId; +``` + +## Best Practices + +### 1. Meaningful CTE Names +```sql +-- Good - descriptive names +with large_image_files as (...), + image_metadata as (...), + processed_images as (...) + +-- Avoid - generic names +with temp1 as (...), + data as (...), + result as (...) +``` + +### 2. Logical Data Flow +```sql +-- Structure CTEs in logical processing order +with +raw_data as ( + -- Initial data extraction + select * from #source.data() +), +cleaned_data as ( + -- Data cleaning and normalization + select CleanField1, CleanField2 from raw_data where IsValid = true +), +enriched_data as ( + -- Add calculated fields + select *, CalculatedField from cleaned_data +), +final_result as ( + -- Final transformations + select FinalField1, FinalField2 from enriched_data +) +select * from final_result; +``` + +### 3. Appropriate Filtering +```sql +-- Filter early to improve performance +with filtered_source as ( + select * + from #large.dataset() + where RelevantField = @parameter -- Filter at source level + and Date >= @startDate +), +processed_data as ( + select ProcessedField + from filtered_source + -- Additional processing on smaller dataset +) +select * from processed_data; +``` + +## Error Handling and Troubleshooting + +### Common CTE Issues + +1. **CTE Not Found** +```sql +-- ERROR: Referencing undefined CTE +select * from undefined_cte; + +-- FIX: Define CTE first +with undefined_cte as (select 1 as value) +select * from undefined_cte; +``` + +2. **Circular References** +```sql +-- ERROR: CTEs cannot reference each other circularly +with cte1 as (select * from cte2), + cte2 as (select * from cte1) +select * from cte1; + +-- FIX: Remove circular dependency +with cte1 as (select * from #source.data()), + cte2 as (select * from cte1) +select * from cte2; +``` + +3. **Column Ambiguity** +```sql +-- ERROR: Ambiguous column names +with ambiguous as ( + select Name from #os.files('/path', true) + union all + select Name from #git.repository('/repo') r cross apply r.Files f +) +select Name from ambiguous; -- Which Name? + +-- FIX: Use aliases to clarify +with clarified as ( + select Name as FileName from #os.files('/path', true) + union all + select f.Name as GitFileName from #git.repository('/repo') r cross apply r.Files f +) +select FileName from clarified; +``` + +## Integration with Other Features + +### CTEs with Table Definitions +```sql +table ProcessedTable { + Id 'System.String', + ProcessedValue 'System.String' +}; + +couple #data.source with table ProcessedTable as DataSource; + +with processed_data as ( + select Id, ProcessedValue + from DataSource(@inputPath) + where ProcessedValue is not null +) +select * from processed_data +order by Id; +``` + +### CTEs with Cross Apply +```sql +with document_content as ( + select + f.Name as FileName, + f.GetFileContent() as Content + from #os.files('/documents', true) f + where f.Extension = '.txt' +) +select + dc.FileName, + w.Word, + count(*) as WordCount +from document_content dc +cross apply Split(dc.Content, ' ') w +group by dc.FileName, w.Word +having count(*) > 3 +order by count(*) desc; +``` + +## Next Steps + +- Learn about [joins](./join-operations.md) for combining data from multiple sources +- Explore [cross apply operations](./cross-outer-apply.md) for advanced data relationships +- See [practical examples](./examples-git-insights.md) of CTEs in real-world scenarios \ No newline at end of file diff --git a/.docs2/coupling-syntax.md b/.docs2/coupling-syntax.md new file mode 100644 index 00000000..24aaf3a8 --- /dev/null +++ b/.docs2/coupling-syntax.md @@ -0,0 +1,443 @@ +# Coupling Syntax + +## Overview + +Coupling syntax in Musoq provides a powerful way to bind custom table definitions to data sources, creating reusable, type-safe interfaces for data processing. This feature allows you to create strongly-typed abstractions over various data sources. + +## Basic Coupling Syntax + +### Standard Coupling Pattern +```sql +couple #schema.method with table TableName as AliasName; +``` + +**Components:** +- `couple` - Keyword to initiate coupling +- `#schema.method` - The data source schema and method +- `with table` - Connecting phrase +- `TableName` - Previously defined table structure +- `as AliasName` - Alias for the coupled source + +### Complete Example +```sql +-- 1. Define table structure +table PersonTable { + Name 'System.String', + Age 'System.Int32', + Email 'System.String' +}; + +-- 2. Couple with data source +couple #csv.reader with table PersonTable as PersonSource; + +-- 3. Use in queries +select Name, Age from PersonSource('/data/people.csv', true, 0); +``` + +## Coupling with Different Data Sources + +### CSV Files +```sql +table EmployeeTable { + EmployeeId 'System.Int32', + FirstName 'System.String', + LastName 'System.String', + Department 'System.String', + Salary 'System.Decimal' +}; + +couple #separatedvalues.csv with table EmployeeTable as EmployeeData; + +select + FirstName + ' ' + LastName as FullName, + Department, + Salary +from EmployeeData('/hr/employees.csv', true, 0) +where Salary > 50000; +``` + +### JSON Data +```sql +table ProductTable { + ProductId 'System.String', + Name 'System.String', + Price 'System.Decimal', + Category 'System.String', + InStock 'System.Boolean' +}; + +couple #json.objects with table ProductTable as ProductCatalog; + +select + Category, + count(*) as ProductCount, + avg(Price) as AvgPrice +from ProductCatalog('/data/products.json') +group by Category; +``` + +### Custom Data Sources +```sql +table ApiResponseTable { + Id 'System.String', + Status 'System.String', + Timestamp 'System.DateTime', + Data 'System.String' +}; + +couple #api.endpoint with table ApiResponseTable as ApiData; + +select Status, count(*) as StatusCount +from ApiData(@endpointUrl, @apiKey) +group by Status; +``` + +## Advanced Coupling Patterns + +### Multiple Couplings for Different Sources +```sql +-- Define common table structure +table CommonDataTable { + Id 'System.String', + Value 'System.String', + Category 'System.String' +}; + +-- Couple with multiple sources +couple #csv.reader with table CommonDataTable as CsvSource; +couple #json.reader with table CommonDataTable as JsonSource; +couple #xml.reader with table CommonDataTable as XmlSource; + +-- Use in union operations +select 'CSV' as Source, * from CsvSource('/data.csv', true, 0) +union all +select 'JSON' as Source, * from JsonSource('/data.json') +union all +select 'XML' as Source, * from XmlSource('/data.xml'); +``` + +### Parameterized Coupling +```sql +table ConfigurableTable { + Field1 'System.String', + Field2 'System.Object', + Field3 'System.Decimal?' +}; + +couple #dynamic.source with table ConfigurableTable as DynamicData; + +-- Parameters passed at query time +select * +from DynamicData(@sourcePath, @format, @encoding) +where Field3 is not null; +``` + +### Coupling with Archive Processing +```sql +table ArchiveContentTable { + FileName 'System.String', + Content 'System.String', + Size 'System.Int64' +}; + +couple #archives.content with table ArchiveContentTable as ArchiveData; + +-- Process files within archives +with ArchiveFiles as ( + select FileName, Content, Size + from ArchiveData('/path/to/archive.zip') + where FileName like '%.txt' +) +select + FileName, + Length(Content) as ContentLength, + Size +from ArchiveFiles +where Size > 1000; +``` + +## Type Safety and Schema Validation + +### Automatic Type Conversion +```sql +table TypedTable { + StringField 'System.String', + IntField 'System.Int32', + DecimalField 'System.Decimal', + BoolField 'System.Boolean' +}; + +couple #flexible.source with table TypedTable as TypedData; + +-- Musoq handles type conversion automatically +select + StringField, + IntField * 2 as DoubledInt, + DecimalField / 100 as Percentage, + case when BoolField then 'Yes' else 'No' end as BoolText +from TypedData(@dataSource); +``` + +### Nullable Field Handling +```sql +table OptionalFieldsTable { + RequiredId 'System.String', + OptionalName 'System.String?', + OptionalValue 'System.Decimal?' +}; + +couple #data.source with table OptionalFieldsTable as OptionalData; + +select + RequiredId, + coalesce(OptionalName, 'Unknown') as Name, + coalesce(OptionalValue, 0.0) as Value +from OptionalData(@sourcePath) +where RequiredId is not null; +``` + +## Integration with Query Features + +### Coupling with CTEs +```sql +table TransactionTable { + TransactionId 'System.String', + Amount 'System.Decimal', + Date 'System.DateTime', + Category 'System.String' +}; + +couple #financial.data with table TransactionTable as TransactionData; + +with MonthlyTotals as ( + select + Year(Date) as Year, + Month(Date) as Month, + Category, + sum(Amount) as MonthlyTotal + from TransactionData(@transactionFile) + group by Year(Date), Month(Date), Category +) +select + Year, + Month, + Category, + MonthlyTotal, + avg(MonthlyTotal) over (partition by Category) as AvgMonthlyForCategory +from MonthlyTotals +order by Year, Month, Category; +``` + +### Coupling with Cross Apply +```sql +table DocumentTable { + DocumentId 'System.String', + Content 'System.String', + Author 'System.String' +}; + +couple #documents.reader with table DocumentTable as DocumentData; + +select + d.DocumentId, + d.Author, + w.Word, + count(*) as WordCount +from DocumentData(@documentsPath) d +cross apply Split(d.Content, ' ') w +where Length(Trim(w.Word)) > 3 +group by d.DocumentId, d.Author, w.Word +having count(*) > 2 +order by count(*) desc; +``` + +### Coupling with Joins +```sql +table UserTable { + UserId 'System.String', + Name 'System.String', + Email 'System.String' +}; + +table OrderTable { + OrderId 'System.String', + UserId 'System.String', + Amount 'System.Decimal', + OrderDate 'System.DateTime' +}; + +couple #users.data with table UserTable as UserData; +couple #orders.data with table OrderTable as OrderData; + +select + u.Name, + u.Email, + count(o.OrderId) as OrderCount, + sum(o.Amount) as TotalSpent +from UserData(@usersFile) u +inner join OrderData(@ordersFile) o on u.UserId = o.UserId +group by u.UserId, u.Name, u.Email +having sum(o.Amount) > 1000 +order by sum(o.Amount) desc; +``` + +## Best Practices + +### 1. Descriptive Coupling Names +```sql +-- Good - descriptive and purpose-specific +couple #csv.reader with table EmployeeTable as EmployeeFromCsv; +couple #api.endpoint with table EmployeeTable as EmployeeFromApi; + +-- Avoid - generic names +couple #csv.reader with table EmployeeTable as Source1; +couple #api.endpoint with table EmployeeTable as Data; +``` + +### 2. Consistent Table Definitions +```sql +-- Define common interfaces for similar data +table StandardPersonTable { + Id 'System.String', + FirstName 'System.String', + LastName 'System.String', + Email 'System.String' +}; + +-- Reuse across multiple sources +couple #csv.reader with table StandardPersonTable as CsvPersons; +couple #json.reader with table StandardPersonTable as JsonPersons; +couple #api.endpoint with table StandardPersonTable as ApiPersons; +``` + +### 3. Error-Resilient Coupling +```sql +table RobustTable { + Id 'System.String', + Data 'System.String?', -- Optional field + Timestamp 'System.DateTime?', -- Optional timestamp + IsValid 'System.Boolean' -- Validation flag +}; + +couple #unreliable.source with table RobustTable as RobustData; + +select * +from RobustData(@source) +where IsValid = true + and Data is not null; +``` + +## Error Handling and Troubleshooting + +### Common Coupling Errors + +1. **Table Not Defined** +```sql +-- ERROR: UndefinedTable not declared +couple #csv.reader with table UndefinedTable as Source; + +-- FIX: Define table first +table UndefinedTable { Field 'System.String' }; +couple #csv.reader with table UndefinedTable as Source; +``` + +2. **Schema Mismatch** +```sql +table StrictTable { + ExactFieldName 'System.String' +}; + +-- May fail if CSV has different column names +couple #csv.reader with table StrictTable as StrictSource; + +-- Consider flexible approach +table FlexibleTable { + Field1 'System.String', + Field2 'System.String', + Field3 'System.String' +}; +``` + +3. **Type Conversion Errors** +```sql +table NumericTable { + NumericField 'System.Int32' +}; + +couple #csv.reader with table NumericTable as NumericSource; + +-- Handle potential conversion failures +select + case + when IsNumeric(NumericField) then ToInt32(NumericField) + else 0 + end as SafeNumericField +from NumericSource(@csvFile, true, 0); +``` + +### Debugging Strategies + +1. **Test Table Definitions Separately** +```sql +-- Verify table definition syntax +table TestTable { + Field1 'System.String', + Field2 'System.Int32' +}; +``` + +2. **Use DESC to Understand Source Schema** +```sql +desc #csv.reader('/test.csv', true, 0); +``` + +3. **Start with Simple Coupling** +```sql +-- Begin with single-field tables +table SimpleTable { + Data 'System.String' +}; + +couple #csv.reader with table SimpleTable as SimpleSource; +``` + +## Performance Considerations + +### Efficient Coupling Patterns +```sql +-- Pre-filter data in coupling when possible +table FilteredTable { + RelevantField 'System.String', + ImportantValue 'System.Decimal' +}; + +couple #large.dataset with table FilteredTable as FilteredData; + +-- Apply filters early +select * +from FilteredData(@largePath) +where ImportantValue > @threshold; -- Filter applied efficiently +``` + +### Memory-Efficient Processing +```sql +-- Stream processing for large datasets +table StreamingTable { + BatchId 'System.String', + Data 'System.String' +}; + +couple #streaming.source with table StreamingTable as StreamingData; + +-- Process in batches +select BatchId, count(*) as RecordCount +from StreamingData(@streamSource) +group by BatchId +having count(*) > 100; +``` + +## Next Steps + +- Learn about [cross apply operations](./cross-outer-apply.md) for advanced data relationships +- Explore [Common Table Expressions](./common-table-expressions.md) for complex query structures +- See [practical examples](./examples-data-transformation.md) of coupling in real-world scenarios \ No newline at end of file diff --git a/.docs2/index.md b/.docs2/index.md new file mode 100644 index 00000000..6049f679 --- /dev/null +++ b/.docs2/index.md @@ -0,0 +1,134 @@ +# Musoq: Comprehensive SQL Syntax Documentation + +## Introduction + +Musoq is a powerful SQL-like query engine that brings the familiarity of SQL to diverse data sources without requiring a traditional database. This comprehensive documentation covers all syntax elements, constructs, and features supported by Musoq, enabling you to query anything from filesystems and Git repositories to code structures and AI models using familiar SQL syntax. + +**Key Principles:** +- **One query language for everything** - Use SQL syntax across all data sources +- **Read-only by design** - Focus on querying and analysis, not data modification +- **Developer-friendly** - Pragmatic syntax extensions that simplify complex tasks +- **Strongly typed** - All queries are strictly typed with compile-time validation + +## Table of Contents + +### 1. Core SQL Syntax Elements +- [1.1 Basic Query Structure](./basic-query-structure.md) +- [1.2 SELECT Clause](./select-clause.md) +- [1.3 FROM Clause and Data Sources](./from-clause-data-sources.md) +- [1.4 WHERE Clause and Filtering](./where-clause-filtering.md) +- [1.5 ORDER BY Clause and Sorting](./order-by-clause-sorting.md) +- [1.6 GROUP BY Clause and Aggregation](./group-by-clause-aggregation.md) +- [1.7 HAVING Clause](./having-clause.md) + +### 2. Advanced Query Constructs +- [2.1 JOIN Operations](./join-operations.md) + - Inner Joins + - Cross Apply + - Outer Apply +- [2.2 Common Table Expressions (CTEs)](./common-table-expressions.md) +- [2.3 Set Operations](./set-operations.md) + - UNION + - EXCEPT + - INTERSECT +- [2.4 Subqueries and Nested Queries](./subqueries-nested-queries.md) + +### 3. Musoq-Specific Syntax Extensions +- [3.1 Schema and Data Source Syntax](./schema-data-source-syntax.md) +- [3.2 Table Definitions](./table-definitions.md) +- [3.3 Coupling Syntax](./coupling-syntax.md) +- [3.4 Cross Apply and Outer Apply](./cross-outer-apply.md) +- [3.5 SKIP and TAKE (Pagination)](./skip-take-pagination.md) +- [3.6 DESC Command for Schema Discovery](./desc-command-schema-discovery.md) + +### 4. Data Types and Type System +- [4.1 Supported Data Types](./supported-data-types.md) +- [4.2 Type Conversion and Casting](./type-conversion-casting.md) +- [4.3 Type Inference](./type-inference.md) +- [4.4 Nullable Types](./nullable-types.md) + +### 5. Expressions and Operators +- [5.1 Arithmetic Expressions](./arithmetic-expressions.md) +- [5.2 Comparison Operators](./comparison-operators.md) +- [5.3 Logical Operators](./logical-operators.md) +- [5.4 Bitwise Operations](./bitwise-operations.md) +- [5.5 String Operations](./string-operations.md) +- [5.6 Date and Time Operations](./date-time-operations.md) + +### 6. Built-in Functions +- [6.1 Aggregate Functions](./aggregate-functions.md) +- [6.2 String Functions](./string-functions.md) +- [6.3 Mathematical Functions](./mathematical-functions.md) +- [6.4 Date and Time Functions](./date-time-functions.md) +- [6.5 Type Conversion Functions](./type-conversion-functions.md) +- [6.6 Conditional Functions](./conditional-functions.md) + +### 7. Control Flow and Conditional Logic +- [7.1 CASE WHEN Expressions](./case-when-expressions.md) +- [7.2 NULL Handling](./null-handling.md) +- [7.3 IN and NOT IN Operators](./in-not-in-operators.md) + +### 8. Data Source Integration +- [8.1 File System Data Sources](./filesystem-data-sources.md) +- [8.2 Git Repository Data Sources](./git-repository-data-sources.md) +- [8.3 Code Analysis Data Sources](./code-analysis-data-sources.md) +- [8.4 AI and Machine Learning Integration](./ai-ml-integration.md) +- [8.5 Database Connectivity](./database-connectivity.md) +- [8.6 Custom Data Source Development](./custom-data-source-development.md) + +### 9. Advanced Features +- [9.1 Regular Expression Support](./regex-support.md) +- [9.2 JSON Path Extraction](./json-path-extraction.md) +- [9.3 Dynamic Schema Handling](./dynamic-schema-handling.md) +- [9.4 Error Handling and Debugging](./error-handling-debugging.md) +- [9.5 Performance Optimization](./performance-optimization.md) + +### 10. Best Practices and Patterns +- [10.1 Query Design Patterns](./query-design-patterns.md) +- [10.2 Performance Best Practices](./performance-best-practices.md) +- [10.3 Error Prevention](./error-prevention.md) +- [10.4 Code Style and Conventions](./code-style-conventions.md) + +### 11. Practical Examples and Use Cases +- [11.1 File System Analysis](./examples-filesystem-analysis.md) +- [11.2 Git Repository Insights](./examples-git-insights.md) +- [11.3 Code Quality Analysis](./examples-code-quality.md) +- [11.4 Data Transformation Tasks](./examples-data-transformation.md) +- [11.5 AI-Enhanced Analysis](./examples-ai-analysis.md) + +### 12. Reference +- [12.1 Complete Syntax Reference](./complete-syntax-reference.md) +- [12.2 Function Reference](./function-reference.md) +- [12.3 Data Source Reference](./data-source-reference.md) +- [12.4 Error Messages Reference](./error-messages-reference.md) +- [12.5 Migration Guide](./migration-guide.md) + +## Getting Started + +If you're new to Musoq, we recommend starting with: + +1. **[Basic Query Structure](./basic-query-structure.md)** - Learn the fundamentals +2. **[FROM Clause and Data Sources](./from-clause-data-sources.md)** - Understand how to connect to data +3. **[Practical Examples](./examples-filesystem-analysis.md)** - See real-world use cases + +For experienced SQL users, jump to: +- **[Musoq-Specific Syntax Extensions](./schema-data-source-syntax.md)** - Learn what makes Musoq unique +- **[Advanced Query Constructs](./join-operations.md)** - Leverage advanced features + +## Documentation Standards + +This documentation follows accessibility and usability best practices: + +- **Progressive disclosure** - Information is layered from basic to advanced +- **Task-oriented organization** - Content is structured around what you want to accomplish +- **Consistent terminology** - Technical terms are defined and used consistently +- **Comprehensive examples** - Every concept includes practical, working examples +- **Cross-references** - Related topics are linked for easy navigation + +## Contributing + +Found an error or want to improve the documentation? See our [contribution guidelines](./contributing.md) for how to help make these docs better. + +--- + +*This documentation covers Musoq's complete syntax and feature set. Each section builds upon previous concepts, so following the suggested reading order will provide the most comprehensive understanding.* \ No newline at end of file diff --git a/.docs2/schema-data-source-syntax.md b/.docs2/schema-data-source-syntax.md new file mode 100644 index 00000000..0926b81f --- /dev/null +++ b/.docs2/schema-data-source-syntax.md @@ -0,0 +1,258 @@ +# Schema and Data Source Syntax + +## Overview + +Musoq's power comes from its ability to query diverse data sources using a unified SQL-like syntax. The schema system provides a consistent interface to access files, Git repositories, databases, APIs, and more. + +## Data Source Syntax + +### Basic Schema Syntax +All data sources follow the pattern: +```sql +#schema.method(parameter1, parameter2, ...) +``` + +**Components:** +- `#` - Prefix identifying a schema +- `schema` - The schema name (e.g., `os`, `git`, `csharp`) +- `method` - The method within the schema (e.g., `files`, `repository`) +- `parameters` - Method-specific parameters in parentheses + +### Schema Discovery +Use the `desc` command to explore available columns: + +```sql +desc #os.files('/path', true) +desc #git.repository('/repo/path') +desc #csharp.solution('/path/to/solution.sln') +``` + +## Core Data Sources + +### Operating System (`#os`) + +#### Files and Directories +```sql +-- List files in directory +select Name, Length, Extension +from #os.files('/path/to/directory', recursive) + +-- Parameters: +-- - path: Directory path (string) +-- - recursive: Include subdirectories (boolean) +``` + +#### File Metadata +```sql +-- Get file metadata including EXIF data +select DirectoryName, TagName, Description +from #os.metadata('/path/to/file.jpg') +``` + +#### Directory Comparison +```sql +-- Compare two directories +select FullName, Status +from #os.dirscompare('/path/to/dir1', '/path/to/dir2') +``` + +### Git Repositories (`#git`) + +#### Repository Analysis +```sql +-- Access Git repository +select * from #git.repository('/path/to/repository') r + +-- Available sub-properties: +-- r.Commits - All commits +-- r.Branches - All branches +-- r.Tags - All tags +-- r.Files - Files in repository +``` + +#### Commit History +```sql +-- Query commit information +select Sha, Message, AuthorEmail, Date +from #git.repository('/repo/path') r +cross apply r.Commits c +``` + +#### Branch Information +```sql +-- List all branches +select Name, IsRemote, Tip +from #git.repository('/repo/path') r +cross apply r.Branches b +``` + +### C# Code Analysis (`#csharp`) + +#### Solution Analysis +```sql +-- Analyze entire solution +select * from #csharp.solution('/path/to/solution.sln') s + +-- Available sub-properties: +-- s.Projects - All projects +-- s.Projects.Documents - Source files +-- s.Projects.Documents.Classes - Classes in files +-- s.Projects.Documents.Classes.Methods - Methods in classes +``` + +#### Class and Method Analysis +```sql +-- Find complex methods +select + c.Name as ClassName, + m.Name as MethodName, + m.CyclomaticComplexity +from #csharp.solution('/path/solution.sln') s +cross apply s.Projects p +cross apply p.Documents d +cross apply d.Classes c +cross apply c.Methods m +where m.CyclomaticComplexity > 10 +``` + +### Artificial Intelligence (`#openai`, `#ollama`) + +#### OpenAI Integration +```sql +-- Use GPT models for analysis +select + gpt.Sentiment(Comment) as Sentiment, + Comment +from #flat.file('/comments.txt') f +inner join #openai.gpt('gpt-4') gpt on 1 = 1 +``` + +#### Local AI Models (Ollama) +```sql +-- Use local models +select + llama.DescribeImage(img.Base64File()) as Description, + img.Name +from #os.files('/images', false) img +inner join #ollama.models('llava:13b', 0.0) llama on 1 = 1 +where img.Extension in ('.jpg', '.png') +``` + +### File Formats + +#### CSV and Delimited Files +```sql +-- Query CSV files +select Column1, Column2, Column3 +from #separatedvalues.csv('/path/to/file.csv', hasHeaders, skipLines) + +-- Parameters: +-- - path: File path +-- - hasHeaders: First row contains headers (boolean) +-- - skipLines: Number of lines to skip (integer) +``` + +#### JSON Files +```sql +-- Query JSON data +select PropertyName, PropertyValue +from #json.file('/path/to/file.json') +``` + +#### Archive Files +```sql +-- Query contents of ZIP files +select Key, IsDirectory, Size +from #archives.file('/path/to/archive.zip') +``` + +### Database Connectivity + +#### Airtable +```sql +-- Query Airtable bases +select * from #airtable.table('baseId', 'tableName', 'apiKey') +``` + +## Parameter Types and Guidelines + +### String Parameters +- Always use single quotes: `'/path/to/file'` +- Escape single quotes with double quotes: `'Can''t stop'` + +### Boolean Parameters +- Use `true` or `false` (case-insensitive) +- **Not** `1`/`0` or `'true'`/`'false'` + +### Numeric Parameters +- Integers: `42`, `0`, `-10` +- Decimals: `3.14`, `0.5` + +### Example with Mixed Parameters +```sql +select Name, Size +from #os.files('/home/user/documents', true) -- string, boolean +where Size > 1000000 -- numeric comparison +``` + +## Advanced Schema Usage + +### Parameterized Sources with Variables +```sql +-- Using variables in paths (when supported by execution environment) +select * from #os.files(@DirectoryPath, @RecursiveFlag) +``` + +### Cross Apply with Schema Methods +```sql +-- Process each file's metadata +select + f.Name, + m.TagName, + m.Description +from #os.files('/photos', false) f +cross apply #os.metadata(f.FullName) m +where f.Extension = '.jpg' +``` + +### Multiple Schema Integration +```sql +-- Combine file system and AI analysis +select + f.Name, + ai.ExtractText(f.GetContent()) as ExtractedText +from #os.files('/documents', true) f +inner join #openai.gpt('gpt-4') ai on 1 = 1 +where f.Extension = '.pdf' +``` + +## Schema Aliases and Reusability + +While you cannot alias schemas themselves, you can alias their results: + +```sql +-- Alias the result of a schema method +select fs.Name, fs.Length +from #os.files('/path', true) fs +where fs.Extension = '.txt' +``` + +## Error Handling + +### Common Schema Errors +1. **Invalid schema name**: `#invalid.method()` - schema doesn't exist +2. **Invalid method name**: `#os.invalidmethod()` - method doesn't exist +3. **Wrong parameter count**: `#os.files('/path')` - missing required parameter +4. **Wrong parameter type**: `#os.files('/path', 'true')` - boolean expected, string provided + +### Debugging Tips +1. Use `desc` to verify available schemas and methods +2. Check parameter types and counts +3. Verify file paths exist and are accessible +4. Test with simple queries before building complex ones + +## Next Steps + +- Learn about [table definitions](./table-definitions.md) for custom schema coupling +- Explore [cross apply operations](./cross-outer-apply.md) for advanced data source relationships +- See [practical examples](./examples-filesystem-analysis.md) of schema usage in real scenarios \ No newline at end of file diff --git a/.docs2/table-definitions.md b/.docs2/table-definitions.md new file mode 100644 index 00000000..f1fc797c --- /dev/null +++ b/.docs2/table-definitions.md @@ -0,0 +1,371 @@ +# Table Definitions + +## Overview + +Table definitions in Musoq allow you to create custom table schemas that can be coupled with data sources. This feature enables type-safe data processing and provides a way to define structured interfaces for your data. + +## Basic Table Definition Syntax + +### Simple Table Definition +```sql +table TableName { + ColumnName 'DataType' +}; +``` + +### Multiple Column Table +```sql +table PersonTable { + Name 'System.String', + Age 'System.Int32', + Email 'System.String' +}; +``` + +## Supported Data Types + +### Primitive Types +```sql +table DataTypesExample { + StringColumn 'System.String', + IntColumn 'System.Int32', + LongColumn 'System.Int64', + DecimalColumn 'System.Decimal', + DoubleColumn 'System.Double', + FloatColumn 'System.Single', + BoolColumn 'System.Boolean', + DateTimeColumn 'System.DateTime', + DateTimeOffsetColumn 'System.DateTimeOffset', + TimeSpanColumn 'System.TimeSpan', + GuidColumn 'System.Guid' +}; +``` + +### Nullable Types +```sql +table NullableTypesExample { + RequiredName 'System.String', + OptionalAge 'System.Int32?', + OptionalEmail 'System.String?' +}; +``` + +### Collection Types +```sql +table CollectionExample { + Tags 'System.String[]', + Numbers 'System.Int32[]', + NestedData 'System.Object[]' +}; +``` + +## Coupling Tables with Data Sources + +### Basic Coupling Syntax +```sql +table UserTable { + Username 'System.String', + Email 'System.String' +}; + +couple #some.datasource with table UserTable as SourceOfUsers; + +select Username, Email from SourceOfUsers(); +``` + +### Complete Example with CSV Processing +```sql +-- Define table structure for CSV data +table EmployeeTable { + Name 'System.String', + Department 'System.String', + Salary 'System.Decimal', + HireDate 'System.DateTime' +}; + +-- Couple with CSV data source +couple #separatedvalues.csv with table EmployeeTable as EmployeeSource; + +-- Query using the coupled table +select + Name, + Department, + Salary, + HireDate +from EmployeeSource('/path/to/employees.csv', true, 0) +where Salary > 50000 +order by HireDate desc; +``` + +## Advanced Table Definition Patterns + +### Complex Data Processing Example +```sql +-- Define structure for JSON processing +table ProductTable { + ProductId 'System.Int32', + Name 'System.String', + Price 'System.Decimal', + Category 'System.String', + InStock 'System.Boolean' +}; + +couple #json.objects with table ProductTable as ProductSource; + +-- Query with aggregation +select + Category, + count(*) as ProductCount, + avg(Price) as AveragePrice, + sum(case when InStock then 1 else 0 end) as InStockCount +from ProductSource('/data/products.json') +group by Category +having count(*) > 5 +order by AveragePrice desc; +``` + +### Archive Processing with Table Definition +```sql +-- Define table for CSV files within archives +table PeopleDetails { + Name 'System.String', + Surname 'System.String', + Age 'System.Int32' +}; + +couple #separatedvalues.comma with table PeopleDetails as SourceOfPeopleDetails; + +-- Process CSV files from within ZIP archive +with Files as ( + select a.Key as InZipPath + from #archives.file('./archive.zip') a + where a.IsDirectory = false + and a.Contains(a.Key, '/') = false + and a.Key like '%.csv' +) +select + f.InZipPath, + b.Name, + b.Surname, + b.Age +from #archives.file('./archive.zip') a +inner join Files f on f.InZipPath = a.Key +cross apply SourceOfPeopleDetails(a.GetStreamContent(), true, 0) as b; +``` + +## Dynamic Schema Coupling + +### Runtime Table Definitions +```sql +-- Table definition can be created dynamically based on data discovery +table DynamicTable { + Field1 'System.String', + Field2 'System.Object', + Field3 'System.Decimal?' +}; + +couple #dynamic.source with table DynamicTable as DynamicSource; + +select * from DynamicSource(@runtimeParameter); +``` + +### Flexible Object Processing +```sql +-- Handle semi-structured data +table FlexibleTable { + Id 'System.String', + Data 'System.Object', + Metadata 'System.String?' +}; + +couple #json.lines with table FlexibleTable as FlexibleSource; + +select + Id, + Data, + Metadata +from FlexibleSource('/path/to/jsonlines.jsonl') +where Data is not null; +``` + +## Type Safety and Validation + +### Automatic Type Conversion +```sql +-- Musoq performs automatic type conversion when possible +table NumericTable { + StringNumber 'System.String', + IntNumber 'System.Int32' +}; + +couple #csv.source with table NumericTable as NumericSource; + +-- This query will attempt to convert string to int +select + StringNumber, + IntNumber + 100 as AdjustedNumber +from NumericSource('/numbers.csv', true, 0) +where ToInt32(StringNumber) > 0; -- Explicit conversion for validation +``` + +### Nullable Handling +```sql +table OptionalDataTable { + RequiredField 'System.String', + OptionalField 'System.String?', + NumericField 'System.Int32?' +}; + +couple #data.source with table OptionalDataTable as OptionalSource; + +select + RequiredField, + coalesce(OptionalField, 'Default Value') as SafeOptionalField, + coalesce(NumericField, 0) as SafeNumericField +from OptionalSource(@sourcePath) +where RequiredField is not null; +``` + +## Common Patterns and Best Practices + +### 1. Consistent Naming Conventions +```sql +-- Use PascalCase for table and column names +table UserProfileTable { + UserId 'System.Int32', + FirstName 'System.String', + LastName 'System.String', + CreatedDate 'System.DateTime' +}; +``` + +### 2. Appropriate Nullable Usage +```sql +-- Make optional fields nullable +table ContactInfoTable { + Name 'System.String', -- Required + Email 'System.String?', -- Optional + Phone 'System.String?', -- Optional + Address 'System.String?' -- Optional +}; +``` + +### 3. Descriptive Coupling Names +```sql +couple #api.endpoint with table ContactInfoTable as ContactInfoSource; +couple #csv.reader with table ContactInfoTable as CsvContactSource; +couple #json.parser with table ContactInfoTable as JsonContactSource; +``` + +## Error Handling and Troubleshooting + +### Common Table Definition Errors + +1. **Invalid Type Names** +```sql +-- Wrong +table BadTable { + Field 'string' -- ERROR: Use 'System.String' +}; + +-- Correct +table GoodTable { + Field 'System.String' +}; +``` + +2. **Missing Type Quotes** +```sql +-- Wrong +table BadTable { + Field System.String -- ERROR: Type must be quoted +}; + +-- Correct +table GoodTable { + Field 'System.String' +}; +``` + +3. **Coupling Mismatches** +```sql +-- Table and data source must have compatible schemas +table MismatchTable { + Field1 'System.String' +}; + +-- This may fail if CSV has different structure +couple #separatedvalues.csv with table MismatchTable as Source; +``` + +### Debugging Tips + +1. **Test table definitions separately** +```sql +-- Test the table definition first +table TestTable { + Name 'System.String' +}; +``` + +2. **Use desc to understand data source schema** +```sql +desc #separatedvalues.csv('/test.csv', true, 0); +``` + +3. **Start with simple coupling** +```sql +-- Begin with basic single-column tables +table SimpleTable { + Data 'System.String' +}; +``` + +## Integration with Other Features + +### With Common Table Expressions (CTEs) +```sql +table ProcessedTable { + OriginalData 'System.String', + ProcessedData 'System.String', + ProcessingDate 'System.DateTime' +}; + +couple #processing.engine with table ProcessedTable as ProcessingSource; + +with ProcessedData as ( + select + OriginalData, + ProcessedData, + ProcessingDate + from ProcessingSource(@inputPath) + where ProcessingDate > @cutoffDate +) +select * from ProcessedData +order by ProcessingDate desc; +``` + +### With Cross Apply Operations +```sql +table DocumentTable { + FileName 'System.String', + Content 'System.String' +}; + +couple #documents.reader with table DocumentTable as DocumentSource; + +select + d.FileName, + w.Word, + count(*) as WordCount +from DocumentSource(@documentsPath) d +cross apply Split(d.Content, ' ') w +group by d.FileName, w.Word +having count(*) > 5; +``` + +## Next Steps + +- Learn about [coupling syntax](./coupling-syntax.md) for advanced data source integration +- Explore [cross apply operations](./cross-outer-apply.md) for complex data relationships +- See [practical examples](./examples-data-transformation.md) of table definitions in action \ No newline at end of file From aa478254cc240515e6a2bb1f321c9955da83e017 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Thu, 7 Aug 2025 15:55:42 +0000 Subject: [PATCH 3/3] Add core SQL syntax documentation - SELECT, FROM, WHERE, ORDER BY, GROUP BY, and HAVING clauses Co-authored-by: Puchaczov <6973258+Puchaczov@users.noreply.github.com> --- .docs2/from-clause-data-sources.md | 418 ++++++++++++++++++++++ .docs2/group-by-clause-aggregation.md | 469 ++++++++++++++++++++++++ .docs2/having-clause.md | 404 +++++++++++++++++++++ .docs2/order-by-clause-sorting.md | 450 +++++++++++++++++++++++ .docs2/select-clause.md | 291 +++++++++++++++ .docs2/where-clause-filtering.md | 493 ++++++++++++++++++++++++++ 6 files changed, 2525 insertions(+) create mode 100644 .docs2/from-clause-data-sources.md create mode 100644 .docs2/group-by-clause-aggregation.md create mode 100644 .docs2/having-clause.md create mode 100644 .docs2/order-by-clause-sorting.md create mode 100644 .docs2/select-clause.md create mode 100644 .docs2/where-clause-filtering.md diff --git a/.docs2/from-clause-data-sources.md b/.docs2/from-clause-data-sources.md new file mode 100644 index 00000000..f7a50694 --- /dev/null +++ b/.docs2/from-clause-data-sources.md @@ -0,0 +1,418 @@ +# FROM Clause and Data Sources + +The `FROM` clause specifies the data source for your query. Musoq's power lies in its ability to query diverse data sources using a unified schema-based syntax. + +## Basic FROM Syntax + +```sql +select columns +from #schema.method(parameters) +``` + +The FROM clause in Musoq follows the pattern `#schema.method()` where: +- `#schema` identifies the data source type +- `method` specifies how to access the data +- `parameters` provide configuration for the data source + +## Core Data Sources + +### File System Sources + +#### Files +Query files in directories with flexible filtering: + +```sql +-- Query files in a directory (non-recursive) +select Name, Length, Extension +from #os.files('/path/to/directory', false) + +-- Query files recursively through subdirectories +select Name, Length, Extension, Directory +from #os.files('/path/to/directory', true) +``` + +#### Directories +Access directory information: + +```sql +-- List directories +select Name, CreationTime, LastWriteTime +from #os.directories('/path') + +-- Recursive directory listing +select Name, FullName, Parent +from #os.directories('/path', true) +``` + +### Git Repository Sources + +#### Repository Overview +Access Git repository metadata: + +```sql +-- Get repository information +select * +from #git.repository('/path/to/repo') +``` + +#### Commits +Query commit history with rich metadata: + +```sql +-- Access all commits +select c.Sha, c.Author.Name, c.Author.Email, c.Message, c.Date +from #git.repository('/repo') r +cross apply r.Commits c + +-- Filter commits by date range +select c.Sha, c.Message +from #git.repository('/repo') r +cross apply r.Commits c +where c.Date >= '2024-01-01' +``` + +#### Branches and Tags +Explore repository structure: + +```sql +-- List all branches +select b.Name, b.IsRemote, b.Tip.Sha +from #git.repository('/repo') r +cross apply r.Branches b + +-- List all tags +select t.Name, t.Target.Sha, t.Message +from #git.repository('/repo') r +cross apply r.Tags t +``` + +### Code Analysis Sources + +#### C# Solution Analysis +Analyze .NET code structure: + +```sql +-- Analyze entire solution +select p.Name as ProjectName, p.Language +from #csharp.solution('/path/to/solution.sln') s +cross apply s.Projects p + +-- Dive into code structure +select + c.Name as ClassName, + m.Name as MethodName, + m.LinesOfCode, + m.CyclomaticComplexity +from #csharp.solution('/project.sln') s +cross apply s.Projects p +cross apply p.Documents d +cross apply d.Classes c +cross apply c.Methods m +``` + +### Database Sources + +#### Direct Database Queries +Connect to external databases: + +```sql +-- Query PostgreSQL +select * +from #postgres.query('connection_string', 'SELECT * FROM users') + +-- Query SQLite +select * +from #sqlite.query('/path/to/db.sqlite', 'SELECT * FROM products') +``` + +### AI and Machine Learning Sources + +#### OpenAI Integration +Extract structured data using AI: + +```sql +-- Analyze text with AI +select * +from #openai.query('your-api-key', 'model-name', 'Your prompt here') + +-- Process images with vision models +select * +from #openai.vision('your-api-key', '/path/to/image.jpg', 'Describe this image') +``` + +## Data Source Parameters + +### Path Parameters +Most data sources accept path parameters: + +```sql +-- Absolute paths +from #os.files('/home/user/documents', true) + +-- Relative paths (from current working directory) +from #os.files('./src', true) + +-- Windows paths +from #os.files('C:\\Users\\Username\\Documents', false) +``` + +### Boolean Flags +Control data source behavior: + +```sql +-- Recursive vs non-recursive +from #os.files('/path', true) -- Include subdirectories +from #os.files('/path', false) -- Current directory only + +-- Include hidden files +from #os.files('/path', true, true) -- Third parameter for hidden files +``` + +### Connection Strings +For database sources: + +```sql +-- PostgreSQL connection +from #postgres.query( + 'Host=localhost;Database=mydb;Username=user;Password=pass', + 'SELECT * FROM table' +) + +-- SQLite file +from #sqlite.query('/database/file.db', 'SELECT * FROM users') +``` + +## Advanced FROM Patterns + +### Multiple Data Sources with Joins + +Join data from different sources: + +```sql +-- Join files with Git commits +select + f.Name as FileName, + c.Author.Name as LastModifiedBy, + c.Date as LastCommitDate +from #os.files('/repo/src', true) f +inner join #git.repository('/repo') r on 1 = 1 +cross apply r.Commits c +where c.Message like '%' + f.Name + '%' +``` + +### Subqueries in FROM + +Use subqueries as data sources: + +```sql +-- Query derived data +select + Extension, + TotalFiles, + TotalSize +from ( + select + Extension, + Count(*) as TotalFiles, + Sum(Length) as TotalSize + from #os.files('/data', true) + group by Extension +) as FileStats +where TotalFiles > 10 +``` + +### Common Table Expressions (CTEs) + +Define reusable data sources: + +```sql +-- Use CTE for complex data preparation +with LargeFiles as ( + select Name, Length, Extension + from #os.files('/workspace', true) + where Length > 1000000 +), +FileStats as ( + select + Extension, + Count(*) as FileCount, + Avg(Length) as AvgSize + from LargeFiles + group by Extension +) +select * from FileStats +order by FileCount desc +``` + +## Schema Discovery + +### DESC Command +Explore available schemas and methods: + +```sql +-- List all available schemas +desc schemas + +-- Explore a specific schema +desc #os + +-- Get detailed information about a method +desc #os.files +``` + +### Dynamic Schema Exploration +Query schema metadata programmatically: + +```sql +-- Find available data sources +select SchemaName, MethodName, Description +from #schema.methods() +where SchemaName like '%git%' +``` + +## Data Source Coupling + +### Table Definitions +Define custom table structures: + +```sql +-- Define a custom table structure +table PersonTable { + Name 'System.String', + Age 'System.Int32', + Email 'System.String' +}; + +-- Couple with a data source +couple #csv.file('/data/people.csv') with table PersonTable as People; + +-- Query the coupled data +select Name, Age +from People +where Age > 25 +``` + +## Performance Considerations + +### Efficient Data Source Usage + +**Choose appropriate recursion levels:** +```sql +-- Efficient: Only when needed +from #os.files('/specific/path', false) + +-- Less efficient: Unnecessary recursion +from #os.files('/', true) +``` + +**Filter early in complex sources:** +```sql +-- Efficient: Filter at source level when possible +select c.Sha, c.Message +from #git.repository('/repo') r +cross apply r.Commits c +where c.Author.Email = 'user@example.com' +``` + +**Use appropriate batch sizes:** +```sql +-- Process large datasets in chunks +select Name, Length +from #os.files('/large/directory', true) +order by Length desc +take 1000 +``` + +## Error Handling + +### Common FROM Clause Errors + +**Invalid paths:** +```sql +-- This will fail if path doesn't exist +from #os.files('/nonexistent/path', true) +``` + +**Missing schema:** +```sql +-- This will fail if schema is not available +from #unknown.source() +``` + +**Incorrect parameters:** +```sql +-- This will fail with wrong parameter count +from #os.files() -- Missing required parameters +``` + +### Defensive Patterns +```sql +-- Check if directory exists before querying +select Name, Length +from #os.files('/possibly/missing/path', true) +where Directory is not null +``` + +## Best Practices + +### Data Source Selection +- **Start specific**: Begin with narrow data sources and expand as needed +- **Use appropriate recursion**: Only recurse when you need subdirectory data +- **Leverage schema capabilities**: Use built-in filtering and selection when available + +### Query Organization +- **Consistent aliasing**: Use clear, consistent aliases for data sources +- **Logical grouping**: Group related data sources in complex queries +- **Documentation**: Comment complex data source configurations + +### Performance Optimization +- **Minimize data retrieval**: Select only necessary columns early +- **Use efficient joins**: Prefer cross apply over complex joins when appropriate +- **Batch processing**: Use TAKE and SKIP for large datasets + +## Common Patterns + +### File Analysis Pattern +```sql +-- Standard file system analysis +select + Name, + Extension, + Length, + Directory, + CreationTime +from #os.files('/project', true) +where Extension in ('.cs', '.js', '.py') +``` + +### Git Repository Analysis Pattern +```sql +-- Comprehensive Git analysis +select + c.Sha, + c.Author.Name, + c.Message, + Count(*) over() as TotalCommits +from #git.repository('/repo') r +cross apply r.Commits c +where c.Date >= DateAdd('month', -6, GetDate()) +``` + +### Multi-Source Integration Pattern +```sql +-- Combine multiple data sources +select + 'File' as SourceType, + f.Name as Name, + f.Length as Size +from #os.files('/data', true) f +union all +select + 'Commit' as SourceType, + c.Sha as Name, + Len(c.Message) as Size +from #git.repository('/repo') r +cross apply r.Commits c +``` + +The FROM clause is where Musoq's versatility shines. Master these data source patterns to unlock powerful analysis capabilities across your entire development ecosystem. \ No newline at end of file diff --git a/.docs2/group-by-clause-aggregation.md b/.docs2/group-by-clause-aggregation.md new file mode 100644 index 00000000..b14fdd33 --- /dev/null +++ b/.docs2/group-by-clause-aggregation.md @@ -0,0 +1,469 @@ +# GROUP BY Clause and Aggregation + +The `GROUP BY` clause groups rows with similar values and enables aggregate calculations across those groups. This is essential for statistical analysis, summarization, and extracting insights from your data. + +## Basic GROUP BY Syntax + +```sql +select grouping_columns, aggregate_functions +from data_source +group by grouping_columns +``` + +The `GROUP BY` clause creates groups of rows that share the same values in the specified columns, then applies aggregate functions to calculate summary statistics for each group. + +## Simple Grouping + +### Single Column Grouping + +```sql +-- Count files by extension +select Extension, Count(*) as FileCount +from #os.files('/projects', true) +group by Extension + +-- Total size by file extension +select Extension, Sum(Length) as TotalSize +from #os.files('/data', true) +group by Extension +``` + +### Basic Aggregate Functions + +| Function | Description | Example | +|----------|-------------|---------| +| `Count(*)` | Number of rows in group | `Count(*) as RowCount` | +| `Count(column)` | Non-null values in column | `Count(Extension) as FilesWithExt` | +| `Sum(column)` | Total of numeric values | `Sum(Length) as TotalBytes` | +| `Avg(column)` | Average of numeric values | `Avg(Length) as AvgFileSize` | +| `Min(column)` | Smallest value in group | `Min(CreationTime) as OldestFile` | +| `Max(column)` | Largest value in group | `Max(Length) as LargestFile` | + +## Multiple Column Grouping + +### Hierarchical Grouping + +Group by multiple columns to create nested categories: + +```sql +-- File statistics by directory and extension +select + Directory, + Extension, + Count(*) as FileCount, + Sum(Length) as TotalSize, + Avg(Length) as AvgSize +from #os.files('/workspace', true) +group by Directory, Extension +order by Directory, Extension + +-- Git commit analysis by author and month +select + c.Author.Name, + DatePart('year', c.Date) as Year, + DatePart('month', c.Date) as Month, + Count(*) as CommitCount, + Count(distinct c.Sha) as UniqueCommits +from #git.repository('/repo') r +cross apply r.Commits c +group by c.Author.Name, DatePart('year', c.Date), DatePart('month', c.Date) +order by c.Author.Name, Year, Month +``` + +## Complex Aggregations + +### Statistical Analysis + +Calculate comprehensive statistics for each group: + +```sql +-- Detailed file size analysis by extension +select + Extension, + Count(*) as FileCount, + Sum(Length) as TotalBytes, + Avg(Length) as AvgBytes, + Min(Length) as SmallestFile, + Max(Length) as LargestFile, + StdDev(Length) as SizeStdDev, + Round(Sum(Length) / 1024.0 / 1024.0, 2) as TotalMB +from #os.files('/analysis', true) +where Extension is not null +group by Extension +having Count(*) >= 5 -- Only extensions with 5+ files +order by Sum(Length) desc +``` + +### Time-Based Aggregations + +```sql +-- Git activity by month and author +select + DatePart('year', c.Date) as Year, + DatePart('month', c.Date) as Month, + c.Author.Name, + Count(*) as Commits, + Count(distinct DatePart('day', c.Date)) as ActiveDays, + Sum(Len(c.Message)) as TotalMessageChars, + Avg(Len(c.Message)) as AvgMessageLength +from #git.repository('/repo') r +cross apply r.Commits c +where c.Date >= DateAdd('year', -1, GetDate()) +group by + DatePart('year', c.Date), + DatePart('month', c.Date), + c.Author.Name +order by Year desc, Month desc, Commits desc +``` + +## Grouping by Expressions + +### Calculated Grouping Columns + +Group by computed values: + +```sql +-- Group files by size categories +select + case + when Length < 1024 then 'Small (< 1KB)' + when Length < 1048576 then 'Medium (1KB-1MB)' + when Length < 10485760 then 'Large (1MB-10MB)' + else 'Very Large (> 10MB)' + end as SizeCategory, + Count(*) as FileCount, + Round(Sum(Length) / 1024.0 / 1024.0, 2) as TotalMB +from #os.files('/data', true) +group by + case + when Length < 1024 then 'Small (< 1KB)' + when Length < 1048576 then 'Medium (1KB-1MB)' + when Length < 10485760 then 'Large (1MB-10MB)' + else 'Very Large (> 10MB)' + end +order by + case + when case + when Length < 1024 then 'Small (< 1KB)' + when Length < 1048576 then 'Medium (1KB-1MB)' + when Length < 10485760 then 'Large (1MB-10MB)' + else 'Very Large (> 10MB)' + end = 'Small (< 1KB)' then 1 + when case + when Length < 1024 then 'Small (< 1KB)' + when Length < 1048576 then 'Medium (1KB-1MB)' + when Length < 10485760 then 'Large (1MB-10MB)' + else 'Very Large (> 10MB)' + end = 'Medium (1KB-1MB)' then 2 + when case + when Length < 1024 then 'Small (< 1KB)' + when Length < 1048576 then 'Medium (1KB-1MB)' + when Length < 10485760 then 'Large (1MB-10MB)' + else 'Very Large (> 10MB)' + end = 'Large (1MB-10MB)' then 3 + else 4 + end +``` + +### Date-Based Grouping + +```sql +-- Group Git commits by day of week +select + case DatePart('weekday', c.Date) + when 1 then 'Sunday' + when 2 then 'Monday' + when 3 then 'Tuesday' + when 4 then 'Wednesday' + when 5 then 'Thursday' + when 6 then 'Friday' + when 7 then 'Saturday' + end as DayOfWeek, + Count(*) as CommitCount, + Count(distinct c.Author.Name) as UniqueAuthors +from #git.repository('/repo') r +cross apply r.Commits c +group by DatePart('weekday', c.Date) +order by DatePart('weekday', c.Date) + +-- Group files by creation hour +select + DatePart('hour', CreationTime) as Hour, + Count(*) as FilesCreated, + Round(Avg(Length), 0) as AvgSize +from #os.files('/logs', true) +where CreationTime >= DateAdd('month', -1, GetDate()) +group by DatePart('hour', CreationTime) +order by DatePart('hour', CreationTime) +``` + +## Advanced Aggregation Functions + +### String Aggregations + +```sql +-- Concatenate values within groups (conceptual - actual syntax may vary) +select + c.Author.Name, + Count(*) as CommitCount, + Max(c.Date) as LastCommit, + Min(c.Date) as FirstCommit +from #git.repository('/repo') r +cross apply r.Commits c +group by c.Author.Name +order by Count(*) desc +``` + +### Conditional Aggregations + +```sql +-- Count different types of files within each directory +select + Directory, + Count(*) as TotalFiles, + Sum(case when Extension = '.cs' then 1 else 0 end) as CSharpFiles, + Sum(case when Extension = '.js' then 1 else 0 end) as JavaScriptFiles, + Sum(case when Extension = '.css' then 1 else 0 end) as CSSFiles, + Sum(case when Extension in ('.jpg', '.png', '.gif') then 1 else 0 end) as ImageFiles, + Sum(case when Extension is null then 1 else 0 end) as FilesWithoutExtension +from #os.files('/project', true) +group by Directory +order by Directory +``` + +### Financial-Style Aggregations + +```sql +-- Sum positive and negative values separately (conceptual example) +select + c.Author.Name, + Count(*) as TotalCommits, + Sum(case when Len(c.Message) > 50 then 1 else 0 end) as DetailedCommits, + Sum(case when Len(c.Message) <= 50 then 1 else 0 end) as BriefCommits, + Avg(Len(c.Message)) as AvgMessageLength +from #git.repository('/repo') r +cross apply r.Commits c +group by c.Author.Name +having Count(*) >= 10 +order by Count(*) desc +``` + +## Filtering Groups with HAVING + +### Basic HAVING Clause + +Filter groups based on aggregate conditions: + +```sql +-- Only show extensions with many files +select + Extension, + Count(*) as FileCount, + Sum(Length) as TotalSize +from #os.files('/large-project', true) +group by Extension +having Count(*) >= 100 -- Only extensions with 100+ files +order by Count(*) desc + +-- Active Git contributors only +select + c.Author.Name, + Count(*) as CommitCount, + Max(c.Date) as LastCommit +from #git.repository('/repo') r +cross apply r.Commits c +group by c.Author.Name +having Count(*) >= 10 -- At least 10 commits + and Max(c.Date) >= DateAdd('month', -6, GetDate()) -- Recent activity +order by Count(*) desc +``` + +### Complex HAVING Conditions + +```sql +-- Directories with significant activity and size +select + Directory, + Count(*) as FileCount, + Sum(Length) as TotalBytes, + Round(Sum(Length) / 1024.0 / 1024.0, 2) as TotalMB, + Avg(Length) as AvgFileSize +from #os.files('/codebase', true) +group by Directory +having Count(*) >= 20 -- At least 20 files + and Sum(Length) >= 10485760 -- At least 10MB total + and Avg(Length) >= 1024 -- Average file >= 1KB +order by Sum(Length) desc + +-- Team productivity analysis +select + c.Author.Name, + Count(*) as TotalCommits, + Count(distinct DatePart('day', c.Date)) as ActiveDays, + Round(Cast(Count(*) as float) / Count(distinct DatePart('day', c.Date)), 2) as CommitsPerDay +from #git.repository('/repo') r +cross apply r.Commits c +where c.Date >= DateAdd('month', -3, GetDate()) +group by c.Author.Name +having Count(*) >= 15 -- At least 15 commits + and Count(distinct DatePart('day', c.Date)) >= 10 -- Active on 10+ days + and Cast(Count(*) as float) / Count(distinct DatePart('day', c.Date)) >= 1.5 -- 1.5+ commits/day +order by Count(*) desc +``` + +## Window Functions with Grouping + +### Ranking Within Groups + +```sql +-- Rank files by size within each directory +select + Directory, + Name, + Length, + Row_Number() over (partition by Directory order by Length desc) as SizeRank, + Count(*) over (partition by Directory) as FilesInDirectory +from #os.files('/project', true) +where Row_Number() over (partition by Directory order by Length desc) <= 5 +order by Directory, SizeRank +``` + +## Performance Optimization + +### Efficient Grouping Strategies + +```sql +-- Efficient: Filter before grouping +select + Extension, + Count(*) as FileCount, + Sum(Length) as TotalSize +from #os.files('/large-dataset', true) +where Extension is not null -- Filter first + and Length > 0 -- Exclude empty files +group by Extension +having Count(*) >= 5 -- Then filter groups +order by Sum(Length) desc + +-- Less efficient: Grouping all data then filtering +select + Extension, + Count(*) as FileCount, + Sum(Length) as TotalSize +from #os.files('/large-dataset', true) +group by Extension +having Count(*) >= 5 + and Sum(Length) > 1048576 +order by Sum(Length) desc +``` + +### Memory-Conscious Grouping + +```sql +-- Use TAKE to limit result set size +select + Extension, + Count(*) as FileCount, + Round(Sum(Length) / 1024.0 / 1024.0, 2) as TotalMB +from #os.files('/huge-directory', true) +group by Extension +order by Count(*) desc +take 20 -- Only top 20 extensions +``` + +## Common Grouping Patterns + +### File System Analysis + +```sql +-- Storage analysis by file type +select + case + when Extension in ('.jpg', '.png', '.gif', '.bmp') then 'Images' + when Extension in ('.mp4', '.avi', '.mov', '.wmv') then 'Videos' + when Extension in ('.mp3', '.wav', '.flac') then 'Audio' + when Extension in ('.pdf', '.doc', '.docx', '.txt') then 'Documents' + when Extension in ('.zip', '.rar', '.7z') then 'Archives' + when Extension in ('.exe', '.msi', '.deb', '.rpm') then 'Executables' + else 'Other' + end as FileCategory, + Count(*) as FileCount, + Round(Sum(Length) / 1024.0 / 1024.0 / 1024.0, 2) as TotalGB, + Round(Avg(Length) / 1024.0 / 1024.0, 2) as AvgSizeMB +from #os.files('/storage', true) +group by + case + when Extension in ('.jpg', '.png', '.gif', '.bmp') then 'Images' + when Extension in ('.mp4', '.avi', '.mov', '.wmv') then 'Videos' + when Extension in ('.mp3', '.wav', '.flac') then 'Audio' + when Extension in ('.pdf', '.doc', '.docx', '.txt') then 'Documents' + when Extension in ('.zip', '.rar', '.7z') then 'Archives' + when Extension in ('.exe', '.msi', '.deb', '.rpm') then 'Executables' + else 'Other' + end +order by Sum(Length) desc +``` + +### Git Repository Insights + +```sql +-- Developer productivity over time +select + c.Author.Name, + DatePart('year', c.Date) as Year, + DatePart('quarter', c.Date) as Quarter, + Count(*) as Commits, + Count(distinct DatePart('day', c.Date)) as ActiveDays, + Round(Avg(Len(c.Message)), 1) as AvgMessageLength +from #git.repository('/repo') r +cross apply r.Commits c +where c.Date >= DateAdd('year', -2, GetDate()) +group by + c.Author.Name, + DatePart('year', c.Date), + DatePart('quarter', c.Date) +having Count(*) >= 5 +order by c.Author.Name, Year, Quarter +``` + +### Code Quality Analysis + +```sql +-- Complexity analysis by project and class +select + p.Name as ProjectName, + Count(distinct c.Name) as ClassCount, + Count(m.Name) as MethodCount, + Round(Avg(m.CyclomaticComplexity), 2) as AvgComplexity, + Max(m.CyclomaticComplexity) as MaxComplexity, + Sum(m.LinesOfCode) as TotalLinesOfCode +from #csharp.solution('/solution.sln') s +cross apply s.Projects p +cross apply p.Documents d +cross apply d.Classes c +cross apply c.Methods m +group by p.Name +order by Avg(m.CyclomaticComplexity) desc +``` + +## Best Practices + +### Design Principles +- **Start simple**: Begin with basic grouping, then add complexity +- **Meaningful groups**: Group by columns that create logical categories +- **Appropriate aggregations**: Choose aggregate functions that match your analysis goals +- **Filter effectively**: Use WHERE before GROUP BY and HAVING after GROUP BY + +### Performance Guidelines +- **Filter before grouping**: Reduce data size before applying GROUP BY +- **Limit result sets**: Use TAKE when you don't need all groups +- **Efficient expressions**: Avoid complex calculations in GROUP BY when possible +- **Index considerations**: Group by columns that can be efficiently sorted + +### Readability Guidelines +- **Clear column names**: Use meaningful aliases for aggregate columns +- **Logical ordering**: Order results in a way that supports analysis +- **Consistent formatting**: Maintain consistent patterns across similar grouping queries +- **Document complex logic**: Comment business rules embedded in grouping expressions + +The GROUP BY clause is fundamental to data analysis and reporting. Master these patterns to transform raw data into meaningful insights and summaries that drive decision-making. \ No newline at end of file diff --git a/.docs2/having-clause.md b/.docs2/having-clause.md new file mode 100644 index 00000000..2a64c171 --- /dev/null +++ b/.docs2/having-clause.md @@ -0,0 +1,404 @@ +# HAVING Clause + +The `HAVING` clause filters groups created by `GROUP BY` based on aggregate conditions. While `WHERE` filters individual rows before grouping, `HAVING` filters groups after aggregation calculations are complete. + +## Basic HAVING Syntax + +```sql +select columns, aggregate_functions +from data_source +group by columns +having aggregate_condition +``` + +The `HAVING` clause is evaluated after the `GROUP BY` clause has created groups and aggregate functions have been calculated. + +## Fundamental Concepts + +### WHERE vs HAVING + +Understanding the difference between `WHERE` and `HAVING` is crucial: + +```sql +-- WHERE filters rows before grouping +select Extension, Count(*) as FileCount +from #os.files('/data', true) +where Length > 1024 -- Filter files before grouping +group by Extension +having Count(*) >= 10 -- Filter groups after aggregation + +-- This query: +-- 1. Filters files larger than 1KB (WHERE) +-- 2. Groups remaining files by extension +-- 3. Only shows extensions with 10+ files (HAVING) +``` + +### Execution Order + +SQL clauses execute in this order: +1. `FROM` - Identify data source +2. `WHERE` - Filter individual rows +3. `GROUP BY` - Create groups +4. `HAVING` - Filter groups +5. `SELECT` - Choose columns +6. `ORDER BY` - Sort results + +## Basic HAVING Conditions + +### Count-Based Filtering + +Filter groups by the number of items they contain: + +```sql +-- Extensions with many files +select Extension, Count(*) as FileCount +from #os.files('/project', true) +group by Extension +having Count(*) >= 50 +order by Count(*) desc + +-- Active Git contributors +select + c.Author.Name, + Count(*) as CommitCount +from #git.repository('/repo') r +cross apply r.Commits c +group by c.Author.Name +having Count(*) >= 20 -- Only authors with 20+ commits +order by Count(*) desc +``` + +### Sum-Based Filtering + +Filter groups by total values: + +```sql +-- Directories consuming significant disk space +select + Directory, + Count(*) as FileCount, + Sum(Length) as TotalBytes, + Round(Sum(Length) / 1024.0 / 1024.0, 2) as TotalMB +from #os.files('/workspace', true) +group by Directory +having Sum(Length) >= 10485760 -- At least 10MB +order by Sum(Length) desc + +-- File types using substantial storage +select + Extension, + Count(*) as FileCount, + Round(Sum(Length) / 1024.0 / 1024.0 / 1024.0, 2) as TotalGB +from #os.files('/storage', true) +group by Extension +having Sum(Length) >= 1073741824 -- At least 1GB +order by Sum(Length) desc +``` + +### Average-Based Filtering + +Filter groups by average values: + +```sql +-- File types with large average sizes +select + Extension, + Count(*) as FileCount, + Round(Avg(Length) / 1024.0 / 1024.0, 2) as AvgSizeMB +from #os.files('/media', true) +group by Extension +having Avg(Length) >= 1048576 -- Average size >= 1MB + and Count(*) >= 5 -- At least 5 files +order by Avg(Length) desc + +-- Git authors with detailed commit messages +select + c.Author.Name, + Count(*) as CommitCount, + Round(Avg(Len(c.Message)), 1) as AvgMessageLength +from #git.repository('/repo') r +cross apply r.Commits c +group by c.Author.Name +having Avg(Len(c.Message)) >= 100 -- Average message >= 100 chars + and Count(*) >= 10 -- At least 10 commits +order by Avg(Len(c.Message)) desc +``` + +## Complex HAVING Conditions + +### Multiple Aggregate Conditions + +Combine multiple aggregate functions in HAVING: + +```sql +-- Significant directories with many diverse files +select + Directory, + Count(*) as FileCount, + Count(distinct Extension) as UniqueExtensions, + Sum(Length) as TotalBytes, + Round(Avg(Length), 0) as AvgFileSize +from #os.files('/codebase', true) +group by Directory +having Count(*) >= 50 -- At least 50 files + and Count(distinct Extension) >= 5 -- At least 5 different file types + and Sum(Length) >= 5242880 -- At least 5MB total + and Avg(Length) >= 1024 -- Average file >= 1KB +order by Count(*) desc, Sum(Length) desc +``` + +### Logical Operators in HAVING + +Use AND, OR, and NOT to create complex conditions: + +```sql +-- Active teams or highly productive individuals +select + c.Author.Name, + Count(*) as CommitCount, + Count(distinct DatePart('day', c.Date)) as ActiveDays, + Max(c.Date) as LastCommit +from #git.repository('/repo') r +cross apply r.Commits c +where c.Date >= DateAdd('month', -6, GetDate()) +group by c.Author.Name +having (Count(*) >= 100 and Count(distinct DatePart('day', c.Date)) >= 30) -- Very active + or (Count(*) >= 50 and Count(distinct DatePart('day', c.Date)) >= 40) -- Consistent + or (Count(*) >= 200) -- High volume +order by Count(*) desc +``` + +### Range Conditions + +Filter groups within specific ranges: + +```sql +-- Medium-sized directories (not too small, not too large) +select + Directory, + Count(*) as FileCount, + Round(Sum(Length) / 1024.0 / 1024.0, 2) as TotalMB +from #os.files('/balanced-project', true) +group by Directory +having Count(*) between 10 and 100 -- 10-100 files + and Sum(Length) between 1048576 and 52428800 -- 1MB-50MB +order by Count(*) desc +``` + +## Advanced HAVING Patterns + +### Conditional Aggregation in HAVING + +Use CASE statements within aggregate functions: + +```sql +-- Directories with good balance of code vs other files +select + Directory, + Count(*) as TotalFiles, + Sum(case when Extension in ('.cs', '.js', '.py', '.java') then 1 else 0 end) as CodeFiles, + Sum(case when Extension not in ('.cs', '.js', '.py', '.java') then 1 else 0 end) as OtherFiles +from #os.files('/mixed-project', true) +group by Directory +having Count(*) >= 20 -- At least 20 files + and Sum(case when Extension in ('.cs', '.js', '.py', '.java') then 1 else 0 end) >= 10 -- At least 10 code files + and Sum(case when Extension in ('.cs', '.js', '.py', '.java') then 1 else 0 end) * 100.0 / Count(*) >= 50 -- At least 50% code +order by Directory +``` + +### Statistical Filtering + +Filter based on statistical measures: + +```sql +-- File types with consistent sizes (low standard deviation) +select + Extension, + Count(*) as FileCount, + Round(Avg(Length), 0) as AvgSize, + Round(StdDev(Length), 0) as SizeStdDev, + Round(StdDev(Length) / Avg(Length) * 100, 2) as CoefficientOfVariation +from #os.files('/consistent-data', true) +group by Extension +having Count(*) >= 20 -- At least 20 files + and StdDev(Length) / Avg(Length) <= 0.5 -- Low variation (CV <= 50%) +order by StdDev(Length) / Avg(Length) + +-- Note: StdDev function availability may vary by implementation +``` + +## Performance Considerations + +### Efficient HAVING Usage + +```sql +-- Efficient: Filter with WHERE first, then HAVING +select + Extension, + Count(*) as FileCount, + Sum(Length) as TotalSize +from #os.files('/large-dataset', true) +where Length > 0 -- Filter empty files first (WHERE) + and Extension is not null -- Filter files without extension first +group by Extension +having Count(*) >= 100 -- Then filter groups (HAVING) +order by Sum(Length) desc + +-- Less efficient: Only using HAVING +select + Extension, + Count(*) as FileCount, + Sum(Length) as TotalSize +from #os.files('/large-dataset', true) +group by Extension +having Count(*) >= 100 + and Sum(case when Length > 0 then 1 else 0 end) = Count(*) -- Complex condition +order by Sum(Length) desc +``` + +### Memory-Conscious Filtering + +```sql +-- Limit groups before expensive calculations +select + Directory, + Count(*) as FileCount, + Sum(Length) as TotalBytes +from #os.files('/huge-filesystem', true) +group by Directory +having Count(*) >= 1000 -- Only large directories +order by Sum(Length) desc +take 50 -- Limit final results +``` + +## Common HAVING Patterns + +### Top N Analysis + +Find the most significant groups: + +```sql +-- Top file types by count and size +select + Extension, + Count(*) as FileCount, + Round(Sum(Length) / 1024.0 / 1024.0, 2) as TotalMB, + Round(Avg(Length) / 1024.0, 2) as AvgKB +from #os.files('/analysis', true) +group by Extension +having Count(*) >= 10 -- Minimum threshold +order by Count(*) desc, Sum(Length) desc +take 15 -- Top 15 extensions + +-- Most active Git contributors in recent months +select + c.Author.Name, + Count(*) as RecentCommits, + Max(c.Date) as LastCommit, + DateDiff('day', Min(c.Date), Max(c.Date)) as ActiveSpan +from #git.repository('/repo') r +cross apply r.Commits c +where c.Date >= DateAdd('month', -6, GetDate()) +group by c.Author.Name +having Count(*) >= 25 -- At least 25 commits +order by Count(*) desc +take 10 -- Top 10 contributors +``` + +### Quality Thresholds + +Identify groups meeting quality criteria: + +```sql +-- Well-maintained code modules (frequent commits, recent activity) +select + Substring(c.Message, 1, 20) as CommitPrefix, + Count(*) as CommitCount, + Count(distinct c.Author.Name) as Contributors, + Max(c.Date) as LastCommit, + DateDiff('day', Max(c.Date), GetDate()) as DaysSinceLastCommit +from #git.repository('/repo') r +cross apply r.Commits c +where c.Date >= DateAdd('month', -12, GetDate()) +group by Substring(c.Message, 1, 20) +having Count(*) >= 15 -- Active development + and Count(distinct c.Author.Name) >= 2 -- Multiple contributors + and DateDiff('day', Max(c.Date), GetDate()) <= 30 -- Recent activity +order by Count(*) desc + +-- Directories with balanced file distribution +select + Directory, + Count(*) as TotalFiles, + Count(distinct Extension) as FileTypes, + Round(Count(distinct Extension) * 100.0 / Count(*), 2) as DiversityPercent +from #os.files('/project', true) +group by Directory +having Count(*) >= 20 -- Significant size + and Count(distinct Extension) >= 3 -- Multiple file types + and Count(distinct Extension) * 100.0 / Count(*) >= 15 -- Good diversity (15%+) +order by Count(*) desc +``` + +### Outlier Detection + +Find groups that deviate from normal patterns: + +```sql +-- Unusually large files by type +select + Extension, + Count(*) as FileCount, + Round(Avg(Length) / 1024.0 / 1024.0, 2) as AvgSizeMB, + Round(Max(Length) / 1024.0 / 1024.0, 2) as MaxSizeMB, + Round(Max(Length) / Avg(Length), 2) as SizeRatio +from #os.files('/outlier-analysis', true) +group by Extension +having Count(*) >= 10 -- Enough samples + and Max(Length) / Avg(Length) >= 5 -- Largest file is 5x+ average +order by Max(Length) / Avg(Length) desc + +-- Git authors with unusual commit patterns +select + c.Author.Name, + Count(*) as TotalCommits, + Round(Avg(Len(c.Message)), 1) as AvgMessageLength, + Max(Len(c.Message)) as LongestMessage, + Min(Len(c.Message)) as ShortestMessage +from #git.repository('/repo') r +cross apply r.Commits c +group by c.Author.Name +having Count(*) >= 20 -- Significant activity + and (Max(Len(c.Message)) >= 500 -- Very long messages + or Min(Len(c.Message)) <= 10 -- Very short messages + or Max(Len(c.Message)) / Avg(Len(c.Message)) >= 5) -- High variation +order by Count(*) desc +``` + +## Best Practices + +### Design Guidelines +- **Use WHERE first**: Filter individual rows before grouping when possible +- **Meaningful thresholds**: Set HAVING conditions based on business requirements +- **Combine conditions logically**: Use AND/OR appropriately to create clear filter logic +- **Consider data distribution**: Set thresholds that make sense for your data + +### Performance Guidelines +- **Filter early**: Use WHERE to reduce data before GROUP BY operations +- **Simple expressions**: Keep HAVING conditions as simple as possible +- **Limit results**: Use TAKE after HAVING to control result set size +- **Appropriate aggregates**: Choose efficient aggregate functions for your filtering needs + +### Readability Guidelines +- **Clear conditions**: Write HAVING conditions that clearly express business rules +- **Consistent formatting**: Align complex HAVING conditions for readability +- **Meaningful thresholds**: Use round numbers or business-meaningful values when possible +- **Document complex logic**: Comment unusual or domain-specific filtering rules + +### Common Mistakes to Avoid +- **Don't use column aliases in HAVING**: Reference the actual aggregate function +- **Don't filter individual row values**: Use WHERE for row-level filtering +- **Don't create overly complex conditions**: Break complex HAVING into multiple conditions +- **Don't forget about NULL handling**: Consider how NULLs affect your aggregate calculations + +The HAVING clause is essential for creating meaningful summaries from grouped data. Use it to focus on the most significant, relevant, or interesting groups in your analysis, turning raw data into actionable insights. \ No newline at end of file diff --git a/.docs2/order-by-clause-sorting.md b/.docs2/order-by-clause-sorting.md new file mode 100644 index 00000000..46d03391 --- /dev/null +++ b/.docs2/order-by-clause-sorting.md @@ -0,0 +1,450 @@ +# ORDER BY Clause and Sorting + +The `ORDER BY` clause sorts query results by one or more columns or expressions. Musoq supports comprehensive sorting capabilities including multi-column sorting, custom sort orders, and sorting by computed expressions. + +## Basic ORDER BY Syntax + +```sql +select columns +from data_source +order by column [asc|desc] +``` + +Results are sorted in ascending order by default. Use `desc` for descending order. + +## Single Column Sorting + +### Ascending Order (Default) + +```sql +-- Sort files by name (A to Z) +select Name, Length +from #os.files('/documents', true) +order by Name + +-- Explicit ascending order +select Name, Length +from #os.files('/documents', true) +order by Name asc +``` + +### Descending Order + +```sql +-- Sort files by size (largest first) +select Name, Length +from #os.files('/downloads', true) +order by Length desc + +-- Sort Git commits by date (newest first) +select c.Sha, c.Message, c.Date +from #git.repository('/repo') r +cross apply r.Commits c +order by c.Date desc +``` + +## Multiple Column Sorting + +### Hierarchical Sorting + +Sort by multiple columns with different precedence: + +```sql +-- Sort by extension first, then by size within each extension +select Name, Extension, Length +from #os.files('/projects', true) +order by Extension, Length desc + +-- Sort Git commits by author, then by date +select c.Author.Name, c.Date, c.Message +from #git.repository('/repo') r +cross apply r.Commits c +order by c.Author.Name, c.Date desc +``` + +### Mixed Sort Directions + +Combine ascending and descending sorts: + +```sql +-- Group by extension (A-Z), then largest files first within each group +select Name, Extension, Length +from #os.files('/data', true) +order by Extension asc, Length desc + +-- Sort commits by author name (A-Z), then newest first +select c.Author.Name, c.Date, c.Message +from #git.repository('/repo') r +cross apply r.Commits c +order by c.Author.Name asc, c.Date desc +``` + +## Sorting by Expressions + +### Calculated Values + +Sort by computed expressions: + +```sql +-- Sort files by size in MB +select Name, Length, Round(Length / 1024.0 / 1024.0, 2) as MegaBytes +from #os.files('/media', true) +order by Length / 1024.0 / 1024.0 desc + +-- Sort commits by message length +select c.Sha, c.Message, Len(c.Message) as MessageLength +from #git.repository('/repo') r +cross apply r.Commits c +order by Len(c.Message) desc +``` + +### String Expressions + +Sort by string manipulations: + +```sql +-- Sort files by uppercase name (case-insensitive) +select Name, Extension +from #os.files('/mixed-case', true) +order by Upper(Name) + +-- Sort by file extension in uppercase +select Name, Extension +from #os.files('/documents', true) +order by Upper(Extension), Name +``` + +### Date and Time Expressions + +Sort by date calculations: + +```sql +-- Sort files by days since creation (newest first) +select Name, CreationTime, DateDiff('day', CreationTime, GetDate()) as DaysOld +from #os.files('/logs', true) +order by DateDiff('day', CreationTime, GetDate()) asc + +-- Sort commits by day of week, then by time +select c.Sha, c.Date, c.Message +from #git.repository('/repo') r +cross apply r.Commits c +order by DatePart('weekday', c.Date), DatePart('hour', c.Date) +``` + +## Conditional Sorting + +### CASE-Based Sorting + +Custom sort orders using CASE expressions: + +```sql +-- Custom priority sorting for file types +select Name, Extension +from #os.files('/project', true) +order by + case Extension + when '.cs' then 1 -- C# files first + when '.js' then 2 -- JavaScript second + when '.css' then 3 -- CSS third + when '.html' then 4 -- HTML fourth + else 5 -- Everything else last + end, + Name + +-- Priority sorting for Git commit types +select c.Sha, c.Message +from #git.repository('/repo') r +cross apply r.Commits c +order by + case + when c.Message like 'fix:%' then 1 -- Bug fixes first + when c.Message like 'feat:%' then 2 -- Features second + when c.Message like 'docs:%' then 3 -- Documentation third + else 4 -- Others last + end, + c.Date desc +``` + +### Size-Based Custom Sorting + +```sql +-- Sort files by size categories +select Name, Length, + case + when Length < 1024 then 'Small' + when Length < 1048576 then 'Medium' + when Length < 10485760 then 'Large' + else 'Very Large' + end as SizeCategory +from #os.files('/data', true) +order by + case + when Length < 1024 then 1 + when Length < 1048576 then 2 + when Length < 10485760 then 3 + else 4 + end, + Length desc +``` + +## Sorting with Aggregations + +### GROUP BY with ORDER BY + +Sort aggregated results: + +```sql +-- File count by extension, sorted by count +select Extension, Count(*) as FileCount +from #os.files('/source', true) +group by Extension +order by Count(*) desc + +-- Total size by directory, sorted by size +select Directory, Sum(Length) as TotalSize +from #os.files('/project', true) +group by Directory +order by Sum(Length) desc +``` + +### Complex Aggregation Sorting + +```sql +-- Git contributors sorted by activity +select + c.Author.Name, + Count(*) as CommitCount, + Max(c.Date) as LastCommit, + Min(c.Date) as FirstCommit +from #git.repository('/repo') r +cross apply r.Commits c +group by c.Author.Name +order by Count(*) desc, Max(c.Date) desc + +-- File analysis by extension with multiple metrics +select + Extension, + Count(*) as FileCount, + Sum(Length) as TotalSize, + Avg(Length) as AvgSize, + Max(Length) as LargestFile +from #os.files('/codebase', true) +group by Extension +order by Sum(Length) desc, Count(*) desc +``` + +## Advanced Sorting Patterns + +### Null Handling in Sorting + +```sql +-- Handle null values in sorting +select Name, Extension, Length +from #os.files('/mixed', true) +order by + case when Extension is null then 1 else 0 end, -- Nulls last + Extension, + Length desc + +-- Sort with null-safe comparisons +select c.Sha, c.Author.Name, c.Author.Email +from #git.repository('/repo') r +cross apply r.Commits c +order by + case when c.Author.Email is null then 'zzz' else c.Author.Email end, + c.Date desc +``` + +### Nested Property Sorting + +Sort by complex object properties: + +```sql +-- Sort Git commits by author properties +select c.Sha, c.Author.Name, c.Author.Email, c.Committer.Name +from #git.repository('/repo') r +cross apply r.Commits c +order by c.Author.Name, c.Committer.Name, c.Date desc + +-- Sort code methods by complexity metrics +select + c.Name as ClassName, + m.Name as MethodName, + m.CyclomaticComplexity, + m.LinesOfCode +from #csharp.solution('/project.sln') s +cross apply s.Projects p +cross apply p.Documents d +cross apply d.Classes c +cross apply c.Methods m +order by m.CyclomaticComplexity desc, m.LinesOfCode desc +``` + +## Performance Considerations + +### Efficient Sorting Strategies + +```sql +-- Efficient: Sort by indexed or primary columns +select Name, Length +from #os.files('/data', true) +order by CreationTime desc -- Efficient for time-based data + +-- Less efficient: Complex expressions in ORDER BY +select Name, Length +from #os.files('/data', true) +order by Substring(Name, 1, 5), Length / 1024.0 / 1024.0 +``` + +### Large Dataset Sorting + +```sql +-- Use TAKE for large datasets to limit sorting overhead +select Name, Length +from #os.files('/huge-directory', true) +order by Length desc +take 100 -- Only sort enough to get top 100 + +-- Combine with filtering for better performance +select Name, Length +from #os.files('/large-dataset', true) +where Extension in ('.cs', '.js', '.py') -- Filter first +order by Length desc +take 50 +``` + +## Sorting with Window Functions + +### ROW_NUMBER for Ranking + +```sql +-- Rank files by size within each directory +select + Directory, + Name, + Length, + Row_Number() over (partition by Directory order by Length desc) as SizeRank +from #os.files('/project', true) +order by Directory, SizeRank +``` + +## Common Sorting Patterns + +### File System Analysis + +```sql +-- Top 10 largest files +select Name, Length, Directory +from #os.files('/workspace', true) +order by Length desc +take 10 + +-- Files grouped by extension, sorted by size +select Extension, Name, Length +from #os.files('/documents', true) +where Extension is not null +order by Extension, Length desc + +-- Recently modified files first +select Name, LastWriteTime, Length +from #os.files('/active', true) +order by LastWriteTime desc, Length desc +``` + +### Git Repository Analysis + +```sql +-- Recent commits by all authors +select c.Author.Name, c.Date, c.Message +from #git.repository('/repo') r +cross apply r.Commits c +order by c.Date desc +take 50 + +-- Contributors by commit count +select + c.Author.Name, + Count(*) as Commits, + Max(c.Date) as LastCommit +from #git.repository('/repo') r +cross apply r.Commits c +group by c.Author.Name +order by Count(*) desc, Max(c.Date) desc + +-- Commits by message type and recency +select c.Sha, c.Message, c.Date, c.Author.Name +from #git.repository('/repo') r +cross apply r.Commits c +order by + case + when c.Message like 'fix:%' then 1 + when c.Message like 'feat:%' then 2 + else 3 + end, + c.Date desc +``` + +### Code Quality Analysis + +```sql +-- Methods sorted by complexity +select + p.Name as Project, + c.Name as Class, + m.Name as Method, + m.CyclomaticComplexity, + m.LinesOfCode +from #csharp.solution('/solution.sln') s +cross apply s.Projects p +cross apply p.Documents d +cross apply d.Classes c +cross apply c.Methods m +order by + m.CyclomaticComplexity desc, + m.LinesOfCode desc, + p.Name, + c.Name, + m.Name +``` + +### Multi-Source Sorting + +```sql +-- Combine and sort data from multiple sources +select + 'File' as Type, + f.Name as Name, + f.Length as Size, + f.CreationTime as Date +from #os.files('/project', true) f +union all +select + 'Commit' as Type, + c.Sha as Name, + Len(c.Message) as Size, + c.Date as Date +from #git.repository('/project') r +cross apply r.Commits c +order by Date desc, Type, Size desc +``` + +## Best Practices + +### Performance Guidelines +- **Limit sorting scope**: Use WHERE clauses to reduce data before sorting +- **Use TAKE/SKIP**: Limit results when you don't need the entire dataset +- **Avoid complex expressions**: Simple column sorts are more efficient than calculated expressions +- **Consider data types**: Ensure consistent data types for optimal sorting performance + +### Readability Guidelines +- **Clear sort logic**: Use meaningful column names or aliases in ORDER BY +- **Consistent direction**: Be explicit about ASC/DESC even when using defaults +- **Group related sorts**: Keep related sorting columns together +- **Document complex sorts**: Comment unusual or business-specific sorting logic + +### Design Patterns +- **Primary, secondary, tertiary**: Design hierarchical sorts from most to least important +- **User-friendly defaults**: Sort by the most commonly needed order first +- **Stable sorting**: Use additional columns to ensure consistent ordering for equal values + +The ORDER BY clause transforms raw data into meaningful, organized results. Master these sorting patterns to present data in the most useful and insightful way for your analysis needs. \ No newline at end of file diff --git a/.docs2/select-clause.md b/.docs2/select-clause.md new file mode 100644 index 00000000..463bf00e --- /dev/null +++ b/.docs2/select-clause.md @@ -0,0 +1,291 @@ +# SELECT Clause + +The `SELECT` clause defines what data to retrieve from your query. In Musoq, the SELECT clause supports standard SQL functionality plus powerful extensions for working with diverse data sources. + +## Basic SELECT Syntax + +```sql +select column1, column2, column3 +from #schema.datasource() +``` + +### Simple Column Selection + +Select specific columns by name: + +```sql +-- Select specific columns from files +select Name, Length, Extension +from #os.files('/path/to/directory', true) +``` + +### SELECT All Columns + +Use the asterisk (`*`) to select all available columns: + +```sql +-- Select all columns from Git commits +select * +from #git.commits('/path/to/repo') +``` + +## Column Aliases + +Assign custom names to columns using the `AS` keyword: + +```sql +-- Create meaningful column names +select + Name as FileName, + Length as SizeInBytes, + Length / 1024 as SizeInKB +from #os.files('/docs', true) +``` + +The `AS` keyword is optional: + +```sql +-- Equivalent syntax without AS +select + Name FileName, + Length SizeInBytes, + Length / 1024 SizeInKB +from #os.files('/docs', true) +``` + +## Expressions in SELECT + +### Arithmetic Expressions + +Perform calculations directly in the SELECT clause: + +```sql +-- Calculate file sizes in different units +select + Name, + Length, + Length / 1024 as KiloBytes, + Length / 1024 / 1024 as MegaBytes, + Round(Length / 1024.0 / 1024.0, 2) as MegaBytesRounded +from #os.files('/data', true) +where Length > 1000000 +``` + +### String Operations + +Manipulate text data with string functions: + +```sql +-- Extract and format file information +select + Upper(Name) as UpperName, + Substring(Name, 1, 10) as FirstTenChars, + Concat(Name, ' (', Length, ' bytes)') as NameWithSize +from #os.files('/docs', false) +``` + +### Conditional Expressions + +Use CASE expressions for conditional logic: + +```sql +-- Categorize files by size +select + Name, + Length, + case + when Length < 1024 then 'Small' + when Length < 1048576 then 'Medium' + else 'Large' + end as SizeCategory +from #os.files('/downloads', true) +``` + +## Complex Property Access + +### Nested Property Navigation + +Access nested properties using dot notation: + +```sql +-- Access nested Git commit properties +select + c.Sha, + c.Author.Name as AuthorName, + c.Author.Email as AuthorEmail, + c.Committer.Date as CommitDate +from #git.repository('/repo') r +cross apply r.Commits c +``` + +### Self Property Access + +Access the entire object using the `Self` keyword: + +```sql +-- Work with complete objects +select + Self.Name, + Self.Length, + Self as CompleteFileInfo +from #os.files('/temp', false) +``` + +## Function Calls in SELECT + +### Built-in Functions + +Use Musoq's extensive function library: + +```sql +-- Date and mathematical functions +select + Name, + CreationTime, + DateDiff('day', CreationTime, GetDate()) as DaysOld, + Abs(Length - 1024) as DistanceFromKB, + Power(Length / 1024, 2) as KBSquared +from #os.files('/logs', true) +``` + +### Aggregate Functions + +Apply aggregation functions (typically used with GROUP BY): + +```sql +-- Summary statistics by file extension +select + Extension, + Count(*) as FileCount, + Sum(Length) as TotalSize, + Avg(Length) as AverageSize, + Max(Length) as LargestFile, + Min(Length) as SmallestFile +from #os.files('/projects', true) +group by Extension +``` + +## Advanced SELECT Features + +### Type Casting + +Convert between data types explicitly: + +```sql +-- Explicit type conversions +select + Name, + Cast(Length as 'System.String') as LengthAsString, + Cast(CreationTime as 'System.String') as CreationTimeAsString +from #os.files('/data', false) +``` + +### NULL Handling + +Handle null values gracefully: + +```sql +-- Provide defaults for null values +select + Name, + Coalesce(Extension, 'no-extension') as FileExtension, + case when Length is null then 0 else Length end as SafeLength +from #os.files('/mixed', true) +``` + +### Complex Calculations + +Combine multiple operations: + +```sql +-- Complex file analysis +select + Name, + Extension, + Length, + Round( + case + when Length = 0 then 0 + else Length / (1024.0 * 1024.0) + end, + 3 + ) as MegaBytes, + case + when Extension in ('.jpg', '.png', '.gif') then 'Image' + when Extension in ('.txt', '.md', '.doc') then 'Document' + when Extension in ('.exe', '.dll', '.so') then 'Executable' + else 'Other' + end as FileCategory +from #os.files('/workspace', true) +``` + +## SELECT with Multiple Data Sources + +When working with joins or multiple data sources, qualify column names: + +```sql +-- Qualified column names in joins +select + f.Name as FileName, + f.Length as FileSize, + d.Name as DirectoryName +from #os.files('/path', false) f +inner join #os.directories('/path') d on f.Directory = d.FullName +``` + +## Best Practices + +### Column Naming +- Use descriptive aliases for calculated columns +- Maintain consistent naming conventions +- Avoid reserved keywords as column names + +### Performance Considerations +- Select only the columns you need (avoid `SELECT *` in production) +- Use appropriate data types for calculations +- Consider the impact of complex expressions on query performance + +### Readability +- Format complex SELECT clauses with proper indentation +- Group related columns together +- Use meaningful aliases that describe the data + +## Common Patterns + +### File Analysis Pattern +```sql +select + Name as FileName, + Extension as FileType, + Round(Length / 1024.0, 2) as SizeKB, + CreationTime as Created, + LastWriteTime as Modified +from #os.files('/analysis', true) +``` + +### Git Analysis Pattern +```sql +select + c.Sha as CommitHash, + c.Author.Name as Developer, + c.Message as CommitMessage, + DateDiff('day', c.Date, GetDate()) as DaysAgo +from #git.repository('/repo') r +cross apply r.Commits c +``` + +### Code Analysis Pattern +```sql +select + c.Name as ClassName, + m.Name as MethodName, + m.LinesOfCode as LOC, + m.CyclomaticComplexity as Complexity +from #csharp.solution('/project.sln') s +cross apply s.Projects p +cross apply p.Documents d +cross apply d.Classes c +cross apply c.Methods m +``` + +The SELECT clause is the foundation of data retrieval in Musoq. Master these patterns to effectively extract and transform data from any source. \ No newline at end of file diff --git a/.docs2/where-clause-filtering.md b/.docs2/where-clause-filtering.md new file mode 100644 index 00000000..5f81a155 --- /dev/null +++ b/.docs2/where-clause-filtering.md @@ -0,0 +1,493 @@ +# WHERE Clause and Filtering + +The `WHERE` clause filters data based on specified conditions. Musoq supports standard SQL filtering operations plus extensions for working with complex data types and nested properties. + +## Basic WHERE Syntax + +```sql +select columns +from data_source +where condition +``` + +Conditions evaluate to boolean values (true/false) and determine which rows are included in the result set. + +## Comparison Operators + +### Basic Comparisons + +```sql +-- Numeric comparisons +select Name, Length +from #os.files('/downloads', true) +where Length > 1048576 -- Files larger than 1MB + +-- String comparisons +select Name, Extension +from #os.files('/documents', true) +where Extension = '.pdf' + +-- Date comparisons +select Name, CreationTime +from #os.files('/logs', true) +where CreationTime >= '2024-01-01' +``` + +### Supported Comparison Operators + +| Operator | Description | Example | +|----------|-------------|---------| +| `=` | Equal to | `Length = 1024` | +| `!=` or `<>` | Not equal to | `Extension != '.tmp'` | +| `>` | Greater than | `Length > 1000000` | +| `>=` | Greater than or equal | `CreationTime >= '2024-01-01'` | +| `<` | Less than | `Length < 1024` | +| `<=` | Less than or equal | `CreationTime <= GetDate()` | + +### String Comparisons + +```sql +-- Case-sensitive string comparison +select Name +from #os.files('/src', true) +where Extension = '.cs' + +-- String inequality +select Name +from #os.files('/temp', true) +where Name != 'temp.txt' +``` + +## Logical Operators + +### AND Operator + +Combine multiple conditions (all must be true): + +```sql +-- Multiple conditions with AND +select Name, Length, Extension +from #os.files('/projects', true) +where Length > 1000 + and Extension = '.cs' + and CreationTime > '2024-01-01' + +-- Complex AND conditions +select c.Sha, c.Author.Name, c.Message +from #git.repository('/repo') r +cross apply r.Commits c +where c.Author.Email = 'developer@company.com' + and c.Date >= DateAdd('month', -3, GetDate()) + and Len(c.Message) > 50 +``` + +### OR Operator + +Match any of the specified conditions: + +```sql +-- Files with multiple extensions +select Name, Extension +from #os.files('/documents', true) +where Extension = '.pdf' + or Extension = '.doc' + or Extension = '.docx' + +-- Size-based OR conditions +select Name, Length +from #os.files('/data', true) +where Length < 1024 -- Small files + or Length > 10485760 -- Large files (> 10MB) +``` + +### NOT Operator + +Negate a condition: + +```sql +-- Exclude specific file types +select Name, Extension +from #os.files('/mixed', true) +where not (Extension = '.tmp' or Extension = '.log') + +-- Exclude empty files +select Name, Length +from #os.files('/content', true) +where not Length = 0 +``` + +## Pattern Matching + +### LIKE Operator + +Pattern matching with wildcards: + +```sql +-- Wildcard patterns +select Name +from #os.files('/projects', true) +where Name like '%.cs' -- Files ending with .cs + +select Name +from #os.files('/logs', true) +where Name like 'error%' -- Files starting with 'error' + +select Name +from #os.files('/data', true) +where Name like '%temp%' -- Files containing 'temp' + +-- Single character wildcard +select Name +from #os.files('/files', true) +where Name like 'file?.txt' -- file1.txt, file2.txt, etc. +``` + +### Pattern Matching Examples + +```sql +-- Git commit message patterns +select c.Sha, c.Message +from #git.repository('/repo') r +cross apply r.Commits c +where c.Message like 'fix:%' + or c.Message like 'feat:%' + or c.Message like 'docs:%' + +-- Method name patterns in code +select c.Name, m.Name +from #csharp.solution('/project.sln') s +cross apply s.Projects p +cross apply p.Documents d +cross apply d.Classes c +cross apply c.Methods m +where m.Name like 'Get%' + or m.Name like 'Set%' + or m.Name like 'Is%' +``` + +## Set Membership + +### IN Operator + +Check if a value exists in a list: + +```sql +-- File extension filtering +select Name, Extension +from #os.files('/source', true) +where Extension in ('.cs', '.vb', '.fs', '.cpp', '.h') + +-- Specific file sizes +select Name, Length +from #os.files('/test', true) +where Length in (1024, 2048, 4096) + +-- Author filtering in Git +select c.Sha, c.Author.Name, c.Message +from #git.repository('/repo') r +cross apply r.Commits c +where c.Author.Name in ('Alice', 'Bob', 'Charlie') +``` + +### NOT IN Operator + +Exclude values from a set: + +```sql +-- Exclude system and temporary files +select Name, Extension +from #os.files('/workspace', true) +where Extension not in ('.tmp', '.log', '.cache', '.swap') + +-- Exclude specific directories +select Name, Directory +from #os.files('/project', true) +where Directory not in ('/project/bin', '/project/obj', '/project/.git') +``` + +## NULL Handling + +### IS NULL and IS NOT NULL + +Handle missing or undefined values: + +```sql +-- Find files without extension +select Name +from #os.files('/mixed', true) +where Extension is null + +-- Find files with extension +select Name, Extension +from #os.files('/documents', true) +where Extension is not null + +-- Handle nullable Git properties +select c.Sha, c.Author.Name +from #git.repository('/repo') r +cross apply r.Commits c +where c.Author.Email is not null + and c.Committer.Email is not null +``` + +### NULL Comparison Behavior + +```sql +-- NULL comparisons always return false +select Name +from #os.files('/test', true) +where Length = null -- This returns no results + +-- Correct NULL checking +select Name +from #os.files('/test', true) +where Length is null -- This finds NULL lengths +``` + +## Complex Property Filtering + +### Nested Property Access + +Filter on nested object properties: + +```sql +-- Git commit author properties +select c.Sha, c.Author.Name, c.Author.Email +from #git.repository('/repo') r +cross apply r.Commits c +where c.Author.Name = 'John Doe' + and c.Author.Email like '%@company.com' + +-- File system detailed properties +select Name, Length, Attributes +from #os.files('/system', true) +where Attributes.IsHidden = false + and Attributes.IsReadOnly = false +``` + +### Object Property Combinations + +```sql +-- Complex Git filtering +select c.Sha, c.Message, c.Author.Name +from #git.repository('/repo') r +cross apply r.Commits c +where c.Author.Name != c.Committer.Name -- Different author and committer + and c.Date > DateAdd('week', -1, GetDate()) + and Len(c.Message) between 10 and 100 +``` + +## Function-Based Filtering + +### String Functions in WHERE + +```sql +-- Case-insensitive filtering +select Name +from #os.files('/documents', true) +where Upper(Extension) = '.PDF' + +-- String length filtering +select Name, Extension +from #os.files('/files', true) +where Len(Name) > 20 + and Substring(Name, 1, 4) = 'long' + +-- String manipulation +select c.Sha, c.Message +from #git.repository('/repo') r +cross apply r.Commits c +where Trim(c.Message) != '' + and Left(c.Message, 4) in ('fix:', 'feat', 'docs') +``` + +### Mathematical Functions + +```sql +-- Mathematical conditions +select Name, Length +from #os.files('/data', true) +where Abs(Length - 1024) < 100 -- Files approximately 1KB + and Power(Length, 0.5) > 32 -- Mathematical calculations + and Round(Length / 1024.0, 0) = 5 -- Exactly 5KB when rounded +``` + +### Date and Time Functions + +```sql +-- Date-based filtering +select Name, CreationTime +from #os.files('/logs', true) +where DateDiff('day', CreationTime, GetDate()) <= 7 -- Files from last week + and DatePart('hour', CreationTime) between 9 and 17 -- Created during work hours + and DatePart('weekday', CreationTime) not in (1, 7) -- Not weekend + +-- Git commit date filtering +select c.Sha, c.Message, c.Date +from #git.repository('/repo') r +cross apply r.Commits c +where DatePart('year', c.Date) = 2024 + and DatePart('month', c.Date) in (1, 2, 3) -- First quarter +``` + +## Advanced Filtering Patterns + +### Range Filtering + +```sql +-- BETWEEN operator for ranges +select Name, Length +from #os.files('/media', true) +where Length between 1048576 and 10485760 -- 1MB to 10MB + +-- Date ranges +select c.Sha, c.Author.Name, c.Date +from #git.repository('/repo') r +cross apply r.Commits c +where c.Date between '2024-01-01' and '2024-12-31' +``` + +### Conditional Filtering with CASE + +```sql +-- Dynamic filtering based on conditions +select Name, Length, Extension +from #os.files('/mixed', true) +where case + when Extension = '.log' then Length < 10485760 -- Log files < 10MB + when Extension = '.zip' then Length > 1048576 -- ZIP files > 1MB + else Length > 0 -- Other files not empty +end +``` + +### Subquery Filtering + +```sql +-- Filter based on subquery results +select Name, Length +from #os.files('/project', true) +where Extension in ( + select Extension + from #os.files('/templates', true) + group by Extension + having Count(*) > 5 +) + +-- EXISTS filtering +select f.Name +from #os.files('/source', true) f +where exists ( + select 1 + from #os.files('/backup', true) b + where b.Name = f.Name +) +``` + +## Performance Optimization + +### Early Filtering + +```sql +-- Efficient: Filter early in the pipeline +select Name, Length +from #os.files('/large-directory', true) +where Extension = '.cs' -- Filter files first + and Length > 1000 -- Then filter by size + +-- Less efficient: Complex calculations on all rows +select Name, Complex_Calculation(Length) as Result +from #os.files('/large-directory', true) +where Complex_Calculation(Length) > 100 +``` + +### Index-Friendly Filtering + +```sql +-- Use direct column comparisons when possible +where CreationTime >= '2024-01-01' -- Good + +-- Avoid functions on columns in WHERE +where DatePart('year', CreationTime) = 2024 -- Less efficient +``` + +## Common Filtering Patterns + +### File Analysis Patterns + +```sql +-- Large files in specific directories +select Name, Length, Directory +from #os.files('/project', true) +where Length > 5242880 -- > 5MB + and Directory not like '%/bin/%' + and Directory not like '%/obj/%' + and Extension in ('.cs', '.cpp', '.h') + +-- Recently modified files +select Name, LastWriteTime +from #os.files('/active-project', true) +where DateDiff('hour', LastWriteTime, GetDate()) <= 24 + and Extension != '.tmp' +``` + +### Git Analysis Patterns + +```sql +-- Active contributors +select c.Author.Name, Count(*) as CommitCount +from #git.repository('/repo') r +cross apply r.Commits c +where c.Date >= DateAdd('month', -3, GetDate()) + and c.Author.Email like '%@company.com' + and c.Message not like 'Merge%' +group by c.Author.Name +having Count(*) >= 5 + +-- Bug fix commits +select c.Sha, c.Message, c.Author.Name +from #git.repository('/repo') r +cross apply r.Commits c +where (c.Message like '%fix%' + or c.Message like '%bug%' + or c.Message like '%issue%') + and c.Message not like '%fix typo%' + and Len(c.Message) > 10 +``` + +### Code Quality Patterns + +```sql +-- Complex methods that need refactoring +select + c.Name as ClassName, + m.Name as MethodName, + m.CyclomaticComplexity, + m.LinesOfCode +from #csharp.solution('/project.sln') s +cross apply s.Projects p +cross apply p.Documents d +cross apply d.Classes c +cross apply c.Methods m +where m.CyclomaticComplexity > 10 + or m.LinesOfCode > 50 + or (m.ParameterCount > 5 and m.CyclomaticComplexity > 5) +``` + +## Best Practices + +### Condition Ordering +- **Most selective first**: Place conditions that eliminate the most rows first +- **Cheapest operations first**: Simple comparisons before complex functions +- **AND before OR**: Group AND conditions before OR conditions when possible + +### Readability +- **Use parentheses**: Clarify complex logical expressions +- **Consistent formatting**: Align conditions for better readability +- **Meaningful conditions**: Write self-documenting filter logic + +### Performance +- **Avoid functions on columns**: Use direct comparisons when possible +- **Use appropriate data types**: Ensure type compatibility to avoid conversions +- **Filter early**: Apply WHERE conditions as early as possible in the query + +The WHERE clause is essential for extracting meaningful insights from your data. Master these filtering patterns to create precise, efficient queries that surface exactly the information you need. \ No newline at end of file