diff --git a/README.md b/README.md index 5cbbf74..9114f91 100644 --- a/README.md +++ b/README.md @@ -1,74 +1,260 @@ -# Understanding Different Design Choices in Training Large Time Series Models - +# LTSM-Bundle: A Toolbox and Benchmark on Large Language Models for Time Series Forecasting + +
+LTSM Model +
[![Test](https://github.com/daochenzha/ltsm/actions/workflows/test.yml/badge.svg)](https://github.com/daochenzha/ltsm/actions/workflows/test.yml) -This work investigates the transition from traditional Time Series Forecasting (TSF) to Large Time Series Models (LTSMs), leveraging universal transformer-based models. Training LTSMs on diverse time series data introduces challenges due to varying frequencies, dimensions, and patterns. We explore various design choices for LTSMs, including pre-processing, model configurations, and dataset setups. We introduce **Time Series Prompt**, a statistical prompting strategy, and $\texttt{LTSM-bundle}$, which encapsulates the most effective design practices identified. $\texttt{LTSM-bundle}$ is developed by [Data Lab at Rice University](https://cs.rice.edu/~xh37/). +> Empowering forecasts with precision and efficiency. + +## Table of Contents + +* [Overview](#overview) +* [Why LTSM-bundle](#why-ltsm-bundle) +* [Features](#features) +* [Installation](#installation) +* [Quick Start](#quick-start) +* [Project Structure](#project-structure) +* [Datasets and Prompts](#datasets-and-prompts) +* [Model Access](#model-access) +* [Cite This Work](#cite-this-work) +* [License](#license) +* [Acknowledgments](#acknowledgments) + +--- -## Resources -:star2: Please star our repo to follow the latest updates on LTSM-bundle! +## Overview -:mega: We have released our [paper](https://arxiv.org/abs/2406.14045) and source code of LTSM-bundle-v1.0! +This work investigates the transition from traditional Time Series Forecasting (TSF) to Large Time Series Models (LTSMs), leveraging large transformer-based models like GPT. Training LTSMs on diverse time series data introduces challenges due to varying frequencies, dimensions, and patterns. -:books: Follow our latest [English Tutorial](https://github.com/daochenzha/ltsm/tree/main/tutorial) or [中文教程](https://zhuanlan.zhihu.com/p/708804309) to costomize your LTSM! +We explore multiple design choices, including pre-processing strategies, tokenization, model architectures, and dataset setups. We introduce: -:earth_americas: For more information, please visit: -* Paper: [https://arxiv.org/abs/2406.14045](https://arxiv.org/abs/2406.14045) -* Blog: [Time Series Are Not That Different for LLMs](https://towardsdatascience.com/time-series-are-not-that-different-for-llms-56435dc7d2b1) -* Tutorial: [Build your own LTSM-bundle](https://github.com/daochenzha/ltsm/tree/main/tutorial) -* Chinese Tutorial: [https://zhuanlan.zhihu.com/p/708804309](https://zhuanlan.zhihu.com/p/708804309) -* Do you want to learn more about data pipeline search? Please check out our [data-centric AI survey](https://arxiv.org/abs/2303.10158) and [data-centric AI resources](https://github.com/daochenzha/data-centric-AI) ! +* **Time Series Prompt**: A statistical prompting strategy +* **LTSM-bundle**: A toolkit encapsulating effective design practices + +The project is developed by the [Data Lab at Rice University](https://cs.rice.edu/~xh37/). + +--- ## Why LTSM-bundle? -The LTSM-bundle package leverages the HuggingFace transformers toolkit, offering flexibility to switch between different advanced language models as the backbone. It is easy to tailor the general LTSMs to their specific time series forecasting needs by selecting the most suitable language model from a wide array of options. The flexibility enhances the adaptability of the package across different industries and data types, ensuring optimal performance in diverse scenarios. + +The LTSM-bundle leverages HuggingFace transformers, allowing flexible integration of large-scale pre-trained language models for time series tasks. Users can customize the pipeline to fit specific forecasting needs with minimal overhead, making it adaptable across various domains and industries. + +Key highlights: + +* Plug-and-play with GPT-style backbones +* Modular pipeline for easy experimentation +* Support for statistical and text prompts + +--- + +## Features + +| Category | Highlights | +| ----------------- | ----------------------------------------------------------------------------------------------------------------------------------- | +| ⚙️ Architecture | Modular design, GPT-style transformers for time series | +| 📝 Prompting | Time Series Prompt & Text Prompt support | +| ⚡️ Performance | GPU acceleration, optimized pipelines | +| 🔧 Integrations | LoRA support, JSON/CSV-based dataset and prompt interfaces | +| 🔬 Testing | Unit and integration tests, GitHub Actions CI | +| 📊 Data | Built-in data loaders, scalers, and tokenizers | +| 📂 Documentation | Tutorials in [English](https://github.com/daochenzha/ltsm/tree/main/tutorial) and [Chinese](https://zhuanlan.zhihu.com/p/708804309) | + +--- ## Installation -``` + +We recommend using Conda: + +```bash conda create -n ltsm python=3.8.0 conda activate ltsm -git clone git@github.com:daochenzha/ltsm.git -cd ltsm -pip3 install -e . -pip3 install -r requirements.txt ``` -## Quick Exploration on LTSM-bundle +Then install the package: -Training on **[Time Series Prompt]** and **[Linear Tokenization]** ```bash -bash scripts/train_ltsm_csv.sh +git clone https://github.com/datamllab/ltsm.git +cd ltsm +pip install -e . +pip install -r requirements.txt ``` -Training on **[Text Prompt]** and **[Linear Tokenization]** -```bash -bash scripts/train_ltsm_textprompt_csv.sh +--- + +## 🔧 Training Examples +```Python +from ltsm.data_pipeline import StatisticalTrainingPipeline, get_args, seed_all +from ltsm.models.base_config import LTSMConfig +from ltsm.common.base_training_pipeline import TrainingConfig + +# Option 1: Load config via command-line arguments +config = get_args() + +# Option 2: Load config from a JSON file +config = TrainingConfig.load("example.json") + +# Option 3: Manually customize a supported model config in code +# (e.g., LTSMConfig, DLinearConfig, InformerConfig, etc.) +config = LTSMConfig(seq_len=336, pred_len=96) + +# Set random seeds for reproducibility +seed = config.train_params["seed"] +seed_all(seed) + +# Initialize the training pipeline with the loaded config +pipeline = StatisticalTrainingPipeline(config) + +# Run the training and evaluation process +pipeline.run() ``` -Training on **[Time Series Prompt]** and **[Time Series Tokenization]** -```bash -bash scripts/train_ltsm_tokenizer_csv.sh +## 🔍 Inference Examples + +```Python +import os +import torch +import pandas as pd +from huggingface_hub import hf_hub_download +from safetensors.torch import load_file +from ltsm.models import LTSMConfig, ltsm_model + +# Download model config and weights from Hugging Face +config_path = hf_hub_download("LSC2204/LTSM-bundle", "config.json") +weights_path = hf_hub_download("LSC2204/LTSM-bundle", "model.safetensors") + +# Load model and weights +model_config = LTSMConfig() +model_config.load(config_path) +model = ltsm_model.LTSM(model_config) + +state_dict = load_file(weights_path) +model.load_state_dict(state_dict) + +# Move model to device +device = torch.device("cuda" if torch.cuda.is_available() else "cpu") +model = model.to(device).eval() + +# Load your dataset (e.g., weather) +df_weather = pd.read_csv("/path/to/dataset.csv") +print("Loaded data shape:", df_weather.shape) + +# Load prompts per feature +feature_prompts = {} +prompt_dir = "/path/to/prompts/" +for feature, filename in { + "T (degC)": "weather_T (degC)_prompt.pth.tar", + "rain (mm)": "weather_rain (mm)_prompt.pth.tar" +}.items(): + prompt_tensor = torch.load(os.path.join(prompt_dir, filename)) + feature_prompts[feature] = prompt_tensor.squeeze(0).float().to(device) + +# Predict (custom code here depending on your model usage) +# For example: +with torch.no_grad(): + inputs = feature_prompts["T (degC)"].unsqueeze(0) + preds = model(inputs) + print("Prediction output shape:", preds.shape) ``` -## Datasets and Time Series Prompts -Download the datasets +--- + + +## Project Structure + + + +```text +└── ltsm/ + ├── datasets + │ └── README.md + ├── imgs + │ ├── ltsm_model.png + │ ├── prompt_csv_tsne.png + │ └── stat_prompt.png + ├── ltsm + │ ├── data_pipeline + │ ├── data_provider + │ ├── models + │ └── utils + ├── main_ltsm.py + ├── main_tokenizer.py + ├── prompt_bank + │ ├── prompt_data_normalize_split + │ ├── stat-prompt + │ └── text_prompt_data_csv + ├── requirements.txt + ├── scripts + │ ├── test_csv_lora.sh + │ ├── test_ltsm.sh + │ ├── train_ltsm_csv.sh + │ ├── train_ltsm_textprompt_csv.sh + │ └── train_ltsm_tokenizer_csv.sh + ├── setup.py + └── tutorial + └── README.md +``` + +--- + +## Datasets and Prompts + +Download datasets: + ```bash cd datasets -download: https://drive.google.com/drive/folders/1hLFbz0FRxdiDCzgFYtKCOPJYSBVvwW9P +# Google Drive link: +https://drive.google.com/drive/folders/1hLFbz0FRxdiDCzgFYtKCOPJYSBVvwW9P ``` -Download the time series prompts +Download time series prompts: + ```bash -cd prompt_bank/propmt_data_csv -download: https://drive.google.com/drive/folders/1hLFbz0FRxdiDCzgFYtKCOPJYSBVvwW9P +cd prompt_bank/prompt_data_csv +# Same Google Drive link applies ``` +--- + +## Model Access + +You can find our trained LTSM models on Hugging Face: + +➡️ [https://huggingface.co/LSC2204/LTSM-bundle](https://huggingface.co/LSC2204/LTSM-bundle) + +--- + ## Cite This Work -If you find this work useful, you may cite this work: -``` -@article{ltsm-bundle, - title={Understanding Different Design Choices in Training Large Time Series Models}, - author={Chuang*, Yu-Neng and Li*, Songchen and Yuan*, Jiayi and Wang*, Guanchu and Lai*, Kwei-Herng and Yu, Leisheng and Ding, Sirui and Chang, Chia-Yuan and Tan, Qiaoyu and Zha, Daochen and Hu, Xia}, - journal={arXiv preprint arXiv:2406.14045}, - year={2024} + +If you find this work useful, please cite: + +```bibtex +@misc{chuang2025ltsmbundletoolboxbenchmarklarge, + title={LTSM-Bundle: A Toolbox and Benchmark on Large Language Models for Time Series Forecasting}, + author={Yu-Neng Chuang and Songchen Li and Jiayi Yuan and Guanchu Wang and Kwei-Herng Lai and Songyuan Sui and Leisheng Yu and Sirui Ding and Chia-Yuan Chang and Qiaoyu Tan and Daochen Zha and Xia Hu}, + year={2025}, + eprint={2406.14045}, + archivePrefix={arXiv}, + primaryClass={cs.LG}, + url={https://arxiv.org/abs/2406.14045}, } ``` + +--- + +## License + +This project is licensed under the MIT License. See the [LICENSE](https://choosealicense.com/licenses/mit/) file for details. + +--- + +## Acknowledgments + +We thank all contributors and collaborators involved in the LTSM project. Special thanks to the Data Lab at Rice University and the open-source community for enabling fast prototyping and reproducible research. + +--- + +
+ ⬆️ Back to Top +
diff --git a/multi_agents_pipeline/agents/Planning_Agent.py b/multi_agents_pipeline/agents/Planning_Agent.py index 61b4e6e..786e2d0 100644 --- a/multi_agents_pipeline/agents/Planning_Agent.py +++ b/multi_agents_pipeline/agents/Planning_Agent.py @@ -38,7 +38,7 @@ async def send_message_to_openai(self, messages: List[SystemMessage], ctx: Messa else: raise ValueError("Response content is not a valid JSON string") - async def generate_ts_task(self, original_message: TSTaskMessage, ctx: MessageContext) -> TSMessage: + async def generate_ts_task(self, original_message: TSTaskMessage, ctx: MessageContext, message_class: Optional[bool | BaseModel] = False) -> TSMessage: """Generates a time series task message based on the original message. Args: @@ -49,25 +49,47 @@ async def generate_ts_task(self, original_message: TSTaskMessage, ctx: MessageCo """ ts_message = SystemMessage( source="user", - content=f"""The task for the time series analysis is: {original_message.description}. - The time-series data is stored at {original_message.filepath}. Provide a detailed description of the data - based on the task description. Also, provide what type of analysis would be required to complete the task among - the following types: ["statistical forecasting", "anomaly detection"]. + content=f""" + The task for the time series analysis is: {original_message.description}. + The time-series data is stored at: {original_message.filepath}. + Based on the task description, determine which one of the following types of analysis best matches the requirement: + + Task Type 1: The tasks asks a question where analyzing the properties of the time-series would be necessary for a classification task. + Task Type 2: "ts-forecasting": The task asks a question where predicting the future values of the time-series based on the historical data would be necessary. + Task Type 3: "anomaly-detection": The task asks a question where detecting anomalies in the time-series data would be necessary. + + Only reply with one of three numbers representing the task type: [1, 2, 3]. + Do not explain your reasoning. + Do not add any extra text. + Only output the chosen task type. """ ) - response_content = await self.send_message_to_openai([ts_message], ctx, json_output=TSMessage) + response_content = await self.send_message_to_openai([ts_message], ctx, json_output=message_class) + response = response_content.strip() + task_type = "ts-classifiation" + if response == "1": + task_type = "ts-classification" + elif response == "2": + task_type = "ts-forecasting" + elif response == "3": + task_type = "anomaly-detection" + try: - ts_task = TSMessage.model_validate_json(response_content) - ts_task.source = "planner" # Set the source to the Planning Agent - ts_task.filepath = original_message.filepath # Ensure the filepath is preserved + if message_class: + ts_task = TSMessage.model_validate_json(response_content) + ts_task.source = "planner" # Set the source to the Planning Agent + ts_task.filepath = original_message.filepath + else: + ts_task = TSMessage(source="planner", filepath=original_message.filepath, task_type=task_type, description=original_message.description) + # Send the generated task to the QA Agent return ts_task except ValidationError as e: raise ValueError(f"Response content is not a valid TextMessage: {e}") from e - async def generate_qa_task(self, original_message: TSTaskMessage, ctx: MessageContext) -> TextMessage: + async def generate_qa_task(self, original_message: TSTaskMessage, ctx: MessageContext, message_class: Optional[bool | BaseModel] = False) -> TextMessage: """Generates a QA task message based on the original message. Args: @@ -78,16 +100,35 @@ async def generate_qa_task(self, original_message: TSTaskMessage, ctx: MessageCo """ task_message = SystemMessage( source="user", - content=f"""Write a descriptive task for the following prompt: {original_message.description}. - The time-series data is stored at {original_message.filepath}. + content=f"""You are given the following original task description: {original_message.description}. + The time-series data is stored at: {original_message.filepath}. + + Your task is to generate a **non-trivial multiple-choice question and answer task** that accomplishes the original goal described above. + + **Instructions:** + 1. Think carefully about what the task is asking. + 2. Write a **clear, concise multiple-choice question** relevant to the task. + 3. Include **at least two numbered answer options**, starting from **1**. + 4. If the original task description already defines specific class labels or meanings for numbered outputs, + you must reuse those exact classes as your multiple-choice answer options. + 5. **Do NOT include the correct answer.** + 6. Clearly state that the goal is for the user to answer the question. + 7. Do not bold any text in the question or answer options. + 8. **Require the user to respond in this exact format (copy this exactly): + Reason: + Answer: ** """ ) - response_content = await self.send_message_to_openai([task_message], ctx, json_output=TextMessage) + response_content = await self.send_message_to_openai([task_message], ctx, json_output=message_class) try: - qa_task = TextMessage.model_validate_json(response_content) - qa_task.source = "planner" # Set the source to the Planning Agent + if message_class: + qa_task = message_class.model_validate_json(response_content) + qa_task.source = "planner" # Set the source to the Planning Agent + else: + qa_task = TextMessage(source="planner", content=response_content) + # Send the generated task to the QA Agent return qa_task except ValidationError as e: @@ -100,18 +141,16 @@ async def handle_ts_task_message(self, message: TSTaskMessage, ctx: MessageConte Args: message (TSTaskMessage): The incoming message containing the user's query. """ - ts_task = await self.generate_ts_task(message, ctx) - print(f"[{self.name}] Sending TS task to TS Agent...") - await self.publish_message( - ts_task, - TopicId(type="Planner-TS", source=self.id.key) - ) - #await self.send_message(ts_task, AgentId("ts_agent", "default")) - - qa_task = await self.generate_qa_task(message, ctx) + qa_task = await self.generate_qa_task(message, ctx, False) print(f"[{self.name}] Sending QA task to QA Agent...") await self.publish_message( qa_task, TopicId(type="Planner-QA", source=self.id.key) ) - #await self.send_message(qa_task, AgentId("qa_agent", "default")) \ No newline at end of file + + ts_task = await self.generate_ts_task(message, ctx, False) + print(f"[{self.name}] Sending TS task to TS Agent...") + await self.publish_message( + ts_task, + TopicId(type="Planner-TS", source=self.id.key) + ) \ No newline at end of file diff --git a/multi_agents_pipeline/agents/QA_Agent.py b/multi_agents_pipeline/agents/QA_Agent.py index dd977c2..5e065b4 100644 --- a/multi_agents_pipeline/agents/QA_Agent.py +++ b/multi_agents_pipeline/agents/QA_Agent.py @@ -40,18 +40,26 @@ async def handle_TS(self, message: TSMessage, ctx: MessageContext) -> None: """This is the TS info given by TS Agent """ df = pd.read_csv(Path(message.filepath)) - stats = df.describe().to_string() + csv_text = df.to_csv(index=False) + #stats = df.describe().to_string() # below is the prompt that combine the task and the TS Info. # TODO : Modify according to the task type and task description. Currently just a placeholder + task_string = "" + if message.task_type == "ts-classification": + task_string = f"An expert time-series analyst has provided a description of the time-series data: {message.description}" + elif message.task_type == "ts-forecasting": + task_string = f"An expert time-series analyst has made a prediction. Here is a statistical analysis of the forecasting result: {message.description}" prompt = f""" You are a Time Series Expert. Here is a task given by the planner: {self._last_plan or "(no plan received)"} - Here is the output of Time-Series Agent: - {stats} + An expert time-series analyst Here is the forecasting result of the Time-Series Agent: + {csv_text} + + {task_string} Please finish the task based on the above information. """ diff --git a/multi_agents_pipeline/agents/TS_Agent.py b/multi_agents_pipeline/agents/TS_Agent.py index bc145f0..39dcb19 100644 --- a/multi_agents_pipeline/agents/TS_Agent.py +++ b/multi_agents_pipeline/agents/TS_Agent.py @@ -11,7 +11,7 @@ TopicId, type_subscription ) -from autogen_core.models import ChatCompletionClient, UserMessage, AssistantMessage +from autogen_core.models import ChatCompletionClient, UserMessage, AssistantMessage, SystemMessage from pydantic import BaseModel from .custom_messages import TextMessage, TSMessage from multi_agents_pipeline.ltsm_inference import inference @@ -32,18 +32,82 @@ async def handle_TS(self, message: TSMessage, ctx: MessageContext) -> None: """This is the TS info given by Planner. LTSM will process the TS data and return the answer. """ file_path = message.filepath - task_type = message.task_type + task_type = message.task_type + if task_type == "ts-classification": + description = await self.write_ts_description(file_path, message.description, ctx) + print(f"[{self.name}] Generated TS classification response: {description}") - ts_response = inference( - file=file_path, - task_type=task_type + await self.publish_message(TSMessage(source=self.name, + filepath = file_path, + description = description, + task_type=task_type), TopicId(type="TS-Info", source=self.id.key)) + elif task_type == "ts-forecasting": + ts_response = inference( + file=file_path, + task_type=task_type + ) + + description = await self.write_ts_prediction_description(ts_response, message.description, ctx) + print(f"[{self.name}] Generated TS forecasting response: {description}") + + await self.publish_message(TSMessage(source=self.name, + filepath = ts_response, + description = description, + task_type=task_type), TopicId(type="TS-Info", source=self.id.key)) + elif task_type == "anomaly-detection": + # TODO: Implement anomaly detection logic + pass + + async def write_ts_description(self, file_path: str, original_message:str, ctx: MessageContext) -> None: + data = pd.read_csv(Path(file_path)).to_csv(index=False) + message = SystemMessage( + source="user", + content=f"""You are a domain expert reviewing a prediction made by an expert time-series analyst. The description of the data and problem is as follows: {original_message} + The prediction is provided here as a comma-separated list of numerical values: {data}. + Your goal is to interpret this sequence by identifying and describing key statistical and temporal patterns. + As you analyze the data, consider the following: + 1. Descriptive Statistics (mean, mode, median, standard deviation, etc.) + 2. Temporal Trends (increasing/decreasing patterns, periodicity, etc.) + 3. Seasonality and Cyclic Behavior + 4. Anomalies or Outliers + Focus on interpreting patterns rather than listing every number. + """ ) + + response = await self._model_client.create( + messages=[message], + cancellation_token=ctx.cancellation_token) + + if isinstance(response.content, str): + return response.content + else: + raise ValueError("Response content is not a valid string") + + async def write_ts_prediction_description(self, ts_response: str, original_message: str, ctx: MessageContext) -> None: + prediction = pd.read_csv(Path(ts_response)).to_csv(index=False) + message = SystemMessage( + source="user", + content=f"""You are a domain expert reviewing a prediction made by an expert time-series analyst. The description of the data and problem is as follows: {original_message} + The prediction is provided here as a comma-separated list of numerical values: {prediction}. + Your goal is to interpret this sequence by identifying and describing key statistical and temporal patterns. + As you analyze the data, consider the following: + 1. Descriptive Statistics (mean, mode, median, standard deviation, etc.) + 2. Temporal Trends (increasing/decreasing patterns, periodicity, etc.) + 3. Seasonality and Cyclic Behavior + 4. Anomalies or Outliers + Focus on interpreting patterns rather than listing every number. + """ + ) - # publish - await self.publish_message(TSMessage(source=self.name, - filepath = ts_response, - task_type="ts-classification"), TopicId(type="TS-Info", source=self.id.key)) + response = await self._model_client.create( + messages=[message], + cancellation_token=ctx.cancellation_token) + + if isinstance(response.content, str): + return response.content + else: + raise ValueError("Response content is not a valid string") def get_last_response(self) -> Optional[str]: return self._last_ts_response diff --git a/multi_agents_pipeline/ltsm_inference.py b/multi_agents_pipeline/ltsm_inference.py index cda6e2f..9e476d9 100644 --- a/multi_agents_pipeline/ltsm_inference.py +++ b/multi_agents_pipeline/ltsm_inference.py @@ -31,9 +31,8 @@ def inference(file: str, task_type: str = "ts-classification") -> str: """ #login(token="Hugging Face Token") # Login to Hugging Face Hub if needed - config = LTSMConfig(seq_len=150, pred_len=150, prompt_len=0) - #model = get_model(config, "LTSM", local_pretrain=None, hf_hub_model="LSC2204/LTSM-bundle") - model = get_model(config, "LTSM", local_pretrain=None, hf_hub_model=None) + config = LTSMConfig() + model = get_model(config, "LTSM", local_pretrain=None, hf_hub_model="LSC2204/LTSM-bundle") task_type = task_type files = file.split() @@ -44,20 +43,14 @@ def inference(file: str, task_type: str = "ts-classification") -> str: os.makedirs(base_path, exist_ok=True) for index, file in enumerate(files): df = CSVReader(file).fetch() - processor = StandardScaler() - input_data, _, _, = processor.process( - raw_data=df.to_numpy(), - train_data=[df.to_numpy()], - val_data=[df.to_numpy()], - test_data=[df.to_numpy()], - fit_train_only=True, # Use the training data for scaling - do_anomaly=False - ) - input_data = np.array(input_data[0]) + input_data = df.to_numpy().transpose() if input_data.ndim == 1: input_data = input_data.reshape(-1, 1) tensor_data = torch.tensor(input_data, dtype=torch.float32) tensor_data = tensor_data.unsqueeze(0) + tensor_data_length = tensor_data.shape[1] + # Pad tensor to match pretrained LTSM input size (336 seq_len + 133 prompt_len) + tensor_data = torch.nn.functional.pad(tensor_data, (0, 0, 133+336-tensor_data_length, 0), mode='constant', value=tensor_data[0, tensor_data_length-1, 0]) with torch.no_grad(): model.eval() output = model(tensor_data)