Merge pull request #125 from alipay/dev

fix:  tweak jsonl and excel name format in dataAgent
This commit is contained in:
Jerry Z H
2024-07-12 12:37:58 +08:00
committed by GitHub
9 changed files with 7 additions and 7 deletions

View File

@@ -108,7 +108,7 @@ As shown in the figure below:
![data_agent_dataset](../_picture/data_agent_dataset_en.png)
[dataAgent sample evaluation dataset](../../../sample_standard_app/app/examples/data/dataset_turn_1_2024-07-10-15:06:24.jsonl)
[dataAgent sample evaluation dataset](../../../sample_standard_app/app/examples/data/dataset_turn_1_2024-07-10-15-06-24.jsonl)
### Complete Evaluation Results
@@ -124,7 +124,7 @@ As shown in the figure below:
- More dimensions Score/Suggestion: similar to the Relevance dimension.
![data_agent_eval_result](../_picture/data_agent_eval_result_en.png)
[dataAgent sample eval result](../../../sample_standard_app/app/examples/data/eval_result_turn_1_2024-07-10-15:06:24.xlsx)
[dataAgent sample eval result](../../../sample_standard_app/app/examples/data/eval_result_turn_1_2024-07-10-15-06-24.xlsx)
@@ -139,7 +139,7 @@ As shown in the figure below:
![data_agent_eval_report](../_picture/data_agent_eval_report_en.png)
[dataAgent sample evaluation report](../../../sample_standard_app/app/examples/data/eval_report_2024-07-10-15:06:24.xlsx)
[dataAgent sample evaluation report](../../../sample_standard_app/app/examples/data/eval_report_2024-07-10-15-06-24.xlsx)
### Comparative Experiment
Adjust the llm model in demo_rag_agent within aU from the previous `qwen1.5-72b-chat` to `qwen1.5-7b-chat`, and after evaluation by dataAgent, the comprehensive evaluation reports are as follows:

View File

@@ -110,7 +110,7 @@ tips: 请合理配置问题集及具体评测行数,以免造成大量算力
![data_agent_dataset](../_picture/data_agent_dataset.png)
[dataAgent生产的评测数据集样例地址](../../../sample_standard_app/app/examples/data/dataset_turn_1_2024-07-10-10:48:30.jsonl)
[dataAgent生产的评测数据集样例地址](../../../sample_standard_app/app/examples/data/dataset_turn_1_2024-07-10-10-48-30.jsonl)
### 完整评测结果
生产评测数据集后dataAgent开始数据多维度评估标注产出完整评测结果若执行多轮dataAgent跑批任务则产出多个完整评测结果
@@ -129,7 +129,7 @@ tips: 请合理配置问题集及具体评测行数,以免造成大量算力
- 例如第1条数据在**相关性维度的suggestion**: 虽然回答了关于北京天气的问题,但提供的温度单位为华氏度,与国内用户习惯的摄氏度不符,建议转换为摄氏度并提供更全面的天气信息,如湿度、风力等。
- 例如第3条数据在**事实性维度的suggestion**: 回答中包含事实错误,如将阿根廷球星莱昂内尔·梅西错误地归入英格兰队。优化建议是确保所有提及的数据和事实准确无误,尤其是在涉及具体人物和事件时。
[dataAgent生产的完整评测结果样例地址](../../../sample_standard_app/app/examples/data/eval_result_turn_1_2024-07-10-10:48:30.xlsx)
[dataAgent生产的完整评测结果样例地址](../../../sample_standard_app/app/examples/data/eval_result_turn_1_2024-07-10-10-48-30.xlsx)
### 综合评测报告
根据多轮完整的评测结果,生成一份综合评测报告。
@@ -141,7 +141,7 @@ tips: 请合理配置问题集及具体评测行数,以免造成大量算力
- 更多维度 Avg Score 以此类推
![data_agent_eval_report](../_picture/data_agent_eval_report.png)
[dataAgent生产的综合评测报告样例地址](../../../sample_standard_app/app/examples/data/eval_report_2024-07-10-10:48:30.xlsx)
[dataAgent生产的综合评测报告样例地址](../../../sample_standard_app/app/examples/data/eval_report_2024-07-10-10-48-30.xlsx)
### 对比实验
调整aU中`demo_rag_agent`中的模型从上文生产评测报告时的**qwen1.5-7b-chat**改为**qwen1.5-72b-chat**通过dataAgent评测后生成的综合评测报告如下:

View File

@@ -53,7 +53,7 @@ class DataAgent(Agent):
input_object (InputObject): input parameters passed by the user.
agent_input (dict): agent input parsed from `input_object` by the user.
"""
date = datetime.datetime.now().strftime("%Y-%m-%d-%H:%M:%S")
date = datetime.datetime.now().strftime("%Y-%m-%d-%H-%M-%S")
input_object.add_data('date', date)
# step1: build q&a dataset from the candidate agent which needs to be evaluated.