← 返回
沟通协作 中文

Data Cleaning & Annotation Workflow

Complete workflow for time series datasets (Energy, Manufacturing, Climate) on Kaggle to Data Annotation platform (data.smlcrm.com). Includes downloading, cl...
{"answer":"Kaggle时间序列数据集(能源、制造、气候)至数据标注平台(data.smlcrm.com)完整工作流,含下载、清理..."}
deyashmukh
沟通协作 clawhub v1.0.0 1 版本 99774.1 Key: 无需
★ 0
Stars
📥 1,325
下载
💾 28
安装
1
版本
#latest

概述

Simulacrum Data Annotation Workflow

Complete end-to-end workflow for time series dataset preparation and annotation on the Data Annotation platform (data.smlcrm.com).

What This Skill Does

This skill captures the precise workflow for processing time series datasets (Energy, Manufacturing, Climate) from discovery to CLEAN status:

  1. Find Dataset: Search Kaggle for Energy/Manufacturing/Climate time series data
  2. Download: Get CSV files via browser or Kaggle CLI
  3. Clean: Run Python/pandas script to handle missing values, duplicates, formatting
  4. Upload RAW: Upload original CSV with metadata (name, domain, source URL, description)
  5. Configure Headers: Set column types (Time, Target, Covariate, Group) and units
  6. Assign Groups: Select ALL variables (target + covariates), apply ALL group tags
  7. Upload Cleaned: Final upload → CLEAN status

Supported Domains

  • Energy: Power consumption, utilities, renewable energy, grid data
  • Manufacturing: Industrial processes, steel production, emissions, equipment data
  • Climate: CO2 emissions, environmental monitoring, weather correlation data

Quick Start

For the full pipeline from Kaggle to annotated dataset:

1. Find dataset on Kaggle
2. Download (browser or kaggle CLI)
3. Clean with scripts/clean_dataset.py
4. Upload RAW dataset to data.smlcrm.com (with metadata)
5. Click "Clean" and upload cleaned file
6. Configure column metadata (types, units)
7. Assign groups to variables
8. Upload cleaned dataset → CLEAN status

Workflow Steps

Step 1: Find and Download Dataset

From Kaggle (Browser Method):

  1. Navigate to kaggle.com/datasets
  2. Search for relevant dataset (e.g., "steel industry energy consumption", "manufacturing emissions", "climate CO2")
  3. Review data description, file list, and preview
  4. Click "Download" button
  5. Extract CSV file from downloaded zip

Alternative: Kaggle CLI

# Install if needed: pip install kaggle
# Configure: kaggle competitions list

scripts/download_kaggle.sh <dataset-name> [output-dir]
# Example: scripts/download_kaggle.sh csafrit2/steel-industry-energy-consumption

Step 2: Clean the Dataset

Always run the cleaning script before upload:

python3 scripts/clean_dataset.py <input.csv> [-o <output.csv>]

What the script does:

  • Strips whitespace from column names
  • Removes duplicate rows
  • Fills missing numeric values with median
  • Fills missing categorical values with mode or 'Unknown'
  • Converts timestamp columns to datetime format
  • Outputs column summary for metadata configuration

Output:

  • Cleaned CSV file ready for upload
  • Column summary printed to console (save this for metadata config)

Step 3: Upload Raw Dataset to Platform

  1. Navigate to data.smlcrm.com/dashboard
  2. Click "Upload Dataset" button
  3. Fill in metadata for the RAW dataset:
    • Name: Descriptive dataset name
    • Domain: Category (Energy, Manufacturing, Climate, etc.)
    • Source URL: Kaggle or original source URL
    • Description: Brief summary of the dataset
  4. Upload the original/raw CSV file (not cleaned yet)
  5. Click Upload

Result: Dataset appears in list with RAW status

Step 4: Upload Cleaned File & Configure Metadata

  1. Find the RAW dataset in the list
  2. Click "Clean" button
  3. Upload the cleaned CSV file (from Step 2)
  4. Configure headers for each column:
SettingDescription
----------------------
NameColumn name (editable)
UnitsMeasurement units (kWh, °C, %, ratio, tCO2, etc.)
TypeTime / Target / Covariate / Group

Column Type Guide:

  • Time: Timestamp/datetime columns (usually required)
  • Target: Variable to predict (at least one required)
  • Covariate: Input features/independent variables
  • Group: Categorical segment variables (WeekStatus, Day_of_week, Load_Type, etc.)

Bulk Configuration:

  • Select multiple rows via checkboxes
  • Use "Apply" dropdown to set type for selected columns
  • Set units individually or in bulk

Common Unit Patterns:

  • Energy: kWh, MWh, MW
  • Power: kVarh, kW
  • Emissions: tCO2, kgCO2
  • Ratios: ratio, %
  • Time: seconds, minutes, hours

Step 5: Assign Groups to Variables

Purpose: Group variables define how data is segmented for analysis.

Exact Workflow:

  1. Select ALL variables by checking their checkboxes:
    • Target variable(s)
    • ALL covariate variables
  1. Apply ALL group tags to selected variables:
    • Click first group tag (e.g., WeekStatus) → all selected get this group
    • Click second group tag (e.g., Day_of_week) → all selected get this group
    • Click third group tag (e.g., Load_Type) → all selected get this group
    • Continue for all available group tags
  1. Result: All variables have all groups assigned (e.g., "WeekStatus × Day_of_week × Load_Type")

Important: Assign groups to BOTH target variables AND all covariates.

Step 6: Final Upload

  1. Click "Upload Cleaned Dataset" button
  2. Wait for processing
  3. Dataset status changes from RAWCLEAN
  4. Verify data points count is correct

Example: Steel Industry Energy Dataset

Source: https://www.kaggle.com/datasets/csafrit2/steel-industry-energy-consumption

Metadata:

  • Name: Steel Industry Energy Consumption (South Korea)
  • Domain: Energy
  • Data Points: 350,400

Column Configuration:

ColumnTypeUnits
---------------------
TimestampsTime-
Usage_kWhTargetkWh
Lagging_Current_Reactive.Power_kVarhCovariatekVarh
Leading_Current_Reactive_Power_kVarhCovariatekVarh
CO2(tCO2)CovariatetCO2
Lagging_Current_Power_FactorCovariateratio
Leading_Current_Power_FactorCovariateratio
NSMCovariateseconds
WeekStatusGroup-
Day_of_weekGroup-
Load_TypeGroup-

Group Assignment:

  1. Select: Usage_kWh, Lagging_Current_Reactive.Power_kVarh, Leading_Current_Reactive_Power_kVarh, CO2(tCO2), Lagging_Current_Power_Factor, Leading_Current_Power_Factor, NSM
  2. Click: WeekStatus → all selected get WeekStatus
  3. Click: Day_of_week → all selected get Day_of_week
  4. Click: Load_Type → all selected get Load_Type
  5. Final: All variables show "WeekStatus × Day_of_week × Load_Type"

Reference Materials

For detailed platform configuration guidance, see references/platform_guide.md.

Troubleshooting

"Next" button disabled:

  • Check at least one Time column is set
  • Check at least one Target column is set
  • Verify all columns have types assigned

Groups not appearing:

  • Columns must be marked as "Group" type first
  • Proceed to next step after setting Group types

Upload fails:

  • Re-run cleaning script
  • Check CSV format (comma-delimited)
  • Verify no empty column names

Scripts

ScriptPurpose
-----------------
scripts/clean_dataset.pyClean and prepare CSV for upload
scripts/download_kaggle.shDownload datasets via Kaggle CLI

Platform URL

Data Annotation Platform: https://data.smlcrm.com

版本历史

共 1 个版本

  • v1.0.0 当前
    2026-03-29 07:11 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

communication-collaboration

imap-smtp-email

gzlicanyi
使用IMAP/SMTP读取和发送邮件;检查新/未读邮件、获取内容、搜索邮箱、标记已读/未读、发送带附件的邮件。支持...
★ 113 📥 52,404
communication-collaboration

Slack

steipete
当需要通过 slack 工具从 Clawdbot 控制 Slack 时使用,包括在频道或私信中回复消息或置顶/取消置顶项目。
★ 157 📥 47,678
communication-collaboration

Himalaya

lamelas
{"answer":"通过IMAP/SMTP管理邮件的CLI。可在终端使用 `himalaya` 收发、回复、转发、搜索及整理邮件。支持多账户与MML(MIME元语言)编写邮件。"}
★ 68 📥 45,585