Dokumentation (english)

Hugging Face Connector

Connect to Hugging Face Hub datasets. Access thousands of datasets for AI and machine learning projects.

Connect to Hugging Face Hub datasets. Access thousands of datasets for AI and machine learning projects.

Setup Instructions

1. Navigate to Data Integrations

Go to the Data Integrations tab in your flow.

2. Select Hugging Face Integration

Click Select an Integration, type Hugging Face in the search, and click Connect.

3. Create Hugging Face Access Token (Optional)

For public datasets, no token is required. For private datasets or gated content:

Hugging Face Token
  1. Go to Hugging Face Settings
  2. Click New token
  3. Give your token a name (e.g., "AIcuFlow Connector")
  4. Select token type:
    • Read: For downloading datasets only (recommended)
    • Write: If you plan to upload data back
  5. Click Generate token
  6. Important: Copy the token immediately - you won't be able to see it again
Hugging Face Settings

4. Configure the Connector

Back in the connector setup, fill in:

  • Connector Name: Give your connector a descriptive name (e.g., "HuggingFace Datasets")
  • Access Token (Optional): Paste your token (only required for private/gated datasets)
  • Folder (Optional): Select a destination folder in the file manager
    • If not specified, data will be stored in the root directory
Hugging Face Connector Configuration

5. Specify Dataset

Configure which dataset to download:

  • Dataset Name: The identifier of the dataset on Hugging Face Hub

    • Format: organization/dataset-name or just dataset-name
    • Examples:
      • imdb - IMDb movie reviews
      • squad - Stanford Question Answering Dataset
      • wikitext - Wikipedia text corpus
      • glue - General Language Understanding Evaluation
  • Dataset Subset (Optional): Some datasets have multiple subsets or configurations

    • Example: For glue dataset, you can specify mrpc, sst2, etc.
  • Split (Optional): Specify which split to download

    • Common splits: train, test, validation
    • Leave empty to download all splits

You can find dataset information at: https://huggingface.co/datasets/{dataset-name}

Hugging Face Dataset

You can add the dataset id in the repository id section.

6. Configure Download Options

  • Download Format: Choose how to save the data

    • Parquet: Efficient columnar format (recommended)
    • CSV: Comma-separated values
    • JSON: JavaScript Object Notation
    • Arrow: Apache Arrow format
  • Cache: Enable caching to avoid re-downloading unchanged data

7. Create the Connection

After filling in all details, click Create Connection.

The system will:

  • Authenticate with Hugging Face Hub (if token provided)
  • Download the specified dataset
  • Convert to your chosen format
  • Begin the initial data synchronization

8. Monitor Sync Status

  1. Navigate to Data Synchronization to see the import progress
  2. Large datasets may take time to download and process
  3. Progress will be shown for each split being downloaded

9. Access Your Data

  1. Once the sync is complete, go to File Manager
  2. Navigate to the folder you specified (or root directory)
  3. You'll see the dataset files organized by splits (train, test, validation)
  4. Click on any file to preview the data
  5. The data is now ready to use in your AI pipelines and flows

What Gets Imported:

  • Dataset files in your chosen format
  • Metadata and dataset information
  • All specified splits (train, test, validation)
  • Dataset card and documentation (if available)

Best Practices:

  • Check dataset licenses before using in production
  • Use specific dataset versions/commits for reproducibility
  • Start with small datasets to test your pipeline
  • Cache datasets to avoid repeated downloads
  • Keep access tokens secure and never share them

Popular Hugging Face Datasets:

  • imdb - Movie reviews sentiment analysis
  • squad - Question answering dataset
  • glue - Language understanding benchmark
  • conll2003 - Named entity recognition
  • wmt14 - Machine translation dataset
  • cifar10 - Image classification dataset
  • common_voice - Multilingual speech dataset

Command Palette

Search for a command to run...

Schnellzugriffe
STRG + KSuche
STRG + DNachtmodus / Tagmodus
STRG + LSprache ändern

Software-Details
Kompiliert vor 1 Tag
Release: v4.0.0-production
Buildnummer: master@64a3463
Historie: 68 Items