Dokumentation (english)

File formats

There's many, but what do they do?

A file format defines how data is stored.

Most people think of file formats like PDF, PowerPoint (PPTX) or Excel (XLSX). But for AI and data, file formats are usually grouped by how the data looks like in memory so the storage of a computer.

Different data modalities require different file formats.

There are four simple types:

  1. Tabular data can be represented as rows and columns (e.g. a DataFrame) e.g. csv, tsv, parquet, xlsx, arrow
  2. Image data is interpreted as pixel matrices with channels e.g. png, jpg, bmp
  3. Structured data can be parsed into key–value representations (often nested) e.g. json, yaml, xml

Everything else falls into binary data (4.). Binary formats (e.g. pdf, docx ) are containers, not directly machine-readable representations. They contain text, tables, or images, but this structure is hidden behind the file format specification and needs to be extracted.


Command Palette

Search for a command to run...

Schnellzugriffe
STRG + KSuche
STRG + DNachtmodus / Tagmodus
STRG + LSprache ändern

Software-Details
Kompiliert vor etwa 10 Stunden
Release: v4.0.0-production
Buildnummer: master@d237a7f
Historie: 10 Items