Tutorial: Tybles quickstart

Let’s say we want to read the following data into a pandas.DataFrame:

kepler_id,koi_name,kepler_name,status,period,radius
10666592,K00002.01,Kepler-2 b,CONFIRMED,2.204735365,16.39
6922244,K00010.01,Kepler-8 b,CONFIRMED,3.522498573,14.83
11904151,K00072.01,Kepler-10 b,CONFIRMED,0.837491331,1.45
10187017,K00082.04,Kepler-102 c,CONFIRMED,7.07136076,0.58
10187017,K00082.05,Kepler-102 b,CONFIRMED,5.28695437,0.49
10984090,K00112.02,Kepler-466 c,CONFIRMED,3.709213846,1.24
9579641,K00115.01,Kepler-105 b,CONFIRMED,5.41220713,3.28

which we provide as a string below.

We also provide an invalid version of the same CSV content.

CSV content

import io

valid_csv = """
kepler_id,koi_name,kepler_name,status,period,radius
10666592,K00002.01,Kepler-2 b,CONFIRMED,2.204735365,16.39
6922244,K00010.01,Kepler-8 b,CONFIRMED,3.522498573,14.83
11904151,K00072.01,Kepler-10 b,CONFIRMED,0.837491331,1.45
10187017,K00082.04,Kepler-102 c,CONFIRMED,7.07136076,0.58
10187017,K00082.05,Kepler-102 b,CONFIRMED,5.28695437,0.49
10984090,K00112.02,Kepler-466 c,CONFIRMED,3.709213846,1.24
9579641,K00115.01,Kepler-105 b,CONFIRMED,5.41220713,3.28
""".strip()

invalid_csv = """
kepler_id,koi_name,kepler_name,status,period,radius
10666592,K00002.01,Kepler-2 b,CONFIRMED,2.204735365,16.39
6922244,K00010.01,Kepler-8 b,CONFIRMED,3.522498573,14.83
11904151,K00072.01,Kepler-10 b,CONFIRMED,0.837491331,1.45
10187017,K00082.04,Kepler-102 c,CONFIRMED,7.07136076,0.58
10187017,K00082.05,Kepler-102 b,CONFIRMED,5.28695437,0.49
10984090,K00112.02,Kepler-466 c,CONFIRMED,3.709213846,1.24
K9579641,K00115.01,Kepler-105 b,CONFIRMED,5.41220713,3.28
"""

Defining a dataframe schema

We will first describe the schema using a dataclass().

import tybles as tb
from dataclasses import dataclass
import numpy as np
import pandas as pd

@dataclass(frozen=True)
class Planet:
    kepler_id: np.int32
    koi_name: str
    kepler_name: str
    status: str
    period: np.float64

Reading a CSV file

We then read the valid contents using Tybles.

from io import StringIO
schema = tb.schema(Planet)
schema.read_csv(StringIO(valid_csv))
kepler_id koi_name kepler_name status period
0 10666592 K00002.01 Kepler-2 b CONFIRMED 2.204735
1 6922244 K00010.01 Kepler-8 b CONFIRMED 3.522499
2 11904151 K00072.01 Kepler-10 b CONFIRMED 0.837491
3 10187017 K00082.04 Kepler-102 c CONFIRMED 7.071361
4 10187017 K00082.05 Kepler-102 b CONFIRMED 5.286954
5 10984090 K00112.02 Kepler-466 c CONFIRMED 3.709214
6 9579641 K00115.01 Kepler-105 b CONFIRMED 5.412207

Now, if we attempt to read the invalid CSV content without Tybles, no error is raised.

pd.read_csv(StringIO(invalid_csv))
kepler_id koi_name kepler_name status period radius
0 10666592 K00002.01 Kepler-2 b CONFIRMED 2.204735 16.39
1 6922244 K00010.01 Kepler-8 b CONFIRMED 3.522499 14.83
2 11904151 K00072.01 Kepler-10 b CONFIRMED 0.837491 1.45
3 10187017 K00082.04 Kepler-102 c CONFIRMED 7.071361 0.58
4 10187017 K00082.05 Kepler-102 b CONFIRMED 5.286954 0.49
5 10984090 K00112.02 Kepler-466 c CONFIRMED 3.709214 1.24
6 K9579641 K00115.01 Kepler-105 b CONFIRMED 5.412207 3.28

But reading the invalid CSV content with Tybles, the data type mismatch is detected. (The error happens in Pandas, not in Tybles, actually, because Tybles provides Pandas a specification of the dtypes that are expected for each of the columns)

import traceback
try:
    schema.read_csv(StringIO(invalid_csv))
except Exception as e:
    traceback.print_exc(limit=3)
Traceback (most recent call last):
  File "pandas/_libs/parsers.pyx", line 1113, in pandas._libs.parsers.TextReader._convert_tokens
TypeError: Cannot cast array data from dtype('O') to dtype('int32') according to the rule 'safe'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/tmp/ipykernel_3153/3785629674.py", line 3, in <cell line: 2>
    schema.read_csv(StringIO(invalid_csv))
  File "/home/runner/work/tybles/tybles/src/tybles/__init__.py", line 212, in read_csv
    pd.read_csv(
  File "/home/runner/.cache/pypoetry/virtualenvs/tybles-EP-d9W_r-py3.8/lib/python3.8/site-packages/pandas/util/_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
ValueError: invalid literal for int() with base 10: 'K9579641'

Handling missing columns

The following CSV content is missing the koi_name column. When creating the schema, you can tell Tybles how to handle missing columns.

  • missing_columns="missing" returns an incomplete DataFrame. Some features of Tybles, for example row validation or retrivial of rows using dataclasses will not work.

    When using that option, validate should be set to False.

  • missing_columns="fill" fills the missing columns with the dtype default value (for example zero or the empty string).

  • missing_columns="error" raises an exception (this is the default).

missing_values_csv = """
kepler_id,kepler_name,status,period,radius
10666592,Kepler-2 b,CONFIRMED,2.204735365,16.39
6922244,Kepler-8 b,CONFIRMED,3.522498573,14.83
11904151,Kepler-10 b,CONFIRMED,0.837491331,1.45
10187017,Kepler-102 c,CONFIRMED,7.07136076,0.58
10187017,Kepler-102 b,CONFIRMED,5.28695437,0.49
10984090,Kepler-466 c,CONFIRMED,3.709213846,1.24
9579641,Kepler-105 b,CONFIRMED,5.41220713,3.28
""".strip()
schema_missing_fill = tb.schema(Planet, missing_columns="fill")
schema_missing_fill.read_csv(StringIO(missing_values_csv))
kepler_id koi_name kepler_name status period
0 10666592 Kepler-2 b CONFIRMED 2.204735
1 6922244 Kepler-8 b CONFIRMED 3.522499
2 11904151 Kepler-10 b CONFIRMED 0.837491
3 10187017 Kepler-102 c CONFIRMED 7.071361
4 10187017 Kepler-102 b CONFIRMED 5.286954
5 10984090 Kepler-466 c CONFIRMED 3.709214
6 9579641 Kepler-105 b CONFIRMED 5.412207
schema_missing_missing = tb.schema(Planet, missing_columns="missing", validate=False)
schema_missing_missing.read_csv(StringIO(missing_values_csv))
kepler_id kepler_name status period
0 10666592 Kepler-2 b CONFIRMED 2.204735
1 6922244 Kepler-8 b CONFIRMED 3.522499
2 11904151 Kepler-10 b CONFIRMED 0.837491
3 10187017 Kepler-102 c CONFIRMED 7.071361
4 10187017 Kepler-102 b CONFIRMED 5.286954
5 10984090 Kepler-466 c CONFIRMED 3.709214
6 9579641 Kepler-105 b CONFIRMED 5.412207
schema_missing_error = tb.schema(Planet, missing_columns="error")
try:
    schema_missing_error.read_csv(StringIO(missing_values_csv))
except Exception as e:
    traceback.print_exc(limit=3)
Traceback (most recent call last):
  File "/tmp/ipykernel_3153/2875286452.py", line 3, in <cell line: 2>
    schema_missing_error.read_csv(StringIO(missing_values_csv))
  File "/home/runner/work/tybles/tybles/src/tybles/__init__.py", line 211, in read_csv
    return self.process_raw_data_frame(
  File "/home/runner/work/tybles/tybles/src/tybles/__init__.py", line 256, in process_raw_data_frame
    raise ValueError("Missing columns in CSV file: " + ", ".join(missing))
ValueError: Missing columns in CSV file: koi_name

Handling extra columns

There are three values for the extra_columns option, when calling :meth:~tybles.Schema.read_csv.

  • extra_columns="drop" removes the extra columns from the dataframe (default).

  • extra_columns="keep" keep the extra columns around.

  • extra_columns="error' raises an exception.

extra_csv = """
kepler_id,koi_name,kepler_name,status,period,radius,extra
10666592,K00002.01,Kepler-2 b,CONFIRMED,2.204735365,16.39,1
6922244,K00010.01,Kepler-8 b,CONFIRMED,3.522498573,14.83,2
11904151,K00072.01,Kepler-10 b,CONFIRMED,0.837491331,1.45,3
10187017,K00082.04,Kepler-102 c,CONFIRMED,7.07136076,0.58,4
10187017,K00082.05,Kepler-102 b,CONFIRMED,5.28695437,0.49,5
10984090,K00112.02,Kepler-466 c,CONFIRMED,3.709213846,1.24,6
9579641,K00115.01,Kepler-105 b,CONFIRMED,5.41220713,3.28,7
""".strip()
schema_extra_drop = tb.schema(Planet, extra_columns="drop")
schema_extra_drop.read_csv(StringIO(extra_csv))
kepler_id koi_name kepler_name status period
0 10666592 K00002.01 Kepler-2 b CONFIRMED 2.204735
1 6922244 K00010.01 Kepler-8 b CONFIRMED 3.522499
2 11904151 K00072.01 Kepler-10 b CONFIRMED 0.837491
3 10187017 K00082.04 Kepler-102 c CONFIRMED 7.071361
4 10187017 K00082.05 Kepler-102 b CONFIRMED 5.286954
5 10984090 K00112.02 Kepler-466 c CONFIRMED 3.709214
6 9579641 K00115.01 Kepler-105 b CONFIRMED 5.412207
schema_extra_keep = tb.schema(Planet, extra_columns="keep")
schema_extra_keep.read_csv(StringIO(extra_csv))
kepler_id koi_name kepler_name status period radius extra
0 10666592 K00002.01 Kepler-2 b CONFIRMED 2.204735 16.39 1
1 6922244 K00010.01 Kepler-8 b CONFIRMED 3.522499 14.83 2
2 11904151 K00072.01 Kepler-10 b CONFIRMED 0.837491 1.45 3
3 10187017 K00082.04 Kepler-102 c CONFIRMED 7.071361 0.58 4
4 10187017 K00082.05 Kepler-102 b CONFIRMED 5.286954 0.49 5
5 10984090 K00112.02 Kepler-466 c CONFIRMED 3.709214 1.24 6
6 9579641 K00115.01 Kepler-105 b CONFIRMED 5.412207 3.28 7
try:
    schema_extra_error = tb.schema(Planet, extra_columns="error")
    schema_extra_error.read_csv(StringIO(extra_csv))
except Exception as e:
    traceback.print_exc(limit=3)
Traceback (most recent call last):
  File "/tmp/ipykernel_3153/3187951842.py", line 3, in <cell line: 1>
    schema_extra_error.read_csv(StringIO(extra_csv))
  File "/home/runner/work/tybles/tybles/src/tybles/__init__.py", line 211, in read_csv
    return self.process_raw_data_frame(
  File "/home/runner/work/tybles/tybles/src/tybles/__init__.py", line 265, in process_raw_data_frame
    raise ValueError("Extra columns in CSV file: " + ", ".join(extra))
ValueError: Extra columns in CSV file: extra, radius

Validation (using beartype)

The beartype library provides helpers to add validation to types. Here is an example.

See Beartype Validators for an explanation of the Is[...] syntax.

import beartype.vale as bv 
from typing_extensions import Annotated
@dataclass(frozen=True)
class ValidatedPlanet:
    kepler_id: Annotated[np.int32, bv.Is[lambda x: x >= 0]]
    koi_name: Annotated[str, bv.Is[lambda x: x.strip() != ""]]
    kepler_name: Annotated[str, bv.Is[lambda x: x.strip() != ""]]
    status: Annotated[str, bv.Is[lambda x: x in {"CANDIDATE", "CONFIRMED"}]]
    period: Annotated[np.float64, bv.Is[lambda x: x >= 0]]
negative_csv = """
kepler_id,koi_name,kepler_name,status,period,radius
10666592,K00002.01,Kepler-2 b,CONFIRMED,2.204735365,16.39
6922244,K00010.01,Kepler-8 b,CONFIRMED,3.522498573,14.83
11904151,K00072.01,Kepler-10 b,CONFIRMED,0.837491331,1.45
-10187017,K00082.04,Kepler-102 c,CONFIRMED,7.07136076,0.58
10187017,K00082.05,Kepler-102 b,CONFIRMED,5.28695437,0.49
10984090,K00112.02,Kepler-466 c,CONFIRMED,3.709213846,1.24
9579641,K00115.01,Kepler-105 b,CONFIRMED,5.41220713,3.28
""".strip()
schema_validated = tb.schema(ValidatedPlanet, validate=True) # validate is true by default
try:
    schema_validated.read_csv(StringIO(negative_csv))
except Exception as e:
    traceback.print_exc(limit=3)
Traceback (most recent call last):
  File "/tmp/ipykernel_3153/3476105072.py", line 3, in <cell line: 2>
    schema_validated.read_csv(StringIO(negative_csv))
  File "/home/runner/work/tybles/tybles/src/tybles/__init__.py", line 211, in read_csv
    return self.process_raw_data_frame(
  File "/home/runner/work/tybles/tybles/src/tybles/__init__.py", line 274, in process_raw_data_frame
    self.validate_row(row)
AssertionError: False not tri-state boolean.
schema_not_validated = tb.schema(ValidatedPlanet, validate=False)
# this succeeds
schema_not_validated.read_csv(StringIO(negative_csv))
kepler_id koi_name kepler_name status period
0 10666592 K00002.01 Kepler-2 b CONFIRMED 2.204735
1 6922244 K00010.01 Kepler-8 b CONFIRMED 3.522499
2 11904151 K00072.01 Kepler-10 b CONFIRMED 0.837491
3 -10187017 K00082.04 Kepler-102 c CONFIRMED 7.071361
4 10187017 K00082.05 Kepler-102 b CONFIRMED 5.286954
5 10984090 K00112.02 Kepler-466 c CONFIRMED 3.709214
6 9579641 K00115.01 Kepler-105 b CONFIRMED 5.412207

Type-safe row access

By setting return_type="Tyble" instead of the default return_type="DataFrame", one gets a Tyble object that behaves like a sequence/list.

t = schema.read_csv(StringIO(valid_csv), return_type="Tyble")
t
Tyble: self.schema.row_spec=Planet
       self[0]=Planet(kepler_id=10666592, koi_name='K00002.01', kepler_name='Kepler-2 b', status='CONFIRMED', period=2.204735365)
       self.data_frame=
   kepler_id   koi_name   kepler_name     status    period
0   10666592  K00002.01    Kepler-2 b  CONFIRMED  2.204735
1    6922244  K00010.01    Kepler-8 b  CONFIRMED  3.522499
2   11904151  K00072.01   Kepler-10 b  CONFIRMED  0.837491
3   10187017  K00082.04  Kepler-102 c  CONFIRMED  7.071361
4   10187017  K00082.05  Kepler-102 b  CONFIRMED  5.286954
5   10984090  K00112.02  Kepler-466 c  CONFIRMED  3.709214
6    9579641  K00115.01  Kepler-105 b  CONFIRMED  5.412207

The Tyble elements are instances of schema.row_spec, which is the dataclass that provided the dataframe specification.

One can then use standard list comprehensions to handle the rows.

Of course, such processing is much slower than using Pandas directly, but Tybles is made for small datasets in programs that use Pandas because of familiarity/ease-of-use.

The main advantage is that row access is now typed.

[row.kepler_name for row in t if row.kepler_id != 6922244]
['Kepler-2 b',
 'Kepler-10 b',
 'Kepler-102 c',
 'Kepler-102 b',
 'Kepler-466 c',
 'Kepler-105 b']

One can reconstruct a pandas.DataFrame from a sequence of rows.

planets = [
    Planet(np.int32(10666592), "K00002.01", "Kepler-2 b", "CONFIRMED", np.float64(2.204735)),
    Planet(np.int32(6922244), "K00010.01", "Kepler-8 b", "CONFIRMED", np.float64(3.522499)),
]
schema.from_rows(planets)
kepler_id koi_name kepler_name status period
0 10666592 K00002.01 Kepler-2 b CONFIRMED 2.204735
1 6922244 K00010.01 Kepler-8 b CONFIRMED 3.522499