Tutorial: Tybles quickstart
Contents
Tutorial: Tybles quickstart¶
Let’s say we want to read the following data into a pandas.DataFrame
:
kepler_id,koi_name,kepler_name,status,period,radius
10666592,K00002.01,Kepler-2 b,CONFIRMED,2.204735365,16.39
6922244,K00010.01,Kepler-8 b,CONFIRMED,3.522498573,14.83
11904151,K00072.01,Kepler-10 b,CONFIRMED,0.837491331,1.45
10187017,K00082.04,Kepler-102 c,CONFIRMED,7.07136076,0.58
10187017,K00082.05,Kepler-102 b,CONFIRMED,5.28695437,0.49
10984090,K00112.02,Kepler-466 c,CONFIRMED,3.709213846,1.24
9579641,K00115.01,Kepler-105 b,CONFIRMED,5.41220713,3.28
which we provide as a string below.
We also provide an invalid version of the same CSV content.
CSV content¶
import io
valid_csv = """
kepler_id,koi_name,kepler_name,status,period,radius
10666592,K00002.01,Kepler-2 b,CONFIRMED,2.204735365,16.39
6922244,K00010.01,Kepler-8 b,CONFIRMED,3.522498573,14.83
11904151,K00072.01,Kepler-10 b,CONFIRMED,0.837491331,1.45
10187017,K00082.04,Kepler-102 c,CONFIRMED,7.07136076,0.58
10187017,K00082.05,Kepler-102 b,CONFIRMED,5.28695437,0.49
10984090,K00112.02,Kepler-466 c,CONFIRMED,3.709213846,1.24
9579641,K00115.01,Kepler-105 b,CONFIRMED,5.41220713,3.28
""".strip()
invalid_csv = """
kepler_id,koi_name,kepler_name,status,period,radius
10666592,K00002.01,Kepler-2 b,CONFIRMED,2.204735365,16.39
6922244,K00010.01,Kepler-8 b,CONFIRMED,3.522498573,14.83
11904151,K00072.01,Kepler-10 b,CONFIRMED,0.837491331,1.45
10187017,K00082.04,Kepler-102 c,CONFIRMED,7.07136076,0.58
10187017,K00082.05,Kepler-102 b,CONFIRMED,5.28695437,0.49
10984090,K00112.02,Kepler-466 c,CONFIRMED,3.709213846,1.24
K9579641,K00115.01,Kepler-105 b,CONFIRMED,5.41220713,3.28
"""
Defining a dataframe schema¶
We will first describe the schema using a dataclass()
.
import tybles as tb
from dataclasses import dataclass
import numpy as np
import pandas as pd
@dataclass(frozen=True)
class Planet:
kepler_id: np.int32
koi_name: str
kepler_name: str
status: str
period: np.float64
Reading a CSV file¶
We then read the valid contents using Tybles.
from io import StringIO
schema = tb.schema(Planet)
schema.read_csv(StringIO(valid_csv))
kepler_id | koi_name | kepler_name | status | period | |
---|---|---|---|---|---|
0 | 10666592 | K00002.01 | Kepler-2 b | CONFIRMED | 2.204735 |
1 | 6922244 | K00010.01 | Kepler-8 b | CONFIRMED | 3.522499 |
2 | 11904151 | K00072.01 | Kepler-10 b | CONFIRMED | 0.837491 |
3 | 10187017 | K00082.04 | Kepler-102 c | CONFIRMED | 7.071361 |
4 | 10187017 | K00082.05 | Kepler-102 b | CONFIRMED | 5.286954 |
5 | 10984090 | K00112.02 | Kepler-466 c | CONFIRMED | 3.709214 |
6 | 9579641 | K00115.01 | Kepler-105 b | CONFIRMED | 5.412207 |
Now, if we attempt to read the invalid CSV content without Tybles, no error is raised.
pd.read_csv(StringIO(invalid_csv))
kepler_id | koi_name | kepler_name | status | period | radius | |
---|---|---|---|---|---|---|
0 | 10666592 | K00002.01 | Kepler-2 b | CONFIRMED | 2.204735 | 16.39 |
1 | 6922244 | K00010.01 | Kepler-8 b | CONFIRMED | 3.522499 | 14.83 |
2 | 11904151 | K00072.01 | Kepler-10 b | CONFIRMED | 0.837491 | 1.45 |
3 | 10187017 | K00082.04 | Kepler-102 c | CONFIRMED | 7.071361 | 0.58 |
4 | 10187017 | K00082.05 | Kepler-102 b | CONFIRMED | 5.286954 | 0.49 |
5 | 10984090 | K00112.02 | Kepler-466 c | CONFIRMED | 3.709214 | 1.24 |
6 | K9579641 | K00115.01 | Kepler-105 b | CONFIRMED | 5.412207 | 3.28 |
But reading the invalid CSV content with Tybles, the data type mismatch is detected. (The error happens in Pandas, not in Tybles, actually, because Tybles provides Pandas a specification of the dtypes that are expected for each of the columns)
import traceback
try:
schema.read_csv(StringIO(invalid_csv))
except Exception as e:
traceback.print_exc(limit=3)
Traceback (most recent call last):
File "pandas/_libs/parsers.pyx", line 1113, in pandas._libs.parsers.TextReader._convert_tokens
TypeError: Cannot cast array data from dtype('O') to dtype('int32') according to the rule 'safe'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/tmp/ipykernel_3153/3785629674.py", line 3, in <cell line: 2>
schema.read_csv(StringIO(invalid_csv))
File "/home/runner/work/tybles/tybles/src/tybles/__init__.py", line 212, in read_csv
pd.read_csv(
File "/home/runner/.cache/pypoetry/virtualenvs/tybles-EP-d9W_r-py3.8/lib/python3.8/site-packages/pandas/util/_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
ValueError: invalid literal for int() with base 10: 'K9579641'
Handling missing columns¶
The following CSV content is missing the koi_name
column. When creating the schema, you
can tell Tybles how to handle missing columns.
missing_columns="missing"
returns an incompleteDataFrame
. Some features of Tybles, for example row validation or retrivial of rows using dataclasses will not work.When using that option,
validate
should be set toFalse
.missing_columns="fill"
fills the missing columns with the dtype default value (for example zero or the empty string).missing_columns="error"
raises an exception (this is the default).
missing_values_csv = """
kepler_id,kepler_name,status,period,radius
10666592,Kepler-2 b,CONFIRMED,2.204735365,16.39
6922244,Kepler-8 b,CONFIRMED,3.522498573,14.83
11904151,Kepler-10 b,CONFIRMED,0.837491331,1.45
10187017,Kepler-102 c,CONFIRMED,7.07136076,0.58
10187017,Kepler-102 b,CONFIRMED,5.28695437,0.49
10984090,Kepler-466 c,CONFIRMED,3.709213846,1.24
9579641,Kepler-105 b,CONFIRMED,5.41220713,3.28
""".strip()
schema_missing_fill = tb.schema(Planet, missing_columns="fill")
schema_missing_fill.read_csv(StringIO(missing_values_csv))
kepler_id | koi_name | kepler_name | status | period | |
---|---|---|---|---|---|
0 | 10666592 | Kepler-2 b | CONFIRMED | 2.204735 | |
1 | 6922244 | Kepler-8 b | CONFIRMED | 3.522499 | |
2 | 11904151 | Kepler-10 b | CONFIRMED | 0.837491 | |
3 | 10187017 | Kepler-102 c | CONFIRMED | 7.071361 | |
4 | 10187017 | Kepler-102 b | CONFIRMED | 5.286954 | |
5 | 10984090 | Kepler-466 c | CONFIRMED | 3.709214 | |
6 | 9579641 | Kepler-105 b | CONFIRMED | 5.412207 |
schema_missing_missing = tb.schema(Planet, missing_columns="missing", validate=False)
schema_missing_missing.read_csv(StringIO(missing_values_csv))
kepler_id | kepler_name | status | period | |
---|---|---|---|---|
0 | 10666592 | Kepler-2 b | CONFIRMED | 2.204735 |
1 | 6922244 | Kepler-8 b | CONFIRMED | 3.522499 |
2 | 11904151 | Kepler-10 b | CONFIRMED | 0.837491 |
3 | 10187017 | Kepler-102 c | CONFIRMED | 7.071361 |
4 | 10187017 | Kepler-102 b | CONFIRMED | 5.286954 |
5 | 10984090 | Kepler-466 c | CONFIRMED | 3.709214 |
6 | 9579641 | Kepler-105 b | CONFIRMED | 5.412207 |
schema_missing_error = tb.schema(Planet, missing_columns="error")
try:
schema_missing_error.read_csv(StringIO(missing_values_csv))
except Exception as e:
traceback.print_exc(limit=3)
Traceback (most recent call last):
File "/tmp/ipykernel_3153/2875286452.py", line 3, in <cell line: 2>
schema_missing_error.read_csv(StringIO(missing_values_csv))
File "/home/runner/work/tybles/tybles/src/tybles/__init__.py", line 211, in read_csv
return self.process_raw_data_frame(
File "/home/runner/work/tybles/tybles/src/tybles/__init__.py", line 256, in process_raw_data_frame
raise ValueError("Missing columns in CSV file: " + ", ".join(missing))
ValueError: Missing columns in CSV file: koi_name
Handling extra columns¶
There are three values for the extra_columns
option, when calling :meth:~tybles.Schema.read_csv
.
extra_columns="drop"
removes the extra columns from the dataframe (default).extra_columns="keep"
keep the extra columns around.extra_columns="error'
raises an exception.
extra_csv = """
kepler_id,koi_name,kepler_name,status,period,radius,extra
10666592,K00002.01,Kepler-2 b,CONFIRMED,2.204735365,16.39,1
6922244,K00010.01,Kepler-8 b,CONFIRMED,3.522498573,14.83,2
11904151,K00072.01,Kepler-10 b,CONFIRMED,0.837491331,1.45,3
10187017,K00082.04,Kepler-102 c,CONFIRMED,7.07136076,0.58,4
10187017,K00082.05,Kepler-102 b,CONFIRMED,5.28695437,0.49,5
10984090,K00112.02,Kepler-466 c,CONFIRMED,3.709213846,1.24,6
9579641,K00115.01,Kepler-105 b,CONFIRMED,5.41220713,3.28,7
""".strip()
schema_extra_drop = tb.schema(Planet, extra_columns="drop")
schema_extra_drop.read_csv(StringIO(extra_csv))
kepler_id | koi_name | kepler_name | status | period | |
---|---|---|---|---|---|
0 | 10666592 | K00002.01 | Kepler-2 b | CONFIRMED | 2.204735 |
1 | 6922244 | K00010.01 | Kepler-8 b | CONFIRMED | 3.522499 |
2 | 11904151 | K00072.01 | Kepler-10 b | CONFIRMED | 0.837491 |
3 | 10187017 | K00082.04 | Kepler-102 c | CONFIRMED | 7.071361 |
4 | 10187017 | K00082.05 | Kepler-102 b | CONFIRMED | 5.286954 |
5 | 10984090 | K00112.02 | Kepler-466 c | CONFIRMED | 3.709214 |
6 | 9579641 | K00115.01 | Kepler-105 b | CONFIRMED | 5.412207 |
schema_extra_keep = tb.schema(Planet, extra_columns="keep")
schema_extra_keep.read_csv(StringIO(extra_csv))
kepler_id | koi_name | kepler_name | status | period | radius | extra | |
---|---|---|---|---|---|---|---|
0 | 10666592 | K00002.01 | Kepler-2 b | CONFIRMED | 2.204735 | 16.39 | 1 |
1 | 6922244 | K00010.01 | Kepler-8 b | CONFIRMED | 3.522499 | 14.83 | 2 |
2 | 11904151 | K00072.01 | Kepler-10 b | CONFIRMED | 0.837491 | 1.45 | 3 |
3 | 10187017 | K00082.04 | Kepler-102 c | CONFIRMED | 7.071361 | 0.58 | 4 |
4 | 10187017 | K00082.05 | Kepler-102 b | CONFIRMED | 5.286954 | 0.49 | 5 |
5 | 10984090 | K00112.02 | Kepler-466 c | CONFIRMED | 3.709214 | 1.24 | 6 |
6 | 9579641 | K00115.01 | Kepler-105 b | CONFIRMED | 5.412207 | 3.28 | 7 |
try:
schema_extra_error = tb.schema(Planet, extra_columns="error")
schema_extra_error.read_csv(StringIO(extra_csv))
except Exception as e:
traceback.print_exc(limit=3)
Traceback (most recent call last):
File "/tmp/ipykernel_3153/3187951842.py", line 3, in <cell line: 1>
schema_extra_error.read_csv(StringIO(extra_csv))
File "/home/runner/work/tybles/tybles/src/tybles/__init__.py", line 211, in read_csv
return self.process_raw_data_frame(
File "/home/runner/work/tybles/tybles/src/tybles/__init__.py", line 265, in process_raw_data_frame
raise ValueError("Extra columns in CSV file: " + ", ".join(extra))
ValueError: Extra columns in CSV file: extra, radius
Validation (using beartype)¶
The beartype library provides helpers to add validation to types. Here is an example.
See Beartype Validators for an
explanation of the Is[...]
syntax.
import beartype.vale as bv
from typing_extensions import Annotated
@dataclass(frozen=True)
class ValidatedPlanet:
kepler_id: Annotated[np.int32, bv.Is[lambda x: x >= 0]]
koi_name: Annotated[str, bv.Is[lambda x: x.strip() != ""]]
kepler_name: Annotated[str, bv.Is[lambda x: x.strip() != ""]]
status: Annotated[str, bv.Is[lambda x: x in {"CANDIDATE", "CONFIRMED"}]]
period: Annotated[np.float64, bv.Is[lambda x: x >= 0]]
negative_csv = """
kepler_id,koi_name,kepler_name,status,period,radius
10666592,K00002.01,Kepler-2 b,CONFIRMED,2.204735365,16.39
6922244,K00010.01,Kepler-8 b,CONFIRMED,3.522498573,14.83
11904151,K00072.01,Kepler-10 b,CONFIRMED,0.837491331,1.45
-10187017,K00082.04,Kepler-102 c,CONFIRMED,7.07136076,0.58
10187017,K00082.05,Kepler-102 b,CONFIRMED,5.28695437,0.49
10984090,K00112.02,Kepler-466 c,CONFIRMED,3.709213846,1.24
9579641,K00115.01,Kepler-105 b,CONFIRMED,5.41220713,3.28
""".strip()
schema_validated = tb.schema(ValidatedPlanet, validate=True) # validate is true by default
try:
schema_validated.read_csv(StringIO(negative_csv))
except Exception as e:
traceback.print_exc(limit=3)
Traceback (most recent call last):
File "/tmp/ipykernel_3153/3476105072.py", line 3, in <cell line: 2>
schema_validated.read_csv(StringIO(negative_csv))
File "/home/runner/work/tybles/tybles/src/tybles/__init__.py", line 211, in read_csv
return self.process_raw_data_frame(
File "/home/runner/work/tybles/tybles/src/tybles/__init__.py", line 274, in process_raw_data_frame
self.validate_row(row)
AssertionError: False not tri-state boolean.
schema_not_validated = tb.schema(ValidatedPlanet, validate=False)
# this succeeds
schema_not_validated.read_csv(StringIO(negative_csv))
kepler_id | koi_name | kepler_name | status | period | |
---|---|---|---|---|---|
0 | 10666592 | K00002.01 | Kepler-2 b | CONFIRMED | 2.204735 |
1 | 6922244 | K00010.01 | Kepler-8 b | CONFIRMED | 3.522499 |
2 | 11904151 | K00072.01 | Kepler-10 b | CONFIRMED | 0.837491 |
3 | -10187017 | K00082.04 | Kepler-102 c | CONFIRMED | 7.071361 |
4 | 10187017 | K00082.05 | Kepler-102 b | CONFIRMED | 5.286954 |
5 | 10984090 | K00112.02 | Kepler-466 c | CONFIRMED | 3.709214 |
6 | 9579641 | K00115.01 | Kepler-105 b | CONFIRMED | 5.412207 |
Type-safe row access¶
By setting return_type="Tyble"
instead of the default return_type="DataFrame"
, one gets
a Tyble
object that behaves like a sequence/list.
t = schema.read_csv(StringIO(valid_csv), return_type="Tyble")
t
Tyble: self.schema.row_spec=Planet
self[0]=Planet(kepler_id=10666592, koi_name='K00002.01', kepler_name='Kepler-2 b', status='CONFIRMED', period=2.204735365)
self.data_frame=
kepler_id koi_name kepler_name status period
0 10666592 K00002.01 Kepler-2 b CONFIRMED 2.204735
1 6922244 K00010.01 Kepler-8 b CONFIRMED 3.522499
2 11904151 K00072.01 Kepler-10 b CONFIRMED 0.837491
3 10187017 K00082.04 Kepler-102 c CONFIRMED 7.071361
4 10187017 K00082.05 Kepler-102 b CONFIRMED 5.286954
5 10984090 K00112.02 Kepler-466 c CONFIRMED 3.709214
6 9579641 K00115.01 Kepler-105 b CONFIRMED 5.412207
The Tyble elements are instances of schema.row_spec
, which is the dataclass that provided
the dataframe specification.
One can then use standard list comprehensions to handle the rows.
Of course, such processing is much slower than using Pandas directly, but Tybles is made for small datasets in programs that use Pandas because of familiarity/ease-of-use.
The main advantage is that row access is now typed.
[row.kepler_name for row in t if row.kepler_id != 6922244]
['Kepler-2 b',
'Kepler-10 b',
'Kepler-102 c',
'Kepler-102 b',
'Kepler-466 c',
'Kepler-105 b']
One can reconstruct a pandas.DataFrame
from a sequence of rows.
planets = [
Planet(np.int32(10666592), "K00002.01", "Kepler-2 b", "CONFIRMED", np.float64(2.204735)),
Planet(np.int32(6922244), "K00010.01", "Kepler-8 b", "CONFIRMED", np.float64(3.522499)),
]
schema.from_rows(planets)
kepler_id | koi_name | kepler_name | status | period | |
---|---|---|---|---|---|
0 | 10666592 | K00002.01 | Kepler-2 b | CONFIRMED | 2.204735 |
1 | 6922244 | K00010.01 | Kepler-8 b | CONFIRMED | 3.522499 |