API Reference
Abstract Base Class: Script
- class potnia.script.Script(config: str)
Bases:
object
The abstract base class for handling text transliteration and unicode conversion.
- config
Path to the configuration file or configuration data in YAML format.
- Type:
str
- config: str
- regularize(string: str) str
Applies regularization rules to a given string.
- Parameters:
string (str) – Text string to be regularized.
- Returns:
Regularized text string.
- Return type:
str
- to_transliteration(text: str) str
Converts unicode text to transliteration format.
NB. This function may not work as expected for all scripts/languages because there may not be a one-to-one mapping between unicode and transliteration.
- Parameters:
text (str) – Input text in unicode format.
- Returns:
Transliterated text.
- Return type:
str
- to_unicode(text: str, regularize: bool = False) str
Converts transliterated text to unicode format.
- Parameters:
text (str) – Input text in transliterated format.
regularize (bool, optional) – Whether to apply regularization. Defaults to False.
- Returns:
Text converted to unicode format, optionally regularized.
- Return type:
str
- tokenize_transliteration(text: str) list[str]
Tokenizes transliterated text according to specific patterns.
- Parameters:
text (str) – Input text in transliterated format.
- Returns:
List of tokens
- Return type:
list[str]
- tokenize_unicode(text: str) list[str]
Tokenizes unicode text according to specific patterns.
By default, it tokenizes each character as a separate token. This method can be overridden in subclasses to provide more complex tokenization.
- Parameters:
text (str) – Input text in unicode format.
- Returns:
List of tokens
- Return type:
list[str]
Scripts Available
Linear A
- class potnia.scripts.linear_a.LinearA(config: str = 'linear_a.yaml')
Class for handling text transliteration and unicode conversion for Linear A.
To use the singleton instance, import like so:
from potnia import linear_a
- config
Path to the configuration file or configuration data in string format. By default, it uses the ‘linear_a.yaml file in the ‘data’ directory.
- Type:
str
- config: str = 'linear_a.yaml'
- regularize(string: str) str
Applies regularization rules to a given string.
- Parameters:
string (str) – Text string to be regularized.
- Returns:
Regularized text string.
- Return type:
str
- to_transliteration(text: str) str
Converts unicode text to transliteration format.
NB. This function may not work as expected for all scripts/languages because there may not be a one-to-one mapping between unicode and transliteration.
- Parameters:
text (str) – Input text in unicode format.
- Returns:
Transliterated text.
- Return type:
str
- to_unicode(text: str, regularize: bool = False) str
Converts transliterated text to unicode format.
- Parameters:
text (str) – Input text in transliterated format.
regularize (bool, optional) – Whether to apply regularization. Defaults to False.
- Returns:
Text converted to unicode format, optionally regularized.
- Return type:
str
- tokenize_transliteration(input_string: str) list[str]
Tokenizes transliterated text according to specific patterns.
- Parameters:
text (str) – Input text in transliterated format.
- Returns:
List of tokens
- Return type:
list[str]
- tokenize_unicode(text: str) list[str]
Tokenizes a unicode string by splitting and joining words with dashes.
- Parameters:
text (str) – Input text in unicode format.
- Returns:
List of tokenized strings.
- Return type:
list[str]
Linear B
- class potnia.scripts.linear_b.LinearB(config: str = 'linear_b')
Class for handling text transliteration and unicode conversion for Linear B.
To use the singleton instance, import like so:
from potnia import linear_b
Designed especially for texts from DĀMOS (Database of Mycenaean at Oslo): https://damos.hf.uio.no/ and LiBER (Linear B Electronic Resources): https://liber.cnr.it/
- config
Path to the configuration file or configuration data in string format. By default, it uses the ‘linear_a.yaml file in the ‘data’ directory.
- Type:
str
- config: str = 'linear_b'
- regularize(text: str) str
Applies regularization rules to a given string.
- Parameters:
string (str) – Text string to be regularized.
- Returns:
Regularized text string.
- Return type:
str
- to_transliteration(text: str) str
Converts unicode text to transliteration format.
NB. This function may not work as expected for all scripts/languages because there may not be a one-to-one mapping between unicode and transliteration.
- Parameters:
text (str) – Input text in unicode format.
- Returns:
Transliterated text.
- Return type:
str
- to_unicode(text: str, regularize: bool = False) str
Converts transliterated text to unicode format.
- Parameters:
text (str) – Input text in transliterated format.
regularize (bool, optional) – Whether to apply regularization. Defaults to False.
- Returns:
Text converted to unicode format, optionally regularized.
- Return type:
str
- tokenize_transliteration(text: str) list[str]
Tokenizes transliterated text according to specific patterns.
- Parameters:
text (str) – Input text in transliterated format.
- Returns:
List of tokens
- Return type:
list[str]
- tokenize_unicode(text: str) list[str]
Tokenizes a unicode string by splitting and joining words with dashes.
- Parameters:
text (str) – Input text in unicode format.
- Returns:
List of tokenized strings.
- Return type:
list[str]
Arabic
- class potnia.scripts.arabic.Arabic(config: str = 'arabic')
Class for handling text transliteration and unicode conversion to Arabic.
To use the singleton instance, import like so:
from potnia import arabic
Uses the DIN 31635 standard for Arabic transliteration.
If you need the Tim Buckwalter transliteration system, then use the PyArabic library.
- config
Path to the configuration file or configuration data in string format. By default, it uses the ‘arabic.yaml file in the ‘data’ directory.
- Type:
str
- config: str = 'arabic'
- regularize(string: str) str
Applies regularization rules to a given string.
- Parameters:
string (str) – Text string to be regularized.
- Returns:
Regularized text string.
- Return type:
str
- to_transliteration(text: str) str
Converts unicode text to transliteration format.
NB. This function may not work as expected for all scripts/languages because there may not be a one-to-one mapping between unicode and transliteration.
- Parameters:
text (str) – Input text in unicode format.
- Returns:
Transliterated text.
- Return type:
str
- to_unicode(text: str, regularize: bool = False) str
Converts transliterated text to unicode format.
- Parameters:
text (str) – Input text in transliterated format.
regularize (bool, optional) – Whether to apply regularization. Defaults to False.
- Returns:
Text converted to unicode format, optionally regularized.
- Return type:
str
- tokenize_transliteration(text: str) list[str]
Tokenizes transliterated text according to specific patterns.
- Parameters:
text (str) – Input text in transliterated format.
- Returns:
List of tokens
- Return type:
list[str]
- tokenize_unicode(text: str) list[str]
Tokenizes unicode text according to specific patterns.
By default, it tokenizes each character as a separate token. This method can be overridden in subclasses to provide more complex tokenization.
- Parameters:
text (str) – Input text in unicode format.
- Returns:
List of tokens
- Return type:
list[str]
Hittite
- class potnia.scripts.hittite.Hittite(config: str = 'hittite')
Class for handling text transliteration and unicode conversion to Hittite.
To use the singleton instance, import like so:
from potnia import hittite
Designed especially for texts from the Catalog der Texte der Hethiter (CTH): https://www.hethport.uni-wuerzburg.de/CTH/index.php
- config
Path to the configuration file or configuration data in string format. By default, it uses the ‘hittite.yaml file in the ‘data’ directory.
- Type:
str
- config: str = 'hittite'
- regularize(string: str) str
Applies regularization rules to a given string.
- Parameters:
string (str) – Text string to be regularized.
- Returns:
Regularized text string.
- Return type:
str
- to_transliteration(text: str) str
Converts unicode text to transliteration format.
NB. This function may not work as expected for all scripts/languages because there may not be a one-to-one mapping between unicode and transliteration.
- Parameters:
text (str) – Input text in unicode format.
- Returns:
Transliterated text.
- Return type:
str
- to_unicode(text: str, regularize: bool = False) str
Converts transliterated text to unicode format.
- Parameters:
text (str) – Input text in transliterated format.
regularize (bool, optional) – Whether to apply regularization. Defaults to False.
- Returns:
Text converted to unicode format, optionally regularized.
- Return type:
str
- tokenize_transliteration(input_string: str) list[str]
Tokenizes transliterated text according to specific patterns.
- Parameters:
text (str) – Input text in transliterated format.
- Returns:
List of tokens
- Return type:
list[str]
- tokenize_unicode(text: str) list[str]
Tokenizes unicode text according to specific patterns.
By default, it tokenizes each character as a separate token. This method can be overridden in subclasses to provide more complex tokenization.
- Parameters:
text (str) – Input text in unicode format.
- Returns:
List of tokens
- Return type:
list[str]