API Reference

Abstract Base Class: Script

class potnia.script.Script(config: str)

Bases: object

The abstract base class for handling text transliteration and unicode conversion.

config

Path to the configuration file or configuration data in YAML format.

Type:

str

config: str
regularize(string: str) str

Applies regularization rules to a given string.

Parameters:

string (str) – Text string to be regularized.

Returns:

Regularized text string.

Return type:

str

to_transliteration(text: str) str

Converts unicode text to transliteration format.

NB. This function may not work as expected for all scripts/languages because there may not be a one-to-one mapping between unicode and transliteration.

Parameters:

text (str) – Input text in unicode format.

Returns:

Transliterated text.

Return type:

str

to_unicode(text: str, regularize: bool = False) str

Converts transliterated text to unicode format.

Parameters:
  • text (str) – Input text in transliterated format.

  • regularize (bool, optional) – Whether to apply regularization. Defaults to False.

Returns:

Text converted to unicode format, optionally regularized.

Return type:

str

tokenize_transliteration(text: str) list[str]

Tokenizes transliterated text according to specific patterns.

Parameters:

text (str) – Input text in transliterated format.

Returns:

List of tokens

Return type:

list[str]

tokenize_unicode(text: str) list[str]

Tokenizes unicode text according to specific patterns.

By default, it tokenizes each character as a separate token. This method can be overridden in subclasses to provide more complex tokenization.

Parameters:

text (str) – Input text in unicode format.

Returns:

List of tokens

Return type:

list[str]

Scripts Available

Linear A

class potnia.scripts.linear_a.LinearA(config: str = 'linear_a.yaml')

Class for handling text transliteration and unicode conversion for Linear A.

To use the singleton instance, import like so: from potnia import linear_a

config

Path to the configuration file or configuration data in string format. By default, it uses the ‘linear_a.yaml file in the ‘data’ directory.

Type:

str

config: str = 'linear_a.yaml'
regularize(string: str) str

Applies regularization rules to a given string.

Parameters:

string (str) – Text string to be regularized.

Returns:

Regularized text string.

Return type:

str

to_transliteration(text: str) str

Converts unicode text to transliteration format.

NB. This function may not work as expected for all scripts/languages because there may not be a one-to-one mapping between unicode and transliteration.

Parameters:

text (str) – Input text in unicode format.

Returns:

Transliterated text.

Return type:

str

to_unicode(text: str, regularize: bool = False) str

Converts transliterated text to unicode format.

Parameters:
  • text (str) – Input text in transliterated format.

  • regularize (bool, optional) – Whether to apply regularization. Defaults to False.

Returns:

Text converted to unicode format, optionally regularized.

Return type:

str

tokenize_transliteration(input_string: str) list[str]

Tokenizes transliterated text according to specific patterns.

Parameters:

text (str) – Input text in transliterated format.

Returns:

List of tokens

Return type:

list[str]

tokenize_unicode(text: str) list[str]

Tokenizes a unicode string by splitting and joining words with dashes.

Parameters:

text (str) – Input text in unicode format.

Returns:

List of tokenized strings.

Return type:

list[str]

Linear B

class potnia.scripts.linear_b.LinearB(config: str = 'linear_b')

Class for handling text transliteration and unicode conversion for Linear B.

To use the singleton instance, import like so: from potnia import linear_b

Designed especially for texts from DĀMOS (Database of Mycenaean at Oslo): https://damos.hf.uio.no/ and LiBER (Linear B Electronic Resources): https://liber.cnr.it/

config

Path to the configuration file or configuration data in string format. By default, it uses the ‘linear_a.yaml file in the ‘data’ directory.

Type:

str

config: str = 'linear_b'
regularize(text: str) str

Applies regularization rules to a given string.

Parameters:

string (str) – Text string to be regularized.

Returns:

Regularized text string.

Return type:

str

to_transliteration(text: str) str

Converts unicode text to transliteration format.

NB. This function may not work as expected for all scripts/languages because there may not be a one-to-one mapping between unicode and transliteration.

Parameters:

text (str) – Input text in unicode format.

Returns:

Transliterated text.

Return type:

str

to_unicode(text: str, regularize: bool = False) str

Converts transliterated text to unicode format.

Parameters:
  • text (str) – Input text in transliterated format.

  • regularize (bool, optional) – Whether to apply regularization. Defaults to False.

Returns:

Text converted to unicode format, optionally regularized.

Return type:

str

tokenize_transliteration(text: str) list[str]

Tokenizes transliterated text according to specific patterns.

Parameters:

text (str) – Input text in transliterated format.

Returns:

List of tokens

Return type:

list[str]

tokenize_unicode(text: str) list[str]

Tokenizes a unicode string by splitting and joining words with dashes.

Parameters:

text (str) – Input text in unicode format.

Returns:

List of tokenized strings.

Return type:

list[str]

Arabic

class potnia.scripts.arabic.Arabic(config: str = 'arabic')

Class for handling text transliteration and unicode conversion to Arabic.

To use the singleton instance, import like so: from potnia import arabic

Uses the DIN 31635 standard for Arabic transliteration.

If you need the Tim Buckwalter transliteration system, then use the PyArabic library.

config

Path to the configuration file or configuration data in string format. By default, it uses the ‘arabic.yaml file in the ‘data’ directory.

Type:

str

config: str = 'arabic'
regularize(string: str) str

Applies regularization rules to a given string.

Parameters:

string (str) – Text string to be regularized.

Returns:

Regularized text string.

Return type:

str

to_transliteration(text: str) str

Converts unicode text to transliteration format.

NB. This function may not work as expected for all scripts/languages because there may not be a one-to-one mapping between unicode and transliteration.

Parameters:

text (str) – Input text in unicode format.

Returns:

Transliterated text.

Return type:

str

to_unicode(text: str, regularize: bool = False) str

Converts transliterated text to unicode format.

Parameters:
  • text (str) – Input text in transliterated format.

  • regularize (bool, optional) – Whether to apply regularization. Defaults to False.

Returns:

Text converted to unicode format, optionally regularized.

Return type:

str

tokenize_transliteration(text: str) list[str]

Tokenizes transliterated text according to specific patterns.

Parameters:

text (str) – Input text in transliterated format.

Returns:

List of tokens

Return type:

list[str]

tokenize_unicode(text: str) list[str]

Tokenizes unicode text according to specific patterns.

By default, it tokenizes each character as a separate token. This method can be overridden in subclasses to provide more complex tokenization.

Parameters:

text (str) – Input text in unicode format.

Returns:

List of tokens

Return type:

list[str]

Hittite

class potnia.scripts.hittite.Hittite(config: str = 'hittite')

Class for handling text transliteration and unicode conversion to Hittite.

To use the singleton instance, import like so: from potnia import hittite

Designed especially for texts from the Catalog der Texte der Hethiter (CTH): https://www.hethport.uni-wuerzburg.de/CTH/index.php

config

Path to the configuration file or configuration data in string format. By default, it uses the ‘hittite.yaml file in the ‘data’ directory.

Type:

str

config: str = 'hittite'
regularize(string: str) str

Applies regularization rules to a given string.

Parameters:

string (str) – Text string to be regularized.

Returns:

Regularized text string.

Return type:

str

to_transliteration(text: str) str

Converts unicode text to transliteration format.

NB. This function may not work as expected for all scripts/languages because there may not be a one-to-one mapping between unicode and transliteration.

Parameters:

text (str) – Input text in unicode format.

Returns:

Transliterated text.

Return type:

str

to_unicode(text: str, regularize: bool = False) str

Converts transliterated text to unicode format.

Parameters:
  • text (str) – Input text in transliterated format.

  • regularize (bool, optional) – Whether to apply regularization. Defaults to False.

Returns:

Text converted to unicode format, optionally regularized.

Return type:

str

tokenize_transliteration(input_string: str) list[str]

Tokenizes transliterated text according to specific patterns.

Parameters:

text (str) – Input text in transliterated format.

Returns:

List of tokens

Return type:

list[str]

tokenize_unicode(text: str) list[str]

Tokenizes unicode text according to specific patterns.

By default, it tokenizes each character as a separate token. This method can be overridden in subclasses to provide more complex tokenization.

Parameters:

text (str) – Input text in unicode format.

Returns:

List of tokens

Return type:

list[str]