Skip to content

Latest commit

 

History

History
259 lines (197 loc) · 18.2 KB

File metadata and controls

259 lines (197 loc) · 18.2 KB

Database and contributing

The database is stored in YAML files in lib/hyperglot/data/*.yaml with script defaults stored in lib/hyperglot/extra_data/default.yaml.

Updates are committed/merged into the dev branch while the master branch holds the latest released version.

Languages

The highest level entries in the database represent languages indexed using the three-letter ISO 639-3 code. Each language entry can have the following attributes:

  • name (required): the English name of the language as defined by ISO 639-3.
  • preferred_name (optional): an override of the ISO 639-3 name. This is useful when the ISO 639-3 name is considered pejorative or racist. We also use this to simplify very long names and where we have a preference (e.g., Sami over Saami). This can be turned off when using the database via the CLI tool or module to adhere strictly to the naming as defined in ISO 639-3.
  • autonym (optional): the name of the language in the language itself.
  • orthographies (optional): a list of orthographies for the language. See the orthography entry format below.
  • speakers (optional) the number of L1 speakers. Note that this is a number of speakers, not a number of readers. Only integer values are allowed.
  • speakers_date (optional) is the publication date of the source used for the speakers count.
  • status (required, default: living) one of the following:
    • living: a language that is currently in use and has some first-language (L1) speakers,
    • historical: a language with no first-language (L1) speakers, or
    • constructed: a language that has been deliberately created, such as Esperanto or Interlingue.
  • sources (required) a list of source references used to format the data. Please, read the Criteria for establishing orthographies and use APA style to format them.
  • validity (required, default: todo): one of the following:
    • todo – an entry that is a work in progress,
    • draft: a complete entry that has not yet been sufficiently verified,
    • preliminary: a complete entry that has been verified using two online sources, or
    • verified: a complete entry that has been verified by a competent reviewer or by two authoritative sources.
  • note (optional): any additional information or clarification.
  • contributors (required): a list of contributors for this file. Typically, a contributor is a person who provides data based on reliable sources rather than solely on personal knowledge of the language.
  • reviewers (optional): a reviewer is typically a competent speaker or a linguist, essentially a contributor that vouches for the data validity with their own expertise. A person can be either a contributor or a reviewer, not both.

Unless stated otherwise above, the default values are either an empty string or an empty list.

Orthographies

A language can refer to one or more orthographies. An orthography specifies the script and characters from this script used to represent the language. There can be multiple orthographies for the same language using the same or different scripts. Each entry can have these attributes which default to an empty string or list unless stated otherwise:

  • base (required unless using inheritance): a string of space-separated Unicode characters or combinations of characters and combining marks that are required to represent the language in common texts.
  • auxiliary (optional): a string of space-separated Unicode characters or combinations of characters and combining marks that are not essential for basic language support, but occur frequently in literature.
  • marks (optional): combining marks needed for the glyph composition of base or auxiliary as well as any additional combining marks required for this orthography. Saving with hyperglot-save will also automatically add any marks that can be decomposed from characters in base or auxiliary to the marks. Marks that are not part of any characters should be added here and they will be required for fonts checking such languages.
  • punctuation (optional): required punctuation characters.
  • numerals (optional): required numeral characters. Do not include mathematical operators.
  • currency (optional): recommended (rather than required) currency symbols.
  • autonym (optional): the name of the language in the language itself using this orthography. If missing, the autonym defined in the parent language entry is used. It is expected that the autonym can be spelled with the orthography's base.
  • script (required): the English name of the script used by the orthography, e.g., Arabic, Armenian, Gujarati. When a language uses a combination of several scripts in conjunction each script forms its own orthography. Script names (do not use script codes) should follow ISO 15924.
  • combinations (optional): a list of combinations and their frequencies. This is used in script shaping checks to test whether font OpenType instructions handle given combination.
  • status (required, default: primary): one of the following (there can be multiple orthographies with the same status per language):
    • primary: the current default orthography of a language,
    • secondary: a current but less frequently used orthography of a language (e.g., competing orthography gaining or losing popularity, orthography specific to a georgraphic location),
    • historical: an orthography that is no longer in use (all orthographies for a historical language should be marked as historical),
    • transliteration a representation of a language using an alternative script. Orthographies with secondary status are ignored during language support detection, but used when detecting orthography support.
  • preferred_as_group (optional, default: false) if set to true, all orthographies of this language with this attribute must be supported together. In other words, a language is detected as supported only if all its orthographies with this attribute are supported. For example, this is used for Serbian to require both Cyrillic-script and Latin-script orthographies to be supported and for Japanese to require Hiragana, Katakana, and Kanji orthographies to be supported.
  • note (optional): a note regarding the whole orthography, e.g., note about the use, naming, or inclusion of certain characters. Any note on design treatment of the characters should be in design_requirements.
  • design_requirements (optional): a list of design requirement records

Design requirements

A design requirement describes considerations specific to a given orthography. It should ideally be phrased in a font-format-agnostic way. The record can be general, addressing the design of all characters in the orthography, or specific to one or more characters (specified in the alternates entry). Each entry can have these attributes:

  • note: a note describing the design considerations.
  • alternates (optional): a string of space-separated characters (or character combinations) from the orthography which may require special design treatment different from the exemplars associated with the specific code points in the Unicode specification.

Criteria for establishing orthographies

Hyperglot serves primarily as a tool for language-support discovery. This is in contrast to other tools that attempt to set standards for assessment of quality in fonts or provide examples of characters from given language. Thus, Hyperglot’s main objective is to minimize false negatives, i.e. the number of languages falsely rejected during analysis of a font. This is the motivation behind the key principle of minimality applied to orthographies. Simply put, include only what needs to be included, nothing else.

  1. Use at least two definitive authoritative sources to establish base character set. This set includes the minimal essential requirements needed to represent the language in writing which typically maps to a standard alphabet or syllabary for the language or an approximation of thereof. Ideally, these characters should follow the logical order used in the sources.
  2. Characters that appear in frequent loan words or personal names should be included in auxiliary
  3. Deprecated characters can be also included in auxiliary, e.g., ş ţ for Romanian. You may also consider setting up a secondary/historical orthography instead.
  4. Contributors may decide to include digraphs or trigraphs for the sake of completeness, e.g., ch for Czech. All variations of uppercase and lowercase in common use should be included, e.g., CH, Ch, ch. These are not used during language-support analysis.
  5. Make sure to include any required or auxiliary marks that are not already in base or auxiliary in `marks, e.g., the acute accent in Bulgarian Cyrillic.
  6. Common punctuation characters used in literature should be included in punctuation. The same applies to numerals.
  7. Currencies used in the countries where the language is considered official can be included as a recommendation in currency.

After the initial unbridled growth, we aim to focus at improving the authoritativeness of the data by better tracking its provenance.

Authoritative sources include, in the order of preference:

  1. national/governmental standards,
  2. official institutions,
  3. educational materials, e.g., dictionaries,
  4. literature of well-informed writers,
  5. major religious works.

If possible, choose the primary sources over secondary sources such as Wikipedia or Omniglot. All sources used to compile the language and orthographic data should be listed in sources. Please, use the APA style and Markdown to add formatting (asterisks for italics, no need to format the links). Note that APA provides guidance regarding references to religious works, missing reference information, or referencing Wikipedia. Where available, provide a link or DOI. When citing an online source, try to find a permanent link (Wikipedia has those) to refer to a particular version of the document.

An example:

sources:
- Breton language. (2024, August 21). In *Wikipedia*. https://en.wikipedia.org/wiki/Breton_language?oldid=1241510288
- '*The Unicode Common Locale Data Repository (CLDR)*. (2024, September 27). The Unicode Consortium. https://cldr.unicode.org'
- Alvestrand, Harald Tveit. (1995) *Characters and character sets for various languages.* Retrieved on September 6, 2021 from https://www.alvestrand.no/ietf/lang-chars.txt

Unless stated otherwise, speaker counts are sourced from Wikipedia.

Contribution notes

  • Languages that are not written should not be included. (This should be self-evident.)
  • Languages that have some speakers should not be marked as historical even if ISO standard says so.
  • Languages need to have an ISO code, and more generally speaking, should have active speakers (unless historical) and not be purely scientific or theoretic in nature.
  • When adding or editing language data use the CLI commands hyperglot-validate to check your new data is compatible and use hyperglot-save to actually "save" the database in a standardized way (clean up, sorting, etc).
  • Note a few things that will happen automatically when saving with hyperglot-save:
    • Marks found in base or auxiliary be automatically added to marks.
    • All marks entries will be placed on top of for easier readability.
    • All character list entries will be spaced with a single space between them, on one line.
    • All language and orthography attributes will be alphabetically (a–z); while this might not be the most intuitive, this ensures that data is always sorted the same, and thus comparing different versions of the data (with version control) yields predictable results.
  • When contributing code make sure to install the pytest package and run pytest tests to make sure no errors are detected. Ideally, write tests for any code additions or changes you make.
  • Add yourself to any language files you edit, and add yourself to CONTRIBUTORS.txt
  • Hyperglot uses a cache file .hyperglot-cache stored in a local directory. This may cause confusion, as it currently only gets deleted when hyperglot-save is run. We hope to improve this in the future.

Inheritance

These attributes of an orthography can inherit from other languages/orthographies: base, auxiliary, marks, punctuation, numerals, currency, design_requirements.

Inheritance uses the ISO 639-3 code of the language to inherit from, with optional script name, orthography status, and attribute specification.

Examples:

# Inherit the base characters of eng to this orthography's base attribute
base: <eng>

# Inherit the auxiliary characters of eng, but into the base attribute of this orthography
base: <eng auxiliary>

# Inherit the ott transliteration orthography
base: <ott transliteration>

# Inherit the Cyrillic script base from srp (note script names are title case)
base: <srp Cyrillic>

# The inherited characters are inserted in place of the <...> so
base: Å <eng> À Á
# will result in:
base: Å A B C (etc. all from eng) À Á

Warning: Avoid using <g> (less/greater) character highlights elsewhere in notes etc., e.g., use ‘g’ or ‹g› (single guillemets) instead.

Using defaults

Since many languages of a given script will share some basic set of attributes there are convenience defaults. When possible, use these defaults and avoid deeply nested inheritance chains. You can use the [lib/hyperglot/extra_data/default.yaml] contents for inheritance, as if it were an ISO code, e.g.:

numerals: <default Arabic>
punctuation: <default Latin> <default Cyrillic>

If you wish to expand the defaults file, consult the maintainers(it only makes sense to include them for scripts with multiple languages).

Macrolanguages

Macrolanguages are used in the ISO 639-3 standard to keep it compatible with ISO 639-2 in situations where one language entry in ISO 639-2 corresponds to a group of languages in ISO 639-3. Macrolanguages are typically not used in Hyperglot’s main database. They are stored in a separate file in other/hyperglot_macrolanguages.yaml for convenience.

However, in some situations, we prefer to include certain macrolanguages as if they were regular ISO 639-3 languages. This is done to simplify the listings or to deal with scarcity of information for its sub-languages. Besides the same attributes as language entries, macrolanguages can use the following:

  • includes (required) contains a list of ISO 639-3 codes referring to sub-languages of the macrolanguage.
  • preferred_as_individual (optional, default: false): set to true signifies that the macrolanguage is included in the main database as if it was a regular language.

Example of an individual language with a single orthographic entry

lib/hyperglot/data/dan.yaml:

orthographies:
 - base: a b c d e f g h i j k l m n o p q r s t u v w x y z å æ ø
   auxiliary: ǻ  # this character is used only in linguistic literature for Danish
   autonym: Dansk
   script: Latin
 name: Danish
 speakers: 6000000
 sources:
 - Ager, S. (2021, May 4). *Omniglot* https://www.omniglot.com
- Danish language. (2024, August 19). In *Wikipedia*. https://en.wikipedia.org/wiki/Danish_language?oldid=1241084738
- '*The Unicode Common Locale Data Repository (CLDR)*. (2024, September 27). The Unicode Consortium. https://cldr.unicode.org'
- Quotation mark. (2024, September 26). In *Wikipedia*. https://en.wikipedia.org/w/index.php?title=Quotation_mark&oldid=1247790780
 todo_status: strong  # status of the database record

Example of a macrolanguage entry

lib/hyperglot/data/fas.yaml:

name: Persian
includes: [pes, prs, tgk, aiq, bhh, haz, jpr, phv, deh, jdt, ttt]
speakers: 70000000
sources:
- Iranian Persian. (2024, July 31). In *Wikipedia*. https://en.wikipedia.org/wiki/Iranian_Persian?oldid=1237740929
- Ager, S. (2021, May 4). *Omniglot* https://www.omniglot.com

Development and contributions

Contributions are most welcome. If you wish to update the database, submit a pull request with an editted and validated version of the hyperglot/data files. Ideally, use hyperglot-validate and hyperglot-save, as this will check and format the data in a way consistent with the database standards.

To start a new language entry you can use this template and include it in hyperglot/data/*.yaml as a new language draft in your pull request or github issue:

name: # required
orthographies:
- autonym: # optional, name of the language in this language and orthography
  auxiliary: # optional
  base: # required
  currency: # optional
  marks: # optional
  numerals: <default>
  punctuation: <default>
  script: # required
  status: primary
sources: # required (a list)
# - an APA-style reference to the source of the data
speakers: # optional, integer
speakers_date: # optional, YYYY
note: # optional
design_requirements: # optional (a list)
# - note: optional
#   alternates: optional
status: living
validity: draft
contributors:
- Your Name # if you are contributing the data solely based on the sources
reviewers:
# - Your Name # if you claim expertise in this language

Development

To run the script during development without having to constantly reinstall the pip package, you can use:

git clone https://github.com/rosettatype/hyperglot.git && cd hyperglot
pip install --upgrade --user --editable .
pip install -r requirements-tests.txt

To test the codebases after making changes run the pytest test suite:

pytest

To validate, sort, and verify the data integrity of hyperglot/data and generate a report of any formatting errors, run:

hyperglot-validate

To save hyperglot/data use (this will format, sort and prune the data read in from the individual yaml files):

hyperglot-save

Note that this will read and write the yaml file and may change the formatting of your file.