This document describes the module for frequency, attestation and corpus information of the OntoLex Lexicon Model for Ontologies (OntoLex-Lemon) developed by the W3C Community Group Ontology-Lexica. The module is targeted at complementing dictionaries and other linguistic resources containing lexicographic data with a vocabulary to express
This document is an official report of the OntoLex community group. It does not represent the view of single individuals but reflects the consensus and agreement reached as part of the regular group discussions. The report should be regarded as the official specification of lemon.
If you wish to make comments regarding this document, please send them to public-ontolex@w3.org (subscribe, archives).
OntoLex-Lemon provides a core vocabulary to represent linguistic information associated with ontology and vocabulary elements. The model follows the principle of semantics by reference in the sense that the semantics of a lexical entry is expressed by reference to an individual, class or property defined in an ontology. The OntoLex module for Frequency, Attestations and Corpus-Based Information (OntoLex-FrAC) complements OntoLex-Lemon with the capability of including information drawn from or found in corpora and linguistic primary data.
In particular, the model’s primary motivation is to provide a means to link lexical resources to corpora and other collections of text, and to express the relationship between lexical information and the primary data from which it is derived. As such this module will:
This is a list of relevant namespaces that will be used in the rest of this document:
OntoLex module for frequency, attestation and corpus information
@prefix frac: <http://www.w3.org/ns/lemon/frac#> .
OntoLex (core) model and other lemon modules:
@prefix ontolex: <http://www.w3.org/ns/lemon/ontolex#> .
@prefix synsem: <http://www.w3.org/ns/lemon/synsem#> .
@prefix decomp: <http://www.w3.org/ns/lemon/decomp#> .
@prefix vartrans: <http://www.w3.org/ns/lemon/vartrans#> .
@prefix lime: <http://www.w3.org/ns/lemon/lime#> .
@prefix lexicog: <http://www.w3.org/ns/lemon/lexicog#> .
Other models:
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix owl: <http://www.w3.org/2002/07/owl#>.
@prefix xsd: <http://www.w3.org/2001/XMLSchema#>.
@prefix lexinfo: <http://www.lexinfo.net/ontology/3.0/lexinfo#>.
@prefix dct: <http://purl.org/dc/terms/>.
@prefix oa: <http://www.w3.org/ns/oa#>.
@prefix dcterms: <http://purl.org/dc/terms/> .
The following diagram depicts the OntoLex module for frequency, attestation and corpus information (OntoLex-FrAC). Boxes represent classes of the model. Arrows with filled heads represent object properties. Arrows with empty heads represent rdfs:subClassOf
. Vocabulary elements introduced by this module are shaded grey (classes) or set in italics.
OntoLex-FrAC provides the necessary vocabulary to express observations obtained from a language resource about any linguistic or conceptual entity that can be observed in a corpus (“observable”). By observable, we mean
ontolex:LexicalEntry
, ontolex:Form
, ontolex:LexicalSense
or ontolex:LexicalConcept
), as well asontolex:denotes
, ontolex:reference
or ontolex:isConceptOf
property).The top-level concepts of OntoLex-FrAC are thus frac:Observable
and frac:Observation
, complemented by a designating where the observation has been frac:observedIn
.
Observable (Class)
URI: http://www.w3.org/ns/lemon/frac#Observable
Observable is a superclass for any element of a lexical resource that frequency, attestation or corpus-derived information can be expressed about. This includes, among others, ontolex:LexicalEntry
, ontolex:LexicalSense
, ontolex:Form
, and ontolex:LexicalConcept
. Elements that FrAC properties apply to must be observable in a corpus or another linguistic data source.
For OntoLex, we assume that frequency, attestation and corpus information can be provided about every linguistic content element in the OntoLex-Lemon core model and in existing or forthcoming OntoLex modules. This includes ontolex:Form
(for token frequency, etc.), ontolex:LexicalEntry
(frequency of disambiguated lemmas), ontolex:LexicalSense
(sense frequency), ontolex:LexicalConcept
(e.g., synset frequency), lexicog:Entry
(if used for representing homonyms: frequency of non-disambiguated lemmas), etc. (cf. Fig. 1). In particular, we consider all these elements to be countable, annotatable/attestable. For this reason, we introduce frac:Observable
as a top-level element within the FrAC module that is used to define the rdfs:domain
of any properties that link lexical and corpus-derived information.
The definition frac:Observable
does not posit an exhaustive list of possible observables. Instead, anything that can be observed in a corpus can be defined as frac:Observable
. This includes elements of OntoLex modules not listed here (e.g., decomp:Component
, synsem:SyntacticArgument
, etc.) or future OntoLex vocabularies. Likewise, it can also include URIs which have no relation to OntoLex whatsoever, as these are foreseen as external elements that OntoLex-Lemon can provide information about, but only if they are based on or linked with corpus information, attested in a document, a text or its annotations.
Observation (Class)
URI: http://www.w3.org/ns/lemon/frac#Observation
Observation is a superclass for anything that can be observed in a corpus about an Observable.
SubClassOf: exactly 1 frac:observedIn
, min 1 dct:description
, exactly 1 rdf:value
Observations as understood here are empirical (quantitative) observations that are made against a corpus, a text, a document or another type of language data. Observations can be made in any kind of (collection or excerpt of) linguistic data at any scale, structured or unstructured, regardless of its physical materialization (as an electronic corpus, as a series of printed books, as a bibliographical database or as metadata record for a particular corpus).
observedIn (ObjectProperty)
URI: http://www.w3.org/ns/lemon/frac#observedIn
For a frac:Observation
, the property observedIn defines the URI of the data source (or its metadata entry) that this particular observation was made in or derived from. This can be, for example, a corpus or a text represented by its access URL, a book represented by its bibliographical metadata, etc.
Domain: frac:Observation
Range: anyURI
Lexicographers use (corpus) frequency and distribution information while compiling lexical entries, as a qualitative assessment of their resources. In this module, we focus on absolute frequencies, as relative frequencies can be derived if absolute frequencies and totals are known. Absolute frequencies are used in computational lexicography, and they are an essential piece of information for NLP and corpus linguistics.
Frequency (Class)
URI: http://www.w3.org/ns/lemon/frac#Frequency
Frequency is afrac:Observation
of the absolute number of attestations (rdf:value
) of a particular frac:Observable
(see frac:frequency
) that is frac:observedIn
in a particular data source. Using frac:unit
, frequency objects can also identify the (segmentation) unit that their counts are based on.SubClassOf: frac:Observation
SubClassOf: rdf:value
exactly 1 , frac:observedIn
exactly 1
A frequency should have a unit that specifies the segmentation unit of the frequency count. This can be, for example, “tokens”, “types”, “lemmas”, “sentences”, “paragraphs”, etc.
unit (Property)
URI: http://www.w3.org/ns/lemon/frac#unit
For a frac:Frequency
object, the property unit provides an identifier of the respective segmentation unit.
rdfs:range frac:Frequency
Examples of values of frac:unit
include string literals such as "tokens"
, "sentences"
, etc. If a future community standard provides reference URIs for such datatypes, frac:unit
should be used as a datatype property. Until such a convention has been established, it is recommended to be used as a datatype property.
frequency (ObjectProperty)
URI: http://www.w3.org/ns/lemon/frac#frequency
The property frequency assigns a particular frac:Observable
a frac:Frequency
.
rdfs:domain frac:Observable
rdfs:range frac:Frequency
There is only a single data source, in which the frequency is observed, so the frequency value should correspond to the aggregation of sources and languages in that dataset.
The definition above only applies to absolute frequencies. For expressing relative frequencies, we expect the associated data source (frac:observedIn
) object to define a total of elements contained (frac:total
). In many practical applications, it is necessary to provide relative counts, and in this way, these can be easily derived from the absolute (element) frequency provided by the Frequency class and the total defined by the underlying corpus. If the real absolute values are unknown and only relative scores are provided, data providers should use percentage values for both the Frequency
rdf:value
and for the frac:total
(i.e., 100%
) of the associated corpus.
A simple example of indicating the frequency of a word in a corpus is given below:
The identifiers and data is drawn from the Open English Wordnet project, however, it is simplified for explanatory purposes.
The following example illustrates word and form frequencies for the Sumerian word a (n.) “water” from the Electronic Penn Sumerian Dictionary and the frequencies of the underlying corpus.
The example shows an orthographic variation (in the original writing system, Sumerian Cuneiform sux-Xsux, and its Latin transcription sux-Latn). It is slightly simplified insofar as the ePSD2 provides individual counts for different periods and only three of six orthographical variants are given. Note that these are orthographical variants, not morphological variants (which are not given in the dictionary).
total (ObjectProperty)
URI: http://www.w3.org/ns/lemon/frac#total
The object property total assigns any potential FrAC data source (i.e., dct:Collection
, dct:Dataset
, dct:Text
or any other member of DCMI Type) the total number of elements that it contains as a frac:Frequency
object.
Domain: class that is a dcam:memberOf
DCMI Type
Range: frac:Frequency
For frac:total
, users should provide both the frequency and the segmentation/unit over which this frequency is obtained. For an observable, then, relative frequencies (for any given unit u) can then be calculated from the object values of frac:frequency/rdf:value
and frac:frequency/frac:observedIn/frac:total/rdf:value
if (and only if) the corresponding units match.
An example of the use of frac:total
is given below:
Attestations constitute a special form of citation that provide evidence for the existence of a certain lexical phenomena; they can elucidate meaning or illustrate various linguistic features.
In scholarly dictionaries, attestations are a representative selection from the occurrences of a headword in a textual corpus. These citations often consist of a quotation accompanied by a reference to the source. The quoted text usually contains the occurrence of the headword.
Attestation (Class)
URI: http://www.w3.org/ns/lemon/frac#Attestation
An Attestation is a frac:Observation
that represents one exact or normalized quotation or excerpt from a source document that illustrates a particular form, sense, lexeme or features such as spelling variation, morphology, syntax, collocation, register. For an attestation, rdf:value
represents the text of a quotation as represented in the original source.
SubClassOf: rdf:value
max 1
SubClassOf: frac:Observation
Attestations are linked with the frac:attestation
property to the frac:Observable
they attest.
attestation (ObjectProperty)
URI: http://www.w3.org/ns/lemon/frac#attestation
The property frac:attestation associates an attestation to the frac:Observable. This is a subproperty of frac:citation
using concrete data as evidence.
Domain: Observable
Range: Attestation
SubPropertyOf: citation
As an example of an attestation, consider the following example from Open English Wordnet:
In general, the object of a citation represents the successful act of citing an entity which can be referred to by a standardised bibliographic reference, cf. Peroni (2012) :
[a Citation is] “a conceptual directional link from a citing entity to a cited entity, created by a human performative act of making a citation, typically instantiated by the inclusion of a bibliographic reference in the reference list of the citing entity, or by the inclusion within the citing entity of a link, in the form of an HTTP Uniform Resource Locator (URL), to a resource on the World Wide Web”.
Citations are given with the following property:
citation (ObjectProperty)
URI: http://www.w3.org/ns/lemon/frac#citation
The property frac:citation associates a citation to the Observable
citing it.
Domain: Observable
However, note that FrAC does not formally define a general “Citation” class to define the range of citation
, but only provides Attestation
as one specific possibility. Beyond attestations, different vocabularies have been suggested for linking bibliographical information, and we advise users of FrAC to make a consistent choice among them, adequate for their respective needs and the conventions of their users’ community. frac:citation
serves as an interface to these external vocabularies. If the CITO vocabulary is used in a particular resource, their FrAC Citations can be defined as the subclass of CITO citations having frac:Observable as citing entity and attestations would correspond to citations with the cito:hasCitationCharacterization value citesAsEvidence. Other relevant vocabularies include, for example, BIBFRAME, FRBR and FaBiO, but also, generic vocabularies such as schema.org.
Glosses are used to give the form of the text as used in the dictionary. This property should not be used to provide direct quotations from the original data source, which should be represented by rdf:value
. Instead, its recommended use is for representations that are either enriched (e.g., by annotations and metadata), amended (e.g., by expanding ligatures or omissions), simplified (e.g., by omissions from the original context, e.g., of the lexeme under consideration) or otherwise differentiated from the plain text representation of the context.
gloss (Property)
URI: http://www.w3.org/ns/lemon/frac#gloss
The gloss of an attestation contains the text content of an attestation as represented within a dictionary.
Domain: Attestation
Range: xsd:String
With frac:gloss
and rdf:value
, frac:Attestation
provides two different properties to represent the context of an observable in any particular data source. rdf:value
should provide information as found in the underlying corpus, e.g., a plain text string. If the dictionary provides a different representation, or if the attestation as given in an underlying dictionary has not yet been confirmed to match the context in the underlying corpus, applications should use frac:gloss
instead of rdf:value
. In other words, rdf:value
corresponds to the representation of the context in the underlying corpus, frac:gloss
to its representation in the underlying dictionary. If both are confirmed to be equal, use rdf:value
.
As an example, for Old English hwæt-hweganunges, Bosworth (2014) gives the example "Ða niétenu ðonne beóþ hwæthuguningas [MS. Cote. -hwugununges] ...
. In OntoLex-FrAC, this would be the frac:gloss
because it contains additional information about spelling variation/normalized spelling not found in the quoted source (MS. Cote.
):
In many applications, it is desirable to specify the location of the occurrence of a headword in the quoted text of an attestation, for example, by means of character offsets. The FrAC standard supports referencing using RFC5147 character offsets, Text Fragments, NIF URIs, or by means of Web Annotation references (see Section 6). As different vocabularies can be used to establish locus objects, the FrAC vocabulary is underspecified with respect to the exact nature of the locus object. Accordingly, the locus property that links an attestation with its source takes any URI as its object.
locus (ObjectProperty)
URI: http://www.w3.org/ns/lemon/frac#locus
frac:locus points to the location at which the relevant word(s) can be found.
Domain: Attestation
frac:locus
denotes a specific location within a text, e.g., a character offset or a URI pointing to a specific location in a text. In contrast, frac:observedIn
can refer to a corpus of other collections of texts. frac:locus
normally refers to a location identified by RFC5147 character offsets, NIF URIs, Open Annotation or Text Fragments references, whereas frac:observedIn
refers to dct:Text
s or dct:Collection
s.
A collocation is a sequence of words or terms that co-occur more often than would be expected by chance. Often, collocations are idiomatic expressions, but they can also be more general, such as “strong tea” or “heavy rain”.
Collocation analysis is an important tool for lexicographical research and instrumental for modern NLP techniques. It has been the mainstay of 1990s corpus linguistics and continues to be an area of active research in computational philology and lexicography.
Collocations are usually defined on surface-oriented criteria, i.e., as a relation between forms or lemmas (lexical entries), not between senses, but they can be analyzed on the level of word senses (the sense that gave rise to the idiom or collocation). Indeed, collocations often contain a variable part, which can be represented by a ontolex:LexicalConcept.
Collocations can involve two or more words, they are thus modelled as an rdfs:Container of frac:Observabless. Collocations may have a fixed or a variable word order. Where fixed word order is required, the collocation must be defined as a sequence (rdf:Seq), otherwise, the default interpretation is as an ordered set (rdf:Bag).
Collocations obtained by quantitative methods are characterized by their method of creation (dct:description), their collocation strength (rdf:value), and the corpus or data source used to create them (frac:observedIn). Collocations share these characteristics with other frac:Observation
s and thus, these are inherited from the frac:Observation class.
Collocation (Class)
URI: http://www.w3.org/ns/lemon/frac#Collocation
A Collocation is a frac:Observation that describes the co-occurrence of two or more frac:Observabless within the same context window and that can be characterized by their collocation score (or weight, frac:cScore) in a particular data source (frac:observedIn).
SubClassOf: frac:Observation, rdfs:Container, frac:Observable
rdfs:member: only frac:Observable
SubClassOf: frac:head
max 1
Collocations are collections of frac:Observables
, and formalized as rdfs:Container, i.e., rdf:Seq or rdf:Bag. The elements of any collocation can be accessed by rdfs:member
. In addition, the elements of an ordered collocation (rdfs:subClassOf rdf:Seq
) can be accessed by means of numerical indices (rdf:_1
, rdf:_2
, etc.).
By default, frac:Collocation is insensitive to word order. If a collocation is word order sensitive, it should be defined as rdfs:subClassOf rdf:Seq
. Collocation analysis typically involves additional parameters such as the size of the context window considered. Such information can be provided in human-readable form in dct:description.
FrAC collocations can be used to represent collocations both in the lexicographic sense (as complex units of meaning) and in the quantitative sense (as determined by collocation metrics over a particular corpus), but that the quantitative interpretation is the preferred one in the context of FrAC. To mark collocations in the lexicographic sense as such, they can be assigned a corresponding lexinfo:termType
, e.g., by means of lexinfo:idiom
, lexinfo:phraseologicalUnit
or lexinfo:setPhrase
. If explicit sense information is being provided, the recommended modelling is by means of ontolex:MultiWordExpression
and the OntoLex-Decomp module rather than frac:Collocation
. To provide collocation scores about a ontolex:MultiWordExpression
, it can be linked via rdfs:member
with a frac:Collocation
.
Collocations are frac:Observable
s, i.e., they can be ascribed frac:frequency
, frac:attestation
, frac:embedding
, they can be described in terms of their (embedding) similarity, and they can be nested inside larger collocations.
Collocations can be described in terms of various collocation scores. If scores for multiple metrics are being provided, these should not use the generic rdf:value
property, but a designated subproperty of frac:cScore
:
cScore (property)
URI: http://www.w3.org/ns/lemon/frac#cScore
Collocation score is a subproperty of rdf:value
that provides the value for one specific type of collocation score for a particular collocation in its respective corpus. Note that this property should not be used directly, but instead, its respective sub-properties for scores of a particular type.
SubPropertyOf: rdf:value
domain: frac:Collocation
LexInfo defines a number of popular collocation metrics as sub-properties of frac:cScore
:
lexinfo:relFreq
(relative frequency): (asymmetric, requires frac:head
)lexinfo:pmi
(pointwise mutual information, sometimes referred to as MI-score or association ratio, cf. Church and Hanks 1990, via Ewert 2005: lexinfo:pmi2
(PMI²-score): lexinfo:pmi3
(PMI³-score, cf. Daille 1994 in Ebert 2005, p.89): lexinfo:pmiLogFreq
(PMI.log-f, salience, formerly default metric in SketchEngine): lexinfo:dice
(Dice coefficient): lexinfo:logDice
(default metric in SketchEngine, Rychly 2008): lexinfo:minSensitivity
(minimum sensitivity, cf. Pedersen 1998): with
dct:description
In addition to collocation scores, also statistical independence tests can be employed as collocation scores:
lexinfo:logLikelihood
(log likelihood, G² Dunning 1993, via Ewer 2005)lexinfo:tScore
(Student’s t test, T-score, cf. Church et al. 1991, via Ewert 2005, p.82 ): lexinfo:chi2
(Person’s Chi-square test Manning 1999 ): with
In addition to classical collocation metrics, as established in computational lexicography and corpus linguistics, related metrics can also be found in different disciplines and are represented here as subproperties of frac:cScore, as well. This includes metrics for association rule mining. In this context, an association rule (collocation) means that the existence of word x implies the existence of word y
lexinfo:support
(the support is an indication of how frequently the rule appears in the dataset): (with N the total number of collocations)lexinfo:confidence
(the confidence is an indication of how often the rule has been found to be true): lexinfo:lift
(the lift or interest of a rule measures how many times more often x and y occur together than expected if they are statistically independent): lexinfo:conviction
(the conviction of a rule is interpreted as the ratio of the expected frequency that x occurs without y, i.e., the frequency that the rule makes an incorrect prediction, if x and y are independent divided by the observed frequency of incorrect predictions): As OntoLex does not provide a generic inventory for grammatical relations, scores defined for grammatical relations are omitted. However, these may be defined by the user.
Many of these metrics are asymmetric and distinguish the lexical element they are about (the head) from its collocate(s). If such metrics are provided, a collocation should explicitly identify its head:
head (property)
URI: http://www.w3.org/ns/lemon/frac#head
The head property identifies the element of a collocation that its scores are about. A collocation must not have more than one head.
domain: frac:Collocation
range: frac:Observable
As an example, the relative frequency score is the number of occurrences of a collocation relative to the overall frequency of its head.
The function of the property frac:head
is restricted to indicate the directionality of asymmetric collocation scores. It must not be confused with the notion of “head” in certain fields of linguistics, e.g., dependency syntax.
The following example illustrates collocations as provided by the Wortschatz portal (scores and definitions as provided for beans, spill the beans, etc.
The Ontolex Module for Frequency, Attestation and Corpus Information does not specify a vocabulary for annotating corpora or other data with lexical information, as this is being provided by the Web Annotation Vocabulary. The following description is non-normative as Web Annotation is defined in a separate W3C recommendation. The definitions below are reproduced and refined only insofar as domain and range declarations have been refined to our use case.
In Web Annotation terminology, the annotated element is the ‘target’, the content of the annotation is the ‘body’, and the process and provenance of the annotation is expressed by properties of oa:Annotation.
Annotation as linked with the oa:hasBody
and oa:hasTarget
properties:
The Web Annotation Vocabulary supports different ways to define targets. This includes:
oa:Annotation explicitly allows n:m relations between ontolex:Elements and elements in the annotated elements. It is thus sufficient for every ontolex:Element to appear in one oa:hasBody statement in order to produce a full annotation of the corpus.
As for frequency, embeddings, etc., resource-specific annotation classes can be defined by owl:Restriction so that modelling effort and verbosity are reduced. These should follow the same conventions.
The NLP Interchange Format (NIF) is a standard for the representation of text annotations. It is based on RDF and allows for the representation of text, its structure, and annotations. NIF is particularly useful for the representation of text annotations in the context of the Semantic Web. The NIF standard is defined in the NIF 2.1 specification.
NIF strings can be used as a locus for an attestation as follows:
In this example, the string “The quick brown fox jumps over the lazy dog.” is annotated as an attestation at character positions 123 to 456.
Alternatively, the loci of attestations may be give as RFC5147 URIs or as Text Fragments. The following example illustrates the use of RFC5147 URIs:
In this example, the string “The quick brown fox jumps over the lazy dog.” is annotated as an attestation at character positions 123 to 456.