Lexon Engineering
(Note: The work is a part of PRIME project deliverables written by Tang, Y and Spyns, P during 2004 till 2006)
We group under this heading these activities that transform the domain conceptualisation that, under the form of verbalised facts in natural language, is still independent of any ontology language or representation formalism. In the following steps, the informal conceptualisation is progressively transformed into formal statements, fitting the VUB DOGMA ontology engineering formal framework. Nevertheless, in our opinion these steps are also useful when implementing an ontology in RDF or OWL. RDF or OWL Lite statements could be a direct result of the lexon engineering stage as the format of the lexons is very close to RDF or OWL Lite statements. As OWL DL statements contain semantic constraints, we have to finish the application specification step before being able to produce OWL DL statements。
This stage consists of the formalisation of the verbalised facts followed by quality control checks. The terms and roles of the remaining lexons are linked to unambiguous definitions.
1.1.1.1.1 Create lexons
This activity uses the verbalised facts resulting from the previous step as input. The aim is to extract lexons. The results are represented as binary fact types or lexons in the form <γ, ti, ri-j, rj-i, tj> where the terms ti and tj refer to concepts and the roles ri-j and rj-i refer to the relationships by which these are related. Currently, the context γ refers to the particular section or input document from which the lexon has been extracted[1]. If the results of the previous activities are binary in nature (as stressed in previous sections), this exercise is significantly simplified.
Table 2‑13 Privacy Directive Lexon Table[2]
|
Appropriate technical and organisational measures must be implemented to protect personal data against … Article 17 of the Directive 95/46/EC |
|||||
|
After Segmentation: 1. Appropriate technical measures must be implemented. 2. Appropriate organisational measures must be implemented. 3. Appropriate technical measures are to protect personal data. 4. Appropriate organisational measures are to protect personal data. … After highlighting: 1. Appropriate technical measures must be implemented. 2. Appropriate organisational measures must be implemented. 3. Appropriate technical measures are to protect personal data. 4. Appropriate organisational measures are to protect personal data. … Create Lexons: |
|||||
|
ID |
γ |
ti |
ri-j |
rj-i |
tj |
|
1 |
PersonalDataProctect |
TechnicalMeasure |
BeImplemented |
Implement |
(Person/Machine/etc.) |
|
2 |
PersonalDataProctect |
OrganisationalMeasure |
BeImplemented |
Implement |
(Person/Machine/etc.) |
|
3 |
PersonalDataProctect |
TechnicalMeasure |
Protect |
BeProtected |
PersonalData |
|
4 |
PersonalDataProctect |
OrganisationalMeasure |
Protect |
BeProtected |
PersonalData |
1.1.1.1.2 Refine lexons
The lexons newly created should undergo a kind of quality check. We say that a lexon is a ‘good’ one when
- This lexon is highly reusable
- This lexon is as simple as possible
- This lexon represents the correct information
- This lexon cannot be broken down any more
Note that the creation of elementary sentences should almost automatically lead to good lexons. Nevertheless, it could happen that some elementary sentences are to be represented by more than one lexon and vice verse. It mostly concerns knowledge implied (but not explicitly mentioned) by an elementary sentence.
Table 2‑14 shows an example based on the material from Table 2‑13.
Table 2‑14 Lexon refinement example
|
|
… |
|||||
|
S4 |
A data controller collecting data about a data subject. Many citizens desire not disclose their complete personal health information in an uncontrolled way. Accurate personal health data are crucial for high quality and personalised health care services but can also be misused to deny people services. |
|||||
|
Segmentation of S4 |
||||||
|
S4.1 |
A data controller collecting data. |
|||||
|
… |
… |
|||||
|
Lexon of S4.1 |
Γ |
ti |
r i-j |
r j-i |
tj |
|
|
Original |
Setting4.1 NSID:1 |
DataController |
Collect |
beCollected |
Data |
|
|
After simplification |
Setting4.1 NSID:1 |
Controller |
Collect |
beCollected |
Data |
|
|
Setting4.1 NSID:1 |
Controller |
isAbout |
appliedTo |
Data |
||
A term or role (or sometimes an expression) used in a specific context and language in principle points to a non ambiguous meaning. It can happen that equal or synonymous lexons or triples have been produced in the course of the domain conceptualisation and lexon engineering steps.
As it is senseless to keep such lexons (also possibly across language borders), the synonymous or equal cases are deleted in this activity. A refined lexon are voted to represent all the literally equal lexons (all composing words are the same) and synonymous lexons (the composing words are the same or synonymous). Two lexons or triples are equal when the respective terms and roles point to the same concepts (as have been defined in the previous step). This also holds when the sequence of terms and roles is inversed. In case of an inversed sequence, words can be antonyms.
Let’s look at three examples:
1. <bike, follow, be followed by, car> & <bicycle, go after, be followed by, automobile>
This example illustrates that two lexons are synonymous when their composing parts have the same sense. Here, bike is bicycle, follow is to go after, and a car is an automobile.
2. <dog, eat, be eaten, meat> & <meat, be eaten, eat, dog>
This sample illustrates that two lexons are equal when their terms and roles are equal (possibly in inverse sequence).
3. <bike, follow, be followed, car> & <automobile, precede, is preceded by, bicycle>
This example is the combination of two previous examples. The terms are synonyms and inversed while the role names are antonyms.
Simple sorting tools (e.g., ‘sort’, ‘unique’ Linux shell commands or spreadsheet functionalites) can already provide a basic level of automation. More sophistication can be achieved by implementing calls to on-line dictionaries, using the WordNet API, or e.g., using the DOGMA concept server[3] functionalities for checking on synonymy/antonymy.
As the core of this and the previous activities concerns the definition of meaning and linking the lexical representations to concepts, it is of primary importance that a sufficient number of stakeholders are gathered so that the refined lexons are based on widely accepted agreements.
In the case of the DOGMA engineering framework, the software allows to recreate[4] the lexons. Therefore, it is no longer necessary to store the lexons separately. However, for reasons of traceability and quality checks afterwards, it might be worthwhile to store the original lexons anyhow.
Here is an example from the E-Health NS - Table 2‑7:
From the NS note, we can extract one lexon as: <NSID:1_note, patient, choose, isChosen, GP>. And from the Narratological Schema episode E1.1.2, another lexon is extracted as: <Narratological SchemaID:1 E1.1.2, user, choose, isChosen, GP>. We might recognize that those two lexons are equal because the user here means the patient[5].
1.1.1.1.3 Ground lexons
Lexon grounding is a conceptual exercise that links the terms and roles that constitute a lexon to existing dictionaries, lexica or standards. If no adequate definitions exist, then new definitions should be drafted by hand following terminological principles by the domain experts and other stakeholders. In this way the vocabulary of the ontology (in the form of terms and roles) is provided with semantics. As a result, synonyms are easily detectable (i.e. they point to the same definition). As a check or additional source, the list of synonyms created by using Abstraction mechanism (see section 2.2.5.2.2), if available, can be used. New labels have to be chosen for a set of synonyms or expressions having the same meaning. These labels are preferably (slightly) different from natural language words to indicate that they operate on the conceptual level rather than the language level. In the VUB DOGMA ontology engineering framework, the definitions, the labels and the synonyms are entered in the concept definition server. Other implementations offering a similar functionality can be envisaged.
The same example is continued:
Table 2‑15 Lexon Dictionary
|
ID |
Label |
Explanation |
|
1 |
Technical |
description of software and hardware and the standards used |
|
2 |
OrganisationalMeasure |
Any manoeuvre that fits the organisational strategy made as part of progress toward a goal[6] |
|
3 |
PersonalData |
That data relating to a living individual which if in the possession of a data controller could by itself or with other data already in the possession of the data controller easily identify the living individual. (from www.nhstayside.scot.nhs.uk/FoISA/Glossary.htm) |
The following table shows several lexons extracted from Table 2‑11.
Table 2‑16 Settings of E-Health Narratological Schema
|
Setting |
||
|
S1 |
Background on Ehealth |
|
|
S2 |
The importance of privacy protection for Health data versus the importance of accurate data for health professionals particularly in emergencies. |
|
|
S3 |
An ontology which allows personal devices to communicate health data accurately and also securely and to inform the user accurately about data processing. |
|
|
S4 |
A data controller collecting data about a data subject. Many citizens desire not disclose their complete personal health information in an uncontrolled way. Accurate personal health data are crucial for high quality and personalised health care services but can also be misused to deny people services. |
|
|
Segmentation of S4 |
||
|
S4.1 |
A data controller collecting data. |
|
|
S4.2 |
Many citizens desire not disclose their complete personal health information in an uncontrolled way. |
|
|
S4.3 |
Accurate personal health data are crucial for high quality health care services. |
|
|
S4.4 |
Accurate personal health data are crucial for personalised health care services. |
|
|
S4.5 |
Accurate personal health data can also be misused to deny people services. |
|
Table 2‑17 Lexon Table of E-Health Settings
|
ID |
Context identifier |
Term1 |
Role |
Co-Role |
Term2 |
|
1 |
Settings |
NS |
Contain |
Is the component |
background |
|
2 |
Settings |
Background |
Is about |
Applied to |
E-Health |
|
3 |
Settings |
Emergency |
Need |
Is needed |
Data |
|
4 |
Settings |
Privacy protection |
protect |
Is protected by |
Data |
|
5 |
Settings |
Data |
Is about |
Applied to |
Health |
|
6 |
Settings |
E-health ontology |
Allow |
Is allowed |
Communication |
|
7 |
Settings |
Device |
Is allowed |
|
Communication |
|
8 |
Settings |
User |
Get |
Is granted to |
Communication |
|
9 |
Settings |
E-health ontology |
Inform |
Is informed by |
Processing |
|
10 |
Settings |
Processing |
Is about |
Is applied to |
Data |
|
11 |
Settings |
E-health ontology |
Inform |
Is informed by |
User |
|
12 |
Settings |
Data controller |
Collect |
Is collected by |
Data |
|
13 |
Settings |
Citizen |
Desire |
Is desired |
Information disclosure |
|
14 |
Settings |
Data |
Is crucial for |
Need crucially |
Service |
|
15 |
Settings |
Service |
Is about |
Is applied to |
Health care |
|
16 |
Settings |
Data |
Is crucial for |
Need crucially |
Service |
|
17 |
Settings |
Service |
Is about |
Is applied to |
Health care |
|
18 |
Settings |
Data |
Is misused |
|
|
|
19 |
Settings |
People |
Ask for |
Is asked by |
Service |
|
20 |
Settings |
Service |
Is denied to |
Is denied |
People |
[1] γ should be as specific and general enough to represent all those mentioned extracted lexons. If those lexons are captured from different but semantically related documents, γ will be chosen as general as enough to represent them. In that sense, the context represents an actual situation of usage able to disambiguate the word senses. Research on this point is still on-going.
[2] Just an example to show how γ is chosen for lexons
[3] It’s currently not publicly available.
[4] Synonymous and antonymous lexons can be reconstructed through the concept definition server.
[5] Abstraction happens during the whole domain conceptualization activity.
[6] Every label that appears in the lexon table should be found in the lexon dictionary, which is the reason why we include all those ‘privacy irrelevant’ labels here.
