Lexis Natural Language Processing Components (Lexis Natural Language Processing Components)

AcronymaX(tm), automated extractor/tagger of acronym definitions

2006


Table of Contents

1. Introduction
1.1. System Requirements
2. AcronymaX Scanner Configuration [C++]
3. Scanner Callback Messages
4. Simplified Acronym Scanner Configuration [Other Languages]
5. Scanner Simplified Callback Messages
6. AcronymaX Files
7. Functions Quick Reference
8. Harvesting Techniques
9. Issues and Roadmap

Chapter 1. Introduction

Table of Contents

1.1. System Requirements

AcronymaX™ is a reusable software component for high-performance, high-quality extraction and/or tagging of acronym/definition pairs in a natural language text. AcronymaX™ will detect acronyms and corresponding definitions (if found nearby in the text) in any alphabetic language. The library includes specific provisions for English, German, French, Italian, Spanish, Portuguese, and Russian; however, AcronymaX™ will extract and tag acronyms in any other languages, provided that the structure of acronym/definition pair remains the same.

AcronymaX™ uses a variation of the search algorithm described in "Recognizing acronyms and their definitions" by Kazem Taghva and Jeff Gilbreth. The algorithm has been enhanced to expand the types of acronym/definitions pairs detected and will be further modified in new versions of AcronymaX™. The main point of the search algorithm is that if the acronym's definition is located nearby the acronym in the running text, as seen in the next example, it can be extracted by matching the letters in the acronym against the first letters of words in the context window:

Example 1.1. A defined acronym in the running text

"Use this form to submit a complaint to the Federal Trade Commission (FTC) Bureau of Consumer Protection about a particular company or organization."

The results of such matching can be used to a number of ends:

  • to tag acronym/definition pairs in linguistic corpora;

  • to build databases of acronym definitions;

  • to improve text understanding by the reader - by providing clues to acronyms in cases when the definition isn't present nearby.

Although the version of this algorithm that AcronymaX™ implements will confidently detect acronym/definition pairs that commonly occur in texts, there are also several types of acronyms that AcronymaX™ won't detect in this release. These are fuzzy and creative "acronyms" (quotes are due, because these aren't acronyms by definition), such as "SNETCO - Southern New England Telephone Company" and "CONMEBOL - Confederacion Sudamericana de Futbol". Another example are industry-specific denotations that also aren't acronyms by definition, but bear certain resemblance to acronyms. AcronymaX™ will address such cases in the nearest future by introducing a fuzzy harvester and a rule-based harvester.

Today, AcronymaX™ can already help battle acronym soups and other cookery disasters. The following table illustrates the types of acronyms AcronymaX™ detects and provides clues to functional variations in the sample data.

Table 1.1. Acronym/Definitions Pairs Detected by AcronymaX

ExampleComment

Application Programming Interface (API)

(API) Application Programming Interface

The location order of acronym and its definition does not matter.

NYPD - New York City Police Department

Limited skips are allowed in definitions.

3DES = Triple Data Encryption Standard

3GPP = 3rd Generation Partnership Project

4H = Head, Heart, Hands, Health

G8 = Group of Eight

Digits in acronyms are processed in many different ways and in seven different languages.

A&A = Astronomy and Astrophysics

Ampersands are processed also; the same set of languages applies.

MOR = middle-of-the-road

Compound words are okay in definitions and the case of words does not affect matching.

MOR stands for "middle-of-the-road"

Explanatory constructs are ignored to an extent.

MSB - Most Significant Bit;Most Significant Byte

Multiple matches are fine.

D.A.R.P.A. - Defense Advanced Research Projects Agency

A/S/L - Age, Sex, Language

Embedded punctuation is okay.

SFTPOVLASBBAPSPAVPAAAOWSFDITNOOACGCUTCOOUTIITCOSA stands for "Society For The Promotion Of Very Long Acronyms Supported By Bureacrats Administrators Public Servants Politicians And Visual Poets And Advertisers All Of Whom Seek Full Disclosure In The Naming Of Organizations And Community Groups Coming Under Their Control Or Operating Under Their Influence In The Cause Of Smooth Administration"

Handles very long acronyms by cheating and performing a simpler search.

1.1. System Requirements

AcronymaX™ is written in C++, but is packaged as a Windows DLL with __stdcall calling convention, so you can use it from any modern language.

The system requirements for developing with AcronymaX™ are as follows.

Table 1.2. AcronymaX™ System Requirements

ComponentRequirements
Processory2.0+Ghz Pentium 4-grade CPU.
Operating SystemWindows XP or Windows 2000 recommended; should also work on earlier and later systems.
RAM512 megabytes recommended; consumption at run-time depends on the sizes of source texts or read buffers you use if reading from a file.
Disk Space15 megabytes.

Chapter 2. AcronymaX™ Scanner Configuration [C++]

When you make a call to AcronymaX™ scanning routines, you must provide a configuration structure with run-time values for the scanner. You don't necessarily need to initialize the whole set of parameters manually; AcronymaX™ lets you generate the default configuration and adjust only what needs to be adjusted. The following table describes the fields of AcronymaxScannerConfiguration structure. Note that when interfacing with AcronymaX™ from languages other than C/C++ you should use SimplifiedAcronymScannerConfiguration instead; see next chapter.

Table 2.1. AcronymaxScannerConfiguration Fields

FieldDescriptionDefault Value
logfileName [std::string]Name of the log file to write message to when running the debug version of AcronymaX™. Leave the field empty if you don't need logging.Empty.
keepDelimitersWhenTokenizing [bool]Instructs AcronymaX™ to store delimiters as tokens when tokenizing text. It is recommended to leave this setting at the default (off) to speed up text processing and analysis.false
delimiters [ICU UnicodeString]Lists characters that AcronymaX™ interprets as token delimiters.\t\r\n=:~!?@#$%^&*()_+<>[]{}/,.;|\"'
tokenAllowedDelimiters [ICU UnicodeString]Lists characters that are allowed to appear in acronym candidates along with letters.&./\0123456789
normalizedAcronymAllowedDelimiters [ICU UnicodeString]Lists characters that count as letters when calculating the normalized acronym length. Leave this setting at the default.&
stopWords [ICU UnicodeString]The initial set of (English) stop words. Leave this setting at the default.-
minimumLettersInAcronym [unsigned long]Sets the minimum length for tokens considered as acronym candidates. If a token is longer than this value, AcronymaX™ won't analyse its surrounding if the token satisfies all acronym criteria. When calculating the length of an acronym candidate token, AcronymaX™ only considers letters and ignores special characters.2
maximumLettersInAcronym [unsigned long]Sets the maximum length for tokens considered as acronym candidates. If a token is shorter than this value, AcronymaX™ won't analyse its surrounding if the token satisfies all acronym criteria. When calculating the length of an acronym candidate token, AcronymaX™ only considers letters and ignores special characters.128
maximumDigitsAllowed [unsigned long]Variations of decypherings for digits appearing in acronyms may lead to serious performance lags, hence the limitation on total number of digits that an acronym candidate may contain. Only increase this value if you are sure that your corpora or source texts will contain acronyms with more than 3 digits.3
directMatchingLengthThreshold [unsigned long]The longest common subsequence algorithm that AcronymaX™ uses to search for acronym definitions has exponential time complexity. As the length of acronym candidate grows, processing time increases. The algorithm is capable of processing acronym candidates of up to 20 letters without significantly degrading the performance of the application. Beyond this threshold, performance degrades rapidly; funny acronym candidates such as SFTPOVLASBBAPSPAVPAAAOWSFDITNOOACGCUTCOOUTIITCOSA may take weeks or even months to process. AcronymaX™ has a small built-in cheat that allows it to quickly decypher long acronyms: it assumes that if the length of an acronym candidate exceeds directMatchingLengthThreshold (20 by default, but you are free to increase this value if you have sufficient computing power), the appropriate definition, if at all present, will contain one word for every letter of the candidate. This means that, for longer acronyms, skips in definitions are not allowed. This approach allows AcronymaX™ to identify the nearby string "Society For The Promotion Of Very Long Acronyms Supported By Bureacrats Administrators Public Servants Politicians And Visual Poets And Advertisers All Of Whom Seek Full Disclosure In The Naming Of Organizations And Community Groups Coming Under Their Control Or Operating Under Their Influence In The Cause Of Smooth Administration" as the definition for the previous long acronym example in only a fraction of a second.20
lowercaseAcronymsAllowed [bool]Instructs AcronymaX™ to treat all tokens that only contain letters of a consistent case (all uppercase or all lowercase) and other character allowed in acronyms as acronym candidates. Leave this setting at the default, unless you are absolutely positive that you know what you are doing.false
suspectAcronymLength [unsigned long]For acronym candidates with lengths exceeding this value, AcronymaX™ will check the token for presence of excessively long sequences of repeated letters. The length of excessive sequences is specified in the excessiveSequenceLength field.15
excessiveSequenceLength [unsigned long]Sets the length of excessive sequences for suspect acronym candidates. Acronym candidates containing excessive sequences will be filtered out.4
filterXMLTags [bool]When extracting acronyms from a stream of HTML or XML text it is desirable to filter out XML-style tags to avoid false matches and improve the chances of actually finding the definitions contained in the stream. For the task of extracting acronyms, the tags can be safely thought of as garbage. filterXMLTags instructs AcronymaX™ whether filtering of XML-style tags is required. Note that AcronymaX™ won't process XML-style tags longer than 1024 characters.true
filteringWithWhitePaint [bool]When working on a stream of HTML or XML text it may be desirable to retain relative positions of analysed letters, for instance, when you need to perform tagging of acronyms in corpora or visualize acronyms and definitions in a user document. Simply filtering out all XML-style tags compresses the source text and changes relative positions of most characters in the text. filterWithWhitePaint instructs AcronymaX™ to replace XML-style tags with equal length whitespace instead of removing them.true
convertHTMLEntities [bool]Instructs AcronymaX™ to convert special HTML entities, such as &spades;, into corresponding Unicode characters.true
contractCharacterSequences [bool]Instructs AcronymaX™ to compress sequences of repeated special characters. For instance, by default, a sequence such as "[[[[" will be compressed to just "[". If filteringWithWhitePaint is on, the removed characters are replaced with spaces. The characters to compress are specified in contractedCharacters.true
contractedCharacters [ICU UnicodeString]Lists the characters to compress. See contractCharacterSequences.[]|/{}().,!+-'\
autodetectSuitableDefinitions [bool]Instructs AcronymaX™ to perform filtering if multiple candidate definitions are found in the vicinity of an acronym. It is highly recommended to leave this setting at the default to avoid retrieving low-confidence definitions.true
suitableDefinitionConfidenceThreshold [double]When filtering out definition candidates (see autodetectSuitableDefinitions) AcronymaX™ will reject candidates with confidence value lower than suitableDefinitionConfidenceThreshold. It is highly recommended to leave this setting at the default.0.85
performExtractionCallbacks [bool]Instructs AcronymaX™ to return information about the acronym/definition pairs found to your application. Unless you are experimenting, set this value to true.false
performProgressCallbacks [bool]Instructs AcronymaX™ to return information on processing progress to your application. This is useful when processing large amounts of text. Progress callbacks report progress in percent for the current file and/or current block.false
runIntegralHarvester [bool]In the current version, AcronymaX™ only implements one harvester of acronym/definition pairs, the integral harvester. Unless you are trying to measure the performance of text preparation routines, do not modify the default setting.true
runFuzzyHarvester [bool]Fuzzy harvester will become available in the next version of AcronymaX™. Right now, leave the field as is.false
maximumSpacingForInitialisms [unsigned long]Fuzzy harvester parameter; leave as is.-
minimumLengthOfInitialismSequence [unsigned long]Fuzzy harvester parameter; leave as is.-
maximumLengthOfInitialismSequence [unsigned long]Fuzzy harvester parameter; leave as is.-
garbageConfidenceFactor [double]When filtering out definition candidates (see autodetectSuitableDefinitions) AcronymaX™ will compare the candidate with the highest confidence value to the runner-up and decide whether definitions beyond the top-rated one should also be returned to the calling applications. If the confidence value for the top-rated definition exceeds the confidence value for the runner-up by over garbageConfidenceFactor times, all definition candidates except for the top-rated one are erased.1.8

Chapter 3. Scanner Callback Messages

AcronymaX™ returns results to your application by executing callback functions you provide. The callback functions' only argument is the callback message that either contains information about an acronym/definition pair found in a source text or simply reports analysis progress. The following table describes the fields of ScannerMessage used to return results from function extractAcronyms.

Table 3.1. ScannerMessage Fields

FieldDescription
messageType [MessageType]Either MTExtractCallback or MTProgressCallback.
acronymVariant [ICU UnicodeString]Contains the specific acronym variant that has been matched. Sometimes it may not be entirely obvious how AcronymaX™ has made a particular match. For instance, digits of the original (canonical) acronym may get converted into letters to allow a match. Seeing the particular variation of the acronym candidate makes such matches more transparent.
cleanCanonicalAcronym [ICU UnicodeString]The original acronym string with special characters stripped away.
canonicalAcronymStartIndex [unsigned long]Starting index of the acronym in the current block of text.
canonicalAcronymLength [unsigned long]Length of the acronym in the current block of text.
definitionStartIndex [unsigned long]Starting index of the definition in the current block of text.
definitionLength [unsigned long]Length of the definition.
definitionStopwordCount [unsigned long]The number of stop words in the definition.
definitionWordCount [unsigned long]The total number of words in the definition. Note that the definition region may contain more words than the definition itself.
definitionSkipsCount [unsigned long]The number of words in the definition region that do not participate in the definition.
definitionDistance [unsigned long]The number of words between the definition and the acronym in the original text.
confidence [float]A confidence value for this acronym/definition pair. Indicates how sure AcronymaX™ is about the correctness of the match. Values close to 1.0 and above indicate strong matches.
fileProgressPercentage [double]Processing progress for the current file. Values should range from 0.0 to 100.0. If you are processing a standalone block of text (i.e. using extractAcronyms), this value is zero.
blockProgressPercentage [float]Processing progress for the current block of text. Values should range from 0.0 to 100.0.
source [HarvesterType]AcronymaX™ harvester that found the match. Currently, the only possible value is HTIntegralHarvester.

Chapter 4. Simplified Acronym Scanner Configuration [Other Languages]

When you make a call to AcronymaX™ scanning routines, you must provide a configuration structure with run-time values for the scanner. You don't necessarily need to initialize the whole set of parameters manually; AcronymaX™ lets you generate the default configuration and adjust only what needs to be adjusted. The following table describes the fields of SimplifiedAcronymaxScannerConfiguration structure. Note that when interfacing with AcronymaX™ from C/C++ you can use AcronymScannerConfiguration instead; see previous chapter.

Note that in the simplified version of scanner configuration you cannot change most of the string settings, such as delimiters.

Table 4.1. SimplifiedAcronymaxScannerConfiguration Fields

FieldDescriptionDefault Value
keepDelimitersWhenTokenizing [bool]Instructs AcronymaX™ to store delimiters as tokens when tokenizing text. It is recommended to leave this setting at the default (off) to speed up text processing and analysis.false
minimumLettersInAcronym [unsigned long]Sets the minimum length for tokens considered as acronym candidates. If a token is longer than this value, AcronymaX™ won't analyse its surrounding if the token satisfies all acronym criteria. When calculating the length of an acronym candidate token, AcronymaX™ only considers letters and ignores special characters.2
maximumLettersInAcronym [unsigned long]Sets the maximum length for tokens considered as acronym candidates. If a token is shorter than this value, AcronymaX™ won't analyse its surrounding if the token satisfies all acronym criteria. When calculating the length of an acronym candidate token, AcronymaX™ only considers letters and ignores special characters.128
maximumDigitsAllowed [unsigned long]Variations of decypherings for digits appearing in acronyms may lead to serious performance lags, hence the limitation on total number of digits that an acronym candidate may contain. Only increase this value if you are sure that your corpora or source texts will contain acronyms with more than 3 digits.3
directMatchingLengthThreshold [unsigned long]The longest common subsequence algorithm that AcronymaX™ uses to search for acronym definitions has exponential time complexity. As the length of acronym candidate grows, processing time increases. The algorithm is capable of processing acronym candidates of up to 20 letters without significantly degrading the performance of the application. Beyond this threshold, performance degrades rapidly; funny acronym candidates such as SFTPOVLASBBAPSPAVPAAAOWSFDITNOOACGCUTCOOUTIITCOSA may take weeks or even months to process. AcronymaX™ has a small built-in cheat that allows it to quickly decypher long acronyms: it assumes that if the length of an acronym candidate exceeds directMatchingLengthThreshold (20 by default, but you are free to increase this value if you have sufficient computing power), the appropriate definition, if at all present, will contain one word for every letter of the candidate. This means that, for longer acronyms, skips in definitions are not allowed. This approach allows AcronymaX™ to identify the nearby string "Society For The Promotion Of Very Long Acronyms Supported By Bureacrats Administrators Public Servants Politicians And Visual Poets And Advertisers All Of Whom Seek Full Disclosure In The Naming Of Organizations And Community Groups Coming Under Their Control Or Operating Under Their Influence In The Cause Of Smooth Administration" as the definition for the previous long acronym example in only a fraction of a second.20
lowercaseAcronymsAllowed [bool]Instructs AcronymaX™ to treat all tokens that only contain letters of a consistent case (all uppercase or all lowercase) and other character allowed in acronyms as acronym candidates. Leave this setting at the default, unless you are absolutely positive that you know what you are doing.false
suspectAcronymLength [unsigned long]For acronym candidates with lengths exceeding this value, AcronymaX™ will check the token for presence of excessively long sequences of repeated letters. The length of excessive sequences is specified in the excessiveSequenceLength field.15
excessiveSequenceLength [unsigned long]Sets the length of excessive sequences for suspect acronym candidates. Acronym candidates containing excessive sequences will be filtered out.4
filterXMLTags [bool]When extracting acronyms from a stream of HTML or XML text it is desirable to filter out XML-style tags to avoid false matches and improve the chances of actually finding the definitions contained in the stream. For the task of extracting acronyms, the tags can be safely thought of as garbage. filterXMLTags instructs AcronymaX™ whether filtering of XML-style tags is required. Note that AcronymaX™ won't process XML-style tags longer than 1024 characters.true
filteringWithWhitePaint [bool]When working on a stream of HTML or XML text it may be desirable to retain relative positions of analysed letters, for instance, when you need to perform tagging of acronyms in corpora or visualize acronyms and definitions in a user document. Simply filtering out all XML-style tags compresses the source text and changes relative positions of most characters in the text. filterWithWhitePaint instructs AcronymaX™ to replace XML-style tags with equal length whitespace instead of removing them.true
convertHTMLEntities [bool]Instructs AcronymaX™ to convert special HTML entities, such as &spades;, into corresponding Unicode characters.true
contractCharacterSequences [bool]Instructs AcronymaX™ to compress sequences of repeated special characters. For instance, by default, a sequence such as "[[[[" will be compressed to just "[". If filteringWithWhitePaint is on, the removed characters are replaced with spaces. The characters to compress are specified in contractedCharacters.true
autodetectSuitableDefinitions [bool]Instructs AcronymaX™ to perform filtering if multiple candidate definitions are found in the vicinity of an acronym. It is highly recommended to leave this setting at the default to avoid retrieving low-confidence definitions.true
suitableDefinitionConfidenceThreshold [double]When filtering out definition candidates (see autodetectSuitableDefinitions) AcronymaX™ will reject candidates with confidence value lower than suitableDefinitionConfidenceThreshold. It is highly recommended to leave this setting at the default.0.85
performExtractionCallbacks [bool]Instructs AcronymaX™ to return information about the acronym/definition pairs found to your application. Unless you are experimenting, set this value to true.false
performProgressCallbacks [bool]Instructs AcronymaX™ to return information on processing progress to your application. This is useful when processing large amounts of text. Progress callbacks report progress in percent for the current file and/or current block.false
runIntegralHarvester [bool]In the current version, AcronymaX™ only implements one harvester of acronym/definition pairs, the integral harvester. Unless you are trying to measure the performance of text preparation routines, do not modify the default setting.true
runFuzzyHarvester [bool]Fuzzy harvester will become available in the next version of AcronymaX™. Right now, leave the field as is.false
maximumSpacingForInitialisms [unsigned long]Fuzzy harvester parameter; leave as is.-
minimumLengthOfInitialismSequence [unsigned long]Fuzzy harvester parameter; leave as is.-
maximumLengthOfInitialismSequence [unsigned long]Fuzzy harvester parameter; leave as is.-
garbageConfidenceFactor [double]When filtering out definition candidates (see autodetectSuitableDefinitions) AcronymaX™ will compare the candidate with the highest confidence value to the runner-up and decide whether definitions beyond the top-rated one should also be returned to the calling applications. If the confidence value for the top-rated definition exceeds the confidence value for the runner-up by over garbageConfidenceFactor times, all definition candidates except for the top-rated one are erased.1.8
readBlockSize [unsigned long]Read buffer size to use when reading data from disk files; in octets. Note that if an acronym/definition pair occurs on the boundary of a read buffer, it will be split into two parts and AcronymaX™ will have trouble extracting or tagging the occurrence. If you need to change the buffer size, choose to increase it, if possible.1048576
logfileName [ char[256] ]Name of the log file to write message to when running the debug version of AcronymaX™. Leave the field empty if you don't need logging.Empty.

Chapter 5. Scanner Simplified Callback Messages

AcronymaX™ returns results to your application by executing callback functions you provide. The callback functions' only argument is the callback message that either contains information about an acronym/definition pair found in a source text or simply reports analysis progress. The following table describes the fields of SimplifiedScannerMessage used to return results from function extractAcronymsFromFile.

Table 5.1. SimplifiedScannerMessage Fields

FieldDescription
messageType [MessageType]Either MTExtractCallback or MTProgressCallback.
canonicalAcronymStartIndex [unsigned long]Starting index of the acronym in the current block of text.
canonicalAcronymLength [unsigned long]Length of the acronym in the current block of text.
definitionStartIndex [unsigned long]Starting index of the definition in the current block of text.
definitionLength [unsigned long]Length of the definition.
definitionStopwordCount [unsigned long]The number of stop words in the definition.
definitionWordCount [unsigned long]The total number of words in the definition. Note that the definition region may contain more words than the definition itself.
definitionSkipsCount [unsigned long]The number of words in the definition region that do not participate in the definition.
definitionDistance [unsigned long]The number of words between the definition and the acronym in the original text.
confidence [float]A confidence value for this acronym/definition pair. Indicates how sure AcronymaX™ is about the correctness of the match. Values close to 1.0 and above indicate strong matches.
fileProgressPercentage [double]Processing progress for the current file. Values should range from 0.0 to 100.0. If you are processing a standalone block of text (i.e. using extractAcronyms), this value is zero.
blockProgressPercentage [float]Processing progress for the current block of text. Values should range from 0.0 to 100.0.
source [HarvesterType]AcronymaX™ harvester that found the match. Currently, the only possible value is HTIntegralHarvester.

Chapter 6. AcronymaX™ Files

This section lists the files that AcronymaX™ binary distribution contains.

AcronymaX™ DLLs. The two versions are the debug acronymax{V}_d.dll and the release acronymax{V}_r.dll where {V} is the current version number of AcronymaX™, for example, 1_0. Your application must be able to find and load suitable version of AcronymaX™ DLL at run-time, so you must make the DLL available on the search path during development and redistribute it with your application.

ICU DLLs. AcronymaX™ DLL depends on ICU for Unicode services and will load ICU DLLs automatically. You must make ICU DLLs available on the search path during development and redistribute them with your application. The release versions of ICU DLLs are: icuin34.dll, icuio34.dll, icule34.dll, iculx34.dll, icutest.dll, icutu34.dll, icuuc34.dll. The debug versions of ICU DLLs are: icuin34d.dll, icuio34d.dll, icule34d.dll, iculx34d.dll, icutestd.dll, icutu34d.dll, icuuc34d.dll. icudt32.dll contains Unicode data and must be present for both debug and release.

AcronymaX™ Data Files. At run-time AcronymaX™ scanner expects to find and load certain information, such as stop words and digit variation rules, from a location known as the data directory. You must redistribute the data directory with your application and make a call to a specific function to let AcronymaX™ know where the data directory is.

AcronymaX™ Header Files. May be required for developing with AcronymaX in C/C++ and possibly in other languages.

C# Wrapper and .Net Example Application. C# wrapper imports functions from AcronymaX™ DLL and defines structure equivalents for simplified scanner configuration and callback messages. You can use the wrapper to build .Net applications with AcronymaX as the .Net Example Applications demonstrates. .Net Example Application is a simple GUI tool that makes acronym/definition pairs stand out in a loaded plain text by coloring them.

C++ Example Applications. The example applications demonstrate usage of AcronymaX™ with C++.

Chapter 7. Functions Quick Reference

[Note]Note

Data type String maps to ICU UnicodeString.

Table 7.1. AcronymaX™ Functions

FunctionDescription

void setDataDirectory (const char* path) 
     throw();
		  

Sets the location of the data directory. Only needs to be called once when preparing to work with AcronymaX™. If you fail to specify the correct directory on disk or if the data files are missing, AcronymaX™ will still work; however, the results it returns will be of lower quality.

void getDefaultAcronymaxScannerConfiguration(AcronymaxScannerConfiguration& scannerConfiguration) 
     throw();
		

Initializes an instance of AcronymaxScannerConfiguration structure with default values. You can adjust the values as needed and use the instance in calls to extractAcronyms. For more information, see Chapter 2, AcronymaX Scanner Configuration [C++]

void extractAcronyms(const String const & text, 
		     const AcronymaxScannerConfiguration& scannerConfiguration, 
		     void(*fpDatabaseCallback)(const ScannerMessage&)) 
		     throw(std::exception);
		

Analyzes the Unicode string text using settings from scannerConfiguration, then returns results and reports progress by calling fpDatabaseCallback.

void getDefaultSimplifiedAcronymaxScannerConfiguration 
               (SimplifiedAcronymaxScannerConfiguration& scannerConfiguration)
     throw();
		

Initializes an instance of SimplifiedAcronymaxScannerConfiguration structure with default values. You can adjues the values as needed and use the instance in calls to extractAcronymsFromFile. For more information, see Chapter 4, Simplified Acronym Scanner Configuration [Other Languages]

void extractAcronymsFromFile(const char* filename,
		             const SimplifiedAcronymaxScannerConfiguration& scannerConfiguration,
 		             void (*fpDatabaseCallback)(const SimplifiedScannerMessage&)) 
     throw(std::exception);
		

Analyzes the file filename using settings from scannerConfiguration, then returns results and reports progress by calling fpDatabaseCallback. Expects the file to be in UTF-8 or a compatible encoding. Note that scanner configuration types and reporting callback function types are different for extractAcronyms and extractAcronymsFromFile. This is due to the fact that extractAcronymsFromFile was intended mostly for early integration of AcronymaX™ with programming platforms other than C++.

Chapter 8. Harvesting Techniques

In harvesting acronym/definition pairs to build a database there are a number of important issues that you must consider when creating an application.

Filtering the Input. When you process content from the Web or other HTML/XML sources of text, it is highly desirable to filter out special tags that otherwise will interfer with the matching process and result in many false matches. You can preprocess texts before feeding them into AcronymaX™ or let AcronymaX™ do the job. By filtering out special tags that shouldn't have any chances of participating in acronym/definition pairs you also speed up processing. A similar rationale applies to scanner flags such as keepDelimitersWhenTokenizing: the more information you filter out, the faster and more precisely AcronymaX™ will process your text.

Filtering False Matches and Garbage. Currently, AcronymaX™ does not implement an heuristic to distinguish definitions in a natural language from garbage. False matches are linguistic garbage as far as AcronymaX™ is considered, and hence need filtering. Fortunately, for a harvesting application, a simple garbage filter is easy to install. By keeping track of the number of document instances a certain definition of an acronym has appeared in, you get an option of accepting only those definitions that appear frequently enough and go over the set threshold. Note that without filtering of the input text this approach won't work. For instance, if you leave alone HTML tags, AcronymaX™ will consistently match "TD DIV" as a definition for acronym candidate TD.

Chapter 9. Issues and Roadmap

False Matches and Garbage. Currently, AcronymaX™ does not implement an heuristic to distinguish definitions in a natural language from garbage. I'm planning to include this functionality in one of the following updates.

Acronym/Definition Coordinates May Be Off. The coordinates of acronym/definition returned in a callback will be off when using extractAcronymsFromFile, for every block of the file except for the first one. This is a bug that will be fixed in the nearest future.

Plural Acronyms. AcronymaX™ won't pay attention to plural acronym forms, such as LANs. This is by design.

Spelling Issues. AcronymaX™ currently won't find acronym/definition pairs such as OLTP = online transaction processing. The reason is that the algorithm used has no idea about the spelling variations of words like "online/on-line".

Spaces in Acronyms. AcronymaX™ won't match correctly the acronym/definition pairs if the acronym candidates contain spaces. For instance, ACT I will be treated as acronym ACT and a separate word I hence matching only "application, channel, technology" instead of "application, channel, technology and industry".

Future Enhancements. In the future versions AcronymaX™ will improve by implementing new algorithms and features:

  • Longest Acronyms First. Currently, AcronymaX™ processes text sequentially which sometimes may result in incorrect matching due to vicinity of similar acronym candidates. Longest Acronym First will introduce non-linear mode of processing where the shortest acronyms that have most chances of producing incorrect matches will be processed last.

  • Fuzzy/Initialisms Harvester. Currently, AcronymaX™ works from acronym candidate and compares it to possible definitions. Fuzzy/Initialisms Harvester will work in the reverse direction - by comparing sequences of capitalized words to acronym candidates and searching for "creative" matches.

  • Rule-Based Harvester. In many areas of knowledge, professionals use abbreviations that look like acronyms, but technically speaking aren't. Such abbreviations often follow specific rules that allow decyphering. Rule-Based Harvester is an additional tunable algorithm to identify and return such abbreviation/definition pairs from a technical text.

  • Garbage Filter. A simple heuristic will be put in place to perform N-gram analysis of the definition found. Definitions that fail to identify as written in a known natural language will be marked as garbage.

Lexis Natural Language Processing Components (Lexis Natural Language Processing Components)