Preparing texts for CLiCTagger¶
Whilst CLiCTagger can work on any plain text, for best results text should be prepared using the method below. Any texts added to the corpora repository should follow this process.
To clean texts so they are ready for use with CLiCTagger, the following steps need to be followed.
Save as/convert to UTF-8 and use typographical (‘curly’) quote marks.
Reformat the book title and author to make consistent across all texts.
Reformat chapter headings to make consistent across all texts.
If committing to the corpora repository, each editing stage is committed and clearly documented with a commit message. Accordingly, it is possible to see the history of a single file, see for example the history of willows.txt.
Save as/convert to UTF-8 and use typographical (‘curly’) quote marks¶
The CLiC Tagger expects files in UTF-8. Ideally, typographical (‘curly’) quote marks should be used to avoid confusion between quote marks and apostrophes.
Convert to unix line endings¶
Step [2] is achieved using the following command:
for f in ChiLit/*.txt; do dos2unix -m $f; done
Reformat chapter headings to make consistent across all texts¶
Chapter headings are formatted as follows: If the chapter heading begins with ‘CHAPTER’ or ‘BOOK’ it must be followed by a number or roman numerals and then a dot. The chapter or book number cannot be written in word form. The heading can optionaly be followed by a chapter title; the chapter title must not break onto a new line. Here are some examples:
CHAPTER 1. The Old Sea-dog at the Admiral Benbow CHAPTER 2. TRAVELLING COMPANIONS. CHAPTER 3. CHAPTER IV. Little Meg's Treat to Her Children CHAPTER V. BOOK 1. BOOK II. Jessica's Mother
Sections beginning with ‘INTRODUCTION’, ‘PREFACE’, ‘CONCLUSION’, ‘PROLOGUE’, ‘PRELUDE’ or ‘MORAL’ are also be treated as seperate chapters. These do not require numbers, but do require the dot. Again the heading can optionaly be followed by a title; the title must not break onto a new line. Here are some examples:
PREFACE. INTRODUCTION. PROLOGUE. THE OLYMPIANS MORAL.--_There is no moral to this chapter._
In all cases there must be no space at the beginning of the line.
Part headings are on a line before the first chapter of that part, in the same format (i.e. “PART” has to be followed by a Roman or Arabic numeral). Blank lines are allowed between the part heading and the chapter heading. The following example is from treasure:
PART 2. The Sea-cook CHAPTER 7. I Go to Bristol IT was longer than the squire imagined ere we were ready for the sea, and none of our first plans--not even Dr. Livesey's, of keeping me
In the CLiC dropdown menu, the part and chapter headings are joined together, i.e. this treasure chapter is shown as “PART 2. The Sea-cook CHAPTER 7. I Go to Bristol”. Whereas treasure contains “PART” headings in the original text that only had to be reformatted, sometimes “PART” (and a number) has to be added to the existing headings in order to represent the structure of the book correctly in the CLiC dropdown menu. An example where the headings had to be adjusted accordingly is sketches. The table of contents in a scanned copy of the book illustrates its nested structure. This table of contents does not reproduce all levels; for example, the chapters within “CHARACTERS” and “TALES” contain a further level of chapters. As CLiC can only handle parts and chapters but no third level, we solved this issue by first adding the numbered parts to the headings (“PART 4.” in the following), joining it with the top chapter level (“CHAPTER I. THE BOARDING-HOUSE”) and therefore accounting for the extra chapter level (CHAPTER I.) on level 2:
PART 4. TALES CHAPTER I. THE BOARDING-HOUSE CHAPTER I.
These extra levels are not very frequent in sketches, but when they occur, they are not necessarily numbered conventionally but e.g. “CHAPTER THE SECOND”. In that instance, we added only “CHAPTER” to count this as a chapter:
The advertisement has again appeared in the morning papers. Results must be reserved for another chapter. CHAPTER. CHAPTER THE SECOND. ‘Well!’ said little Mrs. Tibbs to herself, as she sat in the front parlour of the Coram-street mansion one morning, mending a piece of stair-carpet off the first Landings;—‘Things have not turned out so badly, either, and if I only get a favourable answer to the advertisement, we shall be full again.’
Manual corrections¶
When previewing the output of the CLiC tagger output, you might notice that manual corrections are necessary. These could relate to correcting the format to properly follow the steps listed above, or might point to instances of, for example, missing quote marks. See this example of a manual correction (adding a missing closing quote mark) in the CLiC ArTs corpus.