Tools
Documenting a language involves quite a bit of technology
- DataBase systems for tracking terms, translations, notes, etc.
- Data conversion tools for moving data from a program to another, reordering it, cleaning it, or checking it for consistency
- Sound recording, editing and playback, for documenting the data with actual spoken examples
- Publishing tools for organizing and formating data for publication
- A list of document formats we are using
- … and miscellaneous other tools.
The information below was suggested by various participants during the Algonquian Dictionary Workshop organized by Marie-Odile Junker at Carleton University, Ottawa, ON, Canada on Feb.19-21, 2004, and was compiled by Radu Luchian. If anyone knows more, has additional suggestions or knows Canadian prices for the programs listed below, please let us know
DataBase Management Systems (DBMS)
It’s a very good idea to keep consistent record of the linguistic data one collects. Various DBMSystems have various sets of features (entering, listing, searching, sorting and even preparing for printing), and require various types and levels of programming proficiency.
Specialized for work with corpora:
- SIL
Toolbox (Unicode)(IBMc)(free)
Annotation
Graph Toolkit (AGTK) (Unicode?)(IBMc,Unix)(open)
TshwaneLex (Unicode!!)(proprietary, IBMc/Web, EU100 and way up)Generic (require interface development)
FileMakerPro (no Unicode)(PC+Mac)($US 299 and up, PDA conduit US$49)- Microsoft Access (Unicode)(IBMc)(proprietary,US$229,
free viewer available)
mySQL(partial Unicode)(open, online, all platforms)
askSam (Unicode?)(proprietary, IBMc, US$ 150 and up)
Runtime Revolution (all platforms, proprietary, UK �33 and up)- DM Dictionary (no Unicode)(PocketPC)
Last Updated 2007-04-18
Publishing
How does all the data you collected reach its audience?
Adobe Reader – view pdf files- Adobe Acrobat or
web conversion to pdf - make pdf files - FileMakerPro Print layouts
- mySQL to
PHP and (D)HTML for the Web (free)
TEX (Unicode?)(free, truly
cross-platform)- Microsoft Word ($US229,
free viewer available) - Adobe PageMaker (Unicode)(IBMc+Mac, US$349 and up)
- Adobe FrameMaker (Unicode)(IBMc+Mac+Unix, US$799 and up)
WebPagePaker (no Unicode)(IBMc, $49)
Last Updated 2007-04-18
Data Conversion
Some programs use file formats which are either incompatible with each other. Some times you have to change something all over a database (like the spelling or tagging/formating of a term). And those changes have to be consistent and fast.
Consistent Changes(IBMc)(Unicode)(open) or
MacAutoFormat(Mac)(no Unicode)(free) – for Toolbox/Shoebox
SF Converter (no Unicode)(IBMc)(free) – creates RTF from SF – for Toolbox- MSWord Macros (IBMc+Mac) – for repetitive tasks
- MSWord(+fields) – layout for publication (IBMc+Mac)
Notetab macros (no Unicode)(PC)(free or US$20)- TextPipe - for askSam (Unicode?)(PC)(US$89)
Last Updated 2007-04-18
Sound recording
Before anything can be done with sound, it has to be collected.
- Tape recorder (involves four lossy conversions)
- DAT recorder (involves one or two lossy conversions)
- Digital Voice Recorder (involves one lossy conversion – or two, if recording directly to mp3)
analysis of portable recording tools
Last Updated 2007-04-18
Sound editing
First, you physically patch the sound into your computer: use an audio cable (~CA$3 at Radioshack) to go from the output of your tape recorder/player to the sound card soundIn or SPDIF jack in your computer.
On the Mac or Unix, you’re good to go. On IMBc, you have to make sure sound IS recording. Double-click the little speaker icon in the system tray, choose menu Options, check Recording, click OK, and then select the checkbox under the channel where you plugged in the jack (soundIn or SPDIF).
Tools to digitize analog sound on a computer (anyone can do this: press the record button in the software, start the tape, wait till you get all you need, press the stop button in the software, stop the tape, then select the portions you don’t want to keep (e.g. cough) in the software and delete them, save the file):
- Audacity (free, open source
program) available for Mac OS X, Microsoft Windows, GNU/Linux, and other operating
systems. - CoolEdit2000 (free evaluation, no support, IBMc) – discontinued, but still available in various archives – ask around
- Adobe Audition (IBMc, US$349) – this is in fact CoolEditPro (big brother of CoolEdit), acquired and supported by Adobe Systems Inc. (free trial available)
- Sound Forge (US$300 and up, IBMc,
entry version available: US$70) - Sound Edit 16 (MacOS – not OSX, free?, no support)
- Huge list of
Mac-based sound editors (most of them free) we have not tested those; feel free to experimentMore ways to edit sound (filtering, tagging, fragmenting): - Use CoolEdit or SoundForge for most of these. We may link here a small tutorial for the most common operations. But know that sound editing is an art. It takes one hour to learn and a lifetime to master
Better use an expert – ask around
Transcriber (Unicode)(free, IBMc,Mac,Unix) - helps in transcribing sound-based speech into textThen, to record digital sound files onto dvd or cd, for (1)backup data or (2)actual listening to, use the software that came with your CD/DVD recorder. Note that for CDs (1) and (2) result in two different products. To use (1) you need a computer, while (2) can be played on any CD player. When you burn the CD you have to specify which you want (e.g. data CD or audio CD)
Last Updated 2007-04-18
Sound playback
We consider two main ways to deliver sound from your database to the audience (CD, Web), and two factors to balance (quality, file size).
- JavaScript player (free, all platforms) – uses the sound plugins installed in the individual system (see Quicktime and MediaPlayer below)
- Flash player (free, all platforms) –
Quicktime player (free, IBMc+Mac)
WindowsMediaPlayer (free, IBMc)
Last Updated 2007-04-18
File formats
Here are the file formats mentioned in this document. We are also discussing file sizes, since it’s an important factor in processing, storing and transmitting data: the larger the file the longer it takes to do anything with it. Note that an extension can be used for a variety of formats (e.g. .typ files have different formats in Toolbox and Transcriber, wav files can contain mp3 data)
- .pdf – Portable Document Format (fairly small to very large)
(if you want to display and/or print your document on any platform and on any printer so that it looks exactly the same, this is the way to go)
- .ps – Postscript ( – quite large, depending on features used )
(designed as a printer language by Adobe, it used to be the format for camera-ready documents, the last digital step toward mass printing)
- .rtf – Rich Text Format ( – fairly small to very large )
(for moving text cross-platform and cross-applications while preserving formatting – fonts, bold, italics, etc.)
- .txt, and Toolbox(.db, .sf, .typ, .lng, .prj, .cct) – Text ( – very small )
(the most economical way to record and transfer text, including databases; most programs read this kind of files.)
- .doc – usually a Microsoft Word document (- quite large )
(though other programs also use this extension, which adds confusion)
- designed as a versatile, cross-platform format and program,
it has flowing layout, thus images and tables require a lot of attention;
layout (and sometimes formatting), tends to change- from platform to platform,
- when setting different active printers,
- or even from program version to version; each new version of Word usually introduces a new file format, very often incompatible with previous versions
.
- .xml – eXtensible Markup Language ( – medium to large )
(a standardized text format that pieces of text to be marked with any number of features)
- .dtd – Document Type Definition(s) (- small to medium )
(a standardized text format that formally describes the structure all the features a specific .xml document can contain; an alternative is .xms – XML Schema files)
Multimedia:
The creation and playback of files in this category depend on the existence on the individual computer of small modules called codecs which allow for conversion (coding and decoding) from one format to another; formats can be designed to store data (smaller) on a computer or CD or stream data (slightly larger) on a network (i.e. Internet) – the latter means that the client can begin playback before the entire file is downloaded.
- .raw - digital representation of sound or video ready for dumping
to the sound card or video card (thus uncompressed) – huge - .au, .wav (etc.) – (lossy) sound compression formats (compression is achieved by reducing the amount of data in which each sound sample is encoded (e.g. ): number of bits(8,12,16,24) per sample or sampling frequency (8k-phone,11k-analog tape,22k-CD,44k-digital tape), and by various mathematical functions) – very large
- .mp3, .wav, .ogg, .mov (etc.) – lossy sound compression formats (compression is done using dynamic frequency reduction that is psychologically accurate (within limits) – which means that you can balance the quality of the sound with the final size of the sound file) – large, medium and small
- animation files done with Macromedia tools (Flash: .swf, .fla; Director: .dir) – can contain video streams too, but are mainly designed to be small and interactive; bad designs can result in very large files
- IBMc(.avi), Mac(.mov) – multimedia files (mainly designed to store and/or stream video) – very largeArchives:People tend to forget about this very important milestone in the lifecycle of data. It is important for archives (as well as any data storage), to be maintained in well-labelled, well-organized ways, ready for quick retrieval.
- IBMc(.zip, .arj, .rar), Mac(.sit), Unix(.gz) (etc.) – Archive formats with their native platforms (used in order to reduce the size of other files for transmission and/or storage. Main formats are in bold.
StuffIt Expander on IBMc, Mac and Unix can open any of these and more) – smaller (between half the size of the original file and a few tens of characters larger; files compressed this way can be recovered identical with the original – no loss) Note: sound files do not get compressed much with this technique. For them, use some lossy technique described above. - .uu, .hqx, b64, (etc.) – 7-bit encodings (Sometimes, due to historical reasons of computer networks, there is a need to convert binary files (i.e. programs, archives, images), which are normally stored in a computer with 8 bits per character, into a 7-bit format) – encoded files are larger (at least 1.15 times the size of the original file)
Miscellaneous
Here we put tools which do not fit in previous sections.
Wordnet – semantically organized large subset of English- The
SIL catalogue is a great resource of free or inexpensive research software - If the Toolbox text analysis is not enough for you, see these
Linguistic Annotation tools - MSWord and MSAccess discussed above are also available as part of the
Educational package (US$199 for Win, US$129 for Mac)



Info
,
then you need to install a Cree Syllabics font
For Windows, install 

