Tools

Back to the Settings page

Documenting a language involves quite a bit of technology

DataBase Management Systems (DBMS)


It's a very good idea to keep consistent record of the linguistic data one collects. Various DBMSystems have various sets of features (entering, listing, searching, sorting and even preparing for printing), and require various types and levels of programming proficiency.

Specialized for work with corpora:
  • SIL external linkToolbox (Unicode)(IBMc)(free)
  • external linkAnnotation Graph Toolkit (AGTK) (Unicode?)(IBMc,Unix)(open)
  • external linkTshwaneLex (Unicode!!)(proprietary, IBMc/Web, EU100 and way up)
    Generic (require interface development)
  • external linkFileMakerPro (no Unicode)(PC+Mac)($US 299 and up, PDA conduit US$49)
  • Microsoft Access (Unicode)(IBMc)(proprietary,US$229, external linkfree viewer available)
  • external linkmySQL(partial Unicode)(open, online, all platforms)
  • external linkaskSam (Unicode?)(proprietary, IBMc, US$ 150 and up)
  • external linkRuntime Revolution (all platforms, proprietary, UK £33 and up)
  • DM Dictionary (no Unicode)(PocketPC)

Last Updated 2007-04-18


back to top

Publishing


How does all the data you collected reach its audience?

  • external linkAdobe Reader - view pdf files
  • Adobe Acrobat or external linkweb conversion to pdf - make pdf files
  • FileMakerPro Print layouts
  • mySQL to external linkPHP and (D)HTML for the Web (free)
  • external linkTEX (Unicode?)(free, truly external linkcross-platform)
  • Microsoft Word ($US229, external linkfree viewer available)
  • Adobe PageMaker (Unicode)(IBMc+Mac, US$349 and up)
  • Adobe FrameMaker (Unicode)(IBMc+Mac+Unix, US$799 and up)
  • external linkWebPagePaker (no Unicode)(IBMc, $49)

Last Updated 2007-04-18

back to top

Data Conversion


Some programs use file formats which are either incompatible with each other. Some times you have to change something all over a database (like the spelling or tagging/formating of a term). And those changes have to be consistent and fast.

  • external linkConsistent Changes(IBMc)(Unicode)(open) or external linkMacAutoFormat(Mac)(no Unicode)(free) - for Toolbox/Shoebox
  • external linkSF Converter (no Unicode)(IBMc)(free) - creates RTF from SF - for Toolbox
  • MSWord Macros (IBMc+Mac) - for repetitive tasks
  • MSWord(+fields) - layout for publication (IBMc+Mac)
  • external linkNotetab macros (no Unicode)(PC)(free or US$20)
  • TextPipe - for askSam (Unicode?)(PC)(US$89)

Last Updated 2007-04-18

back to top

Sound recording


Before anything can be done with sound, it has to be collected.

  • Tape recorder (involves four lossy conversions)
  • DAT recorder (involves one or two lossy conversions)
  • Digital Voice Recorder (involves one lossy conversion - or two, if recording directly to mp3)
  • external linkanalysis of portable recording tools

Last Updated 2007-04-18

back to top

Sound editing


First, you physically patch the sound into your computer: use an audio cable (~CA$3 at Radioshack) to go from the output of your tape recorder/player to the sound card soundIn or SPDIF jack in your computer.

On the Mac or Unix, you're good to go. On IMBc, you have to make sure sound IS recording. Double-click the little speaker icon in the system tray, choose menu Options, check Recording, click OK, and then select the checkbox under the channel where you plugged in the jack (soundIn or SPDIF).


Tools to digitize analog sound on a computer (anyone can do this: press the record button in the software, start the tape, wait till you get all you need, press the stop button in the software, stop the tape, then select the portions you don't want to keep (e.g. cough) in the software and delete them, save the file):

  • Audacity (free, open source program) available for Mac OS X, Microsoft Windows, GNU/Linux, and other operating systems.
  • CoolEdit2000 (free evaluation, no support, IBMc) - discontinued, but still available in various archives - ask around
  • Adobe Audition (IBMc, US$349) - this is in fact CoolEditPro (big brother of CoolEdit), acquired and supported by Adobe Systems Inc. (free trial available)
  • Sound Forge (US$300 and up, IBMc, external linkentry version available: US$70)
  • Sound Edit 16 (MacOS - not OSX, free?, no support)
  • Huge list of external linkMac-based sound editors (most of them free) we have not tested those; feel free to experiment

    More ways to edit sound (filtering, tagging, fragmenting):

  • Use CoolEdit or SoundForge for most of these. We may link here a small tutorial for the most common operations. But know that sound editing is an art. It takes one hour to learn and a lifetime to master :) Better use an expert - ask around
  • external linkTranscriber (Unicode)(free, IBMc,Mac,Unix) - helps in transcribing sound-based speech into text

    Then, to record digital sound files onto dvd or cd, for (1)backup data or (2)actual listening to, use the software that came with your CD/DVD recorder. Note that for CDs (1) and (2) result in two different products. To use (1) you need a computer, while (2) can be played on any CD player. When you burn the CD you have to specify which you want (e.g. data CD or audio CD)


Last Updated 2007-04-18

back to top

Sound playback


We consider two main ways to deliver sound from your database to the audience (CD, Web), and two factors to balance (quality, file size).

  • JavaScript player (free, all platforms) - uses the sound plugins installed in the individual system (see Quicktime and MediaPlayer below)
  • Flash player (free, all platforms) -
  • external linkQuicktime player (free, IBMc+Mac)
  • external linkWindowsMediaPlayer (free, IBMc)

Last Updated 2007-04-18

back to top

File formats


    Here are the file formats mentioned in this document. We are also discussing file sizes, since it's an important factor in processing, storing and transmitting data: the larger the file the longer it takes to do anything with it. Note that an extension can be used for a variety of formats (e.g. .typ files have different formats in Toolbox and Transcriber, wav files can contain mp3 data)

  • .pdf - Portable Document Format (fairly small to very large)

    (if you want to display and/or print your document on any platform and on any printer so that it looks exactly the same, this is the way to go)

  • .ps - Postscript ( - quite large, depending on features used )

    (designed as a printer language by Adobe, it used to be the format for camera-ready documents, the last digital step toward mass printing)

  • .rtf - Rich Text Format ( - fairly small to very large )

    (for moving text cross-platform and cross-applications while preserving formatting - fonts, bold, italics, etc.)

  • .txt, and Toolbox(.db, .sf, .typ, .lng, .prj, .cct) - Text ( - very small )

    (the most economical way to record and transfer text, including databases; most programs read this kind of files.)

  • .doc - usually a Microsoft Word document (- quite large )

    (though other programs also use this extension, which adds confusion) - designed as a versatile, cross-platform format and program, it has flowing layout, thus images and tables require a lot of attention; layout (and sometimes formatting), tends to change

    1. from platform to platform,
    2. when setting different active printers,
    3. or even from program version to version; each new version of Word usually introduces a new file format, very often incompatible with previous versions

    .
  • .xml - eXtensible Markup Language ( - medium to large )

    (a standardized text format that pieces of text to be marked with any number of features)

  • .dtd - Document Type Definition(s) (- small to medium )

    (a standardized text format that formally describes the structure all the features a specific .xml document can contain; an alternative is .xms - XML Schema files)



    Multimedia:
    The creation and playback of files in this category depend on the existence on the individual computer of small modules called codecs which allow for conversion (coding and decoding) from one format to another; formats can be designed to store data (smaller) on a computer or CD or stream data (slightly larger) on a network (i.e. Internet) - the latter means that the client can begin playback before the entire file is downloaded.
  • .raw - digital representation of sound or video ready for dumping to the sound card or video card (thus uncompressed) - huge
  • .au, .wav (etc.) - (lossy) sound compression formats (compression is achieved by reducing the amount of data in which each sound sample is encoded (e.g. ): number of bits(8,12,16,24) per sample or sampling frequency (8k-phone,11k-analog tape,22k-CD,44k-digital tape), and by various mathematical functions) - very large
  • .mp3, .wav, .ogg, .mov (etc.) - lossy sound compression formats (compression is done using dynamic frequency reduction that is psychologically accurate (within limits) - which means that you can balance the quality of the sound with the final size of the sound file) - large, medium and small
  • animation files done with Macromedia tools (Flash: .swf, .fla; Director: .dir) - can contain video streams too, but are mainly designed to be small and interactive; bad designs can result in very large files
  • IBMc(.avi), Mac(.mov) - multimedia files (mainly designed to store and/or stream video) - very large

    Archives:
    People tend to forget about this very important milestone in the lifecycle of data. It is important for archives (as well as any data storage), to be maintained in well-labelled, well-organized ways, ready for quick retrieval.
  • IBMc(.zip, .arj, .rar), Mac(.sit), Unix(.gz) (etc.) - Archive formats with their native platforms (used in order to reduce the size of other files for transmission and/or storage. Main formats are in bold. external linkStuffIt Expander on IBMc, Mac and Unix can open any of these and more) - smaller (between half the size of the original file and a few tens of characters larger; files compressed this way can be recovered identical with the original - no loss) Note: sound files do not get compressed much with this technique. For them, use some lossy technique described above.
  • .uu, .hqx, b64, (etc.) - 7-bit encodings (Sometimes, due to historical reasons of computer networks, there is a need to convert binary files (i.e. programs, archives, images), which are normally stored in a computer with 8 bits per character, into a 7-bit format) - encoded files are larger (at least 1.15 times the size of the original file)

back to top

Miscellaneous


Here we put tools which do not fit in previous sections.

  • external linkWordnet - semantically organized large subset of English
  • The external linkSIL catalogue is a great resource of free or inexpensive research software
  • If the Toolbox text analysis is not enough for you, see these external linkLinguistic Annotation tools
  • MSWord and MSAccess discussed above are also available as part of the external linkEducational package (US$199 for Win, US$129 for Mac)


  • back to top