Unifoundry.com Unicode Utilities

"UUU for the WWW"


Home
Unicode Tutorial
Unicode Utilities Unifont Glyphs Hangul Fonts

I Wonder as I Wander

The GNU Unifont covers the Unicode Basic Multilingual Plane (BMP). When I first looked at it, the GNU Unifont was missing roughly 17,500 glyphs. Since then, the addition of Qianqian Fang's Unibit CJK glyphs and my additions leave approximately 2500 glyphs to be completed, including the over 1,000 glyphs added in Unicode 5.1 to the Basic Multilingual Plane. See the Unifont Glyphs page for more details on the font's latest status.

I was on travel, and didn't have access to a Linux system but wanted to work with the GNU Unifont. I did have the cygwin package installed on my Windows laptop but alas, did not have Perl installed to use the Unifont creator's Perl scripts. What to do?

Necessity is the mother of invention, so I decided to write a C version of the Perl script that converts .hex files into ASCII and back for easy font editing in a text editor. I do not include that software for download here, because what I did next was far better.

I completed that software in short order, but then realized that if the glyphs were represented as bitmaps, they could be edited in a graphics editor. As a result, characters could be edited and viewed with the same aspect ratio that they'd have on final display. What better way to edit them than at the correct aspect ratio from the beginning?

The resulting software displays a full 32-bit Unicode value (leading zeroes and all) to be displayed on a page, even though the GNU Unifont only supports Plane 0, the BMP, and even though Unicode itself only specifies values up to U+10FFFF. The upper two byte values are printed in the upper left-hand corner, as "U+nnnn". Characters on a page are arranged in a 16 by 16 grid (256 characters per page). Notches in the grid denote vertical and horizontal centers, and vertical and horizontal boundaries for 8 pixel wide and 16 pixel wide characters.

For example, the distance from the left-most notch in a grid square to the right-most notch in a grid square is 16 pixels, and the distance from the top-most notch in a grid square to the bottom-most notch in a grid square is also 16 pixels. The grid lines themselves are on a 32 by 32 pixel grid, providing some whitespace for clarity. (In Microsoft Paint or any other graphics editor, you can zoom in large enough to count the individual pixels for yourself.)

I chose the single-pixel wide grid border format to be compatible with Font Lab's BitFonter program. Yes, that's right, I use commercial font software sometimes. Bitfonter will read a table of glyphs automatically (with a little hinting on your part). BitFonter will also read a .bdf file, which is generated by one of the Unifont creator's Perl scripts. However, as previously mentioned, I didn't happen to have Perl installed under cygwin on my laptop.

Samples of unihex2bmp output can be found at the end of the Unicode Tutorial page on this website.

I began with the Wireless Bitmap file format (.wbmp) because it was the simplest graphics format I could find: a rectangular monochrome bitmap. It doesn't get any simpler than that. Once that was working, I added header processing for the Microsoft Windows Bitmap (.bmp) format. That allows editing in Microsoft Paint.

In a Wireless Bitmap file, a white pixel is always represented by a "1" bit, and a black pixel is always represented by a "0" bit. That is also the default Windows Bitmap encoding produced by Microsoft Paint, so that is the encoding that I used for pixels: white is a "1" bit, and black is a "0" bit.

Some sample results appear at the bottom of the Unicode Tutorial web page on this site.

After adding that functionality, I decided to add one more option: allowing the matrix to be transposed ("flipped", going from top to bottom, left to right rather than from left to right, top to bottom) to match the glyph ordering in the Unicode standard itself. (Every other system I've seen, including the professional font editing tools from Font Lab, arrange characters the other way, from left to right, top to bottom). I realized that would allow easy comparison with the Unicode code charts to facilitate adding new glyphs.

The two main utilities, unihex2bmp and unibmp2hex, convert GNU Unifont .hex files to and from Windows Bitmap (.bmp) and Wireless Bitmap (.wbmp) files. These two utilities use the Windows Bitmap format to allow glyph editing with the Microsoft Paint accessory bundled with Windows. I was on the road with my laptop when I wrote them, and wanted software that would let me easily edit the GNU Unifont on my Windows laptop.

The utilities were written as a quick hack, without tons of robust error checking or other bullet-proofing. Chances are I won't modify the programs for a while, at least until more of the BMP is covered by the GNU Unifont. This software is written in C, and should compile and run on just about anything that has a C compiler.

Licensing

The Unifoundry.com GNU Unifont Utility Package is Copyright © 2007 Paul Hardy.

The Unifoundry.com GNU Unifont Utility Package is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

The Unifoundry.com GNU Unifont Utility Package is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with the Unifoundry.com GNU Unifont Utility Package. If not, see http://www.gnu.org/licenses/.

Utilities

There are four utility programs in this package:

Usage

The first two utilities accept the following options:

-i
Specify the input file. The default is stdin. For example, "-iunifont.hex" specifies the input file as "unifont.hex".
-o
Specify the output file. The default is stdout. For example, "-omyoutput.bmp" specifies the output file as "output.bmp". Warning: there's no check to see if an output file exists — these utilities will clobber an existing file for output.
-p
Specify a "page", or block of 256 code points, to convert. ("Page" is my term, because that's what prints on a bitmap graphics page; it isn't a standard Unicode term.) For example, -p83 specifies the range U+8300 through U+83FF. If you don't specify a page with unibmp2hex, it figures out the page by reading the row and column labels in the bitmap file. The default page is 0.

In addition, unihex2bmp accepts the following options:

-w
Create a Wireless Bitmap graphics file instead of the default Windows Bitmap file.
-f
"Flip" (transpose) the grid to match the structure of the Unicode standard. This prints code points top to bottom, then left to right. The default order is left to right, then top to bottom.

Note that unibmp2hex will figure out if a bitmap is flipped (transposed) or not, and whether it is in Wireless Bitmap or Microsoft Bitmap format. It reads the last column of numbers to the left of the grid as the format for all hex digits, then compares every the other row and column headers to determine the "page", unless the page is specified with the -p command line option.

unibmp2hex outputs characters in the BMP in standard Unifont .hex format. If a character is above the BMP, it outputs hex codes preceded by an eight digit hexadecimal number rather than a four digit hexadecimal number, with everything else being the same.

unibmp2hex only understands one height, 16 pixels; it only understands two widths, 8 or 16 pixels. When reading the center of each 32 by 32 pixel grid, it detects whether or not the second half of the center 16 by 16 pixel grid is blank. If it is, then it outputs the .hex character as a 16 row by 8 column hex code. If there is even one black pixel in the second half of the 16 by 16 grid, it outputs the .hex character as a 16 row by 16 column hex code.

Caveat Emptor. These programs were written very, very quickly over a few evenings as a hack. It wouldn't surprise me if they have bugs, but they seem to work perfectly. In addition, these programs don't contain much in the way of error checking. If you do feed these programs bogus values or anything similar, expect the unexpected.

GNU Unifont Status

I wanted to determine how many characters were still needed to complete the GNU Unifont. First, I assembled as complete a glyph collection as I could. I began with Roman Czyborra's original unifont.hex file, then applied all of the updates on his website. Then I applied all of the updates from the Debian distribution. Finally, I added the Tibetan contribution that Rich Felker contributed last year.

One remaining difficulty in this calculation is that the unassigned characters, Specials, Noncharacters, etc. weren't noted in any special way. So I went through and added glyphs for them that look like gray boxes, based upon the Unicode 5.0 Standard. Those filler glyphs are available for download in the Unicode Glyphs section of this website. I then inserted the blanks into my local copy of the unifont.hex file.

Finally, to see how many characters there were in each 256 character area, I wrote the unipagecount program to read the new .hex file.

The unipagecount utility prints the high-order nybble as row headers (in the left-most column) and the low-order nybble as column headers (in the first row). Values range from 0 for a 256 character area with no entries to 100 (hex) for a 256 character area with all entries present.

Here are the results on the 2008-01-28 unifont.hex file:

        0   1   2   3   4   5   6   7   8   9   A   B   C   D   E   F
     0 100 100 100 100 100 100 100 100 100 100  5F  68  5A  62 100 100 
     1  B2  53 100 100 100 100 100  3D  65  2B 100  87 100 100 100 100 
     2 100  DF  F2  93  EB  EE  BC  BB 100   0   0  E1 100 100  73 100 
     3  EC  B5  53  64 100 100 100 100 100 100 100 100 100 100 100 100 
     4 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 
     5 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 
     6 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 
     7 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 
     8 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 
     9 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 
     A   1   0   0   0  3C 100 100 100  9C 100 100 100 100 100 100 100 
     B 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 
     C 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 
     D 100 100 100 100 100 100 100 100   0   0   0   0   0   0   0   0 
     E 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 
     F 100 100 100 100 100 100 100 100 100 100  89  FF  30  3E  E1  C7 
    

You can see that the first ten blocks of 256 characters (U+0000 through U+09FF, in the upper left-hand corner) are complete: all 10016, or 25610 characters have glyphs.

Some 256 character blocks don't have any assignments. The range U+D800 through U+DFFF is reserved for surrogate pairs. The range U+E000 through U+F8FF is reserved for private use.

The Unifont Glyphs page shows a color-coded view of font coverage. This was made with the unipagecount program with the -h -l option, to produce HTML output with links. Any box that is light green is 100 percent complete. Any box that is red or near-red has none or hardly any glyphs complete. Yellow and orange are intermediate, with orange cells having less coverage than yellow cells.

Download the Utilities

You can download the utilities here (version 1.02 released 6 January 2008):

You'll also need a copy of Roman Czyborra's unifont.hex file. His very first, original file is at http://czyborra.com/unifont/unifont.hex.gz and the patches that I applied to the original file to derive the above unipagecount values are at unifont-patch-2007-12-31.gz. Note: Roman's website is currently down. For that reason, I'm making the whole GNU Unifont available on this website.

Combining Characters

The combining character dashed circle in the existing unifont.hex file has this pattern:

You can see this, for example, in the Combining Diacritical Marks block at U+0300 through U+036F. Not all combining circles follow this pattern precisely. For that reason, I wrote uniunmask for version 1.02. This program reads a second .hex file, masks.hex, and XORs it with the main .hex file for any matching code points. This allows combining circle marks to appear in a master file, but be easily removed for display, for example on X-Windows.

Version 1.02 also adds the program unihex2bdf, a C program that produces output identical to Roman Czyborra's Perl script. This program reads in a Unifont .hex file and writes out a font file in BDF format for X-Windows.

GNU Unifont and True Type

Luis Alejandro González Miranda has created a utility to convert the GNU Unifont into a True Type font by using fontforge. His website is http://www.lgm.cl/trabajos/unifont/index.en.html.

Roman Czyborra's GNU Unifont Utilities

Since Roman's website is currently down, here's a gzipped tar file of his Perl scripts (bdfimplode.pl, hex2bdf.pl, hexdraw.pl, and hexmerge.pl). I'll post more information later.

http://unifoundry.com/czyborra-perl.tar.gz

Auxes Armes, Netizens!

A call to arms, or "Where do I sign up?"

As can be seen, there are plenty of areas that need work. Roman Czyborra had asked that additions be emailed in .hex format to (anti-spam version of address) unifont at his domain czyborra.com. Current news on the GNU Unifont was available at http://czyborra.com/unifont/.

However, his website is currently down. You can send any updates to unifoundry at this domain and I'll add them to my master copy for the next release. Thanks!

If you have any questions, please email unifoundry at this domain name (not spelled out because of spammers).