GeoGettingStartedwithI18n

From W3C Wiki

ITS WG Collaborative editing page

Follow the conventions for editing this page.

Status: Initial Draft ie. please focus on technical content, rather than wordsmithing at this stage.

Author: Susan K. Miller


Getting Started with Internationalization

This document is the first draft of a article on the GEO site targeted to managers, Web designers and developers who are new to I18n. This section includes key concepts, checklists and FAQs regarding basic I18n and L10n issues.

[rr 18 may] Type out i18n and L10n to internationalization and localization.

Are you new to the world of Web site globalization, internationalization and localization? If so, this overview is for you, whether your role is as a project manager, designer or developer.

Project managers will find:

  • A brief discussion of why a Web site should be globalized.
  • Key concepts, including internationalization, localization and Unicode.
  • Frequently asked questions about internationalization.

RI I assume that the bulleted text will link to the appropriate place each time.

Web site designers will find:

-	An introduction to content, style and page layout issues for internationalizing and localizing Web sites.
-	A checklist  of key issues to use as you design your Web site.

Web site developers will find:

-	An introduction to the key concepts including language declarations and character encoding.
-	Considerations for structuring an internationalized Web site.

[RR 18 May] These items should be in bullets just like the PM section.

Managerial View

[DC 25 May] Do we agree that the manager view is about them understanding the amount of work and complexity of the issues for developers and designers as well as overview definitions? I would say we are also arguing for an internationalized architecture, which will make any sort of system more easily extendable (which in my experience is what managers want).

Why Globalization?

The key reason why companies consider offering multilingual versions of their Web sites is in response to local market cultural and/or legal requirements. In most countries, computer users prefer to work with Web sites in their native language. In order to establish or increase their presence in target countries, companies may decide to offer their Web sites in a local language.

[RR 18 May] "local market cultural and/or" should read "local market, cultural and/or

The globalization of Web sites has become an important issue for companies that want to increase their presence and market and sell their products in international markets. In many case, localization has proven to be the key factor for international product acceptance & success.

[RI] doesn't this para just repeat the previous one? [DC 25 May] - agreed, seems a duplication

While the practice of web production has matured rapidly over the years, Web site globalization is still very much in its infancy.

Although nearly every American corporation now has a web presence, less than 15% offer more than one language. With so few examples to build upon, and few established standards, the web manager planning a multilingual site is often left with little direction and support. [DC 25 May] But US-globalised/multilingual sites are not the only resource - seems too English/US focussed. Do this argue that more than 15% should offer more than one language?

Web site globalization is RI can be? a complex process, which requires a clear project plan and often the allocation of significant time and resources.

Key concepts

There are many ways to define the key concepts of Web site globalization and we will not attempt to supply the definitive version. For the purposes of this overview, we can say that internationalization is the process of building a Web site so that it can support multiple locales (a country or region and language), while localization is the process of modifying that site for a specific locale.

[DC 25 May] Hope this is not word-smithing - can we lose the first sentence?

Internationalization generally focuses on the Web site elements that work "behind the scenes" and are transparent RR not all are transparent, ie number format, etc. to the end user (e.g. encoding, architecture). Localization, on the other hand, RI typically/generally focuses on elements that are visible to the user, including the language of the text, the cultural context for the images, and the layout of the Web page. So, we can say that Web site globalization is the sum of internationalization and localization activities.

Let'€™s RR what are these undifined boxes consider these different facets of globalization in a little more detail.

Internationalization

The Localisation Industry Standards Association (LISA) defines internationalization as... "the process of generalizing a product so that it can handle multiple languages and cultural conventions without the need for re-design."

In general, the internationalization of any product -- including a Web site -- is most efficiently done during the product development cycle, as a precursor to the localization. The biggest and most costly problem for many companies is that RI where? their English-only Web site has to be completely re-created from the ground up due to the limitations imposed by the English text being embedded in the code and applications that can only handle English text. [DC 25 May] Checking applications/systems support languages other than English needs to go higher up I think, even in the manager section.

In fact, an important aspect of internationalization is the separation of text from the source code. Translatable text, i.e. text which is visible to the user, should be moved to separate strings-only resource files. These are the files that can then be handed off to translators.

[DC 25 May] or you can provide an interface, so that translators or people working in that market can change without either accessing developers (even where files are separate). People providing the language variations often seem to want to change them slightly, eg 'up-to-the-minute news' to 'latest news'; sometimes they find the translation words/vocabulary difficult to change out of content, ie, as a list. We wish we'd provided this earlier. We use an XML config, which non-technical users if editing the file directly could easily make badly formed. I think we should be introducing and emphasising the idea of a internationalized architecture (implemented early).

RI It might help to add some wording to more carefully relate this concept to Web technologies.

Central to internationalization is the ability to display the character sets and support local standards of a particular country or region. For example, before Web pages can be translated into most Asian languages, they must support double-byte characters RI suggest multi-byteRR suggest ideographic characters, we need to get away from DB term. If the page has been coded to support only Western European languages, it must be double-byte enabledRR suggest -- ideographic character enabled (e.g. by using Unicode). For more on this, please refer to the Developer View below.

[DC 25 May] In some ways I think encoding is a non-issue for managers except in tool/system support. My feeliung is that the more important issue is language declaration, eg, primary. Managers need to know about end users being able to find information in the appropriate form whether it be language or locale-related; also that the site can be voice-processed appropriately without alienating the end user.

[[RI 'display the character sets' -> 'support the character sets' OR -> 'display the characters']]

Internationalization is often abbreviated as “i18n”, where “18” indicates the number of letters between the "i" and the "n".

Localization

As with internationalization, many different definitions exist for the term localization. LISA notes that localization: "Involves taking a product and making it linguistically and culturally appropriate to the target locale where it will be used and sold."

The term localization is derived from the word locale, which generally means a small area or vicinity. In the context of Web site globalization, locale represents a specific combination of language, region and character encoding RI I don't think character encoding is relevantRR I agree with RI. For example, Portuguese as spoken in Portugal is one locale; Portuguese spoken in Brazil is a different locale.

L10n is often used as an abbreviation for localization, where “10” indicates the number of letters between the "l" and the "n".

Globalization

Because web site internationalization is very much integrated with localization, the process of publishing multi-lingual and multicultural Web sites can be referred to as Web site globalization.

Globalization is a term used in many different ways: • At the highest level, we talk about the globalization of business in general as an economic process. • At the enterprise level, a company physically establishes a global presence by setting up local branch or distribution offices. • For establishing a virtual presence, an enterprise may globalize its web site, by enabling it for non-English speaking visitors, i.e. internationalizing the site’s back-end software and architecture, and localizing the site’s content.

[DC 25 May] Like the bulleted, (check) list approach. Think it's not a wordsmithing issue. Managers want to skim and tick off.

LISA defines globalization by saying it “addresses the business issues associated with taking a product global. In the globalization of hi-tech prods this involves integrating localization throughout a company, after proper internationalization and product design, as well as marketing, sales and support in the world market.” RR should we add g11n here for globalization

Unicode

One important technical development that has affected localization is the Unicode standard. Unicode provides a unique number for every character, no matter what the platform, program or language. The Unicode standard uses two bytes (16 bits) for all characters instead of just one byte for standard European characters and has been implemented for most computing platforms. For more on Unicode, see the Developer view.

RI Unicode isn't a two-byte encoding any more. Maybe some of this http://www.w3.org/International/tutorials/tutorial-char-enc/en/all.html#Slide0040 tutorial text can help here.

YOUR GLOBALIZATION PROJECT

[DC 25 May] Not quite sure how this section division fits in.

Web sites are becoming increasingly dynamic. Consequently, Web site globalization is a complex process, requiring a clear project plan and often the allocation of significant resources. Below are a few of the issues to keep in mind as you consider the ramifications of I18n.

Increasing use of dynamic web pages

With XML, the use of databases to create web site content has become more widespread. For Web site l10n, this means that where traditionally large sets of HTML pages and images needed to be localized, now database tables with structured content will be translated.'

[DC 25 May] We used database-back systems (with and without XML), but we have few dynamic pages. Don't understand the database tables bit.

The challenge of constant change

Many Web sites are updated at least once, if not many times, a day. For multilingual web sites this frequency of updates introduces the challenge of keeping language versions synchronized; ideally, all updates on multilingual web sites are published simultaneously in all target languages. Obviously, this requires an extremely quick turn-around time for translations. Therefore, updating multilingual Web content often requires some form of automation to manage complex workflows. Even if you already have a content management RR should "a content management" be "content management" or "a content management system" in place, you’ll need to investigate whether it can handle translated content.

[DC 25 May] Our sites are not synchronised as such. There is no automated translation. With news the journalists in each language section need to keep content up-to-date, which they do using pan-organisation resources; sites may have different content according to the locale. Sometimes the language other than English is the resource for English news.

Translation and content management

Various tools and web content management systems have been developed to store and manage information in multiple languages on web sites. Setting up an organization and workflow that creates and manages content in multiple languages efficiently is a complicated task that is often underestimated.

Even when the technology is in place to host and manage multilingual web sites, many other issues need to be considered, such as provision of local (not localized) content, allocation of localization budgets, validation process for content translated by a third party service provider, etc.

Legalities

Keep in mind that laws and regulations in other countries can vary significantly from the practice in your country with regard to: copyright, privacy , advertising, and consumer protection. The country of ownership of your Web site can impact which local laws your company is subject to and how the laws are applied.

For example, if you go to the McDonald’s Corporation (U.S.) Web site and then select a non-U.S. locale, a separate page warns you: “You are leaving the McDonald's Corporation web site. The policies, including the privacy policy, on the site you are going to may vary from McDonald's. Be sure to review the policies of every site you visit.”

RR we should not use the McDonalds example if it could be seen in a negative way

You may need to consider getting local legal advice to address these issues and verify that you are in compliance. In some countries, your company may be subject to fines within thirty days of the launch of a local Web site.'

FAQs

Q: Isn’t localization just another word for translation?

A: No, localization goes beyond just translation of the words. It includes allowances for locale-specific cultural references and regional standards.

Q: If I want our Web site to be international, does it have to be multilingual?

A: An “international” web site is one that is intended for an international audience, and a “multilingual” web site refers to a web site that uses more than one language. An international web site may or may not be multilingual, just as a multilingual web site may or may not be international.

For more information, please see the article defining international and multilingual sites and the FAQ that examines the trade-offs between international sites that are monolingual vs. multilingual.

SM QUESTION: WHAT OTHER Questions COULD BE ASKED AND ANSWERED HERE?

[[DRC 11May: Should naming conventions be established at the site design phase where possible so that there is a consistent way of reaching other language versions?

A: Yes, Content negotiation, or sub site naming needs to be consistent so that equivalent content can be found and updated easily.

[DC 25 May] We don't have a one-to-one mapping between pages in different languages, although some multilingual sites do. For us the user journey would be within a particular language area. I don't thing language negociation would be so useful for us, but serving pages within a particular directory sub-structure with the correct language and encoding information is.

Design View

Text content issues

The best thing you can do with your content is to ensure clarity and consistency. The text should always be clear and unambiguous:

• Use consistent phrases and terms. The importance of simple, concise language is magnified when writing for translation. For example, in page navigation decide upfront whether you will use “back” or “previous“; click on”, “click”, “choose”, or “select:, when describing navigation.

• Use simple, active verbs. Compare "Click the GO button" to "The GO button should now be clicked." RR not clear which is good and which is bad

[DC 25 May] I had an idea that for usability especially voice/text browsing, that 'calls to action' should take the form of 'submit this form'.

• Avoid the use of "telegraphic English", ie., write full sentences.

RR give an example of telegraphic english and write out correctly

DRC It might be worth mentioning clear writing rules, such as simplified English

Format and style issues

[[RI Allow space for text to expand in translation. Maintain a clear separation between structure and presentation (ie. semantically marked up XTHML with CSS for presentation). Design of forms.]] RR how about design for mirroring layout for right to left languages

Dates

Visitors to a web site from varying locales may be confused by date formats. The format MM/DD/YY is unique to the United States. Most of Europe uses DD/MM/YY. Japan uses YY/MM/DD. The separators may be slashes, dashes or periods. Some locales print leading zeroes, others suppress them. If a native Japanese speaker is reading a US English web page from a web site in Germany that contains the date 03/04/02 how do they interpret it?

RI actually the most difficult thing is recognising dates typed into forms. Form design must therefore be thought through carefully, and perhaps alternatives prepared, according to user.

For more information, view the solution to this problem published as a Q and A [[RI 'Q & A' -> 'FAQ']] by GEO.(http://www.w3.org/International/questions/qa-date-format)

Calendars

Be aware that many Asian countries use different calendars than the Julian one typically used in the Americas and Europe.

[[RI 'julian' -> 'gregorian']]

Also note that a U.S. calendar week typically displays with Saturday as the last day of the week, while in Europe it is often more common to display Sunday as the last day.

Telephones and Addresses

One of the first places you may notice that your Web site doesn't work beyond your region is when a user from another locale tries to input his or her address and telephone number into your hardcoded form. Telephone numbers and addresses around the world vary substantially from the U.S. standard; these varying formats are well-documented on the Web.

When displaying addresses and telephone numbers, always include the country code and country name. Specify the time zone.

RR suggest for clearity -- When displaying addresses always include the country/region name and for telephone numbers, always include the country code. Specify the time zone.

[DC 25 May] zip code is the one that drives me mad. Also requirement of an area, eg, metropolitan UK areas such as London and Manchester, where there is no real area, though you can, if you're forced to, input 'Greater X'. Also, ordering in drop-down lists such as of countries/languages.

Cultural standards and other issues

The general rule is to avoid culturally-specific content. This includes: humorous references; references to politics, religion and sacred objects; sports and entertainment events and figures; seasons and holidays. Equivalents in other languages can probably be found for your culturally-specific content, but this requires much more effort than straightforward translation and increases both the time and costs of translation.

[[RI Hmm. I think avoid OR take on board the need to change. See my FAQ] for similar arguments.]]

Furthermore, you need RR to take care when referencing or consciously or accidentally attaching meaning to colors, animals, and symbols or hand gestures.

Accessibility issues become even more complex (link to WAI site). WAI requires text links to be the title of the linked document, link should be descriptive (not “click on this button”) Local language description and English name title linked. You could put Arabic after in parentheses .Useful with English link warns user its in English before they wait for download

Avoid using acronyms and abbreviations, which may be not only problematic to reproduce in other languages. If acronyms have to be used, spell them out in the first occurrence in the text. For example, NBA (National Basketball Association) could be left as is in Spanish or could be translated to ANB (Asociacion Nacional de Baloncesto.

RI We should say, use the abbr element and provide expansion in title attribute. This may belong inthe section about text content

Consider measurement conversion. For example, consider adding “(6.2 km)” when referring to “1 mile”. This will also avoid inconsistencies between conversion done in different target languages; some translators may leave the “1 mile” unchanged, others may completely replace it with the conversion. Again, the key is to anticipate any confusion.

When including examples such as a person’s name, use generic examples that are known world-wide.

RI Other cultural issues include things like whether people are used to paying for things over the internet

Images

Icons easily recognizable in the U.S. become less obvious outside its borders. The English term "home page" becomes the “initial page” in Spanish and the “welcome page” in French. So, even if a structure with a chalet roof and a chimney is recognizable in your country as a “home” doesn’t ensure that its use as an icon to return to the Web site’s first page is viable.

Similar difficulties arise with images of mailboxes, trash containers and shopping carts.

Page Layout

As you design Web page layout, there are a number of i18n issues. You will want to leave sufficient space for text expansion in tables, menus, online forms and illustrations. When text will be localized into double-byte RR ideographic languages, anticipate vertical expansion. Anticipate page layout issues when localized into languages written right-to-left.

Use only one space between sentences and after all punctuation marks.

RI is this an issue?

Be careful with your selection of font size. While a site displayed in an 8- or 10-point font may work for German or Finnish users and provide more content on a given page, it won't work everywhere. You can't read Japanese in an 8-point font.

[DC 25 May] We very much use different font sizes depending on the language inc feedback from the journalist users who speak the particular language. We could give some tips on multiple CSS and how to engineer effectively benefitted from the 'cascade'. Currently we have one per language, but this does not work so well when we add classes that we have to add to each stylesheet, especially when upload is via request to another part of the organisation.

As companies launch an increasing number of localized Web sites, user-friendly global navigation grows in importance. The term "global gateway" is frequently used to refer to the visual and technical devices that Web sites employ to direct visitors to their content. One of the more popular devices is a pull-down menu on the home page that includes links to localized versions of the content (eg. translations or alternative country sites). [link to JY FAQ]

RI I think navigation should go in its own section, though we could point to it from here. We should mention decisions about content negotiation vs explicit links, etc.

Be aware of possible differences in sorting rules. For example, sorting in Traditional Chinese is based on the number of strokes in the character. In Swedish, the letter A-with-an-open-circle-on-top RI A-ring Å sorts after Z, not near A.

Checklist for internationalized Web site design

	Terminology is consistent, simple and clear.
	Acronyms and abbreviations are limited.
	Culturally-specific examples, slang and images have been removed.
Legal text (e.g. copyright, privacy statements) is appropriate..

Development View

Note that this section is still under construction

Two of the most important issues for the developer who is new to I18n are character encoding (including the possible use of Unicode)RR Don't need what is in parenthsis and language declaration.

Some thought also should be given upfront to the physical organization of the files and directories that make up the internationalized Web site.

Character Encoding

User agents (eg. browsers) must be able to detect the character encoding used in a Web document, so that the user isn't presented with unreadable text. So you will want to declare the character encoding of the document. For more on choosing an encoding, see the GEO Tutorial.(http://www.w3.org/International/tutorials/tutorial-char-enc/)

Not all HTML and XML documents have to be encoded as Unicode, but it does mean that these documents can only contain characters defined by Unicode. It is a good idea to use a Unicode encoding wherever possible, since it simplifies many aspects of Web internationalization and is supported widely by HTML user agents, and by all XML processors.

Unicode is a universal character set, ie. a standard that defines, in one place, all the characters needed for writing the majority of living languages in use on computers. It aims to be, and to a large extent already is, a superset of all other character sets that have been encoded.

Language Declaration

Applications exist that can use natural language information about content to deliver to users the most relevant information based on their language preferences. The more content is tagged and tagged correctly, the more useful and pervasive such applications will become.

The language attribute unambiguously specifies the 'natural language' of web page content. It should always be used to indicate the primary language of the web page (in the main page container element). If the language changes within the main page container element this should also be reflected.

In current practice one can find XHTML documents that provide information about the language of a page in a number of different ways. One method is to use the lang and xml:lang attributes on the html tag. Other documents provide this information using a meta element. Language information may also be found in the HTTP header that is sent with a document.

So how should you do it? GEO recommends that the lang and xml:lang attributes be used for specifying the language used for processing content, and the HTTP Content-Language setting or a meta element be used, if appropriate or needed, for identifying the audience for the documente. For more on this, refer to: http://www.w3.org/International/questions/qa-http-and-lang

RI probably better to refer to the techniques doc

[[RI I think we should add the following sections:

  • Bidirectional text
  • International CSS properties

]]

File Structure

According to Microsoft’s development guidelines, the most efficient way to design a software [or web] application is to split the application into two conceptual blocks: a code block and a data block. The data block contains all user interface items and no programming code, which means it is the only element that requires localization.

Externalizing all translatables to one resource RR I would continue to use Block here because some times you use multiple resource files in one language for a given project file is more efficient because for each new language version of the application, only the resource file needs to be adapted. This structuring method also provides increased security, since those performing localization tasks do not need to access the source code.

User Interface

RI I think this section belongs under the Design part. Unless we express this as ways that coding needs to support the design decisions.

There are many things to consider when developing a software user interface. The most important thing is to allow for text expansion. Add approx 30% extra space for buttons and in dialog boxes. For more on UI issues, please refer to the Designer View above.

Ensure the menu tree fits on all screen resolutions. For example, the German translation for the Edit menu is “Bearbeiten”, and “Wiedergabe” for the View menu. In both cases, the text has been expanded by 100%. For this reason, many Windows applications in European languages use a question mark symbol (?) instead of the word “Help” on menu bars.

RI another topic that would fit here is text fragmentation - see http://www.multilingual.com/ishida43.htm

RESOURCES

A wide range of articles, techniques, tutorials and FAQs are available on this site. Please refer to the listing on the W3C Internationalization Home page.

RI We should probably point to the topic and techniques indices - and maybe explain how to use the site/what's there

Other resources include:

For more information on the Localisation Industry Standards Association, visit www.lisa.org

For more information on Unicode, visit the Unicode Consortium web site at www.unicode.org

Microsoft's Global Development portal has a wealth of information at http://www.microsoft.com/globaldev/default.mspx