亚洲免费在线-亚洲免费在线播放-亚洲免费在线观看-亚洲免费在线观看视频-亚洲免费在线看-亚洲免费在线视频

創(chuàng)造優(yōu)秀的程序之必備知識(shí):字符編碼(2)—軟件

系統(tǒng) 1969 0

軟件開(kāi)發(fā)者必須知道的Unicode和字符編碼


這是一篇翻譯自Joel Spolsky的文章“The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets”,比較經(jīng)典。

[翻譯時(shí)為增加可讀性,有少許改動(dòng)]

原文:http://www.joelonsoftware.com/articles/Unicode.html


你曾經(jīng)因?yàn)閔tml文件里的Content-Type標(biāo)簽感到迷惑嗎?比如:

    
      <
      
        meta
      
      
         http-equiv
      
      =
      
        "Content-Type" 
      
      
        content
      
      =
      
        "text/html; charset=utf-8" 
      
      
        
          /
        
      
      >
    
  
經(jīng)常見(jiàn)到吧,卻不知道它是用來(lái)做什么的?

你有沒(méi)有收到過(guò)從外國(guó)發(fā)來(lái)的郵件(比如保加利亞),標(biāo)題不能正常顯示,變成了 "???? ?????? ??? ????"?


令人沮喪的是,我發(fā)現(xiàn)很多軟件開(kāi)發(fā)者不太了解這個(gè)神秘的世界:字符集,字符編碼,Unicode等類(lèi)似的東西。2年前, FogBUGZ beta的一個(gè)測(cè)試人員想知道這個(gè)軟件能不能處理處理從日本發(fā)來(lái)的郵件,用日文寫(xiě)的郵件。[Mark,不太清楚這里要表達(dá)的意思]。當(dāng)查看那些我們用來(lái)解析MIME的電子郵件內(nèi)容時(shí),我發(fā)現(xiàn)它完全是用錯(cuò)誤的方式來(lái)進(jìn)行字符編碼的轉(zhuǎn)換,所以我只能寫(xiě)一段代碼來(lái)復(fù)原那些錯(cuò)誤的轉(zhuǎn)換并且從新以正確的方式操作字符編碼。

而當(dāng)我再看到另一個(gè)商用的函數(shù)庫(kù)時(shí),沒(méi)錯(cuò),關(guān)于字符編碼相關(guān)的實(shí)現(xiàn)也很糟糕。我聯(lián)系到了它的開(kāi)發(fā)者,他說(shuō)沒(méi)有什么解決辦法。就像很多程序員一樣,他也寄希望于那些問(wèn)題會(huì)“自己消失”,真是莫名其妙。

讓問(wèn)題“自己消失”,這是不可能的。When I discovered that the popular web development tool PHP has almost complete ignorance of character encoding issues , blithely using 8 bits for characters, making it darn near impossible to develop good international web applications, I thought, enough is enough .

所以我有一個(gè)消息要告訴你:如果你現(xiàn)在是一個(gè)程序員而且對(duì)計(jì)算機(jī)字符,字符集,編碼還有 Unicode的基本內(nèi)容似乎并不了解。那我就正好逮到你了。我發(fā)誓要讓你蹲在潛艇里,罰你連續(xù)削6個(gè)月的圓蔥。

我還要告訴你:這些看似高深的東西,其實(shí)沒(méi)有那么難懂。

本文會(huì)告訴你每個(gè)程序員都應(yīng)該了解的最基本內(nèi)容。之前的一切關(guān)于“純文本==ASCII碼編碼的字符串==每個(gè)字符為一個(gè)字節(jié)的字符串”( "plain text = ascii = characters are 8 bits")不僅是錯(cuò)誤的,而且是無(wú)可救藥地錯(cuò)了。如果你現(xiàn)在還按照這樣的方式編寫(xiě)程序,那就像一個(gè)醫(yī)生不相信細(xì)菌存在一樣。所以在讀完本文之前千萬(wàn)不要去寫(xiě)程序了。

在我們開(kāi)始之前,我應(yīng)該先提醒你,如果你是那些少數(shù)幾個(gè)了解國(guó)際化的人,你可能會(huì)覺(jué)得以下討論的內(nèi)容有一點(diǎn)太簡(jiǎn)單了。我試著將門(mén)檻降到最低,這樣所有人都能理解大致原理并且可以寫(xiě)出能操縱任何語(yǔ)言的代碼,而不僅僅是英文字符串了。而且我也應(yīng)該告訴你字符相關(guān)的技術(shù)只是創(chuàng)造國(guó)際化通用軟件的一小部分,因?yàn)槲颐看沃荒軐?xiě)一樣?xùn)|西,所以今天就寫(xiě)了關(guān)于字符集、字符編碼的內(nèi)容。

從歷史講起

理解它們最簡(jiǎn)單的方式就是按照時(shí)間順序敘述它的發(fā)展。

但是我們現(xiàn)在不談EBCDIC,因?yàn)樗鼘?shí)在太老了,那段歷史已經(jīng)沒(méi)必要談了。

Back in the semi-olden days,Unix剛出現(xiàn)的時(shí)候,當(dāng)K&R還在寫(xiě) The C Programming Language 的時(shí)候,一切都是那么簡(jiǎn)單。EBCDIC也即將消失。那個(gè)時(shí)候計(jì)算機(jī)僅能表示那些不帶變音符號(hào)(注:歐洲部分語(yǔ)言在字母的頂端有變音符號(hào))的英文字符,我們把它叫做 ASCII ,在ASCII碼表中,從編號(hào)32到127承載著所有的字符??崭袷?2號(hào),字母“A”是65號(hào),等等。因此這些字符能方便地存儲(chǔ)在7個(gè)2進(jìn)制位里。那個(gè)時(shí)候大部分電腦都以8個(gè)2進(jìn)制位為一個(gè)字節(jié),所以一個(gè)字節(jié)里不僅能存儲(chǔ)所有的ASCII字符,還有1個(gè)2進(jìn)制位的空位。如果你有點(diǎn)壞,可以把它用在其他邪惡的目的:曾經(jīng)有個(gè)文字處理軟件WordStar填補(bǔ)了那個(gè)空位,以標(biāo)定單詞的尾部。 編號(hào)32以下的稱為不可打印字符。它們用來(lái)讓你發(fā)牢騷。哈哈,只是個(gè)玩笑而已,其實(shí)它們被用作控制字符,比如7可以讓電腦蜂鳴,而12可以退出正在打印的頁(yè)面再裝入新的頁(yè)面。

ASCII table

假設(shè)你使用的語(yǔ)言是美國(guó)英語(yǔ),一切看起來(lái)似乎都挺好。

由于一個(gè)字節(jié)里還剩下很多空間,很多人會(huì)想到“天哪,我們可以把128到255留給自己用”。問(wèn)題是,很多人的想法類(lèi)似,他們按照自己的意愿在128到255之間填充自己的內(nèi)容。

The IBM-PC had something that came to be known as the OEM character set which provided some accented characters for European languages and a bunch of line drawing characters ... horizontal bars, vertical bars, horizontal bars with little dingle-dangles dangling off the right side, etc., and you could use these line drawing characters to make spiffy boxes and lines on the screen, which you can still see running on the 8088 computer at your dry cleaners'. In factas soon as people started buying PCs outside of America all kinds of different OEM character sets were dreamed up, which all used the top 128 characters for their own purposes. For example on some PCs the character code 130 would display as é, but on computers sold in Israel it was the Hebrew letter Gimel ( ? ), so when Americans would send their résumés to Israel they would arrive as <nobr>r<img alt="?" src="http://www.joelonsoftware.com/pictures/unicode/gimel.png" border="0" height="9" width="5" style="margin-right:2px">sum<img alt="?" src="http://www.joelonsoftware.com/pictures/unicode/gimel.png" border="0" height="9" width="5" style="margin-right:2px">s.</nobr> In many cases, such as Russian, there were lots of different ideas of what to do with the upper-128 characters, so you couldn't even reliably interchange Russian documents.

Eventually this OEM free-for-all got codified in the ANSI standard. In the ANSI standard, everybody agreed on what to do below 128, which was pretty much the same as ASCII, but there were lots of different ways to handle the characters from 128 and on up, depending on where you lived. These different systems were called code pages . So for example in Israel DOS used a code page called 862, while Greek users used 737. They were the same below 128 but different from 128 up, where all the funny letters resided. The national versions of MS-DOS had dozens of these code pages, handling everything from English to Icelandic and they even had a few "multilingual" code pages that could do Esperanto and Galician on the same computer! Wow! But getting, say, Hebrew and Greek on the same computer was a co 創(chuàng)造優(yōu)秀的程序之必備知識(shí):字符編碼(2)—軟件開(kāi)發(fā)者必須知道的Unicode和字符編碼 mplete impossibility unless you wrote your own custom program that displayed everything using bitmapped graphics, because Hebrew and Greek required different code pages with different interpretations of the high numbers.

Meanwhile, in Asia, even more crazy things were going on to take into account the fact that Asian alphabets have thousands of letters, which were never going to fit into 8 bits. This was usually solved by the messy system called DBCS, the "double byte character set" in which some letters were stored in one byte and others took two. It was easy to move forward in a string, but dang near impossible to move backwards. Programmers were encouraged not to use s++ and s-- to move backwards and forwards, but instead to call functions such as Windows' AnsiNext and AnsiPrev which knew how to deal with the whole mess.

But still, most people just pretended that a byte was a character and a character was 8 bits and as long as you never moved a string from one computer to another, or spoke more than one language, it would sort of always work. But of course, as soon as the Internet happened, it became quite commonplace to move strings from one computer to another, and the whole mess came tumbling down. Luckily, Unicode had been invented.




創(chuàng)造優(yōu)秀的程序之必備知識(shí):字符編碼(2)—軟件開(kāi)發(fā)者必須知道的Unicode和字符編碼


更多文章、技術(shù)交流、商務(wù)合作、聯(lián)系博主

微信掃碼或搜索:z360901061

微信掃一掃加我為好友

QQ號(hào)聯(lián)系: 360901061

您的支持是博主寫(xiě)作最大的動(dòng)力,如果您喜歡我的文章,感覺(jué)我的文章對(duì)您有幫助,請(qǐng)用微信掃描下面二維碼支持博主2元、5元、10元、20元等您想捐的金額吧,狠狠點(diǎn)擊下面給點(diǎn)支持吧,站長(zhǎng)非常感激您!手機(jī)微信長(zhǎng)按不能支付解決辦法:請(qǐng)將微信支付二維碼保存到相冊(cè),切換到微信,然后點(diǎn)擊微信右上角掃一掃功能,選擇支付二維碼完成支付。

【本文對(duì)您有幫助就好】

您的支持是博主寫(xiě)作最大的動(dòng)力,如果您喜歡我的文章,感覺(jué)我的文章對(duì)您有幫助,請(qǐng)用微信掃描上面二維碼支持博主2元、5元、10元、自定義金額等您想捐的金額吧,站長(zhǎng)會(huì)非常 感謝您的哦?。?!

發(fā)表我的評(píng)論
最新評(píng)論 總共0條評(píng)論
主站蜘蛛池模板: 欧美日韩亚洲国产一区二区三区 | 中文字幕在线免费观看视频 | 日日夜夜噜噜 | 亚洲十欧美十日韩十国产 | 久久精品免视看国产盗摄 | 国产99在线播放免费 | 久热在线视频精品网站 | 成人99| 美国美女一级毛片免费全 | 国产日韩欧美一区二区三区综合 | 国产福利福利视频 | 四虎永久免费影院在线 | 亚洲爱爱视频 | 97理论三级九七午夜在线观看 | 国偷盗摄自产福利一区在线 | 成人亚洲视频在线观看 | 日韩欧美一区二区三区在线 | 天色噜噜噜噜 | 久久国产精品亚洲一区二区 | 婷婷中文| 国产成人精品亚洲日本在线观看 | 伊人久久综合网站 | 911精品国产亚洲日本美国韩国 | 欧美亚洲日本国产 | 呦呦国产| 九九99九九精彩网站 | 4虎最新网站 | 国产精品综合视频 | 欧美午夜精品久久久久免费视 | 国产成人精品曰本亚洲78 | 五月婷婷在线视频观看 | 9久re热视频这里只有精品 | 午夜久久久久久网站 | 欧美毛片基地 | 亚洲国产二区三区 | 久久久青草青青国产亚洲免观 | 福利在线视频一区热舞 | 久草综合视频在线 | 99爱在线视频这里只有精品 | 国偷盗摄自产福利一区在线 | 中文精品北条麻妃中文 |