UTF-EBCDIC

From Infogalactic: the planetary knowledge core
Jump to: navigation, search

UTF-EBCDIC is a character encoding used to represent Unicode characters. It is meant to be EBCDIC-friendly, so that legacy EBCDIC applications on mainframes may process the characters without much difficulty. Its advantages for existing EBCDIC-based systems are similar to UTF-8's advantages for existing ASCII-based systems. Details on UTF-EBCDIC are defined in Unicode Technical Report #16.

To produce the UTF-EBCDIC encoded version of a series of Unicode code points, an encoding based on UTF-8 (known in the specification as UTF-8-Mod) is applied first. The main difference between this encoding and UTF-8 is that it allows Unicode code points U+0080 through U+009F (the C1 control codes) to be represented as a single byte and therefore later mapped to corresponding EBCDIC control codes. In order to achieve this, UTF-8-Mod uses 101XXXXX instead of 10XXXXXX as the format for trailing bytes in a multi-byte sequence. As this can only hold 5 bits rather than 6, the UTF-8-Mod encoding of codepoints above U+009F is generally larger than the UTF-8 encoding.

The UTF-8-Mod transformation leaves the data in an ASCII-based format (for example, U+0041 "A" is still encoded as 01000001), so each byte is fed through a reversible (one-to-one) lookup table to produce the final UTF-EBCDIC encoding. For example, 01000001 in this table maps to 11000001; thus the UTF-EBCDIC encoding of U+0041 (Unicode's "A") is 0xC1 (EBCDIC's "A").

This encoding form is rarely used, even on the EBCDIC-based mainframes for which it was designed. IBM EBCDIC-based mainframe operating systems, such as z/OS, usually use UTF-16 for complete Unicode support. For example, DB2 UDB, COBOL, PL/I, Java and the IBM XML toolkit support UTF-16 on IBM mainframes.

Codepage layout

There are 160 characters with single-byte encodings in UTF-EBCDIC (compared to 128 in UTF-8). As can be seen, the single-byte portion is similar to IBM-1047 instead of IBM-37 due to the location of the square brackets. CCSID 37 has [] at hex BA and BB instead of at hex AD and BD respectively.

UTF-EBCDIC
_0 _1 _2 _3 _4 _5 _6 _7 _8 _9 _A _B _C _D _E _F
 
0_
 
NUL
0000
0
SOH
0001
1
STX
0002
2
ETX
0003
3
ST
009C
4
HT
0009
5
SSA
0086
6
DEL
007F
7
EPA
0097
8
RI
008D
9
SS2
008E
10
VT
000B
11
FF
000C
12
CR
000D
13
SO
000E
14
SI
000F
15
 
1_
 
DLE
0010
16
DC1
0011
17
DC2
0012
18
DC3
0013
19
OSC
009D
20
LF
000A
21
BS
0008
22
ESA
0087
23
CAN
0018
24
EM
0019
25
PU2
0092
26
SS3
008F
27
FS
001C
28
GS
001D
29
RS
001E
30
US
001F
31
 
2_
 
PAD
0080
32
HOP
0081
33
BPH
0082
34
NBH
0083
35
IND
0084
36
NEL
0085
37
ETB
0017
38
ESC
001B
39
HTS
0088
40
HTJ
0089
41
VTS
008A
42
PLD
008B
43
PLU
008C
44
ENQ
0005
45
ACK
0006
46
BEL
0007
47
 
3_
 
DCS
0090
48
PU1
0091
49
SYN
0016
50
STS
0093
51
CCH
0094
52
MW
0095
53
SPA
0096
54
EOT
0004
55
SOS
0098
56
SGCI
0099
57
SCI
009A
58
CSI
009B
59
DC4
0014
60
NAK
0015
61
PM
009E
62
SUB
001A
63
 
4_
 
SP
0020
64

+00
65

+01
66

+02
67

+03
68

+04
69

+05
70

+06
71

+07
72

+08
73

+09
74
.
002E
75
<
003C
76
(
0028
77
+
002B
78
|
007C
79
 
5_
 
&
0026
80

+0A
81

+0B
82

+0C
83

+0D
84

+0E
85

+0F
86

+10
87

+11
88

+12
89
!
0021
90
$
0024
91
*
002A
92
)
0029
93
;
003B
94
^
005E
95
 
6_
 
-
002D
96
/
002F
97

+13
98

+14
99

+15
100

+16
101

+17
102

+18
103

+19
104

+1A
105

+1B
106
,
002C
107
%
0025
108
_
005F
109
>
003E
110
?
003F
111
 
7_
 

+1C
112

+1D
113

+1E
114

+1F
115
2
0000
116
2
0020
117
2
0040
118
2
0060
119
2
0080
120
`
0060
121
:
003A
122
#
0023
123
@
0040
124
'
0027
125
=
003D
126
"
0022
127
 
8_
 
2
00A0
128
a
0061
129
b
0062
130
c
0063
131
d
0064
132
e
0065
133
f
0066
134
g
0067
135
h
0068
136
i
0069
137
2
00C0
138
2
00E0
139
2
0100
140
2
0120
141
2
0140
142
2
0160
143
 
9_
 
2
0180
144
j
006A
145
k
006B
146
l
006C
147
m
006D
148
n
006E
149
o
006F
150
p
0070
151
q
0071
152
r
0072
153
2
01A0
154
2
01C0
155
2
01E0
156
2
0200
157
2
0220
158
2
0240
159
 
A_
 
2
0260
160
~
007E
161
s
0073
162
t
0074
163
u
0075
164
v
0076
165
w
0077
166
x
0078
167
y
0079
168
z
007A
169
2
0280
170
2
02A0
171
2
02C0
172
[
005B
173
2
02E0
174
2
0300
175
 
B_
 
2
0320
176
2
0340
177
2
0360
178
2
0380
179
2
03A0
180
2
03C0
181
2
03E0
182
3
0000
183
3
0400
184
3
0800
185
3
0C00
186
3
1000
187
3
1400
188
]
005D
189
3
1800
190
3
1C00
191
 
C_
 
{
007B
192
A
0041
193
B
0042
194
C
0043
195
D
0044
196
E
0045
197
F
0046
198
G
0047
199
H
0048
200
I
0049
201
3
2000
202
3
2400
203
3
2800
204
3
2C00
205
3
3000
206
3
3400
207
 
D_
 
}
007D
208
J
004A
209
K
004B
210
L
004C
211
M
004D
212
N
004E
213
O
004F
214
P
0050
215
Q
0051
216
R
0052
217
3
3800
218
3
3C00
219
4
4000
220
4
8000
221
4
10000
222
4
18000
223
 
E_
 
\
005C
224
4
20000
225
S
0053
226
T
0054
227
U
0055
228
V
0056
229
W
0057
230
X
0058
231
Y
0059
232
Z
005A
233
4
28000
234
4
30000
235
4
38000
236
5
40000
237
5
100000
238


239
 
F_
 
0
0030
240
1
0031
241
2
0032
242
3
0033
243
4
0034
244
5
0035
245
6
0036
246
7
0037
247
8
0038
248
9
0039
249


250


251


252


253


254
APC
009F
255
_0 _1 _2 _3 _4 _5 _6 _7 _8 _9 _A _B _C _D _E _F

White cells containing a large single-digit number are the start bytes for a sequence of that many bytes. The unbolded hexadecimal code point number shown in the cell is the lowest character value encoded using that start byte. This value can be greater than the value which would be obtained by following the start byte with continuation bytes which are all 65 (hex 0x41), if this would result in an invalid overlong form.

Orange cells with one dot are continuation bytes. The hexadecimal number shown after a "+" plus sign is the value of the 5 bits they add.

Red cells indicate start bytes (for a sequence of that many bytes) which can never appear in properly encoded UTF-EBCDIC text, because any possible continuation would result in an invalid overlong form. For example, 0x76 is marked in red because even 0x76 0x73 (which maps to the UTF-8-Mod sequence 0xC2 0xBF) would merely be an overlong encoding of U+005F (properly encoded as UTF-8-Mod 0x5F, UTF-EBCDIC 0x6D).

See also

External links