6.6.2.4.3.1. Text encoding

There are multiple different standards for how computers store and represent text.

6.6.2.4.3.1.1. ASCII

ASCII stands for the American Standard for Information Interchange. It is the oldest, and extremely common, standard for how letters and characters (e.g. punctuation) are represented as binary numbers.

We often refer to plain text, and by this we mean text represented by ASCII (or UTF-8, see below). ASCII is so common that it’s extremely unlikely you’ll encounter a computer system that’s in use today which wouldn’t be able to take ASCII text input and display the associated letters if you asked it to. It’s a very robust format to be working with.

ASCII uses a very simple encoding system. So simple that the full table is displayed at the bottom of this page. Each letter, and a number of other characters such as full stops, are given a binary number that represents them. (The binary number can of course also be represented in decimal form.) If the computer sees this number, and is asked to interpret it as a character, it’s the character that gets displayed. A table of ASCII characters and the numbers used to represent each character is given below. ASCII uses 7 bits to store a character, and so there are 128 possible characters that it can represent. Some of the early ones represent control signals for a computer, rather than letters.

6.6.2.4.3.1.2. Latin-I

While very robust, and recognized by essentially every computer, the big limitation of ASCII is that it basically assumes you are writing English, using the 26 letters of the English alphabet. It is not very inclusive, and does not allow characters from other languages to be used.

Latin-I was an updated version of ASCII which used 8 bits rather than 7 to represent each character. This allowed the character map to have numbers representing 256 characters rather than 128, and so it supports many character accents, as used in many European languages.

6.6.2.4.3.1.3. Unicode and UTF-8 (and similar)

Unicode and UTF-8 (and UTF-16 and UTF-32) are how plain text is represented in modern computer systems. You may still encounter some older computer systems in use which only recognize ASCII, but in general if you say plain text today, this would probably be assumed to be Unicode encoded in UTF-8. Your programming files are being written in plain text, most likely UFT-8 plain text, and VSCode shows you the encoding used for your text file at the bottom of the screen.

Highlighting of the text encoding used in VSCode

Screenshot of VSCode, software from Microsoft. See course copyright statement.

The encoding scheme, mapping letters to binary numbers computers can store, is more complicated with Unicode and UFT-8 and the details are not important here. The major benefit of UTF-8/16/32 based schemes is that essentially every character from any language can be represented. It is much more inclusive and should be the default.

The negative of UTF-8/16/32 based schemes is that there are lots of characters that look very similar. For example, ' and are both characters for quotes, one is curved and one is straight. It’s easy to get the wrong one! Especially if copying and pasting between different programs and/or the Internet. In general, programming files expect a straight quote symbol '. You’ll get an error if you try and put in your code.

This is just one example. You can get some very hard to spot errors in your text files with characters that look very similar, but are in fact different ones. Occasionally, switching the encoding to ASCII can be useful as it’s more limited character set limits what can be entered to the correct symbols.

6.6.2.4.3.1.4. Which should I use

Most modern files and programs will accept UTF-8. You shouldn’t need to worry about text encoding more this until you get to more advanced programming.

6.6.2.4.3.1.5. ASCII encoding table

The ASCII encoding table, where each letter is represented by a number, is given below. Some of the early ones represent control signals for a computer, rather than letters. (You won’t be asked to memorise this for the exam! It’s just an illustration of the encoding to help your understanding.)

Binary number

Number as an integer

Description

Binary number

Number as an integer

Description

Binary number

Number as an integer

Description

0

0

Null

101011

43

+

1010110

86

V

1

1

Start of heading

101100

44

,

1010111

87

W

10

2

Start of text

101101

45

-

1011000

88

X

11

3

End of text

101110

46

.

1011001

89

Y

100

4

End of transmission

101111

47

/

1011010

90

Z

101

5

Enquiry

110000

48

0

1011011

91

[

110

6

Acknowledge

110001

49

1

1011100

92

\

111

7

Bell, alert

110010

50

2

1011101

93

]

1000

8

Backspace

110011

51

3

1011110

94

^

1001

9

Horizontal tab

110100

52

4

1011111

95

_

1010

10

Line feed

110101

53

5

1100000

96

`

1011

11

Vertical tab

110110

54

6

1100001

97

a

1100

12

Form feed

110111

55

7

1100010

98

b

1101

13

Carriage return

111000

56

8

1100011

99

c

1110

14

Shift out

111001

57

9

1100100

100

d

1111

15

Shift in

111010

58

:

1100101

101

e

10000

16

Data link escape

111011

59

;

1100110

102

f

10001

17

Device control one

111100

60

<

1100111

103

g

10010

18

Device control two

111101

61

=

1101000

104

h

10011

19

Device control three

111110

62

>

1101001

105

i

10100

20

Device control four

111111

63

?

1101010

106

j

10101

21

Negative Acknowledge

1000000

64

@

1101011

107

k

10110

22

Synchronous idle

1000001

65

A

1101100

108

l

10111

23

End of transmission block

1000010

66

B

1101101

109

m

11000

24

Cancel

1000011

67

C

1101110

110

n

11001

25

End of medium

1000100

68

D

1101111

111

o

11010

26

Substitute

1000101

69

E

1110000

112

p

11011

27

Escape

1000110

70

F

1110001

113

q

11100

28

File separator

1000111

71

G

1110010

114

r

11101

29

Group separator

1001000

72

H

1110011

115

s

11110

30

Record separator

1001001

73

I

1110100

116

t

11111

31

Unit separator

1001010

74

J

1110101

117

u

100000

32

Space

1001011

75

K

1110110

118

v

100001

33

!

1001100

76

L

1110111

119

w

100010

34

1001101

77

M

1111000

120

x

100011

35

#

1001110

78

N

1111001

121

y

100100

36

$

1001111

79

O

1111010

122

z

100101

37

%

1010000

80

P

1111011

123

{

100110

38

&

1010001

81

Q

1111100

124

|

100111

39

1010010

82

R

1111101

125

}

101000

40

(

1010011

83

S

1111110

126

~

101001

41

)

1010100

84

T

1111111

127

Delete

101010

42

*

1010101

85

U