SECTION D
- Explain in brief Accessing Values in Strings, Updating Strings and Escape Characters
Accessing Values in Strings
Python does not support a character type; these are treated as strings of length one, thus also considered a substring.
To access substrings, use the square brackets for slicing along with the index or indices to obtain your substring. For example −
#!/usr/bin/python
Var1 = 'Hello World!'
Var2 = "Python Programming"
Print "var1[0]: ", var1[0]
Print "var2[1:5]: ", var2[1:5]
When the above code is executed, it produces the following result −
Var1[0]: H
Var2[1:5]: ytho
Updating Strings
You can "update" an existing string by (re)assigning a variable to another string. The new value can be related to its previous value or to a completely different string altogether. For example −
#!/usr/bin/python
Var1 = 'Hello World!'
Print "Updated String :- ", var1[:6] + 'Python'
When the above code is executed, it produces the following result −
Updated String :- Hello Python
Escape Characters
Following table is a list of escape or non-printable characters that can be represented with backslash notation.
An escape character gets interpreted; in a single quoted as well as double quoted strings.
Backslash notation | Hexadecimal character | Description |
\a | 0x07 | Bell or alert |
\b | 0x08 | Backspace |
\cx |
| Control-x |
\C-x |
| Control-x |
\e | 0x1b | Escape |
\f | 0x0c | Formfeed |
\M-\C-x |
| Meta-Control-x |
\n | 0x0a | Newline |
\nnn |
| Octal notation, where n is in the range 0.7 |
\r | 0x0d | Carriage return |
\s | 0x20 | Space |
\t | 0x09 | Tab |
\v | 0x0b | Vertical tab |
\x |
| Character x |
\xnn |
| Hexadecimal notation, where n is in the range 0.9, a.f, or A.F |
2. What are string special operators and string formatting operators?
String Special Operators
Assume string variable a holds 'Hello' and variable b holds 'Python', then −
Operator | Description | Example |
+ | Concatenation - Adds values on either side of the operator | a + b will give HelloPython |
* | Repetition - Creates new strings, concatenating multiple copies of the same string | a*2 will give –HelloHello |
[] | Slice - Gives the character from the given index | a[1] will give e |
[ : ] | Range Slice - Gives the characters from the given range | a[1:4] will give ell |
In | Membership - Returns true if a character exists in the given string | H in a will give 1 |
Not in | Membership - Returns true if a character does not exist in the given string | M not in a will give 1 |
r/R | Raw String - Suppresses actual meaning of Escape characters. The syntax for raw strings is exactly the same as for normal strings with the exception of the raw string operator, the letter "r," which precedes the quotation marks. The "r" can be lowercase (r) or uppercase (R) and must be placed immediately preceding the first quote mark. | Print r'\n' prints \n and print R'\n'prints \n |
% | Format - Performs String formatting | See at next section |
String Formatting Operator
One of Python's coolest features is the string format operator %. This operator is unique to strings and makes up for the pack of having functions from C's printf() family. Following is a simple example −
#!/usr/bin/python
Print "My name is %s and weight is %d kg!" % ('Zara', 21)
When the above code is executed, it produces the following result −
My name is Zara and weight is 21 kg!
Here is the list of complete set of symbols which can be used along with % −
Format Symbol | Conversion |
%c | Character |
%s | String conversion via str() prior to formatting |
%i | Signed decimal integer |
%d | Signed decimal integer |
%u | Unsigned decimal integer |
%o | Octal integer |
%x | Hexadecimal integer (lowercase letters) |
%X | Hexadecimal integer (UPPERcase letters) |
%e | Exponential notation (with lowercase 'e') |
%E | Exponential notation (with UPPERcase 'E') |
%f | Floating point real number |
%g | The shorter of %f and %e |
%G | The shorter of %f and %E |
Other supported symbols and functionality are listed in the following table −
Symbol | Functionality |
* | Argument specifies width or precision |
- | Left justification |
+ | Display the sign |
<sp> | Leave a blank space before a positive number |
# | Add the octal leading zero ( '0' ) or hexadecimal leading '0x' or '0X', depending on whether 'x' or 'X' were used. |
0 | Pad from left with zeros (instead of spaces) |
% | '%%' leaves you with a single literal '%' |
(var) | Mapping variable (dictionary arguments) |
m.n. | m is the minimum total width and n is the number of digits to display after the decimal point (if appl.) |
3. Write some of the Built-in String Methods
Python includes the following built-in methods to manipulate strings −
Sr.No. | Methods with Description |
1 | Capitalize() Capitalizes first letter of string |
2 | Center(width, fillchar) Returns a space-padded string with the original string centered to a total of width columns.
|
3 | Count(str, beg= 0,end=len(string)) Counts how many times str occurs in string or in a substring of string if starting index beg and ending index end are given.
|
4 | Decode(encoding='UTF-8',errors='strict') Decodes the string using the codec registered for encoding. Encoding defaults to the default string encoding.
|
5 | Encode(encoding='UTF-8',errors='strict') Returns encoded string version of string; on error, default is to raise a ValueError unless errors is given with 'ignore' or 'replace'.
|
6 | Endswith(suffix, beg=0, end=len(string)) Determines if string or a substring of string (if starting index beg and ending index end are given) ends with suffix; returns true if so and false otherwise.
|
7 | Expandtabs(tabsize=8) Expands tabs in string to multiple spaces; defaults to 8 spaces per tab if tabsize not provided.
|
8 | Find(str, beg=0 end=len(string)) Determine if str occurs in string or in a substring of string if starting index beg and ending index end are given returns index if found and -1 otherwise.
|
9 | Index(str, beg=0, end=len(string)) Same as find(), but raises an exception if str not found. |
10 | Isalnum() Returns true if string has at least 1 character and all characters are alphanumeric and false otherwise. |
11 | Isalpha() Returns true if string has at least 1 character and all characters are alphabetic and false otherwise. |
12 | Isdigit() Returns true if string contains only digits and false otherwise. |
13 | Islower() Returns true if string has at least 1 cased character and all cased characters are in lowercase and false otherwise. |
14 | Isnumeric() Returns true if a unicode string contains only numeric characters and false otherwise. |
15 | Isspace() Returns true if string contains only whitespace characters and false otherwise. |
16 | Istitle() Returns true if string is properly "titlecased" and false otherwise. |
17 | Isupper() Returns true if string has at least one cased character and all cased characters are in uppercase and false otherwise. |
18 | Join(seq) Merges (concatenates) the string representations of elements in sequence seq into a string, with separator string. |
19 | Len(string) Returns the length of the string |
20 | Ljust(width[, fillchar]) Returns a space-padded string with the original string left-justified to a total of width columns. |
21 | Lower() Converts all uppercase letters in string to lowercase. |
22 | Lstrip() Removes all leading whitespace in string. |
23 | Maketrans() Returns a translation table to be used in translate function. |
24 | Max(str) Returns the max alphabetical character from the string str. |
25 | Min(str) Returns the min alphabetical character from the string str. |
26 | Replace(old, new [, max]) Replaces all occurrences of old in string with new or at most max occurrences if max given. |
27 | Rfind(str, beg=0,end=len(string)) Same as find(), but search backwards in string. |
28 | Rindex( str, beg=0, end=len(string)) Same as index(), but search backwards in string. |
29 | Rjust(width,[, fillchar]) Returns a space-padded string with the original string right-justified to a total of width columns. |
30 | Rstrip() Removes all trailing whitespace of string. |
31 | Split(str="", num=string.count(str)) Splits string according to delimiter str (space if not provided) and returns list of substrings; split into at most num substrings if given. |
32 | Splitlines( num=string.count('\n')) Splits string at all (or num) NEWLINEs and returns a list of each line with NEWLINEs removed. |
33 | Startswith(str, beg=0,end=len(string)) Determines if string or a substring of string (if starting index beg and ending index end are given) starts with substring str; returns true if so and false otherwise. |
34 | Strip([chars]) Performs both lstrip() and rstrip() on string. |
35 | Swapcase() Inverts case for all letters in string. |
36 | Title() Returns "titlecased" version of string, that is, all words begin with uppercase and the rest are lowercase. |
37 | Translate(table, deletechars="") Translates string according to translation table str(256 chars), removing those in the del string. |
38 | Upper() Converts lowercase letters in string to uppercase. |
39 | Zfill (width) Returns original string leftpadded with zeros to a total of width characters; intended for numbers, zfill() retains any sign given (less one zero). |
40 | Isdecimal() Returns true if a unicode string contains only decimal characters and false otherwise. |
4. Define Unicode string
Introduction
Models that process natural language often handle different languages with different character sets. Unicode is a standard encoding system that is used to represent character from almost all languages. Each character is encoded using a unique integer code point between 0 and 0x10FFFF. A Unicode string is a sequence of zero or more code points.
How to represent Unicode strings in TensorFlow and manipulate them using Unicode equivalents of standard string ops. It separates Unicode strings into tokens based on script detection.
Importtensorflowastf
The tf.string data type
The basic TensorFlowtf.stringdtype allows you to build tensors of byte strings. Unicode strings are utf-8 encoded by default.
Tf.constant(u"Thanks😊")
<tf.Tensor: shape=(), dtype=string, numpy=b'Thanks \xf0\x9f\x98\x8a'>
A tf.string tensor can hold byte strings of varying lengths because the byte strings are treated as atomic units. The string length is not included in the tensor dimensions.
Tf.constant([u"You're",u"welcome!"]).shape
TensorShape([2])
Note: When using python to construct strings, the handling of unicodediffersbetweeen v2 and v3. In v2, unicode strings are indicated by the "u" prefix, as above. In v3, strings are unicode-encoded by default.
5. Write in brief representation of Unicode
Representing Unicode
There are two standard ways to represent a Unicode string in TensorFlow:
- String scalar — where the sequence of code points is encoded using a known character encoding.int32 vector — where each position contains a single code point.
For example, the following three values all represent the Unicode string "语言处理" (which means "language processing" in Chinese):
# Unicode string, represented as a UTF-8 encoded string scalar.
text_utf8 =tf.constant(u"语言处理")
text_utf8
<tf.Tensor: shape=(), dtype=string, numpy=b'\xe8\xaf\xad\xe8\xa8\x80\xe5\xa4\x84\xe7\x90\x86'>
# Unicode string, represented as a UTF-16-BE encoded string scalar.
text_utf16be =tf.constant(u"语言处理".encode("UTF-16-BE"))
text_utf16be
<tf.Tensor: shape=(), dtype=string, numpy=b'\x8b\xed\x8a\x00Y\x04t\x06'>
# Unicode string, represented as a vector of Unicode code points.
text_chars=tf.constant([ord(char)forcharinu"语言处理"])
text_chars
<tf.Tensor: shape=(4,), dtype=int32, numpy=array([35821, 35328, 22788, 29702], dtype=int32)>
Converting between representations
TensorFlow provides operations to convert between these different representations:
- Tf.strings.unicode_decode : Converts an encoded string scalar to a vector of code points.
- Tf.strings.unicode_encode : Converts a vector of code points to an encoded string scalar.
- Tf.strings.unicode_transcode : Converts an encoded string scalar to a different encoding.
Tf.strings.unicode_decode(text_utf8,
input_encoding='UTF-8')
<tf.Tensor: shape=(4,), dtype=int32, numpy=array([35821, 35328, 22788, 29702], dtype=int32)>
Tf.strings.unicode_encode(text_chars,
output_encoding='UTF-8')
<tf.Tensor: shape=(), dtype=string, numpy=b'\xe8\xaf\xad\xe8\xa8\x80\xe5\xa4\x84\xe7\x90\x86'>
Tf.strings.unicode_transcode(text_utf8,
input_encoding='UTF8',
output_encoding='UTF-16-BE')
<tf.Tensor: shape=(), dtype=string, numpy=b'\x8b\xed\x8a\x00Y\x04t\x06'>
Batch dimensions
When decoding multiple strings, the number of characters in each string may not be equal. The return result is atf.RaggedTensor , where the length of the innermost dimension varies depending on the number of characters in each string:
# A batch of Unicode strings, each represented as a UTF8-encoded string.
batch_utf8 =[s.encode('UTF-8')for s in
[u'hÃllo', u'What is the weather tomorrow', u'Göödnight', u'😊']]
batch_chars_ragged=tf.strings.unicode_decode(batch_utf8,
input_encoding='UTF-8')
forsentence_charsinbatch_chars_ragged.to_list():
print(sentence_chars)
[104, 195, 108, 108, 111]
[87, 104, 97, 116, 32, 105, 115, 32, 116, 104, 101, 32, 119, 101, 97, 116, 104, 101, 114, 32, 116, 111, 109, 111, 114, 114, 111, 119]
[71, 246, 246, 100, 110, 105, 103, 104, 116]
[128522]
You can use this tf.RaggedTensor directly, or convert it to a dense tf.Tensor with padding or a tf.SparseTensor using the methods tf.RaggedTensor.to_tensor and tf.RaggedTensor.to_sparse.
Batch_chars_padded=batch_chars_ragged.to_tensor(default_value=-1)
print(batch_chars_padded.numpy())
[[ 104 195 108 108 111 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1]
[ 87 104 97 116 32 105 115 32 116 104
101 32 119 101 97 116 104 101 114 32
116 111 109 111 114 114 111 119]
[ 71 246 246 100 110 105 103 104 116 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1]
[128522 -1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1]]
Batch_chars_sparse=batch_chars_ragged.to_sparse()
When encoding multiple strings with the same lengths, a tf.Tensor may be used as input:
Tf.strings.unicode_encode([[99,97,116],[100,111,103],[99,111,119]],
output_encoding='UTF-8')
<tf.Tensor: shape=(3,), dtype=string, numpy=array([b'cat', b'dog', b'cow'], dtype=object)>
When encoding multiple strings with varying length, a tf.RaggedTensor should be used as input:
Tf.strings.unicode_encode(batch_chars_ragged,output_encoding='UTF-8')
<tf.Tensor: shape=(4,), dtype=string, numpy=
Array([b'h\xc3\x83llo', b'What is the weather tomorrow',
b'G\xc3\xb6\xc3\xb6dnight', b'\xf0\x9f\x98\x8a'], dtype=object)>
If you have a tensor with multiple strings in padded or sparse format, then convert it to a tf.RaggedTensor before calling unicode_encode:
Tf.strings.unicode_encode(
tf.RaggedTensor.from_sparse(batch_chars_sparse),
output_encoding='UTF-8')
<tf.Tensor: shape=(4,), dtype=string, numpy=
Array([b'h\xc3\x83llo', b'What is the weather tomorrow',
b'G\xc3\xb6\xc3\xb6dnight', b'\xf0\x9f\x98\x8a'], dtype=object)>
Tf.strings.unicode_encode(
tf.RaggedTensor.from_tensor(batch_chars_padded, padding=-1),
output_encoding='UTF-8')
<tf.Tensor: shape=(4,), dtype=string, numpy=
Array([b'h\xc3\x83llo', b'What is the weather tomorrow',
b'G\xc3\xb6\xc3\xb6dnight', b'\xf0\x9f\x98\x8a'], dtype=object)>
6. What are Unicode operations?
Character length
The tf.strings.length operation has a parameter unit, which indicates how lengths should be computed. Unit defaults to "BYTE", but it can be set to other values, such as "UTF8_CHAR" or "UTF16_CHAR", to determine the number of Unicode codepoints in each encoded string.
# Note that the final character takes up 4 bytes in UTF8.
thanks =u'Thanks😊'.encode('UTF-8')
num_bytes=tf.strings.length(thanks).numpy()
num_chars=tf.strings.length(thanks, unit='UTF8_CHAR').numpy()
print('{} bytes; {} UTF-8 characters'.format(num_bytes,num_chars))
11 bytes; 8 UTF-8 characters
Character substrings
Similarly, the tf.strings.substr operation accepts the "unit" parameter, and uses it to determine what kind of offsets the "pos" and "len" paremeters contain.
# default: unit='BYTE'. With len=1, we return a single byte.
tf.strings.substr(thanks,pos=7,len=1).numpy()
b'\xf0'
# Specifying unit='UTF8_CHAR', we return a single character, which in this case
# is 4 bytes.
print(tf.strings.substr(thanks,pos=7,len=1, unit='UTF8_CHAR').numpy())
b'\xf0\x9f\x98\x8a'
Split Unicode strings
The tf.strings.unicode_split operation splits unicode strings into substrings of individual characters:
Tf.strings.unicode_split(thanks,'UTF-8').numpy()
Array([b'T', b'h', b'a', b'n', b'k', b's', b' ', b'\xf0\x9f\x98\x8a'],
Dtype=object)
Byte offsets for characters
To align the character tensor generated by tf.strings.unicode_decode with the original string, it's useful to know the offset for where each character begins. The method tf.strings.unicode_decode_with_offsets is similar to unicode_decode, except that it returns a second tensor containing the start offset of each character.
Codepoints, offsets =tf.strings.unicode_decode_with_offsets(u"🎈🎉🎊",'UTF-8')
for(codepoint, offset)in zip(codepoints.numpy(),offsets.numpy()):
print("At byte offset {}: codepoint {}".format(offset,codepoint))
At byte offset 0: codepoint 127880
At byte offset 4: codepoint 127881
At byte offset 8: codepoint 127882
7. Write an Example: Simple segmentation
Segmentation is the task of splitting text into word-like units. This is often easy when space characters are used to separate words, but some languages (like Chinese and Japanese) do not use spaces, and some languages (like German) contain long compounds that must be split in order to analyze their meaning. In web text, different languages and scripts are frequently mixed together, as in "NY株価" (New York Stock Exchange).
We can perform very rough segmentation (without implementing any ML models) by using changes in script to approximate word boundaries. This will work for strings like the "NY株価" example above. It will also work for most languages that use spaces, as the space characters of various scripts are all classified as USCRIPT_COMMON, a special script code that differs from that of any actual text.
# dtype: string; shape: [num_sentences]
#
# The sentences to process. Edit this line to try out different inputs!
sentence_texts=[u'Hello, world.',u'世界こんにちは']
First, we decode the sentences into character codepoints, and find the script identifeir for each character.
# dtype: int32; shape: [num_sentences, (num_chars_per_sentence)]
#
# sentence_char_codepoint[i, j] is the codepoint for the j'th character in
# the i'th sentence.
sentence_char_codepoint=tf.strings.unicode_decode(sentence_texts,'UTF-8')
print(sentence_char_codepoint)
# dtype: int32; shape: [num_sentences, (num_chars_per_sentence)]
#
# sentence_char_scripts[i, j] is the unicode script of the j'th character in
# the i'th sentence.
sentence_char_script=tf.strings.unicode_script(sentence_char_codepoint)
print(sentence_char_script)
<tf.RaggedTensor [[72, 101, 108, 108, 111, 44, 32, 119, 111, 114, 108, 100, 46], [19990, 30028, 12371, 12435, 12395, 12385, 12399]]>
<tf.RaggedTensor [[25, 25, 25, 25, 25, 0, 0, 25, 25, 25, 25, 25, 0], [17, 17, 20, 20, 20, 20, 20]]>
Next, we use those script identifiers to determine where word boundaries should be added. We add a word boundary at the beginning of each sentence, and for each character whose script differs from the previous character:
# dtype: bool; shape: [num_sentences, (num_chars_per_sentence)]
#
# sentence_char_starts_word[i, j] is True if the j'th character in the i'th
# sentence is the start of a word.
sentence_char_starts_word=tf.concat(
[tf.fill([sentence_char_script.nrows(),1],True),
tf.not_equal(sentence_char_script[:,1:],sentence_char_script[:,:-1])],
axis=1)
# dtype: int64; shape: [num_words]
#
# word_starts[i] is the index of the character that starts the i'th word (in
# the flattened list of characters from all sentences).
word_starts=tf.squeeze(tf.where(sentence_char_starts_word.values), axis=1)
print(word_starts)
Tf.Tensor([ 0 5 7 12 13 15], shape=(6,), dtype=int64)
We can then use those start offsets to build a RaggedTensor containing the list of words from all batches:
# dtype: int32; shape: [num_words, (num_chars_per_word)]
#
# word_char_codepoint[i, j] is the codepoint for the j'th character in the
# i'th word.
word_char_codepoint=tf.RaggedTensor.from_row_starts(
values=sentence_char_codepoint.values,
row_starts=word_starts)
print(word_char_codepoint)
<tf.RaggedTensor [[72, 101, 108, 108, 111], [44, 32], [119, 111, 114, 108, 100], [46], [19990, 30028], [12371, 12435, 12395, 12385, 12399]]>
And finally, we can segment the word codepointsRaggedTensor back into sentences:
# dtype: int64; shape: [num_sentences]
#
# sentence_num_words[i] is the number of words in the i'th sentence.
sentence_num_words=tf.reduce_sum(
tf.cast(sentence_char_starts_word, tf.int64),
axis=1)
# dtype: int32; shape: [num_sentences, (num_words_per_sentence), (num_chars_per_word)]
#
# sentence_word_char_codepoint[i, j, k] is the codepoint for the k'th character
# in the j'th word in the i'th sentence.
sentence_word_char_codepoint=tf.RaggedTensor.from_row_lengths(
values=word_char_codepoint,
row_lengths=sentence_num_words)
print(sentence_word_char_codepoint)
<tf.RaggedTensor [[[72, 101, 108, 108, 111], [44, 32], [119, 111, 114, 108, 100], [46]], [[19990, 30028], [12371, 12435, 12395, 12385, 12399]]]>
To make the final result easier to read, we can encode it back into UTF-8 strings:
Tf.strings.unicode_encode(sentence_word_char_codepoint,'UTF-8').to_list()
[[b'Hello', b', ', b'world', b'.'],
[b'\xe4\xb8\x96\xe7\x95\x8c',
b'\xe3\x81\x93\xe3\x82\x93\xe3\x81\xab\xe3\x81\xa1\xe3\x81\xaf']]
8. What are some of the String Manipulation operations explain in brief?
To manipulate strings, we can use some of Pythons built-in methods.
Creation
Word="Hello World"
>>>print word
Hello World
Accessing
Use [ ] to access characters in a string
Word="Hello World"
Letter=word[0]
>>>print letter
H
Length
Word="Hello World"
>>>len(word)
11
Finding
Word="Hello World">>>printword.count('l')# count how many times l is in the string
3
>>>printword.find("H")# find the word H in the string
0
>>>printword.index("World")# find the letters World in the string
6
Count
s ="Count, the number of spaces"
>>>prints.count(' ')
8
Slicing
Use [ # : # ] to get set of letter
Keep in mind that python, as many other languages, starts to count from 0!!
Word="Hello World"
Print word[0]#get one char of the word
Print word[0:1]#get one char of the word (same as above)
Print word[0:3]#get the first three char
Print word[:3]#get the first three char
Print word[-3:]#get the last three char
Print word[3:]#get all but the three first char
Print word[:-3]#get all but the three last character
Word="Hello World"
Word[start:end]# items start through end-1
Word[start:]# items start through the rest of the list
Word[:end]# items from the beginning through end-1
Word[:]# a copy of the whole list
Split Strings
Word="Hello World"
>>>word.split(' ')# Split on whitespace
['Hello','World']
Startswith / Endswith
Word="hello world"
>>>word.startswith("H")
True
>>>word.endswith("d")
True
>>>word.endswith("w")
False
Repeat Strings
Print"."*10# prints ten dots
>>>print"."*10
..........
Replacing
Word="Hello World"
>>>word.replace("Hello","Goodbye")
'Goodbye World'
Changing Upper and Lower Case Strings
String="Hello World"
>>>printstring.upper()
HELLO WORLD
>>>printstring.lower()
Hello world
>>>printstring.title()
Hello World
>>>printstring.capitalize()
Hello world
>>>printstring.swapcase()
HELLOwORLD
Reversing
String="Hello World"
>>>print' '.join(reversed(string))
d l r o W o l l e H
Strip
Python strings have the strip(), lstrip(), rstrip() methods for removing
any character from both ends of a string.
If the characters to be removed are not specified then white-space will be removed
Word="Hello World"
Strip off newline characters from end of the string
>>>printword.strip('
')
Hello World
Strip()#removes from both ends
Lstrip()#removes leading characters (Left-strip)
Rstrip()#removes trailing characters (Right-strip)
>>>word=" xyz "
>>>print word
Xyz
>>>printword.strip()
Xyz
>>>printword.lstrip()
Xyz
>>>printword.rstrip()
Xyz
Concatenation
To concatenate strings in Python use the “+” operator.
"Hello "+"World"# = "Hello World"
"Hello "+"World"+"!"# = "Hello World!"
Join
>>>print":".join(word)# #add a : between every char
H:e:l:l:o::W:o:r:l:d
>>>print" ".join(word)# add a whitespace between every char
H e l l o W o r l d
9. Explain Compare strings in python
You can use ( > , < , <= , <= , == , != ) to compare two strings. Python compares string lexicographically i.e using ASCII value of the characters.
Suppose you have str1 as "Mary" andstr2 as "Mac". The first two characters from str1 andstr2 ( M and M ) are compared. As they are equal, the second two characters are compared. Because they are also equal, the third two characters (r and c ) are compared. And because r has greater ASCII value than c, str1 is greater than str2.
Here are some more examples:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | >>> "tim" == "tie" False >>> "free" != "freedom" True >>> "arrow" > "aron" True >>> "right" >= "left" True >>> "teeth" < "tee" False >>> "yellow" <= "fellow" False >>> "abc" > "" True >>> |
Try it out:
1
Print("tim" == "tie")
2
3
Print("free" != "freedom")
4
5
Print("arrow" > "aron")
6
7
Print("right" >= "left")
8
9
Print("teeth" < "tee")
10
11
Print("yellow" <= "fellow")
12
13
Print("abc" > "")
14
False
True
True
True
False
False
True
10. Explain String Concatenation in Python
String Concatenation is the technique of combining two strings. String Concatenation can be done using many ways.
We can perform string concatenation using following ways:
- Using + operator
- Using join() method
- Using % operator
- Using format() function
Using + Operator
It’s very easy to use + operator for string concatenation. This operator can be used to add multiple strings together. However, the arguments must be a string.
Note: Strings are immutable, therefore, whenever it is concatenated, it is assigned to a new variable.
Example:
# Python program to demonstrate # string concatenation
# Defining strings Var1 ="Hello " Var2 ="World"
# + Operator is used to combine strings Var3 =var1 +var2 Print(var3) |
Output:
Hello World
Here, the variable var1 stores the string “Hello ” and variable var2 stores the string “World”. The + Operator combines the string that is stored in the var1 and var2 and stores in another variable var3.
Using join() Method
The join() method is a string method and returns a string in which the elements of sequence have been joined by str separator.
Example:
# Python program to demonstrate # string concatenation
Var1 ="Hello" Var2 ="World"
# join() method is used to combine the strings Print("".join([var1, var2]))
# join() method is used here to combine # the string with a separator Space(" ") Var3 =" ".join([var1, var2])
Print(var3) |
Output:
HelloWorld
Hello World
In the above example, the variable var1 stores the string “Hello” and variable var2 stores the string “World”. The join() method combines the string that is stored in the var1 and var2. The join method accepts only the list as it’s argument and list size can be anything. We can store the combined string in another variable var3 which is separated by space.
Using % Operator
We can use % operator for string formatting, it can also be used for string concatenation. It’s useful when we want to concatenate strings and perform simple formatting.
Example:
# Python program to demonstrate # string concatenation
Var1 ="Hello" Var2 ="World"
# % Operator is used here to combine the string Print("% s % s"%(var1, var2)) |
Output:
Hello World
Here, the % Operator combine the string that is stored in the var1 and var2. The %s denotes string data type. The value in both the variable is passed to the string %s and becomes “Hello World”.
Using format() function
Str.format() is one of the string formatting methods in Python, which allows multiple substitutions and value formatting. This method lets us concatenate elements within a string through positional formatting.
Example:
# Python program to demonstrate # string concatenation
Var1 ="Hello" Var2 ="World"
# format function is used here to # combine the string Print("{} {}".format(var1, var2))
# store the result in another variable Var3 ="{} {}".format(var1, var2)
Print(var3) |
Output:
Hello World
Hello World
Here, the format() function combines the string that is stored in the var1 and var2 and stores in another variable var3. The curly braces {} are used to set the position of strings. The first variable stores in the first curly braces and second variable stores in the second curly braces. Finally it prints the value “Hello World”.