Python - How to use Record Linkage Toolkit to compare string vectors?

 
Vista:
sin imagen de perfil

How to use Record Linkage Toolkit to compare string vectors?

Publicado por daniel (1 intervención) el 14/03/2019 13:43:45
I have implemented this code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
dfA =
pd.read_csv(args.file,index_col="Full_url",sep=",",engine='c',skipinitialspace=True,
encoding='utf-8',dtype={ "City": object,"Country": object,"State":
object,"Email": object,"Identifier": object,"Family": object,"Given":
object,"Prefix": object,"Suffix": object,"Phone": object})
 
indexer = rl.Index()
indexer.add(Full())
candidate_links = indexer.index(dfA)
compare_cl = rl.Compare()
 
compare_cl.exact('Identifier', 'Identifier', label='Identifier')
compare_cl.string('City', 'City', method='jarowinkler',
threshold=0.85, label='City')
compare_cl.string('Country', 'Country', method='jarowinkler',
threshold=0.85, label='Country')
compare_cl.string('State', 'State', method='jarowinkler',
threshold=0.85, label='State')
compare_cl.string('Email', 'Email', method='damerau_levenshtein',
threshold=0.80, label='Email')
compare_cl.string('Family', 'Family', method='jarowinkler',
threshold=0.80, label='Family')
compare_cl.string('Given', 'Given', method='jarowinkler',
threshold=0.80, label='Given')
compare_cl.string('Prefix', 'Prefix', method='jarowinkler',
threshold=0.80, label='Prefix')
compare_cl.string('Suffix', 'Suffix', method='jarowinkler',
threshold=0.80, label='Suffix')
compare_cl.exact('Phone', 'Phone', label='Phone')
 
features = compare_cl.compute(candidate_links, dfA)


However, I have a problem because the column 'Family' is a vector of
names with a variable length*

For example, a register could be: Family=Daniel||Alex||John||Felix

The items in a vector always are splitted by the character "||". Can I
compare the columns 'Family' as a vector? How do I indicate the
character of separation?
Valora esta pregunta
Me gusta: Está pregunta es útil y esta claraNo me gusta: Está pregunta no esta clara o no es útil
0
Responder