Outputallrowswithwordcountinacolumngreaterthan3

html5 • 2022年11月22日 pm9:32 • 问答 • 166 阅读

I have this dummy df:

columns = ['answer', 'some_number']
data = [['hello how are you doing','1.0'],
       ['hello', '1.0'],
       ['bye bye bye bye', '0.0'],
        ['no', '0.0'],
        ['yes', '1.0'],
        ['Who let the dogs out', '0.0'],
        ['1 + 1 + 1 + 2', '1.0']]
df = pd.DataFrame(columns=columns, data=data)

I want to output the rows with a word count greater than 3.
Here that would the rows 'hello how are you doing', 'bye bye bye bye', 'Who let the dogs out', '1 + 1 + 1 + 2'

My approach doesn't work: df[len(df.answer) > 3]

Output: KeyError: True

回答

If the seperator is ' ' ,you can try series.str.count , else you can replace the sep

n=3
df[df['answer'].str.count(' ').gt(n-1)]

To include Multiple spaces #credits @piRSquared

df['answer'].str.count('s+').gt(2)

Or using list comprehension:

n= 3
df[[len(i.split())>n for i in df['answer']]] #should be faster than above

                    answer some_number
0  hello how are you doing         1.0
2          bye bye bye bye         0.0
5     Who let the dogs out         0.0
6            1 + 1 + 1 + 2         1.0

My vote goes to `count` as it doesn't waste resources creating lists. However, to include possible multiple spaces: `df['answer'].str.count('s+').gt(2)`

回答

A couple more options using str.split():

Combine with str.len():

df[df.answer.str.split().str.len().gt(n)]

Or combine with apply(len):

df[df.answer.str.split().apply(len).gt(n)]

What's fastest?

Fastest overall (BENY's list comprehension):

df[[x.count(' ') >= n for x in df.answer]]

Fastest pandas-based (anky's first answer):
```
df[df.answer.str.count(' ').ge(n)]
```

Timed with ~20 words per sentence:

Why doesn't `df[len(df.answer) > 3]` work?

len(df.answer) returns the length of the answer column itself (7), not the number of words per answer (5, 1, 4, 1, 1, 5, 7).

That means the final expression evaluates to df[7 > 3] or df[True], which breaks because there is no column True:

>>> len(df.answer)
7

>>> len(df.answer) > 3     # 7 > 3
True

>>> df[len(df.answer) > 3] # df[True] doesn't exist
KeyError: True

回答

If I understand this correctly, here's one way:

>>> df.loc[df['answer'].str.split().apply(len) > 3, 'answer']
0    hello how are you doing
2            bye bye bye bye
5       Who let the dogs out
6              1 + 1 + 1 + 2

以上是Outputallrowswithwordcountinacolumngreaterthan3的全部内容。

THE END

二维码

如何在不关闭AdGuard的情况下使Flutter工作

< <上一篇

几次操作后读取返回0

下一篇>>

搜索内容

Outputallrowswithwordcountinacolumngreaterthan3

回答

回答

What's fastest?

Why doesn't `df[len(df.answer) > 3]` work?

回答

目录

目录

推荐文章

最新文章

Outputallrowswithwordcountinacolumngreaterthan3

回答

回答

What's fastest?

Why doesn't df[len(df.answer) > 3] work?

回答

目录

目录

推荐文章

最新文章

Why doesn't `df[len(df.answer) > 3]` work?