Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
332 views
in Technique[技术] by (71.8m points)

python 3.x - Implement max features functionality

>     As a part of this task you have to modify your fit and transform functions so that your vocab will contain only 50 terms with

top idf > scores. > > This task is similar to your previous task, just that here your > vocabulary is limited to only top 50 features names based on their idf > values. Basically your output will have exactly 50 columns and the > number of rows will depend on the number of documents you have in your > corpus. > > Here you will be give a pickle file, with file name cleaned_strings. > You would have to load the corpus from this file and use it as input > to your tfidf vectorizer. > > Steps to approach this task: You would have to write both fit and > transform methods for your custom implementation of tfidf vectorizer, > just like in the previous task. Additionally, here you have to limit > the number of features generated to 50 as described above. Now sort > your vocab based in descending order of idf values and print out the > words in the sorted voacb after you fit your data. Here you should be > getting only 50 terms in your vocab. And make sure to print idf values > for each term in your vocab. Make sure the output of your > implementation is a sparse matrix. Before generating the final output, > you need to normalize your sparse matrix using L2 normalization. You > can refer to this link > https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.normalize.html > Now check the output of a single document in your collection of > documents, you can convert the sparse matrix related only to that > document into dense matrix and print it. And this dense matrix should > contain 1 row and 50 columns.

import pickle
    with open('cleaned_strings', 'rb') as f:
        corpus2 = pickle.load(f)
    # printing the length of the corpus loaded
    print("Number of documents in corpus = ",len(corpus2)) 

output:

Number of documents in corpus = 746

from tqdm import tqdm # tqdm is a library that helps us to visualize the runtime of for loop. refer this to know more about tqdm
#https://tqdm.github.io/

# it accepts only list of sentances
def fit(dataset):    
    unique_words =set() # at first we will initialize an empty set
    # check if its list type or not
    if isinstance(dataset, (list,)):
        for row in dataset: # for each review in the dataset
            for word in row.split(" "): # for each word in the review. #split method converts a string into list of words
                if len(word) < 2:
                    continue
                unique_words.add(word)
        unique_words = sorted(list(unique_words))
        vocab = {j:i for i,j in enumerate(unique_words)}
        
        return vocab
    else:
        print("you need to pass list of sentance")
vocab2=fit(corpus2)
print(vocab2)

how I use max features for vocab will contain only 50 terms with top idf scores. But my output is This

{'aailiyah': 0, 'abandoned': 1, 'ability': 2, 'abroad': 3, 'absolutely': 4, 'abstruse': 5, 'abysmal': 6, 'academy': 7, 'accents': 8, 'accessible': 9, 'acclaimed': 10, 'accolades': 11, 'accurate': 12, 'accurately': 13, 'accused': 14, 'achievement': 15, 'achille': 16, 'ackerman': 17, 'act': 18, 'acted': 19, 'acting': 20, 'action': 21, 'actions': 22, 'actor': 23, 'actors': 24, 'actress': 25, 'actresses': 26, 'actually': 27, 'adams': 28, 'adaptation': 29, 'add': 30, 'added': 31, 'addition': 32, 'admins': 33, 'admiration': 34, 'admitted': 35, 'adorable': 36, 'adrift': 37, 'adventure': 38, 'advise': 39, 'aerial': 40, 'aesthetically': 41, 'affected': 42, 'affleck': 43, 'afraid': 44, 'africa': 45, 'afternoon': 46, 'age': 47, 'aged': 48, 'ages': 49, 'ago': 50, 'agree': 51, 'agreed': 52, 'aimless': 53, 'air': 54, 'aired': 55, 'akasha': 56, 'akin': 57, 'alert': 58, 'alexander': 59, 'alike': 60, 'allison': 61, 'allow': 62, 'allowing': 63, 'almost': 64, 'along': 65, 'alongside': 66, 'already': 67, 'also': 68, 'although': 69, 'always': 70, 'amateurish': 71, 'amaze': 72, 'amazed': 73, 'amazing': 74, 'amazingly': 75, 'america': 76, 'american': 77, 'americans': 78, 'among': 79, 'amount': 80, 'amusing': 81, 'amust': 82, 'anatomist': 83, 'angel': 84, 'angela': 85, 'angeles': 86, 'angelina': 87, 'angle': 88, 'angles': 89, 'angry': 90, 'anguish': 91, 'angus': 92, 'animals': 93, 'animated': 94, 'animation': 95, 'anita': 96, 'ann': 97, 'anne': 98, 'anniversary': 99, 'annoying': 100, 'another': 101, 'anthony': 102, 'antithesis': 103, 'anyone': 104, 'anything': 105, 'anyway': 106, 'apart': 107, 'appalling': 108, 'appealing': 109, 'appearance': 110, 'appears': 111, 'applauded': 112, 'applause': 113, 'appreciate': 114, 'appropriate': 115, 'apt': 116, 'argued': 117, 'armageddon': 118, 'armand': 119, 'around': 120, 'array': 121, 'art': 122, 'articulated': 123, 'artiness': 124, 'artist': 125, 'artistic': 126, 'artless': 127, 'arts': 128, 'aside': 129, 'ask': 130, 'asleep': 131, 'aspect': 132, 'aspects': 133, 'ass': 134, 'assante': 135, 'assaulted': 136, 'assistant': 137, 'astonishingly': 138, 'astronaut': 139, 'atmosphere': 140, 'atrocious': 141, 'atrocity': 142, 'attempt': 143, 'attempted': 144, 'attempting': 145, 'attempts': 146, 'attention': 147, 'attractive': 148, 'audience': 149, 'audio': 150, 'aurv': 151, 'austen': 152, 'austere': 153, 'author': 154, 'average': 155, 'aversion': 156, 'avoid': 157, 'avoided': 158, 'award': 159, 'awarded': 160, 'awards': 161, 'away': 162, 'awesome': 163, 'awful': 164, 'awkwardly': 165, 'aye': 166, 'baaaaaad': 167, 'babbling': 168, 'babie': 169, 'baby': 170, 'babysitting': 171, 'back': 172, 'backdrop': 173, 'backed': 174, 'bad': 175, 'badly': 176, 'bag': 177, 'bailey': 178, 'bakery': 179, 'balance': 180, 'balanced': 181, 'ball': 182, 'ballet': 183, 'balls': 184, 'band': 185, 'barcelona': 186, 'barely': 187, 'barking': 188, 'barney': 189, 'barren': 190, 'based': 191, 'basic': 192, 'basically': 193, 'bat': 194, 'bates': 195, 'baxendale': 196, 'bear': 197, 'beautiful': 198, 'beautifully': 199, 'bec': 200, 'became': 201, 'bechard': 202, 'become': 203, 'becomes': 204, 'began': 205, 'begin': 206, 'beginning': 207, 'behind': 208, 'behold': 209, 'bela': 210, 'believable': 211, 'believe': 212, 'believed': 213, 'bell': 214, 'bellucci': 215, 'belly': 216, 'belmondo': 217, 'ben': 218, 'bendingly': 219, 'bennett': 220, 'bergen': 221, 'bertolucci': 222, 'best': 223, 'better': 224, 'betty': 225, 'beware': 226, 'beyond': 227, 'bible': 228, 'big': 229, 'biggest': 230, 'billy': 231, 'biographical': 232, 'bipolarity': 233, 'bit': 234, 'bitchy': 235, 'black': 236, 'blah': 237, 'blake': 238, 'bland': 239, 'blandly': 240, 'blare': 241, 'blatant': 242, 'blew': 243, 'blood': 244, 'blown': 245, 'blue': 246, 'blush': 247, 'boasts': 248, 'bob': 249, 'body': 250, 'bohemian': 251, 'boiling': 252, 'bold': 253, 'bombardments': 254, 'bond': 255, 'bonding': 256, 'bonus': 257, 'bonuses': 258, 'boobs': 259, 'boogeyman': 260, 'book': 261, 'boost': 262, 'bop': 263, 'bordered': 264, 'borderlines': 265, 'borders': 266, 'bore': 267, 'bored': 268, 'boring': 269, 'borrowed': 270, 'boss': 271, 'bother': 272, 'bothersome': 273, 'bought': 274, 'box': 275, 'boyfriend': 276, 'boyle': 277, 'brain': 278, 'brainsucking': 279, 'brat': 280, 'breaking': 281, 'breeders': 282, 'brevity': 283, 'brian': 284, 'brief': 285, 'brigand': 286, 'bright': 287, 'brilliance': 288, 'brilliant': 289, 'brilliantly': 290, 'bring': 291, 'brings': 292, 'broad': 293, 'broke': 294, 'brooding': 295, 'brother': 296, 'brutal': 297, 'buddy': 298, 'budget': 299, 'buffalo': 300, 'buffet': 301, 'build': 302, 'builders': 303, 'buildings': 304, 'built': 305, 'bullock': 306, 'bully': 307, 'bunch': 308, 'burton': 309, 'business': 310, 'buy': 311, 'cable': 312, 'cailles': 313, 'california': 314, 'call': 315, 'called': 316, 'calls': 317, 'came': 318, 'cameo': 319, 'camera': 320, 'camerawork': 321, 'camp': 322, 'campy': 323, 'canada': 324, 'cancan': 325, 'candace': 326, 'candle': 327, 'cannot': 328, 'cant': 329, 'captain': 330, 'captured': 331, 'captures': 332, 'car': 333, 'card': 334, 'cardboard': 335, 'cardellini': 336, 'care': 337, 'carol': 338, 'carrell': 339, 'carries': 340, 'carry': 341, 'cars': 342, 'cartoon': 343, 'cartoons': 344, 'case': 345, 'cases': 346, 'cast': 347, 'casted': 348, 'casting': 349, 'cat': 350, 'catchy': 351, 'caught': 352, 'cause': 353, 'ceases': 354, 'celebration': 355, 'celebrity': 356, 'celluloid': 357, 'centers': 358, 'central': 359, 'century': 360, 'certain': 361, 'certainly': 362, 'cg': 363, 'cgi': 364, 'chalkboard': 365, 'challenges': 366, 'chance': 367, 'change': 368, 'changes': 369, 'changing': 370, 'channel': 371, 'character': 372, 'characterisation': 373, 'characters': 374, 'charisma': 375, 'charismatic': 376, 'charles': 377, 'charlie': 378, 'charm': 379, 'charming': 380, 'chase': 381, 'chasing': 382, 'cheap': 383, 'cheaply': 384, 'check': 385, 'checking': 386, 'cheek': 387, 'cheekbones': 388, 'cheerfull': 389, 'cheerless': 390, 'cheesiness': 391, 'cheesy': 392, 'chemistry': 393, 'chick': 394, 'child': 395, 'childhood': 396, 'children': 397, 'childrens': 398, 'chills': 399, 'chilly': 400, 'chimp': 401, 'chodorov': 402, 'choice': 403, 'choices': 404, 'choked': 405, 'chosen': 406, 'chow': 407, 'christmas': 408, 'christopher': 409, 'church': 410, 'cinema': 411, 'cinematic': 412, 'cinematographers': 413, 'cinematography': 414, 'circumstances': 415, 'class': 416, 'classic': 417, 'classical': 418, 'clear': 419, 'clearly': 420, 'clever': 421, 'clich': 422, 'cliche': 423, 'clients': 424, 'cliff': 425, 'climax': 426, 'close': 427, 'closed': 428, 'clothes': 429, 'club': 430, 'co': 431, 'coach': 432, 'coal': 433, 'coastal': 434, 'coaster': 435, 'coherent': 436, 'cold': 437, 'cole': 438, 'collect': 439, 'collective': 440, 'colored': 441, 'colorful': 442, 'colours': 443, 'columbo': 444, 'come': 445, 'comedic': 446, 'comedy': 447, 'comes': 448, 'comfortable': 449, 'comforting': 450, 'comical': 451, 'coming': 452, 'commands': 453, 'comment': 454, 'commentary': 455, 'commented': 456, 'comments': 457, 'commercial': 458, 'community': 459, 'company': 460, 'compelling': 461, 'competent': 462, 'complete': 463, 'completed': 464, 'completely': 465, 'complex': 466, 'complexity': 467, 'composed': 468, 'composition': 469, 'comprehensible': 470, 'compromise': 471, 'computer': 472, 'concentrate': 473, 'conception': 474, 'conceptually': 475, 'concerning': 476, 'concerns': 477, 'concert': 478, 'conclusion': 479, 'condescends': 480, 'confidence': 481, 'configuration': 482, 'confirm': 483, 'conflict': 484, 'confuses': 485, 'confusing': 486, 'connections': 487, 'connery': 488, 'connor': 489, 'conrad': 490, 'consequences': 491, 'consider': 492, 'considerable': 493, 'considered': 494, 'considering': 495, 'considers': 496, 'consistent': 497, 'consolations': 498, 'constant': 499, 'constantine': 500, 'constructed': 501, 'contained': 502, 'containing': 503, 'contains': 504, 'content': 505, 'continually': 506, 'continuation': 507, 'continue': 508, 'continuity': 509, 'continuously': 510, 'contract': 511, 'contrast': 512, 'contributing': 513, 'contributory': 514, 'contrived': 515, 'control': 516, 'controversy': 517, 'convention': 518, 'convey': 519, 'convince': 520, 'convincing': 521, 'convoluted': 522, 'cool': 523, 'coppola': 524, 'cords': 525, 'core': 526, 'corn': 527, 'corny': 528, 'correct': 529, 'cost': 530, 'costs': 531, 'costumes': 532, 'cotton': 533, 'could': 534, 'couple': 535, 'course': 536, 'court': 537, 'courtroom': 538, 'cover': 539, 'cowardice': 540, 'cox': 541, 'crackles': 542, 'crafted': 543, 'crap': 544, 'crash

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)
Waitting for answers

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...