Split String To Array Of Strings With 1-3 Words Depends On Length

February 25, 2023 Post a Comment

I have following input string Lorem ipsum dolor sit amet consectetur adipiscing elit sed doeiusmod tempor incididunt ut Duis aute irure dolor in reprehenderit in esse cillum dolor

Solution 1:

You can express your rules as abbreviated regular expressions, build a real regex from them and apply it to your input:

text = "Lorem ipsum, dolor. sit amet? consectetur,   adipiscing,  elit! sed doeiusmod tempor incididunt ut Duis aute irure dolor in reprehenderit in esse cillum dolor eu fugia bla?";

rules = ['(SSS)', '(SS(?=L))', '(L(?=L))', '(SL)', '(LS)', '(.+)']

regex = new RegExp(
    rules
        .join('|')
        .replace(/S/g, '\\w{1,5}\\W+')
        .replace(/L/g, '\\w{6,}\\W+')
    , 'g')

console.log(text.match(regex))

If the rules don't change, the regex construction part is only needed once.

Note that this also handles punctuation in a reasonable way.

Solution 2:

One option is to first create an array of rules, like:

const rules = [
  // [# of words to splice if all conditions met, condition for word1, condition for word2, condition for word3...]
  [3, 'less', 'less', 'less'],
  // the above means: splice 3 words if the next 3 words' lengths are <6, <6, <6
  [2, 'less', 'less', 'eqmore'],
  // the above means: splice 2 words if the next 3 words' lengths are <6, <6, >=6
  [1, 'eqmore', 'eqmore'],
  [2, 'eqmore', 'less'],
  [2, 'less', 'eqmore']
];

Then iterate through the array of rules, finding the rule that matches, extracting the appropriate number of words to splice from the matching rule, and push to the output array:

    const rules = [
      [3, 'less', 'less', 'less'],
      [2, 'less', 'less', 'eqmore'],
      [1, 'eqmore', 'eqmore'],
      [2, 'eqmore', 'less'],
      [2, 'less', 'eqmore']
    ];
const s = "Lorem ipsum dolor sit amet consectetur adipiscing elit sed doeiusmod tempor incididunt ut Duis aute irure dolor in reprehenderit in esse cillum dolor eu fugia";

const words = s.split(' ');
const output = [];
const verify = (cond, word) => cond === 'less' ? word.length < 6 : word.length >= 6;
while (words.length) {
  const [wordCount] = rules.find(
    ([wordCount, ...conds]) => conds.every((cond, i) => verify(cond, words[i]))
  );
  output.push(words.splice(0, wordCount).join(' '));
}
console.log(output);

Of course, the .find assumes that every input string will always have a matching rule for each position spliced.

For the additional rule that any words not matched by the previous rules just be added to the output, put [1] into the bottom of the rules array:

const rules = [
      [3, 'less', 'less', 'less'],
      [2, 'less', 'less', 'eqmore'],
      [1, 'eqmore', 'eqmore'],
      [2, 'eqmore', 'less'],
      [2, 'less', 'eqmore'],
      [1]
    ];
const s = "Lorem ipsum dolor sit amet consectetur adipiscing elit sed doeiusmod tempor incididunt ut Duis aute irure dolor in reprehenderit in esse cillum dolor eu fugia";

const words = s.split(' ');
const output = [];
const verify = (cond, word) => cond === 'less' ? word.length < 6 : word.length >= 6;
while (words.length) {
  const [wordCount] = rules.find(
    ([wordCount, ...conds]) => conds.every((cond, i) => words[i] && verify(cond, words[i]))
  );
  output.push(words.splice(0, wordCount).join(' '));
}
console.log(output);

Solution 3:

If we define words with length <6 to have size 1 and >=6 to have size 2, we can rewrite the rules to "if the next word would make the total size of the current row >= 4, start next line".

function wordSize(word) {
  if (word.length < 6) 
    return 1;
  return 2;
}
let s = "Lorem ipsum dolor sit amet consectetur adipiscing elit sed doeiusd tempor incididunt ut Duis aute irure dolor in reprehenderit in esse cillum dolor eu fugia";
var result = [];
var words = s.split(" ");
var row = [];
for (var i = 0; i < words.length; ++i) {
  if (row.reduce((s, w) => s + wordSize(w), 0) + wordSize(words[i]) >= 4) {
    result.push(row);
    row = [];
  }
  row.push(words[i]);
}
result.push(row);
result = result.map(a => a.join(" "));
console.log(result);

Solution 4:

I also found this problem very interesting. This is a long-format answer which shows the process of how I arrived at the final program. There are several code blocks labeled sketch along the way. I hope for this approach to be helpful to beginners in functional style.

Using the data.maybe module, I started out with -

// sketch 1
const wordsToLines = (words = [], r = []) =>
  words.length === 0
    ? Just (r)
    : ruleA (words)
        .orElse (_ => ruleB (words))
        .orElse (_ => ruleC (words))
        .orElse (_ => ruleD (words))
        .orElse (_ => ruleE (words))
        .orElse (_ => defaultRule (words))
        .chain (({ line, next }) => 
          wordsToLines (next, [...r, line ])
        )

Then I started writing some of the rules ...

// sketch 2
const success = (line, next) =>
  Just ({ line, next })

const defaultRule = ([ line, ...next ]) =>
  success (line, next)

const ruleA = ([ a, b, c, ...more ]) =>
  small (a) && small (b) && small(c)
    ? success (line (a, b, c), more)
    : Nothing ()

const ruleB = ([ a, b, c, ...more ]) =>
  small (a) && small (b) && large (c)
    ? success (line (a, b), [c, ...more])
    : Nothing ()

// ...

Way too messy and repetitive, I thought. As the author of these functions, it's my job to make them work for me! So I restarted this time designing the rules to do the hard work -

// sketch 3
const rule = (guards = [], take = 0) =>
  // TODO: implement me...

const ruleA =
  rule
    ( [ small, small, small ] // pattern to match
    , 3                       // words to consume
    )

const ruleB =
  rule ([ small, small, large ], 2)

// ruleC, ruleD, ruleE, ...

const defaultRule =
  rule ([ always (true) ], 1)

These rules are much simpler. Next, I wanted to clean up wordsToLines a bit -

// sketch 4
const wordsToLines = (words = [], r = []) =>
  words.length === 0
    ? Just (r)
    : oneOf (ruleA, ruleB, ruleC, ruleD, ruleE, defaultRule)
        (words)
        .chain (({ line, next }) => 
          wordsToLines (next, [...r, line ])
        )

In our initial sketch, the rules constructed a {line, next} object, but a higher-order rule means we can hide even more complexity away. And the oneOf helper makes it easy to move our rules inline -

// final revision
const wordsToLines = (words = [], r = []) =>
  words.length === 0
    ? Just (r)
    : oneOf
        ( rule ([ small, small, small ], 3) // A
        , rule ([ small, small, large ], 2) // B
        , rule ([ large, large ], 1)        // C
        , rule ([ large, small ], 2)        // D
        , rule ([ small, large ], 2)        // E
        , rule ([ always (true) ], 1) // default
        )
        ([ words, r ])
        .chain (apply (wordsToLines))

Finally, we can write our main function, formatSentence -

const formatSentence = (sentence = "") =>
  wordsToLines (sentence .split (" "))
    .getOrElse ([])

The wires are mostly untangled now. We just have to supply the remaining dependencies -

const { Just, Nothing } =
  require ("data.maybe")

const [ small, large ] =
  dual ((word = "") => word.length < 6)

const oneOf = (init, ...more) => x =>
  more.reduce((r, f) => r .orElse (_ => f(x)), init (x))

const rule = (guards = [], take = 0) =>
  ([ words = [], r = [] ]) =>
    guards .every ((g, i) => g (words[i]))
      ? Just
          ( [ words .slice (take)
            , [ ...r, words .slice (0, take) .join (" ") ]
            ]
          )
      : Nothing ()

And some functional primitives -

const identity = x =>
  x

const always = x =>
  _ => x

const apply = (f = identity) =>
  (args = []) => f (...args)

const dual = f =>
  [ x => Boolean (f (x))
  , x => ! Boolean (f (x))
  ]

Let's run the program -

formatSentence ("Lorem ipsum dolor sit amet consectetur adipiscing elit sed doeiusmod tempor incididunt ut Duis aute irure dolor in reprehenderit in esse cillum dolor eu fugia ...")

// [ 'Lorem ipsum dolor'
// , 'sit amet'
// , 'consectetur'
// , 'adipiscing elit'
// , 'sed doeiusmod'
// , 'tempor'
// , 'incididunt ut'
// , 'Duis aute irure'
// , 'dolor in'
// , 'reprehenderit in'
// , 'esse cillum'
// , 'dolor eu fugia'
// , '...'
// ]

View the complete program on repl.it and run it to see the results -

Solution 5:

(Updated to incorporate suggestion from user633183.)

I found this an interesting problem. I wanted to write a more generic version immediately, and I settled on one that accepted a list of rules, each of which described the number of words that it would gather and a test for each of those words. So with lt6 being essentially (str) => str.length < 6, the first rule (A) would look like this:

[3, lt6, lt6, lt6],

This, it turns out, is quite similar to the solution from CertainPerformance; that answer uses strings to represent two different behaviors; this one uses actual functions. But they are quite similar. The implementation, though is fairly different.

const allMatch = (fns, xs) =>
  fns.every ( (fn, i) =>  fn ( xs[i] ) )

const splitByRules = (rules) => {
  const run = 
    ( xs
    , res = []
    , [count] = rules .find 
        ( ([count, ...fns]) => 
          count <= xs .length 
          && allMatch (fns, xs)
        ) 
        || [1] // if no rules match, choose next word only
    ) => xs.length === 0
      ? res
      : run 
        ( xs .slice (count) 
        , res .concat ([xs .slice (0, count) ])
        )

  return (str) => 
    run (str .split (/\s+/) ) 
      .map (ss => ss .join (' '))
}

const shorterThan = (n) => (s) => 
  s .length < n

const atLeast = (n) => (s) =>
  s .length >= n

const lt6 = shorterThan (6)
const gte6 = atLeast (6)

const rules = [
// +------------- Number of words to select in next block 
// |        +--------- Functions to test againt each word
// |   _____|_____
// V  /           \
  [3, lt6, lt6, lt6],   // A
  [2, lt6, lt6, gte6],  // B
  [1, gte6, gte6],      // C
  [2, gte6, lt6],       // D
  [2, lt6, gte6],       // E
]

const words  = 'Lorem ipsum dolor sit amet consectetur adipiscing elit sed doeiusmod tempor incididunt ut Duis aute irure dolor in reprehenderit in esse cillum dolor eu fugia ...';

console .log (
  splitByRules (rules) (words) 
)

This uses a recursive function that bottoms out when the remaining list of words is empty and otherwise searches for the first rule that matches (with, again like CertainPerformance, a default rule that simply takes the next word) and selects the corresponding number of words, recurring on the remaining words.

For simplicity, the recursive function accepts an array of words and returns an array of arrays of words. A wrapper function handles converting these to and from strings.

The only other function of substance in here is the helper function allMatch. It is essentially ([f1, f2, ... fn], [x1, x2, ..., xn, ...]) => f1(x1) && f2(x2) && ... && fn(xn).

Of course the currying means that splitByRules (myRules) returns a function you can store and run against different strings.

The order of the rules might be important. If two rules could overlap, you need to put the preferred match ahead of the the other.

This added generality may or may not be of interest to you, but I think this technique has a significant advantage: it's much easier to modify if the rules ever change. Say you now also want to include four words, if they all are fewer than five characters long. Then we would just write const lt5 = shorterThan(5) and include the rule

[4, lt5, lt5, lt5, lt5]

at the beginning of the list.

To me that's a big win.

javascript dox